使用 simpleXML 从 XML 文件中提取 HTML

发布于 2024-10-14 00:39:21 字数 2931 浏览 5 评论 0原文

我正在读取由第 3 方应用程序生成的 xml 文件,其中包括以下内容:

<Cell>
    <Comment ss:Author="Mark Baker">
        <ss:Data xmlns="http://www.w3.org/TR/REC-html40"><B><Font html:Face="Tahoma" html:Size="8" html:Color="#000000">Mark Baker:</Font></B><Font html:Face="Tahoma" html:Size="8" html:Color="#000000">&#10;Comment 1 - No align</Font></ss:Data>
    </Comment>
</Cell>

我想做的是从 Cell->Comment->Data 元素“按原样”或“按原样”访问原始数据实际的 (X)HTML 标记块(最好是后者)。

if (isset($cell->Comment)) {
    echo 'comment found<br />';
    $commentAttributes = $cell->Comment->attributes($namespaces['ss']);
    if (isset($commentAttributes->Author)) {
        echo 'Author: ',(string)$commentAttributes->Author,'<br />';
    }
    $commentData = $cell->Comment->children($namespaces['ss']);
    var_dump($commentData);
    echo '<br />';
}

给我:

comment found
Author: Mark Baker
object(SimpleXMLElement)#130 (2) { ["@attributes"]=> array(1) { ["Author"]=> string(10) "Mark Baker" } ["Data"]=> object(SimpleXMLElement)#129 (0) { } } 

while

if (isset($cell->Comment)) {
    echo 'comment found<br />';
    $commentAttributes = $cell->Comment->attributes($namespaces['ss']);
    if (isset($commentAttributes->Author)) {
        echo 'Author: ',(string)$commentAttributes->Author,'<br />';
    }
    $commentData = $cell->Comment->Data->children();
    var_dump($commentData);
    echo '<br />';
}

给我:

comment found
Author: Mark Baker
object(SimpleXMLElement)#129 (2) { ["B"]=> object(SimpleXMLElement)#118 (1) { ["Font"]=> string(11) "Mark Baker:" } ["Font"]=> string(21) " Comment 1 - No align" } 

不幸的是,simpleXML 似乎将整个元素视为一系列 XML 节点。我确信我应该能够获得原始数据,而无需复杂的循环,或将元素提供给 DOM 解析器;也许使用 xmlns="http://www.w3.org/TR/REC-html40" 命名空间来干净地提取它,但我不知道如何。

任何帮助表示赞赏。

XML 数据的更复杂示例:

<Cell>
    <Comment ss:Author="Mark Baker">
        <ss:Data xmlns="http://www.w3.org/TR/REC-html40">
            <B><Font html:Face="Tahoma" html:Size="8" html:Color="#000000">Mark Baker:</Font></B><Font html:Face="Tahoma" html:Size="8" html:Color="#000000">&#10;</Font><B><Font html:Face="Tahoma" x:Family="Swiss" html:Size="8" html:Color="#000000">Rich </Font><U><Font html:Face="Tahoma" x:Family="Swiss" html:Size="8" html:Color="#FF0000">Text </Font></U><Font html:Face="Tahoma" x:Family="Swiss" html:Size="8" html:Color="#000000">Comment</Font></B><Font html:Face="Tahoma" html:Size="8" html:Color="#000000"> Center Aligned</Font>
        </ss:Data>
    </Comment>
</Cell>

I'm reading an xml file generated by a 3rd-party application that includes the following:

<Cell>
    <Comment ss:Author="Mark Baker">
        <ss:Data xmlns="http://www.w3.org/TR/REC-html40"><B><Font html:Face="Tahoma" html:Size="8" html:Color="#000000">Mark Baker:</Font></B><Font html:Face="Tahoma" html:Size="8" html:Color="#000000">
Comment 1 - No align</Font></ss:Data>
    </Comment>
</Cell>

What I'm trying to do is access the raw data from the Cell->Comment->Data element either "as is" or as an actual block of (X)HTML markup (preferably the latter).

if (isset($cell->Comment)) {
    echo 'comment found<br />';
    $commentAttributes = $cell->Comment->attributes($namespaces['ss']);
    if (isset($commentAttributes->Author)) {
        echo 'Author: ',(string)$commentAttributes->Author,'<br />';
    }
    $commentData = $cell->Comment->children($namespaces['ss']);
    var_dump($commentData);
    echo '<br />';
}

gives me:

comment found
Author: Mark Baker
object(SimpleXMLElement)#130 (2) { ["@attributes"]=> array(1) { ["Author"]=> string(10) "Mark Baker" } ["Data"]=> object(SimpleXMLElement)#129 (0) { } } 

while

if (isset($cell->Comment)) {
    echo 'comment found<br />';
    $commentAttributes = $cell->Comment->attributes($namespaces['ss']);
    if (isset($commentAttributes->Author)) {
        echo 'Author: ',(string)$commentAttributes->Author,'<br />';
    }
    $commentData = $cell->Comment->Data->children();
    var_dump($commentData);
    echo '<br />';
}

gives me:

comment found
Author: Mark Baker
object(SimpleXMLElement)#129 (2) { ["B"]=> object(SimpleXMLElement)#118 (1) { ["Font"]=> string(11) "Mark Baker:" } ["Font"]=> string(21) " Comment 1 - No align" } 

Unfortunately, simpleXML seems to be treating the whole element as a series of XML nodes. I'm sure I should be able to get this is raw data without complex looping, or feeding the element to a DOM Parser; perhaps using the xmlns="http://www.w3.org/TR/REC-html40" namespace to extract this cleanly, but I can't figure out how.

Any help appreciated.

A more complex example of the XML data:

<Cell>
    <Comment ss:Author="Mark Baker">
        <ss:Data xmlns="http://www.w3.org/TR/REC-html40">
            <B><Font html:Face="Tahoma" html:Size="8" html:Color="#000000">Mark Baker:</Font></B><Font html:Face="Tahoma" html:Size="8" html:Color="#000000">
</Font><B><Font html:Face="Tahoma" x:Family="Swiss" html:Size="8" html:Color="#000000">Rich </Font><U><Font html:Face="Tahoma" x:Family="Swiss" html:Size="8" html:Color="#FF0000">Text </Font></U><Font html:Face="Tahoma" x:Family="Swiss" html:Size="8" html:Color="#000000">Comment</Font></B><Font html:Face="Tahoma" html:Size="8" html:Color="#000000"> Center Aligned</Font>
        </ss:Data>
    </Comment>
</Cell>

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

看春风乍起 2024-10-21 00:39:21

我暂时采用了一种快速而肮脏的解决方案。从长远来看,我将改用 XMLReader(出于上述所有原因)...我只是目前没有时间重写所有现有的 simpleXML 代码。

我已经同意了:

$node = $cell->Comment->Data->asXML();
$comment = substr($node,49,-10);
$comment = strip_tags($comment);

虽然我更愿意保留 HTML 标记,但这需要额外的工作,所以我只是删除所有标记,留下纯文本(这是关键元素)。

虽然这远非完美的解决方案,但它做了我需要它做的事情(目前),并且我可以继续执行“待办事项”列表中的下一项,已经添加了一个新项目“重写使用” XMLReader”到该列表。

感谢您的帮助。当我重写时,我一定会重新访问这个线程。

I've gone with a quick and dirty solution for the time being. In the longer term, I'll switch to using XMLReader (for all the reasons mentioned)... I just don't have the time to rewrite all the existing simpleXML code at the moment.

I've gone with:

$node = $cell->Comment->Data->asXML();
$comment = substr($node,49,-10);
$comment = strip_tags($comment);

While I'd prefer to keep the HTML markup, that will require additional work, so I'm simply stripping all the markup leaving me with the plain text (which is the critical element).

While this is a far from perfect solution, it does what I need it to do (for the moment), and I can move on to the next item in my "to do" list, having already added a new item of "rewrite using XMLReader" to that list.

Thanks for the help. I'll be sure to revisit this thread when I am doing that rewrite.

清醇 2024-10-21 00:39:21

所以我知道你的问题来了又去,但我也遇到了同样的问题,我也必须弄清楚我想如何处理它。为了后代,我是这样得到它的。

如果您只接受 (x)HTML:

$data = str_replace('<?xml version="1.0"?>','',$xmlNode->asXML());

如果您认为有人会放入 XML 并且您对此表示同意,那么您只想删除第一个自动生成的 XML 标记:

$data = preg_replace('/^<\?xml version="1.0"\?\>\n/', '',$xmlNode->asXML());

所以您的代码将如下所示:

if (isset($cell->Comment)) {
    echo 'comment found<br />';
    $commentAttributes = $cell->Comment->attributes($namespaces['ss']);
    if (isset($commentAttributes->Author)) {
        echo 'Author: ',(string)$commentAttributes->Author,'<br />';
    }
    $commentData = str_replace('<?xml version="1.0"?>','',$cell->Comment->Data->asXML());
    echo $commentData;
    echo '<br />';
}

So I know your question has come and gone, but I had the same issue and I had to figure out how I wanted to handle it as well. For future generations, here's how I got it.

If you're only accepting (x)HTML:

$data = str_replace('<?xml version="1.0"?>','',$xmlNode->asXML());

If you think someone's going to put in XML and you're OK with that, you'll only want to kill the first, automatically generated XML tag:

$data = preg_replace('/^<\?xml version="1.0"\?\>\n/', '',$xmlNode->asXML());

So your code would look like this:

if (isset($cell->Comment)) {
    echo 'comment found<br />';
    $commentAttributes = $cell->Comment->attributes($namespaces['ss']);
    if (isset($commentAttributes->Author)) {
        echo 'Author: ',(string)$commentAttributes->Author,'<br />';
    }
    $commentData = str_replace('<?xml version="1.0"?>','',$cell->Comment->Data->asXML());
    echo $commentData;
    echo '<br />';
}
忘东忘西忘不掉你 2024-10-21 00:39:21

如果 元素内的 HTML 被视为字符串文字,则必须将其包装到 CDATA 部分 正如评论中已经暗示的,

$xml = <<< XML
<Cell>
    <Comment ss:Author="Mark Baker">
        <ss:Data xmlns="http://www.w3.org/TR/REC-html40">
            <![CDATA[
                <B><Font html:Face="Tahoma" … html:Color="#000000">
            ]]>
        </ss:Data>
    </Comment>
</Cell>
XML;
libxml_use_internal_errors(TRUE);
$cell = simplexml_load_string($xml);
echo $cell->Comment->Data;

如果它不在 CDATA 部分中,它将被视为节点。然后,您需要查找 的 insideXml 以将其作为原始 XML 获取。不幸的是,SimpleXml 和 DOM 都没有直接获取它的本机方法。您必须使用用户态实现。

innerXml 的用户态实现通常要么迭代所有子节点,要么连接它们的原始 XML。或者他们转储整个树并用字符串替换根节点。或者他们创建一个片段或将节点导入到另一个文档中。

我不知道有任何其他方法可以做到这一点。不确定这是否可以通过 XSLT 实现。 XMLReader 有一个 readInnerXML 方法。

If the HTML inside the <ss:Data> element is considered to be a string literal, it has to be wrapped into a CDATA section as was already hinted in the comments

$xml = <<< XML
<Cell>
    <Comment ss:Author="Mark Baker">
        <ss:Data xmlns="http://www.w3.org/TR/REC-html40">
            <![CDATA[
                <B><Font html:Face="Tahoma" … html:Color="#000000">
            ]]>
        </ss:Data>
    </Comment>
</Cell>
XML;
libxml_use_internal_errors(TRUE);
$cell = simplexml_load_string($xml);
echo $cell->Comment->Data;

If it's not in a CDATA section, it will be considered nodes. Then you'd be looking for the innerXml of the <ss:Data> to get that as raw XML. Unfortunately, neither SimpleXml, nor DOM have a native way to fetch that directly. You'd have to use a userland implementation.

Userland implementations of innerXml usually either iterate over all the child nodes and concatenate their raw XML. Or they dump the entire tree and string replace the root node. Or they create a fragment or import the nodes into another document.

I am not aware of any other way to do that. Not sure if this would be possible with XSLT. XMLReader has a readInnerXML method though.

凡尘雨 2024-10-21 00:39:21

如果您的实现使用 DOM,我相信您可以执行以下操作:

//given $node is <ss:data>

$frag = $node->ownerDocument->createDocumentFragment();
foreach($node->childNodes as $child){
    $frag->appendChild($child->cloneNode(true));
}
$string = $node->ownerDocument->saveXML($frag);

If your implementation were to use DOM, I believe you could do the following:

//given $node is <ss:data>

$frag = $node->ownerDocument->createDocumentFragment();
foreach($node->childNodes as $child){
    $frag->appendChild($child->cloneNode(true));
}
$string = $node->ownerDocument->saveXML($frag);
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文