使用 simpleXML 从 XML 文件中提取 HTML
我正在读取由第 3 方应用程序生成的 xml 文件,其中包括以下内容:
<Cell>
<Comment ss:Author="Mark Baker">
<ss:Data xmlns="http://www.w3.org/TR/REC-html40"><B><Font html:Face="Tahoma" html:Size="8" html:Color="#000000">Mark Baker:</Font></B><Font html:Face="Tahoma" html:Size="8" html:Color="#000000"> Comment 1 - No align</Font></ss:Data>
</Comment>
</Cell>
我想做的是从 Cell->Comment->Data 元素“按原样”或“按原样”访问原始数据实际的 (X)HTML 标记块(最好是后者)。
if (isset($cell->Comment)) {
echo 'comment found<br />';
$commentAttributes = $cell->Comment->attributes($namespaces['ss']);
if (isset($commentAttributes->Author)) {
echo 'Author: ',(string)$commentAttributes->Author,'<br />';
}
$commentData = $cell->Comment->children($namespaces['ss']);
var_dump($commentData);
echo '<br />';
}
给我:
comment found
Author: Mark Baker
object(SimpleXMLElement)#130 (2) { ["@attributes"]=> array(1) { ["Author"]=> string(10) "Mark Baker" } ["Data"]=> object(SimpleXMLElement)#129 (0) { } }
while
if (isset($cell->Comment)) {
echo 'comment found<br />';
$commentAttributes = $cell->Comment->attributes($namespaces['ss']);
if (isset($commentAttributes->Author)) {
echo 'Author: ',(string)$commentAttributes->Author,'<br />';
}
$commentData = $cell->Comment->Data->children();
var_dump($commentData);
echo '<br />';
}
给我:
comment found
Author: Mark Baker
object(SimpleXMLElement)#129 (2) { ["B"]=> object(SimpleXMLElement)#118 (1) { ["Font"]=> string(11) "Mark Baker:" } ["Font"]=> string(21) " Comment 1 - No align" }
不幸的是,simpleXML 似乎将整个元素视为一系列 XML 节点。我确信我应该能够获得原始数据,而无需复杂的循环,或将元素提供给 DOM 解析器;也许使用 xmlns="http://www.w3.org/TR/REC-html40" 命名空间来干净地提取它,但我不知道如何。
任何帮助表示赞赏。
XML 数据的更复杂示例:
<Cell>
<Comment ss:Author="Mark Baker">
<ss:Data xmlns="http://www.w3.org/TR/REC-html40">
<B><Font html:Face="Tahoma" html:Size="8" html:Color="#000000">Mark Baker:</Font></B><Font html:Face="Tahoma" html:Size="8" html:Color="#000000"> </Font><B><Font html:Face="Tahoma" x:Family="Swiss" html:Size="8" html:Color="#000000">Rich </Font><U><Font html:Face="Tahoma" x:Family="Swiss" html:Size="8" html:Color="#FF0000">Text </Font></U><Font html:Face="Tahoma" x:Family="Swiss" html:Size="8" html:Color="#000000">Comment</Font></B><Font html:Face="Tahoma" html:Size="8" html:Color="#000000"> Center Aligned</Font>
</ss:Data>
</Comment>
</Cell>
I'm reading an xml file generated by a 3rd-party application that includes the following:
<Cell>
<Comment ss:Author="Mark Baker">
<ss:Data xmlns="http://www.w3.org/TR/REC-html40"><B><Font html:Face="Tahoma" html:Size="8" html:Color="#000000">Mark Baker:</Font></B><Font html:Face="Tahoma" html:Size="8" html:Color="#000000">
Comment 1 - No align</Font></ss:Data>
</Comment>
</Cell>
What I'm trying to do is access the raw data from the Cell->Comment->Data element either "as is" or as an actual block of (X)HTML markup (preferably the latter).
if (isset($cell->Comment)) {
echo 'comment found<br />';
$commentAttributes = $cell->Comment->attributes($namespaces['ss']);
if (isset($commentAttributes->Author)) {
echo 'Author: ',(string)$commentAttributes->Author,'<br />';
}
$commentData = $cell->Comment->children($namespaces['ss']);
var_dump($commentData);
echo '<br />';
}
gives me:
comment found
Author: Mark Baker
object(SimpleXMLElement)#130 (2) { ["@attributes"]=> array(1) { ["Author"]=> string(10) "Mark Baker" } ["Data"]=> object(SimpleXMLElement)#129 (0) { } }
while
if (isset($cell->Comment)) {
echo 'comment found<br />';
$commentAttributes = $cell->Comment->attributes($namespaces['ss']);
if (isset($commentAttributes->Author)) {
echo 'Author: ',(string)$commentAttributes->Author,'<br />';
}
$commentData = $cell->Comment->Data->children();
var_dump($commentData);
echo '<br />';
}
gives me:
comment found
Author: Mark Baker
object(SimpleXMLElement)#129 (2) { ["B"]=> object(SimpleXMLElement)#118 (1) { ["Font"]=> string(11) "Mark Baker:" } ["Font"]=> string(21) " Comment 1 - No align" }
Unfortunately, simpleXML seems to be treating the whole element as a series of XML nodes. I'm sure I should be able to get this is raw data without complex looping, or feeding the element to a DOM Parser; perhaps using the xmlns="http://www.w3.org/TR/REC-html40" namespace to extract this cleanly, but I can't figure out how.
Any help appreciated.
A more complex example of the XML data:
<Cell>
<Comment ss:Author="Mark Baker">
<ss:Data xmlns="http://www.w3.org/TR/REC-html40">
<B><Font html:Face="Tahoma" html:Size="8" html:Color="#000000">Mark Baker:</Font></B><Font html:Face="Tahoma" html:Size="8" html:Color="#000000">
</Font><B><Font html:Face="Tahoma" x:Family="Swiss" html:Size="8" html:Color="#000000">Rich </Font><U><Font html:Face="Tahoma" x:Family="Swiss" html:Size="8" html:Color="#FF0000">Text </Font></U><Font html:Face="Tahoma" x:Family="Swiss" html:Size="8" html:Color="#000000">Comment</Font></B><Font html:Face="Tahoma" html:Size="8" html:Color="#000000"> Center Aligned</Font>
</ss:Data>
</Comment>
</Cell>
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我暂时采用了一种快速而肮脏的解决方案。从长远来看,我将改用 XMLReader(出于上述所有原因)...我只是目前没有时间重写所有现有的 simpleXML 代码。
我已经同意了:
虽然我更愿意保留 HTML 标记,但这需要额外的工作,所以我只是删除所有标记,留下纯文本(这是关键元素)。
虽然这远非完美的解决方案,但它做了我需要它做的事情(目前),并且我可以继续执行“待办事项”列表中的下一项,已经添加了一个新项目“重写使用” XMLReader”到该列表。
感谢您的帮助。当我重写时,我一定会重新访问这个线程。
I've gone with a quick and dirty solution for the time being. In the longer term, I'll switch to using XMLReader (for all the reasons mentioned)... I just don't have the time to rewrite all the existing simpleXML code at the moment.
I've gone with:
While I'd prefer to keep the HTML markup, that will require additional work, so I'm simply stripping all the markup leaving me with the plain text (which is the critical element).
While this is a far from perfect solution, it does what I need it to do (for the moment), and I can move on to the next item in my "to do" list, having already added a new item of "rewrite using XMLReader" to that list.
Thanks for the help. I'll be sure to revisit this thread when I am doing that rewrite.
所以我知道你的问题来了又去,但我也遇到了同样的问题,我也必须弄清楚我想如何处理它。为了后代,我是这样得到它的。
如果您只接受 (x)HTML:
如果您认为有人会放入 XML 并且您对此表示同意,那么您只想删除第一个自动生成的 XML 标记:
所以您的代码将如下所示:
So I know your question has come and gone, but I had the same issue and I had to figure out how I wanted to handle it as well. For future generations, here's how I got it.
If you're only accepting (x)HTML:
If you think someone's going to put in XML and you're OK with that, you'll only want to kill the first, automatically generated XML tag:
So your code would look like this:
如果
元素内的 HTML 被视为字符串文字,则必须将其包装到 CDATA 部分 正如评论中已经暗示的,如果它不在 CDATA 部分中,它将被视为节点。然后,您需要查找
的 insideXml 以将其作为原始 XML 获取。不幸的是,SimpleXml 和 DOM 都没有直接获取它的本机方法。您必须使用用户态实现。innerXml 的用户态实现通常要么迭代所有子节点,要么连接它们的原始 XML。或者他们转储整个树并用字符串替换根节点。或者他们创建一个片段或将节点导入到另一个文档中。
我不知道有任何其他方法可以做到这一点。不确定这是否可以通过
XSLT
实现。XMLReader
有一个readInnerXML
方法。If the HTML inside the
<ss:Data>
element is considered to be a string literal, it has to be wrapped into a CDATA section as was already hinted in the commentsIf it's not in a CDATA section, it will be considered nodes. Then you'd be looking for the innerXml of the
<ss:Data>
to get that as raw XML. Unfortunately, neither SimpleXml, nor DOM have a native way to fetch that directly. You'd have to use a userland implementation.Userland implementations of innerXml usually either iterate over all the child nodes and concatenate their raw XML. Or they dump the entire tree and string replace the root node. Or they create a fragment or import the nodes into another document.
I am not aware of any other way to do that. Not sure if this would be possible with
XSLT
.XMLReader
has areadInnerXML
method though.如果您的实现使用 DOM,我相信您可以执行以下操作:
If your implementation were to use
DOM
, I believe you could do the following: