XMLReader & simpleXML 组合,带条件
我使用 XMLReader 和 simpleXML 的组合来解析 WordPress 导出文件中的帖子。我意识到这有点不正常,但是,它更多的是备份项目,因此如果我们将来需要的话,我们可以轻松地找到其中一篇文章。他们所在的 WP 网站需要关闭。
我遇到的问题是 XML 文件中的某些节点为空或包含无用的值(即不是完整的帖子)。我需要添加一些字符串长度条件,但是,我不确定如何检查每个条件。
<?php
$path_to_xml_file = 'compress.zlib://wordpress.2011.xml.gz';
$reader = new XMLReader();
$reader->open($path_to_xml_file);
while($reader->read())
{
if($reader->nodeType == XMLReader::ELEMENT && $reader->name == 'item')
{
$doc = new DOMDocument('1.0', 'UTF-8');
$xml = simplexml_import_dom($doc->importNode($reader->expand(),true));
//echo $xml->title; //or whatever
// Take care of the articles
$newcontent = $xml->children('http://purl.org/rss/1.0/modules/content/');
$contentString = $newcontent->encoded;
$titleString = $xml->title;
echo '
<div class="article-container" id="article-' . $xml->title . '">
<a href="#top" class="top-link">Back to the Top</a>
<h2>' . $xml->title . '</h2>
<div class="articles">' . $newcontent->encoded . '</div>
</div>';
}
}
?>
我只用 simpleXML 就成功地检查了这一点,但是它本身就太占用内存了。 这是我的 simplexml 代码:
<?php
$url = 'wordpress.2011.xml.gz';
$xml = new SimpleXMLElement("compress.zlib://$url", NULL, TRUE);
foreach ($xml->item as $item) :
$newcontent = $item->children('http://purl.org/rss/1.0/modules/content/');
?>
<?php
$contentString = $newcontent->encoded;
$titleString = $item->title;
if ((strlen($contentString) < 13) || (strlen($titleString) < 5)) {
echo '';
} else {
echo '
<div class="article-container" id="article-' . $item->title . '">
<a href="#top" class="top-link">Back to the Top</a>
<h2>' . $item->title . '</h2>
<div class="articles">' . $newcontent->encoded . '</div>
</div>';
}
?>
<?php endforeach; ?>
更新
在 Francis 的帮助下,它现在可以工作了。这是代码:
<?php
$path_to_xml_file = 'compress.zlib://wordpress.2011.xml.gz';
$reader = new XMLReader();
$reader->open($path_to_xml_file);
$contentNS = 'http://purl.org/rss/1.0/modules/content/';
while($reader->read()) {
if($reader->nodeType == XMLReader::ELEMENT and $reader->name == 'item') {
$doc = new DOMDocument('1.0','UTF-8');
$xml = simplexml_import_dom($doc->importNode($reader->expand(), true));
$titleString = (string) $xml->title;
$contentString = (string) $xml->children($contentNS)->encoded;
if (strlen($contentString) > 12 and strlen($titleString) > 4) {
// Be careful with your output escaping!
// This below looks like it might be wrong:
// - $titleString for an ID (use slug)
// - $titleString not escaped
// - $contentString should be escaped? not sure here.
// Have you considered using XMLWriter()?
echo '
<div class="article-container" id="article-' . $titleString . '">
<a href="#top" class="top-link">Back to the Top</a>
<h2>' . $titleString . '</h2>
<div class="articles">' . $contentString . '</div>
</div>';
} else {
echo'';
}
$reader->next(); //skip the subtrees, go to next item sibling
// we already expand()ed this so we don't need to walk it.
}
}
?>
I am using a combination of XMLReader and simpleXML to parse the Posts in a WordPress export file. I realize this is a little out of the norm but, its more of backup project, so we can easily pull up one of these articles if we need it in the futre. The WP site that they were on needs to come down.
The issue I am having is that some of the nodes in the XML file are empty or contain useless values (ie. Not full posts). I need to add some string length conditions but, I'm not sure how to check for each one.
<?php
$path_to_xml_file = 'compress.zlib://wordpress.2011.xml.gz';
$reader = new XMLReader();
$reader->open($path_to_xml_file);
while($reader->read())
{
if($reader->nodeType == XMLReader::ELEMENT && $reader->name == 'item')
{
$doc = new DOMDocument('1.0', 'UTF-8');
$xml = simplexml_import_dom($doc->importNode($reader->expand(),true));
//echo $xml->title; //or whatever
// Take care of the articles
$newcontent = $xml->children('http://purl.org/rss/1.0/modules/content/');
$contentString = $newcontent->encoded;
$titleString = $xml->title;
echo '
<div class="article-container" id="article-' . $xml->title . '">
<a href="#top" class="top-link">Back to the Top</a>
<h2>' . $xml->title . '</h2>
<div class="articles">' . $newcontent->encoded . '</div>
</div>';
}
}
?>
I was able to successfully check this with just simpleXML but, it was too much of a memory hog all by itself. This was my simplexml code:
<?php
$url = 'wordpress.2011.xml.gz';
$xml = new SimpleXMLElement("compress.zlib://$url", NULL, TRUE);
foreach ($xml->item as $item) :
$newcontent = $item->children('http://purl.org/rss/1.0/modules/content/');
?>
<?php
$contentString = $newcontent->encoded;
$titleString = $item->title;
if ((strlen($contentString) < 13) || (strlen($titleString) < 5)) {
echo '';
} else {
echo '
<div class="article-container" id="article-' . $item->title . '">
<a href="#top" class="top-link">Back to the Top</a>
<h2>' . $item->title . '</h2>
<div class="articles">' . $newcontent->encoded . '</div>
</div>';
}
?>
<?php endforeach; ?>
UPDATE
With Francis' help, it is working now. Here is the code:
<?php
$path_to_xml_file = 'compress.zlib://wordpress.2011.xml.gz';
$reader = new XMLReader();
$reader->open($path_to_xml_file);
$contentNS = 'http://purl.org/rss/1.0/modules/content/';
while($reader->read()) {
if($reader->nodeType == XMLReader::ELEMENT and $reader->name == 'item') {
$doc = new DOMDocument('1.0','UTF-8');
$xml = simplexml_import_dom($doc->importNode($reader->expand(), true));
$titleString = (string) $xml->title;
$contentString = (string) $xml->children($contentNS)->encoded;
if (strlen($contentString) > 12 and strlen($titleString) > 4) {
// Be careful with your output escaping!
// This below looks like it might be wrong:
// - $titleString for an ID (use slug)
// - $titleString not escaped
// - $contentString should be escaped? not sure here.
// Have you considered using XMLWriter()?
echo '
<div class="article-container" id="article-' . $titleString . '">
<a href="#top" class="top-link">Back to the Top</a>
<h2>' . $titleString . '</h2>
<div class="articles">' . $contentString . '</div>
</div>';
} else {
echo'';
}
$reader->next(); //skip the subtrees, go to next item sibling
// we already expand()ed this so we don't need to walk it.
}
}
?>
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
当您说
$contentString = $newcontent->encoded
时,$contentString
的类型不是string
而是SimpleXMLElement
>。因此,strlen() 返回了一些无意义的内容。您需要将
SimpleXMLElement
显式转换为string
才能获取元素的文本值:顺便说一句,您可以简化 DOM 扩展并通过使用
XMLReader::expand()
的可选参数转换为SimpleXMLElement
:EDIT 以及编写的第一个代码块的完整示例你想要什么(我认为?)可以看到我所做的就是从第二个代码示例中取出内部循环并将其放入第一个代码示例中的内部循环中。
When you say
$contentString = $newcontent->encoded
, the type of$contentString
is notstring
butSimpleXMLElement
. Thusstrlen()
is returning something nonsensical.You need to explicitly cast
SimpleXMLElement
s tostring
to get the text value of the element:As an aside, you can simplify your DOM expansion and conversion to
SimpleXMLElement
by using the optional argument toXMLReader::expand()
:EDIT with a complete example of your first code block written to do what you want (I think?) As you can see all I did was take the inner loop from your second code example and put it in the inner loop in your first code example.