XMLReader & simpleXML 组合，带条件

发布于 2024-12-21 15:11:35 字数 3838 浏览 0 评论 0原文

我使用 XMLReader 和 simpleXML 的组合来解析 WordPress 导出文件中的帖子。我意识到这有点不正常，但是，它更多的是备份项目，因此如果我们将来需要的话，我们可以轻松地找到其中一篇文章。他们所在的 WP 网站需要关闭。

我遇到的问题是 XML 文件中的某些节点为空或包含无用的值（即不是完整的帖子）。我需要添加一些字符串长度条件，但是，我不确定如何检查每个条件。

<?php 

$path_to_xml_file = 'compress.zlib://wordpress.2011.xml.gz';


$reader = new XMLReader();
                $reader->open($path_to_xml_file);
                while($reader->read())
                {
                        if($reader->nodeType == XMLReader::ELEMENT && $reader->name == 'item')
                        {
                                        $doc = new DOMDocument('1.0', 'UTF-8');
                                        $xml = simplexml_import_dom($doc->importNode($reader->expand(),true));
                                        //echo $xml->title; //or whatever

// Take care of the articles
$newcontent = $xml->children('http://purl.org/rss/1.0/modules/content/');
$contentString = $newcontent->encoded;
$titleString = $xml->title;

    echo '
    <div class="article-container" id="article-' .  $xml->title . '">
    <a href="#top" class="top-link">Back to the Top</a>
        <h2>' .  $xml->title . '</h2>
        <div class="articles">' . $newcontent->encoded . '</div>
    </div>';
                        }
                }

?>

我只用 simpleXML 就成功地检查了这一点，但是它本身就太占用内存了。 这是我的 simplexml 代码：

<?php 

    $url = 'wordpress.2011.xml.gz';
    $xml = new SimpleXMLElement("compress.zlib://$url", NULL, TRUE);

    foreach ($xml->item as $item) :

    $newcontent = $item->children('http://purl.org/rss/1.0/modules/content/');

    ?>

<?php
$contentString = $newcontent->encoded;
$titleString = $item->title;

if ((strlen($contentString) < 13) || (strlen($titleString) < 5))  {
    echo '';
} else {
    echo '
    <div class="article-container" id="article-' .  $item->title . '">
    <a href="#top" class="top-link">Back to the Top</a>
        <h2>' .  $item->title . '</h2>
        <div class="articles">' . $newcontent->encoded . '</div>
    </div>';
}
?>



 <?php endforeach; ?>

更新

在 Francis 的帮助下，它现在可以工作了。这是代码：

<?php 

$path_to_xml_file = 'compress.zlib://wordpress.2011.xml.gz';

$reader = new XMLReader();
$reader->open($path_to_xml_file);
$contentNS = 'http://purl.org/rss/1.0/modules/content/';
while($reader->read()) {
    if($reader->nodeType == XMLReader::ELEMENT and $reader->name == 'item') {
        $doc = new DOMDocument('1.0','UTF-8');
        $xml = simplexml_import_dom($doc->importNode($reader->expand(), true));
        $titleString = (string) $xml->title;
        $contentString = (string) $xml->children($contentNS)->encoded;
        if (strlen($contentString) > 12 and strlen($titleString) > 4)  {
            // Be careful with your output escaping!
            // This below looks like it might be wrong:
            // - $titleString for an ID (use slug)
            // - $titleString not escaped
            // - $contentString should be escaped? not sure here.
            // Have you considered using XMLWriter()?
            echo '
<div class="article-container" id="article-' .  $titleString . '">
    <a href="#top" class="top-link">Back to the Top</a>
    <h2>' .  $titleString . '</h2>
    <div class="articles">' . $contentString . '</div>
</div>';
        } else {

        echo'';

        }

        $reader->next(); //skip the subtrees, go to next item sibling
        // we already expand()ed this so we don't need to walk it.
    }
}

?>

原文

I am using a combination of XMLReader and simpleXML to parse the Posts in a WordPress export file. I realize this is a little out of the norm but, its more of backup project, so we can easily pull up one of these articles if we need it in the futre. The WP site that they were on needs to come down.

The issue I am having is that some of the nodes in the XML file are empty or contain useless values (ie. Not full posts). I need to add some string length conditions but, I'm not sure how to check for each one.

<?php 

$path_to_xml_file = 'compress.zlib://wordpress.2011.xml.gz';


$reader = new XMLReader();
                $reader->open($path_to_xml_file);
                while($reader->read())
                {
                        if($reader->nodeType == XMLReader::ELEMENT && $reader->name == 'item')
                        {
                                        $doc = new DOMDocument('1.0', 'UTF-8');
                                        $xml = simplexml_import_dom($doc->importNode($reader->expand(),true));
                                        //echo $xml->title; //or whatever

// Take care of the articles
$newcontent = $xml->children('http://purl.org/rss/1.0/modules/content/');
$contentString = $newcontent->encoded;
$titleString = $xml->title;

    echo '
    <div class="article-container" id="article-' .  $xml->title . '">
    <a href="#top" class="top-link">Back to the Top</a>
        <h2>' .  $xml->title . '</h2>
        <div class="articles">' . $newcontent->encoded . '</div>
    </div>';
                        }
                }

?>

I was able to successfully check this with just simpleXML but, it was too much of a memory hog all by itself. This was my simplexml code:

<?php 

    $url = 'wordpress.2011.xml.gz';
    $xml = new SimpleXMLElement("compress.zlib://$url", NULL, TRUE);

    foreach ($xml->item as $item) :

    $newcontent = $item->children('http://purl.org/rss/1.0/modules/content/');

    ?>

<?php
$contentString = $newcontent->encoded;
$titleString = $item->title;

if ((strlen($contentString) < 13) || (strlen($titleString) < 5))  {
    echo '';
} else {
    echo '
    <div class="article-container" id="article-' .  $item->title . '">
    <a href="#top" class="top-link">Back to the Top</a>
        <h2>' .  $item->title . '</h2>
        <div class="articles">' . $newcontent->encoded . '</div>
    </div>';
}
?>



 <?php endforeach; ?>

UPDATE

With Francis' help, it is working now. Here is the code:

<?php 

$path_to_xml_file = 'compress.zlib://wordpress.2011.xml.gz';

$reader = new XMLReader();
$reader->open($path_to_xml_file);
$contentNS = 'http://purl.org/rss/1.0/modules/content/';
while($reader->read()) {
    if($reader->nodeType == XMLReader::ELEMENT and $reader->name == 'item') {
        $doc = new DOMDocument('1.0','UTF-8');
        $xml = simplexml_import_dom($doc->importNode($reader->expand(), true));
        $titleString = (string) $xml->title;
        $contentString = (string) $xml->children($contentNS)->encoded;
        if (strlen($contentString) > 12 and strlen($titleString) > 4)  {
            // Be careful with your output escaping!
            // This below looks like it might be wrong:
            // - $titleString for an ID (use slug)
            // - $titleString not escaped
            // - $contentString should be escaped? not sure here.
            // Have you considered using XMLWriter()?
            echo '
<div class="article-container" id="article-' .  $titleString . '">
    <a href="#top" class="top-link">Back to the Top</a>
    <h2>' .  $titleString . '</h2>
    <div class="articles">' . $contentString . '</div>
</div>';
        } else {

        echo'';

        }

        $reader->next(); //skip the subtrees, go to next item sibling
        // we already expand()ed this so we don't need to walk it.
    }
}

?>

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

风柔一江水 2024-12-28 15:11:35

当您说 $contentString = $newcontent->encoded 时，$contentString 的类型不是 string 而是 SimpleXMLElement >。因此，strlen() 返回了一些无意义的内容。

您需要将 SimpleXMLElement 显式转换为 string 才能获取元素的文本值：

$contentString = (string) $newcontent->encoded;

顺便说一句，您可以简化 DOM 扩展并通过使用 XMLReader::expand() 的可选参数转换为 SimpleXMLElement：

$sxe = simplexml_import_dom($reader->expand(new DOMDocument('1.0','UTF-8')));

EDIT 以及编写的第一个代码块的完整示例你想要什么（我认为？）可以看到我所做的就是从第二个代码示例中取出内部循环并将其放入第一个代码示例中的内部循环中。

$reader = new XMLReader();
$reader->open($path_to_xml_file);
$contentNS = 'http://purl.org/rss/1.0/modules/content/';
while($reader->read()) {
    if($reader->nodeType == XMLReader::ELEMENT and $reader->name == 'item') {
        $xml = simplexml_import_dom($reader->expand(new DOMDocument('1.0', 'UTF-8')));
        $titleString = (string) $xml->title;
        $contentString = (string) $xml->children($contentNS)->encoded;
        if (strlen($contentString) > 12 and strlen($titleString) > 4)  {
            // Be careful with your output escaping!
            // This below looks like it might be wrong:
            // - $titleString for an ID (use slug)
            // - $titleString not escaped
            // - $contentString should be escaped? not sure here.
            // Have you considered using XMLWriter()?
            echo '
<div class="article-container" id="article-' .  $titleString . '">
    <a href="#top" class="top-link">Back to the Top</a>
    <h2>' .  $titleString . '</h2>
    <div class="articles">' . $contentString . '</div>
</div>';
        }
        $reader->next(); //skip the subtrees, go to next item sibling
        // we already expand()ed this so we don't need to walk it.
    }
}

When you say $contentString = $newcontent->encoded, the type of $contentString is not string but SimpleXMLElement. Thus strlen() is returning something nonsensical.

You need to explicitly cast SimpleXMLElements to string to get the text value of the element:

$contentString = (string) $newcontent->encoded;

As an aside, you can simplify your DOM expansion and conversion to SimpleXMLElement by using the optional argument to XMLReader::expand():

$sxe = simplexml_import_dom($reader->expand(new DOMDocument('1.0','UTF-8')));

EDIT with a complete example of your first code block written to do what you want (I think?) As you can see all I did was take the inner loop from your second code example and put it in the inner loop in your first code example.

$reader = new XMLReader();
$reader->open($path_to_xml_file);
$contentNS = 'http://purl.org/rss/1.0/modules/content/';
while($reader->read()) {
    if($reader->nodeType == XMLReader::ELEMENT and $reader->name == 'item') {
        $xml = simplexml_import_dom($reader->expand(new DOMDocument('1.0', 'UTF-8')));
        $titleString = (string) $xml->title;
        $contentString = (string) $xml->children($contentNS)->encoded;
        if (strlen($contentString) > 12 and strlen($titleString) > 4)  {
            // Be careful with your output escaping!
            // This below looks like it might be wrong:
            // - $titleString for an ID (use slug)
            // - $titleString not escaped
            // - $contentString should be escaped? not sure here.
            // Have you considered using XMLWriter()?
            echo '
<div class="article-container" id="article-' .  $titleString . '">
    <a href="#top" class="top-link">Back to the Top</a>
    <h2>' .  $titleString . '</h2>
    <div class="articles">' . $contentString . '</div>
</div>';
        }
        $reader->next(); //skip the subtrees, go to next item sibling
        // we already expand()ed this so we don't need to walk it.
    }
}

回复收藏 0 原文

~没有更多了~