使用 PHP 进行抓取 SimpleXML...我可以抓取图像但不能抓取原始文本?

发布于 2024-07-10 06:34:07 字数 859 浏览 3 评论 0原文

我正在尝试从网站上获取特定的原始文本。 通过这个网站和其他来源,我学会了如何使用 simpleXML 和 xpath 抓取特定图像。

然而,同样的方法似乎不适用于抓取原始文本。 这是现在不起作用的方法。

// first I set the xpath of the div that contains the text I want
$xpath = '//*[@id="storyCommentCountNumber"]';

// then I create a new DOM Document
$html = new DOMDocument();

// then I fetch the file and parse it (@ suppresses warnings).
@$html->loadHTMLFile($url);

// then convert DOM to SimpleXML
$xml = simplexml_import_dom($html);   

// run an XPath query on the div I want using the previously set xpath
$commcount = $xml->xpath($xpath);
print_r($commcount);

现在,当我抓取图像时,该 commcount 对象将返回一个数组,其中包含图像源。

在本例中,我希望该对象返回“storyCommentCountNumber”div 中包含的原始文本。 但该文本似乎并不包含在对象中,而只是 Div 的名称。

我究竟做错了什么? 我可以看出这种方法仅用于抓取 HTML 元素及其内部的位,而不是原始文本。 如何获取该 div 内的文本?

谢谢!

I'm trying to grab a specific bit of raw text from a web site. Using this site and other sources, I learned how to grab specific images using simpleXML and xpath.

However the same approach doesn't appear to be working for grabbing raw text. Here's what's NOT working right now.

// first I set the xpath of the div that contains the text I want
$xpath = '//*[@id="storyCommentCountNumber"]';

// then I create a new DOM Document
$html = new DOMDocument();

// then I fetch the file and parse it (@ suppresses warnings).
@$html->loadHTMLFile($url);

// then convert DOM to SimpleXML
$xml = simplexml_import_dom($html);   

// run an XPath query on the div I want using the previously set xpath
$commcount = $xml->xpath($xpath);
print_r($commcount);

Now when I'm grabbing an image, that commcount object would return an array that contains the images source in it somewhere.

In this case, I want that object to return the raw text contained in the "storyCommentCountNumber" div. But that text doesn't appear to be contained in the object, just the name of the Div.

What am I doing wrong? I can kind of see that this approach is only for grabbing HTML elements and the bits inside of them, not raw text. How do I get the text inside that div?

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

家住魔仙堡 2024-07-17 06:34:07

需要注意的一件事是,当您在 SimpleXML 对象上使用 print_r 或 var_dump 时,您将看不到对象的“文本”(或者有时是属性)。 因此,要查看所有内容,您应该使用 $variable->AsXml() 输出完整的 XML 字符串。

要获取文本,您需要将 SimpleXml 对象转换为字符串。 这会自动拉出内部文本。

 /* remember $commcount is always an array from the xpath */
 foreach($commcount as $str)
 {
     echo (string)$str;
 }

希望以上内容可以给您一个开始。

One thing to note, is that when you are using print_r or var_dump on SimpleXML objects you won't see the "text" of the object (or sometimes the attributes). So to see everything you should output full XML string using $variable->AsXml().

And to get the text you need to cast the SimpleXml object to a string. This automatically pulls out the innerText.

 /* remember $commcount is always an array from the xpath */
 foreach($commcount as $str)
 {
     echo (string)$str;
 }

Hopefully the above can give you a start.

谁的年少不轻狂 2024-07-17 06:34:07

您能否提供 HTML 示例(可能包括您选择的元素之前和之后的几行?)以及 print_r() 的输出?

您可以尝试以下方法看看是否对您有帮助:

if ( count($commcount) > 0 ) {
    $divContent = $commcount[0]->asXml();
    print $divContent;
}

Can you include a sample of the HTML (including maybe a few lines before and after the element you are selecting?) and the output from print_r()?

You might try the following to see if it helps you out:

if ( count($commcount) > 0 ) {
    $divContent = $commcount[0]->asXml();
    print $divContent;
}
掀纱窥君容 2024-07-17 06:34:07

我知道您正在尝试使用 SimpleXML,但我认为使用正则表达式获取原始文本会更容易。

I know you are trying to use SimpleXML, but I would think that grabbing raw text would be easier with a regular expression.

神回复 2024-07-17 06:34:07

尝试检查页面。

:)

Try checking this page out.

:)

嘦怹 2024-07-17 06:34:07

div 内的原始文本不是 div 元素本身的一部分,而是 div 元素的第一个子节点的一部分。 div 中应该有一个文本节点,其中包含您要查找的数据。

The raw text inside the div is not part of the div element itself, rather it is part of the first child node of the div element. There should be a text node within the div that contains the data you are looking for.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文