使用 PHP 进行抓取 SimpleXML...我可以抓取图像但不能抓取原始文本?
我正在尝试从网站上获取特定的原始文本。 通过这个网站和其他来源,我学会了如何使用 simpleXML 和 xpath 抓取特定图像。
然而,同样的方法似乎不适用于抓取原始文本。 这是现在不起作用的方法。
// first I set the xpath of the div that contains the text I want
$xpath = '//*[@id="storyCommentCountNumber"]';
// then I create a new DOM Document
$html = new DOMDocument();
// then I fetch the file and parse it (@ suppresses warnings).
@$html->loadHTMLFile($url);
// then convert DOM to SimpleXML
$xml = simplexml_import_dom($html);
// run an XPath query on the div I want using the previously set xpath
$commcount = $xml->xpath($xpath);
print_r($commcount);
现在,当我抓取图像时,该 commcount 对象将返回一个数组,其中包含图像源。
在本例中,我希望该对象返回“storyCommentCountNumber”div 中包含的原始文本。 但该文本似乎并不包含在对象中,而只是 Div 的名称。
我究竟做错了什么? 我可以看出这种方法仅用于抓取 HTML 元素及其内部的位,而不是原始文本。 如何获取该 div 内的文本?
谢谢!
I'm trying to grab a specific bit of raw text from a web site. Using this site and other sources, I learned how to grab specific images using simpleXML and xpath.
However the same approach doesn't appear to be working for grabbing raw text. Here's what's NOT working right now.
// first I set the xpath of the div that contains the text I want
$xpath = '//*[@id="storyCommentCountNumber"]';
// then I create a new DOM Document
$html = new DOMDocument();
// then I fetch the file and parse it (@ suppresses warnings).
@$html->loadHTMLFile($url);
// then convert DOM to SimpleXML
$xml = simplexml_import_dom($html);
// run an XPath query on the div I want using the previously set xpath
$commcount = $xml->xpath($xpath);
print_r($commcount);
Now when I'm grabbing an image, that commcount object would return an array that contains the images source in it somewhere.
In this case, I want that object to return the raw text contained in the "storyCommentCountNumber" div. But that text doesn't appear to be contained in the object, just the name of the Div.
What am I doing wrong? I can kind of see that this approach is only for grabbing HTML elements and the bits inside of them, not raw text. How do I get the text inside that div?
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
需要注意的一件事是,当您在 SimpleXML 对象上使用 print_r 或 var_dump 时,您将看不到对象的“文本”(或者有时是属性)。 因此,要查看所有内容,您应该使用 $variable->AsXml() 输出完整的 XML 字符串。
要获取文本,您需要将 SimpleXml 对象转换为字符串。 这会自动拉出内部文本。
希望以上内容可以给您一个开始。
One thing to note, is that when you are using print_r or var_dump on SimpleXML objects you won't see the "text" of the object (or sometimes the attributes). So to see everything you should output full XML string using $variable->AsXml().
And to get the text you need to cast the SimpleXml object to a string. This automatically pulls out the innerText.
Hopefully the above can give you a start.
您能否提供 HTML 示例(可能包括您选择的元素之前和之后的几行?)以及 print_r() 的输出?
您可以尝试以下方法看看是否对您有帮助:
Can you include a sample of the HTML (including maybe a few lines before and after the element you are selecting?) and the output from print_r()?
You might try the following to see if it helps you out:
我知道您正在尝试使用 SimpleXML,但我认为使用正则表达式获取原始文本会更容易。
I know you are trying to use SimpleXML, but I would think that grabbing raw text would be easier with a regular expression.
尝试检查此页面。
:)
Try checking this page out.
:)
div 内的原始文本不是 div 元素本身的一部分,而是 div 元素的第一个子节点的一部分。 div 中应该有一个文本节点,其中包含您要查找的数据。
The raw text inside the div is not part of the div element itself, rather it is part of the first child node of the div element. There should be a text node within the div that contains the data you are looking for.