如何使用 PHP 获取页面 HTML DOM 的一部分?
我正在从已发布的 Google 电子表格中获取数据,我想要的只是内容 div 内的信息 (
)
我知道内容以
开头,以
什么是最好/最获取内部 DOM 部分的有效方法是什么?我正在考虑正则表达式(请参阅下面的示例),但它不起作用,我不确定它是否有效...
header('Content-type: text/plain');
$foo = file_get_contents('https://docs.google.com/spreadsheet/pub?key=0Ahuij-1M3dgvdG8waTB0UWJDT3NsUEdqNVJTWXJNaFE&single=true&gid=0&output=html&ndplr=1');
$start = '<div id="content">';
$end = '<div id="footer">';
$foo = preg_replace("#$start(.*?)$end#",'$1',$foo);
echo $foo;
更新
我想我的另一个问题基本上是关于它是否只是更简单、更容易使用带有起点和终点的正则表达式,而不是尝试解析可能有错误的 DOM,然后提取我需要的部分。似乎正则表达式是可行的方法,但很想听听您的意见。
I'm grabbing data from a published google spreadsheet, and all I want is the information inside of the content div (<div id="content">...</div>
)
I know that the content starts off as <div id="content">
and ends as </div><div id="footer">
What's the best / most efficient way to grab the part of the DOM that is inside there? I was thinking regular expression (see my example below) but it is not working and I'm not sure if it that efficient...
header('Content-type: text/plain');
$foo = file_get_contents('https://docs.google.com/spreadsheet/pub?key=0Ahuij-1M3dgvdG8waTB0UWJDT3NsUEdqNVJTWXJNaFE&single=true&gid=0&output=html&ndplr=1');
$start = '<div id="content">';
$end = '<div id="footer">';
$foo = preg_replace("#$start(.*?)$end#",'$1',$foo);
echo $foo;
UPDATE
I guess another question I have is basically about if it is just simpler and easier to use regex with start and end points rather than trying to parse through a DOM which might have errors and then extract the piece I need. Seems like regex would be the way to go but would love to hear your opinions.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
尝试将正则表达式更改为
$foo = preg_replace("#$start(.*?)$end#s",'$1',$foo);
,即s
修饰符更改.
以包含新行。事实上,您的正则表达式必须将同一行上的标签之间的所有内容进行匹配。如果您的 HTML 页面比这更复杂,那么正则表达式可能无法剪切它,您需要查看像 DOMDocument 或 简单 HTML DOM
Try changing your regex to
$foo = preg_replace("#$start(.*?)$end#s",'$1',$foo);
, thes
modifier changes the.
to include new lines. As it is, your regex would have to all the content between the tags on the same line to match.If your HTML page is any more complex than that, then regex probably won't cut it and you'd need to look into a parser like DOMDocument or Simple HTML DOM
如果你有很多事情要做,我建议你看看 http://simplehtmldom.sourceforge.net
非常适合这种事情。
if you have a lot to do, I would recommend you take a look at http://simplehtmldom.sourceforge.net
really good for this sort of thing.
不要使用正则表达式,它可能会失败。
使用 PHP 内置的 DOM 解析:
http://php.net/manual/en/class.domdocument.php
您可以轻松地遍历和解析相关内容。
Do not use regex, it can fail.
Use PHP's inbuilt DOM parse :
http://php.net/manual/en/class.domdocument.php
You can easily traverse and parse relevant content .