使用 PHP 提取 HTML 文档的正文
我知道最好使用 DOM 来实现此目的,但让我们尝试以这种方式提取文本:
<?php
$html=<<<EOD
<html>
<head>
</head>
<body>
<p>Some text</p>
</body>
</html>
EOD;
preg_match('/<body.*?>/', $html, $matches, PREG_OFFSET_CAPTURE);
if (empty($matches))
exit;
$matched_body_start_tag = $matches[0][0];
$index_of_body_start_tag = $matches[0][1];
$index_of_body_end_tag = strpos($html, '</body>');
$body = substr(
$html,
$index_of_body_start_tag + strlen($matched_body_start_tag),
$index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)
);
echo $body;
结果可以在此处看到: http ://ideone.com/vH2FZ
如您所见,我收到的文本比预期多。
有一些我不明白的地方,为了获得 substr($string, $start, $length)
函数的正确长度,我正在使用:
$index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)
我没有发现这个公式有任何问题。
有人可以建议问题出在哪里吗?
非常感谢大家。
编辑:
非常感谢大家。只是我脑子里有一个bug。阅读您的答案后,我现在明白问题是什么,它应该是:
$index_of_body_end_tag - ($index_of_body_start_tag + strlen($matched_body_start_tag));
或者:
$index_of_body_end_tag - $index_of_body_start_tag - strlen($matched_body_start_tag);
I know it's better to use DOM for this purpose but let's try to extract the text in this way:
<?php
$html=<<<EOD
<html>
<head>
</head>
<body>
<p>Some text</p>
</body>
</html>
EOD;
preg_match('/<body.*?>/', $html, $matches, PREG_OFFSET_CAPTURE);
if (empty($matches))
exit;
$matched_body_start_tag = $matches[0][0];
$index_of_body_start_tag = $matches[0][1];
$index_of_body_end_tag = strpos($html, '</body>');
$body = substr(
$html,
$index_of_body_start_tag + strlen($matched_body_start_tag),
$index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)
);
echo $body;
The result can be seen here: http://ideone.com/vH2FZ
As you can see, I am getting more text than expected.
There is something I don't understand, to get the correct length for the substr($string, $start, $length)
function, I am using:
$index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)
I don't see anything wrong with this formula.
Could somebody kindly suggest where the problem is?
Many thanks to you all.
EDIT:
Thank you very very much to all of you. There is just a bug in my brain. After reading your answers, I now understand what the problem is, it should either be:
$index_of_body_end_tag - ($index_of_body_start_tag + strlen($matched_body_start_tag));
Or:
$index_of_body_end_tag - $index_of_body_start_tag - strlen($matched_body_start_tag);
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
问题是你的字符串有新行,其中 .在模式中仅匹配单行,您需要添加 /s 修饰符来 make 。匹配多行
这是我的解决方案,我更喜欢这种方式。
编辑:我正在更新我的答案,以便为您提供更好的解释为什么您的代码失败。
你有这个字符串:
一切似乎都很好,但实际上每行都有非打印字符(换行符)。
您有 53 个可打印字符和 7 个不可打印字符(新行,\n == 实际上每个新行有 2 个字符)。
当您到达这部分代码时:
您将获得 的正确位置。 (从位置 51 开始)但这会计算新行。
因此,当您到达这行代码时:
它的计算结果为 31(包括新行),并且:
它的计算结果为 51 - 25 + 6 = 32(您必须读取的字符),但在它们之间只有 16 个可打印的文本字符<正文>和和 4 个不可打印字符( 之后的新行和 之前的新行)。问题是,您必须像这样对计算进行分组(优先级排序):
评估为 51 - (25 + 6) = 51 - 31 = 20 (16 + 4)。
:) 希望这可以帮助您理解为什么优先级很重要。 (很抱歉误导您有关换行符的信息,它仅在我上面给出的正则表达式示例中有效)。
The problem is that your string have new lines where . in the pattern only matches single lines, you need to add /s modifier to make . to match multi-lines
Here is my solution, I prefer it this way.
Edit: I am updating my answer to provide you with better explanation why your code fails.
You have this string:
Everything seems to be fine with it but actually you have non-print characters (new line characters) on each line.
You have 53 printable characters and 7 non printable (new lines, \n == 2 characters actually for each new line).
When you reach this part of the code:
You get the correct position of </body> (starting at position 51) but this counts the new lines.
So when you reach this line of code:
It it evaluated to 31 (new lines included), and:
It is evaluated to 51 - 25 + 6 = 32 (characters you have to read) but you only have 16 printable characters of text between <body> and </body> and 4 non printable characters (new line after <body> and new line before </body>). And here is the problem, you have to group the calculation (prioritize) like so:
evaluated to 51 - (25 + 6) = 51 - 31 = 20 (16 + 4).
:) Hope this helps you to understand why prioritizing is important. (Sorry for misleading you about newlines it is only valid in regex example I gave above).
就我个人而言,我不会使用正则表达式。
返回
foobar
Personally, I wouldn't use regex.
returns
<h1>foobar</h1>
问题出在您对结束索引的
substr
计算中。你应该一路减去:但你正在做:
也就是说,考虑到你可以使用
preg_match
only 来做到这一点,这似乎有点矫枉过正。您只需使用s
修饰符确保跨新行进行匹配:输出:
The problem is in your
substr
computation of the ending index. You should substract all the way:But you are doing:
That said, it seems a little overkill considering you can do it using
preg_match
only. You just need to make sure you match across new lines, using thes
modifier:Outputs:
有人可能已经发现了你的错误,我没有阅读所有回复。
代数是错误的。
代码在这里
顺便说一句,第一次看到 ideone.com,那太酷了。
或者 ..
Somebodys probably already found your error, i didn't read all the replys.
The algebra is wrong.
code is here
Btw, first time seeing ideone.com, thats pretty cool.
or ..