使用 PHP 提取 HTML 文档的正文

发布于 2024-10-16 05:17:27 字数 1489 浏览 12 评论 0原文

我知道最好使用 DOM 来实现此目的，但让我们尝试以这种方式提取文本：

<?php


$html=<<<EOD
<html>
<head>
</head>
<body>
<p>Some text</p>
</body>
</html>
EOD;


        preg_match('/<body.*?>/', $html, $matches, PREG_OFFSET_CAPTURE);

        if (empty($matches))
            exit;

        $matched_body_start_tag = $matches[0][0];
        $index_of_body_start_tag = $matches[0][1];

        $index_of_body_end_tag = strpos($html, '</body>');


        $body = substr(
                        $html,
                        $index_of_body_start_tag + strlen($matched_body_start_tag),
                        $index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)
        );

echo $body;

结果可以在此处看到： http ://ideone.com/vH2FZ

如您所见，我收到的文本比预期多。

有一些我不明白的地方，为了获得 substr($string, $start, $length) 函数的正确长度，我正在使用：

$index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)

我没有发现这个公式有任何问题。

有人可以建议问题出在哪里吗？

非常感谢大家。

编辑：

非常感谢大家。只是我脑子里有一个bug。阅读您的答案后，我现在明白问题是什么，它应该是：

  $index_of_body_end_tag - ($index_of_body_start_tag + strlen($matched_body_start_tag));

或者：

  $index_of_body_end_tag - $index_of_body_start_tag - strlen($matched_body_start_tag);

原文

I know it's better to use DOM for this purpose but let's try to extract the text in this way:

<?php


$html=<<<EOD
<html>
<head>
</head>
<body>
<p>Some text</p>
</body>
</html>
EOD;


        preg_match('/<body.*?>/', $html, $matches, PREG_OFFSET_CAPTURE);

        if (empty($matches))
            exit;

        $matched_body_start_tag = $matches[0][0];
        $index_of_body_start_tag = $matches[0][1];

        $index_of_body_end_tag = strpos($html, '</body>');


        $body = substr(
                        $html,
                        $index_of_body_start_tag + strlen($matched_body_start_tag),
                        $index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)
        );

echo $body;

The result can be seen here: http://ideone.com/vH2FZ

As you can see, I am getting more text than expected.

There is something I don't understand, to get the correct length for the substr($string, $start, $length) function, I am using:

$index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)

I don't see anything wrong with this formula.

Could somebody kindly suggest where the problem is?

Many thanks to you all.

EDIT:

Thank you very very much to all of you. There is just a bug in my brain. After reading your answers, I now understand what the problem is, it should either be:

  $index_of_body_end_tag - ($index_of_body_start_tag + strlen($matched_body_start_tag));

Or:

  $index_of_body_end_tag - $index_of_body_start_tag - strlen($matched_body_start_tag);

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

初心 2024-10-23 05:17:27

问题是你的字符串有新行，其中 .在模式中仅匹配单行，您需要添加 /s 修饰符来 make 。匹配多行

这是我的解决方案，我更喜欢这种方式。

<?php

$html=<<<EOD
<html>
<head>
</head>
<body buu="grger"     ga="Gag">
<p>Some text</p>
</body>
</html>
EOD;

    // get anything between <body> and </body> where <body can="have_as many" attributes="as required">
    if (preg_match('/(?:<body[^>]*>)(.*)<\/body>/isU', $html, $matches)) {
        $body = $matches[1];
    }
    // outputing all matches for debugging purposes
    var_dump($matches);
?>

编辑：我正在更新我的答案，以便为您提供更好的解释为什么您的代码失败。

你有这个字符串：

<html>
<head>
</head>
<body>
<p>Some text</p>
</body>
</html>

一切似乎都很好，但实际上每行都有非打印字符（换行符）。
您有 53 个可打印字符和 7 个不可打印字符（新行，\n == 实际上每个新行有 2 个字符）。

当您到达这部分代码时：

$index_of_body_end_tag = strpos($html, '</body>');

您将获得的正确位置。（从位置 51 开始）但这会计算新行。

因此，当您到达这行代码时：

$index_of_body_start_tag + strlen($matched_body_start_tag)

它的计算结果为 31（包括新行），并且：

$index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)

它的计算结果为 51 - 25 + 6 = 32（您必须读取的字符），但在它们之间只有 16 个可打印的文本字符<正文>和和 4 个不可打印字符（之后的新行和之前的新行）。问题是，您必须像这样对计算进行分组（优先级排序）：

$index_of_body_end_tag - ($index_of_body_start_tag + strlen($matched_body_start_tag))

评估为 51 - (25 + 6) = 51 - 31 = 20 (16 + 4)。

:) 希望这可以帮助您理解为什么优先级很重要。（很抱歉误导您有关换行符的信息，它仅在我上面给出的正则表达式示例中有效）。

The problem is that your string have new lines where . in the pattern only matches single lines, you need to add /s modifier to make . to match multi-lines

Here is my solution, I prefer it this way.

<?php

$html=<<<EOD
<html>
<head>
</head>
<body buu="grger"     ga="Gag">
<p>Some text</p>
</body>
</html>
EOD;

    // get anything between <body> and </body> where <body can="have_as many" attributes="as required">
    if (preg_match('/(?:<body[^>]*>)(.*)<\/body>/isU', $html, $matches)) {
        $body = $matches[1];
    }
    // outputing all matches for debugging purposes
    var_dump($matches);
?>

Edit: I am updating my answer to provide you with better explanation why your code fails.

You have this string:

<html>
<head>
</head>
<body>
<p>Some text</p>
</body>
</html>

Everything seems to be fine with it but actually you have non-print characters (new line characters) on each line.
You have 53 printable characters and 7 non printable (new lines, \n == 2 characters actually for each new line).

When you reach this part of the code:

$index_of_body_end_tag = strpos($html, '</body>');

You get the correct position of </body> (starting at position 51) but this counts the new lines.

So when you reach this line of code:

$index_of_body_start_tag + strlen($matched_body_start_tag)

It it evaluated to 31 (new lines included), and:

$index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)

It is evaluated to 51 - 25 + 6 = 32 (characters you have to read) but you only have 16 printable characters of text between <body> and </body> and 4 non printable characters (new line after <body> and new line before </body>). And here is the problem, you have to group the calculation (prioritize) like so:

$index_of_body_end_tag - ($index_of_body_start_tag + strlen($matched_body_start_tag))

evaluated to 51 - (25 + 6) = 51 - 31 = 20 (16 + 4).

:) Hope this helps you to understand why prioritizing is important. (Sorry for misleading you about newlines it is only valid in regex example I gave above).

回复收藏 0 原文

桃扇骨 2024-10-23 05:17:27

就我个人而言，我不会使用正则表达式。

<?php

$html = <<<EOD

<html>
    <head>
        <title>Example</title>
    </head>
    <body>
        <h1>foobar</h1>
    </body>
</html>

EOD;

$s = strpos($html, '<body>') + strlen('<body>');
$f = '</body>';

echo trim(substr($html, $s, strpos($html, $f) - $s));

?>

foobar

Personally, I wouldn't use regex.

<?php

$html = <<<EOD

<html>
    <head>
        <title>Example</title>
    </head>
    <body>
        <h1>foobar</h1>
    </body>
</html>

EOD;

$s = strpos($html, '<body>') + strlen('<body>');
$f = '</body>';

echo trim(substr($html, $s, strpos($html, $f) - $s));

?>

returns <h1>foobar</h1>

回复收藏 0 原文

仅冇旳回忆 2024-10-23 05:17:27

问题出在您对结束索引的 substr 计算中。你应该一路减去：

$index_of_body_end_tag - $index_of_body_start_tag - strlen($matched_body_start_tag)

但你正在做：

+ strlen($matched_body_start_tag)

也就是说，考虑到你可以使用 preg_match only 来做到这一点，这似乎有点矫枉过正。您只需使用 s 修饰符确保跨新行进行匹配：

preg_match('/<body[^>]*>(.*?)<\/body>/s', $html, $matches);
echo $matches[1];

输出：

<p>Some text</p>

The problem is in your substr computation of the ending index. You should substract all the way:

$index_of_body_end_tag - $index_of_body_start_tag - strlen($matched_body_start_tag)

But you are doing:

+ strlen($matched_body_start_tag)

That said, it seems a little overkill considering you can do it using preg_match only. You just need to make sure you match across new lines, using the s modifier:

preg_match('/<body[^>]*>(.*?)<\/body>/s', $html, $matches);
echo $matches[1];

Outputs:

<p>Some text</p>

回复收藏 0 原文

凤舞天涯 2024-10-23 05:17:27

有人可能已经发现了你的错误，我没有阅读所有回复。
代数是错误的。

代码在这里

顺便说一句，第一次看到 ideone.com，那太酷了。

$body = substr( 
          $html, 
          $index_of_body_start_tag + strlen($matched_body_start_tag),
          $index_of_body_end_tag - ($index_of_body_start_tag + strlen($matched_body_start_tag))
        );

或者 ..

$body = substr(
          $html,
          $index_of_body_start_tag + strlen($matched_body_start_tag),
          $index_of_body_end_tag - $index_of_body_start_tag - strlen($matched_body_start_tag)
       );

Somebodys probably already found your error, i didn't read all the replys.
The algebra is wrong.

code is here

Btw, first time seeing ideone.com, thats pretty cool.

$body = substr( 
          $html, 
          $index_of_body_start_tag + strlen($matched_body_start_tag),
          $index_of_body_end_tag - ($index_of_body_start_tag + strlen($matched_body_start_tag))
        );

or ..

$body = substr(
          $html,
          $index_of_body_start_tag + strlen($matched_body_start_tag),
          $index_of_body_end_tag - $index_of_body_start_tag - strlen($matched_body_start_tag)
       );

回复收藏 0 原文

~没有更多了~