PHP 抓取工具 - 正则表达式

发布于 2025-01-08 05:09:23 字数 512 浏览 0 评论 0原文

我正在尝试使用 php 遵循网页抓取教程。

我大致了解发生了什么，但我不知道如何过滤已删除的内容以获得我想要的内容。例如：

<?php
$file_string = file_get_contents('page_to_scrape.html');
preg_match('/<title>(.*)<\/title>/i', $file_string, $title);
$title_out = $title[1];
?>

我看到 (.*) 将检索标题标签之间的所有内容，我可以使用正则表达式来获取特定信息吗？假设他的标题中有 Welcome guest #100 我将如何获得散列后面的数字？

或者我是否必须检索标签之间的所有内容然后稍后对其进行操作？

原文

I'm trying to follow a tutorial for web scraping with php.

I understand roughly whats going on, but I don't get how to filter what has been scraped to get exactly what I want. For example:

<?php
$file_string = file_get_contents('page_to_scrape.html');
preg_match('/<title>(.*)<\/title>/i', $file_string, $title);
$title_out = $title[1];
?>

I see that the (.*) will retrieve everything in between title tags, can I use regular expressions to get specific info. Say inside he title had Welcome visitor #100 how would I get the number that comes after the hash?

Or do I have to retrieve everything between the tags then manipulate it later?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

紅太極 2025-01-15 05:09:23

考虑到标题“Welcome guest #100”以及 </code> 标签出现不超过一次的事实，表达式应该是：

preg_match('~<title>Welcome visitor #(\d+)</title>~', ...);

A much people on SO would argument to 永远不要使用正则表达式来解析 (X)HTML;然而，对于这项任务，上述内容就足够了。

尽管 - 如前所述 - </code> 标签（应该）出现不超过一次，但该模式

<title>(.*)</title>

也将与此匹配：

<title>Welcome visitor <title>#<title>100blafoobar</title>

(.*) 是部分允许这样做。一旦您从更改中抓取数据的页面，正则表达式可能会停止工作。

编辑：正确筛选出多个元素及其属性的方法：

$dom = new DomDocument;
$dom->loadHTML($page_content);

$elements = $dom->getElementsByTagName('a');

for ($n = 0; $n < $elements->length; $n++) {
    $item = $elements->item($n);
    $href = $item->getAttribute('href');
}

Given the title "Welcome visitor #100" and the fact a <title> tag occurs no more than once, the expression should be:

preg_match('~<title>Welcome visitor #(\d+)</title>~', ...);

A lot of people on SO would argue to never use regular expressions to parse (X)HTML; for this task, however, the above should suffice.

Although - as mentioned before - a <title> tag (should) occur no more than once, the pattern

<title>(.*)</title>

would as well match this:

<title>Welcome visitor <title>#<title>100blafoobar</title>

(.*) being the part allowing this. As soon as the page you're scraping your data from changes, the regex might stop working.

EDIT: A method to correctly sift out multiple elements and their attributes:

$dom = new DomDocument;
$dom->loadHTML($page_content);

$elements = $dom->getElementsByTagName('a');

for ($n = 0; $n < $elements->length; $n++) {
    $item = $elements->item($n);
    $href = $item->getAttribute('href');
}

回复收藏 0 原文

夜司空 2025-01-15 05:09:23

您只需要更改正则表达式即可匹配您需要的任何内容。如果您要多次使用该图块，最好保存整个图块并稍后对其进行操作，否则只需获取您需要的即可。

/.*((?<=#)\d*).*<\/title>/i</code>

将专门匹配哈希后的数字。它不会匹配没有哈希值的数字。

编写正则表达式的方法有很多，这取决于您想要的通用性或具体性。

您也可以这样写来获取任何数字：

/.*(\d)*.*<\/title>/i</code>

回复收藏 0 原文

°如果伤别离去 2025-01-15 05:09:23

我会首先获取标题标签，然后进一步处理标题。其他答案包含针对此任务的完全有效的解决方案。

一些进一步的注意事项：

请使用 DOMDocument 来做这样的事情，因为它更安全（您的正则表达式可能会在某些特定的 HTML 页面上崩溃）

请使用非贪婪版本的 .*: .*?，否则你会遇到有趣的事情，例如：

<代码>
    <头>
        <标题>a
    
    <正文>
        <标题>测试

您现在将匹配 a... 到 test 之间的所有内容，包括之间。

I would first fetch the title tag and then process the title further. The other answers contain perfectly valid solutions for this task.

Some further notes:

Please use DOMDocument for such things, since it is much safer (your regular expression might break on some specific HTML pages)

Please use the non-greedy version of .*: .*?, otherwise you will run into funny things like:

<html>
    <head>
        <title>a</title>
    </head>
    <body>
        <title>test</title> <!-- not allowed in HTML, but since when does the web pages online actually care about that? -->
    </body>
</html>

You will now match everything between <title>a</title>... up to <title>test</title>, including everything in between.

回复收藏 0 原文

~没有更多了~