PHP 抓取工具 - 正则表达式

发布于 2025-01-08 05:09:23 字数 512 浏览 0 评论 0原文

我正在尝试使用 php 遵循网页抓取教程

我大致了解发生了什么,但我不知道如何过滤已删除的内容以获得我想要的内容。例如:

<?php
$file_string = file_get_contents('page_to_scrape.html');
preg_match('/<title>(.*)<\/title>/i', $file_string, $title);
$title_out = $title[1];
?>

我看到 (.*) 将检索标题标签之间的所有内容,我可以使用正则表达式来获取特定信息吗?假设他的标题中有 Welcome guest #100 我将如何获得散列后面的数字?

或者我是否必须检索标签之间的所有内容然后稍后对其进行操作?

I'm trying to follow a tutorial for web scraping with php.

I understand roughly whats going on, but I don't get how to filter what has been scraped to get exactly what I want. For example:

<?php
$file_string = file_get_contents('page_to_scrape.html');
preg_match('/<title>(.*)<\/title>/i', $file_string, $title);
$title_out = $title[1];
?>

I see that the (.*) will retrieve everything in between title tags, can I use regular expressions to get specific info. Say inside he title had Welcome visitor #100 how would I get the number that comes after the hash?

Or do I have to retrieve everything between the tags then manipulate it later?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

紅太極 2025-01-15 05:09:23

考虑到标题“Welcome guest #100”以及 </code> 标签出现不超过一次的事实,表达式应该是:

preg_match('~<title>Welcome visitor #(\d+)</title>~', ...);

A much people on SO would argument to 永远不要使用正则表达式来解析 (X)HTML;然而,对于这项任务,上述内容就足够了。

尽管 - 如前所述 - </code> 标签(应该)出现不超过一次,但该模式

<title>(.*)</title>

也将与此匹配:

<title>Welcome visitor <title>#<title>100blafoobar</title>

(.*) 是部分允许这样做。一旦您从更改中抓取数据的页面,正则表达式可能会停止工作。


编辑:正确筛选出多个元素及其属性的方法:

$dom = new DomDocument;
$dom->loadHTML($page_content);

$elements = $dom->getElementsByTagName('a');

for ($n = 0; $n < $elements->length; $n++) {
    $item = $elements->item($n);
    $href = $item->getAttribute('href');
}

Given the title "Welcome visitor #100" and the fact a <title> tag occurs no more than once, the expression should be:

preg_match('~<title>Welcome visitor #(\d+)</title>~', ...);

A lot of people on SO would argue to never use regular expressions to parse (X)HTML; for this task, however, the above should suffice.

Although - as mentioned before - a <title> tag (should) occur no more than once, the pattern

<title>(.*)</title>

would as well match this:

<title>Welcome visitor <title>#<title>100blafoobar</title>

(.*) being the part allowing this. As soon as the page you're scraping your data from changes, the regex might stop working.


EDIT: A method to correctly sift out multiple elements and their attributes:

$dom = new DomDocument;
$dom->loadHTML($page_content);

$elements = $dom->getElementsByTagName('a');

for ($n = 0; $n < $elements->length; $n++) {
    $item = $elements->item($n);
    $href = $item->getAttribute('href');
}
夜司空 2025-01-15 05:09:23

您只需要更改正则表达式即可匹配您需要的任何内容。如果您要多次使用该图块,最好保存整个图块并稍后对其进行操作,否则只需获取您需要的即可。

/.*((?<=#)\d*).*<\/title>/i</code>

将专门匹配哈希后的数字。它不会匹配没有哈希值的数字。

编写正则表达式的方法有很多,这取决于您想要的通用性或具体性。

您也可以这样写来获取任何数字:

/.*(\d)*.*<\/title>/i</code>

You would just need to change the regex to match whatever you need. If you are going to use the tile more than once it's better to save the whole and manipulate it later, otherwise just get what you need.

/<title>.*((?<=#)\d*).*<\/title>/i

Would specifically match a number after a hash. It would not match a number without a hash.

There are many ways to write regex, it depends on how general or specific you want to be.

You could also write like this to get any number:

/<title>.*(\d)*.*<\/title>/i

°如果伤别离去 2025-01-15 05:09:23

我会首先获取标题标签,然后进一步处理标题。其他答案包含针对此任务的完全有效的解决方案。

一些进一步的注意事项:

  • 请使用 DOMDocument 来做这样的事情,因为它更安全(您的正则表达式可能会在某些特定的 HTML 页面上崩溃)
  • 请使用非贪婪版本的 .*: .*?,否则你会遇到有趣的事情,例如:

    <代码>
        <头>
            <标题>a
        
        <正文>
            <标题>测试 
        
    
    

您现在将匹配 a...test 之间的所有内容,包括之间。

I would first fetch the title tag and then process the title further. The other answers contain perfectly valid solutions for this task.

Some further notes:

  • Please use DOMDocument for such things, since it is much safer (your regular expression might break on some specific HTML pages)
  • Please use the non-greedy version of .*: .*?, otherwise you will run into funny things like:

    <html>
        <head>
            <title>a</title>
        </head>
        <body>
            <title>test</title> <!-- not allowed in HTML, but since when does the web pages online actually care about that? -->
        </body>
    </html>
    

You will now match everything between <title>a</title>... up to <title>test</title>, including everything in between.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文