PHP 抓取工具 - 正则表达式
我正在尝试使用 php 遵循网页抓取教程。
我大致了解发生了什么,但我不知道如何过滤已删除的内容以获得我想要的内容。例如:
<?php
$file_string = file_get_contents('page_to_scrape.html');
preg_match('/<title>(.*)<\/title>/i', $file_string, $title);
$title_out = $title[1];
?>
我看到 (.*)
将检索标题标签之间的所有内容,我可以使用正则表达式来获取特定信息吗?假设他的标题中有 Welcome guest #100
我将如何获得散列后面的数字?
或者我是否必须检索标签之间的所有内容然后稍后对其进行操作?
I'm trying to follow a tutorial for web scraping with php.
I understand roughly whats going on, but I don't get how to filter what has been scraped to get exactly what I want. For example:
<?php
$file_string = file_get_contents('page_to_scrape.html');
preg_match('/<title>(.*)<\/title>/i', $file_string, $title);
$title_out = $title[1];
?>
I see that the (.*)
will retrieve everything in between title tags, can I use regular expressions to get specific info. Say inside he title had Welcome visitor #100
how would I get the number that comes after the hash?
Or do I have to retrieve everything between the tags then manipulate it later?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
考虑到标题“Welcome guest #100”以及</code> 标签出现不超过一次的事实,表达式应该是:
A much people on SO would argument to 永远不要使用正则表达式来解析 (X)HTML;然而,对于这项任务,上述内容就足够了。
尽管 - 如前所述 -</code> 标签(应该)出现不超过一次,但该模式
也将与此匹配:
(.*)
是部分允许这样做。一旦您从更改中抓取数据的页面,正则表达式可能会停止工作。编辑:正确筛选出多个元素及其属性的方法:
Given the title "Welcome visitor #100" and the fact a
<title>
tag occurs no more than once, the expression should be:A lot of people on SO would argue to never use regular expressions to parse (X)HTML; for this task, however, the above should suffice.
Although - as mentioned before - a
<title>
tag (should) occur no more than once, the patternwould as well match this:
(.*)
being the part allowing this. As soon as the page you're scraping your data from changes, the regex might stop working.EDIT: A method to correctly sift out multiple elements and their attributes:
您只需要更改正则表达式即可匹配您需要的任何内容。如果您要多次使用该图块,最好保存整个图块并稍后对其进行操作,否则只需获取您需要的即可。
/.*((?<=#)\d*).*<\/title>/i</code>
将专门匹配哈希后的数字。它不会匹配没有哈希值的数字。
编写正则表达式的方法有很多,这取决于您想要的通用性或具体性。
您也可以这样写来获取任何数字:
/.*(\d)*.*<\/title>/i</code>
You would just need to change the regex to match whatever you need. If you are going to use the tile more than once it's better to save the whole and manipulate it later, otherwise just get what you need.
/<title>.*((?<=#)\d*).*<\/title>/i
Would specifically match a number after a hash. It would not match a number without a hash.
There are many ways to write regex, it depends on how general or specific you want to be.
You could also write like this to get any number:
/<title>.*(\d)*.*<\/title>/i
我会首先获取标题标签,然后进一步处理标题。其他答案包含针对此任务的完全有效的解决方案。
一些进一步的注意事项:
请使用非贪婪版本的
.*
:.*?
,否则你会遇到有趣的事情,例如:您现在将匹配
a ...
到test
之间的所有内容,包括之间。I would first fetch the title tag and then process the title further. The other answers contain perfectly valid solutions for this task.
Some further notes:
Please use the non-greedy version of
.*
:.*?
, otherwise you will run into funny things like:You will now match everything between
<title>a</title>...
up to<title>test</title>
, including everything in between.