如何“刮”?来自页面源的内容?

发布于 2024-12-03 08:00:10 字数 1435 浏览 0 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

傲世九天 2024-12-10 08:00:10

我尝试使用简单的 HTML DOM PHP 库抓取多个网站,该库可以在此处获取:http://simplehtmldom.sourceforge.net/

然后使用这样的代码:

<?php
include_once 'simple_html_dom.php';

$url = "http://slashdot.org/";
$html = file_get_html($url);

//remove additional spaces
$pat[0] = "/^\s+/";
$pat[1] = "/\s{2,}/";
$pat[2] = "/\s+\$/";
$rep[0] = "";
$rep[1] = " ";
$rep[2] = "";

foreach($html->find('h2') as $heading) { //for each heading
        //find all spans with a inside then echo the found text out
        echo preg_replace($pat, $rep, $heading->find('span a', 0)->plaintext) . "\n"; 
}
?>

这会导致类似的结果:

5.8 Earthquake Hits East Coast of the US
Origins of Lager Found In Argentina
Inside Oregon State University's Open Source Lab
WebAPI: Mozilla Proposes Open App Interface For Smartphones
Using Tablets Becoming Popular Bathroom Activity
The Syrian Government's Internet Strategy
Deus Ex: Human Revolution Released
Taken Over By Aliens? Google Has It Covered
The GIMP Now Has a Working Single-Window Mode
Zombie Cookies Just Won't Die
Motorola's Most Important 18 Patents
MK-1 Robotic Arm Capable of Near-Human Dexterity, Dancing
Evangelical Scientists Debate Creation Story
Android On HP TouchPad
Google Street View Gets Israeli Government's Nod
Internet Restored In Tripoli As Rebels Take Control
GA Tech: Internet's Mid-Layers Vulnerable To Attack
Serious Crypto Bug Found In PHP 5.3.7
Twitter To Meet With UK Government About Riots
EU Central Court Could Validate Software Patents

I have tried scraping multiple sites using the simple HTML DOM PHP library, which can be obtained here: http://simplehtmldom.sourceforge.net/

Then using code like this:

<?php
include_once 'simple_html_dom.php';

$url = "http://slashdot.org/";
$html = file_get_html($url);

//remove additional spaces
$pat[0] = "/^\s+/";
$pat[1] = "/\s{2,}/";
$pat[2] = "/\s+\$/";
$rep[0] = "";
$rep[1] = " ";
$rep[2] = "";

foreach($html->find('h2') as $heading) { //for each heading
        //find all spans with a inside then echo the found text out
        echo preg_replace($pat, $rep, $heading->find('span a', 0)->plaintext) . "\n"; 
}
?>

This results in something like:

5.8 Earthquake Hits East Coast of the US
Origins of Lager Found In Argentina
Inside Oregon State University's Open Source Lab
WebAPI: Mozilla Proposes Open App Interface For Smartphones
Using Tablets Becoming Popular Bathroom Activity
The Syrian Government's Internet Strategy
Deus Ex: Human Revolution Released
Taken Over By Aliens? Google Has It Covered
The GIMP Now Has a Working Single-Window Mode
Zombie Cookies Just Won't Die
Motorola's Most Important 18 Patents
MK-1 Robotic Arm Capable of Near-Human Dexterity, Dancing
Evangelical Scientists Debate Creation Story
Android On HP TouchPad
Google Street View Gets Israeli Government's Nod
Internet Restored In Tripoli As Rebels Take Control
GA Tech: Internet's Mid-Layers Vulnerable To Attack
Serious Crypto Bug Found In PHP 5.3.7
Twitter To Meet With UK Government About Riots
EU Central Court Could Validate Software Patents
夜访吸血鬼 2024-12-10 08:00:10

这不是最好的解决方案,但它有效:

$page = file_get_contents('http://example.com/page.html');
preg_match_all('#<strong>([^<]+)</strong><br />\s*([^<]+)<#', $page, 
                                             $result, PREG_SET_ORDER);
foreach ($result as $row) {
    echo "<p><b>$row[1]</b> $row[2]</p>\n";
}

如果需要转义更复杂的内容,请考虑 DOM文档

This isn't the best solution, but it works:

$page = file_get_contents('http://example.com/page.html');
preg_match_all('#<strong>([^<]+)</strong><br />\s*([^<]+)<#', $page, 
                                             $result, PREG_SET_ORDER);
foreach ($result as $row) {
    echo "<p><b>$row[1]</b> $row[2]</p>\n";
}

If need to scape something more complex, consider DOMDocument.

南汐寒笙箫 2024-12-10 08:00:10

您可以使用正则表达式

编辑

正则表达式并不是解决大问题的最佳解决方案,但对于具有标准格式的简单页面,正则表达式通常最容易使用。

You can use Regular Expressions.

Edit

Regex isn't the best solution for large problems, but for simple pages with a standard format, regex is often simplest to use.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文