当前位置：文江博客话题详情

如何“刮”？来自页面源的内容？

发布于 2024-12-03 08:00:10 字数 1435 浏览 0 评论 0原文

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

傲世九天 2024-12-10 08:00:10

我尝试使用简单的 HTML DOM PHP 库抓取多个网站，该库可以在此处获取：http://simplehtmldom.sourceforge.net/

然后使用这样的代码：

<?php
include_once 'simple_html_dom.php';

$url = "http://slashdot.org/";
$html = file_get_html($url);

//remove additional spaces
$pat[0] = "/^\s+/";
$pat[1] = "/\s{2,}/";
$pat[2] = "/\s+\$/";
$rep[0] = "";
$rep[1] = " ";
$rep[2] = "";

foreach($html->find('h2') as $heading) { //for each heading
        //find all spans with a inside then echo the found text out
        echo preg_replace($pat, $rep, $heading->find('span a', 0)->plaintext) . "\n"; 
}
?>

这会导致类似的结果：

5.8 Earthquake Hits East Coast of the US
Origins of Lager Found In Argentina
Inside Oregon State University's Open Source Lab
WebAPI: Mozilla Proposes Open App Interface For Smartphones
Using Tablets Becoming Popular Bathroom Activity
The Syrian Government's Internet Strategy
Deus Ex: Human Revolution Released
Taken Over By Aliens? Google Has It Covered
The GIMP Now Has a Working Single-Window Mode
Zombie Cookies Just Won't Die
Motorola's Most Important 18 Patents
MK-1 Robotic Arm Capable of Near-Human Dexterity, Dancing
Evangelical Scientists Debate Creation Story
Android On HP TouchPad
Google Street View Gets Israeli Government's Nod
Internet Restored In Tripoli As Rebels Take Control
GA Tech: Internet's Mid-Layers Vulnerable To Attack
Serious Crypto Bug Found In PHP 5.3.7
Twitter To Meet With UK Government About Riots
EU Central Court Could Validate Software Patents

I have tried scraping multiple sites using the simple HTML DOM PHP library, which can be obtained here: http://simplehtmldom.sourceforge.net/

Then using code like this:

<?php
include_once 'simple_html_dom.php';

$url = "http://slashdot.org/";
$html = file_get_html($url);

//remove additional spaces
$pat[0] = "/^\s+/";
$pat[1] = "/\s{2,}/";
$pat[2] = "/\s+\$/";
$rep[0] = "";
$rep[1] = " ";
$rep[2] = "";

foreach($html->find('h2') as $heading) { //for each heading
        //find all spans with a inside then echo the found text out
        echo preg_replace($pat, $rep, $heading->find('span a', 0)->plaintext) . "\n"; 
}
?>

This results in something like:

5.8 Earthquake Hits East Coast of the US
Origins of Lager Found In Argentina
Inside Oregon State University's Open Source Lab
WebAPI: Mozilla Proposes Open App Interface For Smartphones
Using Tablets Becoming Popular Bathroom Activity
The Syrian Government's Internet Strategy
Deus Ex: Human Revolution Released
Taken Over By Aliens? Google Has It Covered
The GIMP Now Has a Working Single-Window Mode
Zombie Cookies Just Won't Die
Motorola's Most Important 18 Patents
MK-1 Robotic Arm Capable of Near-Human Dexterity, Dancing
Evangelical Scientists Debate Creation Story
Android On HP TouchPad
Google Street View Gets Israeli Government's Nod
Internet Restored In Tripoli As Rebels Take Control
GA Tech: Internet's Mid-Layers Vulnerable To Attack
Serious Crypto Bug Found In PHP 5.3.7
Twitter To Meet With UK Government About Riots
EU Central Court Could Validate Software Patents

回复收藏 0 原文

夜访吸血鬼 2024-12-10 08:00:10

这不是最好的解决方案，但它有效：

$page = file_get_contents('http://example.com/page.html');
preg_match_all('#<strong>([^<]+)</strong><br />\s*([^<]+)<#', $page, 
                                             $result, PREG_SET_ORDER);
foreach ($result as $row) {
    echo "<p><b>$row[1]</b> $row[2]</p>\n";
}

如果需要转义更复杂的内容，请考虑 DOM文档。

This isn't the best solution, but it works:

$page = file_get_contents('http://example.com/page.html');
preg_match_all('#<strong>([^<]+)</strong><br />\s*([^<]+)<#', $page, 
                                             $result, PREG_SET_ORDER);
foreach ($result as $row) {
    echo "<p><b>$row[1]</b> $row[2]</p>\n";
}

If need to scape something more complex, consider DOMDocument.

回复收藏 0 原文

南汐寒笙箫 2024-12-10 08:00:10

您可以使用正则表达式。

编辑

正则表达式并不是解决大问题的最佳解决方案，但对于具有标准格式的简单页面，正则表达式通常最容易使用。

回复收藏 0 原文

~没有更多了~

关于作者

木緿

暂无简介

0 文章

0 评论

23 人气

关注发私信

Gabu-gabumon

文章 0 评论 0

关注

qq_CgiN62

文章 0 评论 0

关注

荔枝明

文章 0 评论 0

关注

赏烟花じ飞满天

文章 0 评论 0

关注

独守阴晴ぅ圆缺

文章 0 评论 0

关注

¤→小豸慧

文章 0 评论 0

友情链接

文江博客

如何“刮”？来自页面源的内容？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

Gabu-gabumon

qq_CgiN62

荔枝明

赏烟花じ飞满天

独守阴晴ぅ圆缺

¤→小豸慧

友情链接

如何“刮”？来自页面源的内容？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

Gabu-gabumon

qq_CgiN62

荔枝明

赏烟花じ飞满天

独守阴晴ぅ圆缺

¤→小豸慧

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。