<?php
include_once 'simple_html_dom.php';
$url = "http://slashdot.org/";
$html = file_get_html($url);
//remove additional spaces
$pat[0] = "/^\s+/";
$pat[1] = "/\s{2,}/";
$pat[2] = "/\s+\$/";
$rep[0] = "";
$rep[1] = " ";
$rep[2] = "";
foreach($html->find('h2') as $heading) { //for each heading
//find all spans with a inside then echo the found text out
echo preg_replace($pat, $rep, $heading->find('span a', 0)->plaintext) . "\n";
}
?>
这会导致类似的结果:
5.8 Earthquake Hits East Coast of the US
Origins of Lager Found In Argentina
Inside Oregon State University's Open Source Lab
WebAPI: Mozilla Proposes Open App Interface For Smartphones
Using Tablets Becoming Popular Bathroom Activity
The Syrian Government's Internet Strategy
Deus Ex: Human Revolution Released
Taken Over By Aliens? Google Has It Covered
The GIMP Now Has a Working Single-Window Mode
Zombie Cookies Just Won't Die
Motorola's Most Important 18 Patents
MK-1 Robotic Arm Capable of Near-Human Dexterity, Dancing
Evangelical Scientists Debate Creation Story
Android On HP TouchPad
Google Street View Gets Israeli Government's Nod
Internet Restored In Tripoli As Rebels Take Control
GA Tech: Internet's Mid-Layers Vulnerable To Attack
Serious Crypto Bug Found In PHP 5.3.7
Twitter To Meet With UK Government About Riots
EU Central Court Could Validate Software Patents
<?php
include_once 'simple_html_dom.php';
$url = "http://slashdot.org/";
$html = file_get_html($url);
//remove additional spaces
$pat[0] = "/^\s+/";
$pat[1] = "/\s{2,}/";
$pat[2] = "/\s+\$/";
$rep[0] = "";
$rep[1] = " ";
$rep[2] = "";
foreach($html->find('h2') as $heading) { //for each heading
//find all spans with a inside then echo the found text out
echo preg_replace($pat, $rep, $heading->find('span a', 0)->plaintext) . "\n";
}
?>
This results in something like:
5.8 Earthquake Hits East Coast of the US
Origins of Lager Found In Argentina
Inside Oregon State University's Open Source Lab
WebAPI: Mozilla Proposes Open App Interface For Smartphones
Using Tablets Becoming Popular Bathroom Activity
The Syrian Government's Internet Strategy
Deus Ex: Human Revolution Released
Taken Over By Aliens? Google Has It Covered
The GIMP Now Has a Working Single-Window Mode
Zombie Cookies Just Won't Die
Motorola's Most Important 18 Patents
MK-1 Robotic Arm Capable of Near-Human Dexterity, Dancing
Evangelical Scientists Debate Creation Story
Android On HP TouchPad
Google Street View Gets Israeli Government's Nod
Internet Restored In Tripoli As Rebels Take Control
GA Tech: Internet's Mid-Layers Vulnerable To Attack
Serious Crypto Bug Found In PHP 5.3.7
Twitter To Meet With UK Government About Riots
EU Central Court Could Validate Software Patents
发布评论
评论(3)
我尝试使用简单的 HTML DOM PHP 库抓取多个网站,该库可以在此处获取:http://simplehtmldom.sourceforge.net/
然后使用这样的代码:
这会导致类似的结果:
I have tried scraping multiple sites using the simple HTML DOM PHP library, which can be obtained here: http://simplehtmldom.sourceforge.net/
Then using code like this:
This results in something like:
这不是最好的解决方案,但它有效:
如果需要转义更复杂的内容,请考虑 DOM文档。
This isn't the best solution, but it works:
If need to scape something more complex, consider DOMDocument.
您可以使用正则表达式。
编辑
正则表达式并不是解决大问题的最佳解决方案,但对于具有标准格式的简单页面,正则表达式通常最容易使用。
You can use Regular Expressions.
Edit
Regex isn't the best solution for large problems, but for simple pages with a standard format, regex is often simplest to use.