抓取 HN 首页 - 处理简单的 HTML Dom 错误
我正在使用“Simple HTML Dom”来抓取 HN 首页 (news.ycombinator.com),它在大多数情况下都非常有效。
然而,他们时不时地宣传缺乏爬虫所寻找的元素的工作/公司,即分数、用户名和评论数量。
这当然会破坏数组,从而破坏我的脚本的输出:
<?php
// 2012-02-12 Maximilian (Extract news.ycombinator.com's Front Page)
// Set the header during development
//header ("content-type: text/xml");
// Call the external PHP Simple HTML DOM Parser (http://simplehtmldom.sourceforge.net/manual.htm)
include('lib/simple_html_dom.php');
date_default_timezone_set('Europe/Berlin');
// Download 'news.ycombinator.com' content
//$tmp = file_get_contents('http://news.ycombinator.com');
//file_put_contents('get.tmp', $tmp);
// Retrieve the content
$html = file_get_html('tc.tmp');
// Set the extraction pattern for each item
$title = $html->find("tr td table tr td.title a");
$score = $html->find("tr td.subtext span");
$user = $html->find("tr td.subtext a[href^=user]");
$link = $html->find("tr td table tr td.title a");
$time = $html->find("tr td.subtext");
$additionals = $html->find("tr td.subtext a[href^=item?id]");
// Construct the feed by looping through the items
for($i=0;$i<29;$i++) {
$cr=1;
// Check if the item points to an external website
if (!strstr($link[$i]->href,'http')) {
$url = 'http://news.ycombinator.com/'.$link[$i]->href;
$description = "Join the discussion on Hacker News.";
} else {
$url = $link[$i]->href;
// Getting content here
if (empty($abstract)) {
$description ="Failed to load any relevant content. Please try again later.";
} else {
$description = $abstract;
}
}
// Put all the items together
$result .= '<item><id>f'.$i.'</id><title>'.htmlspecialchars(trim($title[$i]->plaintext)).'</title><description><![CDATA['.$description.']]></description><pubDate>'.str_replace(' | '.$additionals[$i]->plaintext,'',str_replace($score[$i]->plaintext.' by '.$user[$i]->plaintext.' ','',$time[$i]->plaintext)).'</pubDate><score>'.$score[$i]->plaintext.'</score><user>'.$user[$i]->plaintext.'</user><comments>'.$additionals[$i]->plaintext.'</comments><id>'.substr($additionals[$i]->href,8).'</id><discussion>http://news.ycombinator.com/'.$additionals[$i]->href.'</discussion><link>'.htmlspecialchars($url).'</link></item>';
}
$output = '<rss><channel><id>news.ycombinator.com Frontpage</id><buildDate>'.date('Y-m-d H:i:s').'</buildDate>'.$result.'</channel></rss>';
file_put_contents('tc.xml', $output);
?>
这是正确输出的示例
<item>
<id>f0</id>
<title>Show HN: Bootswatch, free swatches for your Bootstrap site</title>
<description><![CDATA[Easy to Install Simply download the CSS file from the swatch of your choice and replace the one in Bootstrap. No messing around with hex values. Whole New Feel We've all been there with the black bar and blue buttons. See how a splash of color and typography can transform the feel of your site. Modular Changes are contained in just two LESS files, enabling modification and ensuring forward compatibility.]]></description>
<pubDate>3 hours ago</pubDate>
<score>196 points</score>
<user>parkov</user>
<comments>30 comments</comments>
<id>3594540</id>
<discussion>http://news.ycombinator.com/item?id=3594540</discussion>
<link>http://bootswatch.com</link>
</item>
<item>
<id>f1</id>
<title>Louis CK inspires Jim Gaffigan to sell comedy special for $5 online</title>
<description><![CDATA[Dear Internet Friends,Inspired by the brilliant Louis CK, I have decided to debut my all-new hour stand-up special on my website, Jimgaffigan.com.Beginning sometime in April, “Jim Gaffigan: Mr. Universe” will be available exclusively for download for only $5. A dollar from each download will go directly to The Bob Woodruff Foundation; a charity dedicated to serving injured Veterans and their families.I am confident that the low price of my new comedy special and the fact that 20% of each $5 download will be donated to this very noble cause will prevent people from stealing it. Maybe I’m being naïve, but I trust you guys.]]></description>
<pubDate>57 minutes ago</pubDate>
<score>25 points</score>
<user>rkudeshi</user>
<comments>4 comments</comments>
<id>3595285</id>
<discussion>http://news.ycombinator.com/item?id=3595285</discussion>
<link>http://www.whosay.com/jimgaffigan/content/218011</link>
</item>
,这是错误输出的示例。请注意,元素不为空,因此我似乎无法捕获错误并简单地跳转到下一个项目。促销帖子之后的所有内容都会中断:
<item>
<id>f14</id>
<title>Build the next Legos: We're hiring an iOS Developer & Web Developer (YC S11)</title>
<description><![CDATA[Interested in building the next generation of toys on digital devices such as the iPad? That’s what we’re doing here at Launchpad Toys with apps like Toontastic (Named one of the “Top 10 iPad Apps of 2011” by the New York Times and was recently added to the iTunes Hall of Fame) and an awesom]]><![CDATA[e suite of others we have under development. We’re looking for creative and playful coders that have made games or highly visual apps/sites in the past for our two open development positions. As a kid, you probably played with Legos endlessly and grew up to be a hacker because you still love building things. Sounds like you? Email us at [email protected] with a couple links to some projects and code that we can look at along with your resume.]]></description>
<pubDate>2 hours ago</pubDate>
<score>14 points</score>
<user>bproper</user>
<comments>7 comments</comments>
<id>3594944</id>
<discussion>http://news.ycombinator.com/item?id=3594944</discussion>
<link>http://launchpadtoys.com/blog/2012/02/iosdeveloper-webdeveloper/</link>
</item>
<item>
<id>f15</id>
<title>SOPA foe Fred Wilson supports a blacklist on pirate sites</title>
<description><![CDATA[VC Fred Wilson says Google, Bing, Facebook, and Twitter should warn people when they try to log in at known pirate sites: "We don't need legislation." Fred Wilson says: If they try to pass antipiracy legislation, it will once again be 'war.' (Credit: Greg Sandoval/CNET) Fred Wilson, a well-known ven]]><![CDATA[ture capitalist from New York, says he's in favor of creating a blacklist for Web sites found to traffic in pirated films, music, and other intellectual property. The co-founder of Union Square Ventures told a gathering of media executives at the Paley Center for Media yesterday that he believes a good antipiracy measure would be for Google, Twitter, Facebook, and other major sites to issue warnings to people when they try to connect with a known pirate site. Fred Wilson, a co-founder of Union Square Ventures, says 'Our children have been taught to steal.' (Credit: Union Square Ventures) Wilson favors establishing an independent group to create a "black and white list." "The blacklist are those sites we all know are bad news," he told the audience in New York.]]></description>
<pubDate>14 points by bproper 2 hours ago | 7 comments</pubDate>
<score>24 points</score>
<user>andrewcross</user>
<comments>12 comments</comments>
<id>3594558</id>
<discussion>http://news.ycombinator.com/item?id=3594558</discussion>
<link>http://news.cnet.com/8301-31001_3-57377862-261/post-sopa-influential-tech-investor-favors-blacklisting-pirate-sites/</link>
</item>
所以这是我的问题:如何处理缺少特定元素并且 find() 不会抛出错误的情况?我是否必须从头开始,还是有更好的方法来抓取 HN 首页?
对于任何好奇的人,这里是整个 XML 文件: http://thequeue.org/api/tc.xml< /a>
I'm using 'Simple HTML Dom' to scrape the HN Front Page (news.ycombinator.com), which works great most of the time.
However, every now and then they promote a job/company that lacks the elements that the scraper is looking for, i.e. score, username and number of comments.
This of course, breaks the array and thus the output of my script:
<?php
// 2012-02-12 Maximilian (Extract news.ycombinator.com's Front Page)
// Set the header during development
//header ("content-type: text/xml");
// Call the external PHP Simple HTML DOM Parser (http://simplehtmldom.sourceforge.net/manual.htm)
include('lib/simple_html_dom.php');
date_default_timezone_set('Europe/Berlin');
// Download 'news.ycombinator.com' content
//$tmp = file_get_contents('http://news.ycombinator.com');
//file_put_contents('get.tmp', $tmp);
// Retrieve the content
$html = file_get_html('tc.tmp');
// Set the extraction pattern for each item
$title = $html->find("tr td table tr td.title a");
$score = $html->find("tr td.subtext span");
$user = $html->find("tr td.subtext a[href^=user]");
$link = $html->find("tr td table tr td.title a");
$time = $html->find("tr td.subtext");
$additionals = $html->find("tr td.subtext a[href^=item?id]");
// Construct the feed by looping through the items
for($i=0;$i<29;$i++) {
$cr=1;
// Check if the item points to an external website
if (!strstr($link[$i]->href,'http')) {
$url = 'http://news.ycombinator.com/'.$link[$i]->href;
$description = "Join the discussion on Hacker News.";
} else {
$url = $link[$i]->href;
// Getting content here
if (empty($abstract)) {
$description ="Failed to load any relevant content. Please try again later.";
} else {
$description = $abstract;
}
}
// Put all the items together
$result .= '<item><id>f'.$i.'</id><title>'.htmlspecialchars(trim($title[$i]->plaintext)).'</title><description><![CDATA['.$description.']]></description><pubDate>'.str_replace(' | '.$additionals[$i]->plaintext,'',str_replace($score[$i]->plaintext.' by '.$user[$i]->plaintext.' ','',$time[$i]->plaintext)).'</pubDate><score>'.$score[$i]->plaintext.'</score><user>'.$user[$i]->plaintext.'</user><comments>'.$additionals[$i]->plaintext.'</comments><id>'.substr($additionals[$i]->href,8).'</id><discussion>http://news.ycombinator.com/'.$additionals[$i]->href.'</discussion><link>'.htmlspecialchars($url).'</link></item>';
}
$output = '<rss><channel><id>news.ycombinator.com Frontpage</id><buildDate>'.date('Y-m-d H:i:s').'</buildDate>'.$result.'</channel></rss>';
file_put_contents('tc.xml', $output);
?>
Here's an example of the correct output
<item>
<id>f0</id>
<title>Show HN: Bootswatch, free swatches for your Bootstrap site</title>
<description><![CDATA[Easy to Install Simply download the CSS file from the swatch of your choice and replace the one in Bootstrap. No messing around with hex values. Whole New Feel We've all been there with the black bar and blue buttons. See how a splash of color and typography can transform the feel of your site. Modular Changes are contained in just two LESS files, enabling modification and ensuring forward compatibility.]]></description>
<pubDate>3 hours ago</pubDate>
<score>196 points</score>
<user>parkov</user>
<comments>30 comments</comments>
<id>3594540</id>
<discussion>http://news.ycombinator.com/item?id=3594540</discussion>
<link>http://bootswatch.com</link>
</item>
<item>
<id>f1</id>
<title>Louis CK inspires Jim Gaffigan to sell comedy special for $5 online</title>
<description><![CDATA[Dear Internet Friends,Inspired by the brilliant Louis CK, I have decided to debut my all-new hour stand-up special on my website, Jimgaffigan.com.Beginning sometime in April, “Jim Gaffigan: Mr. Universe” will be available exclusively for download for only $5. A dollar from each download will go directly to The Bob Woodruff Foundation; a charity dedicated to serving injured Veterans and their families.I am confident that the low price of my new comedy special and the fact that 20% of each $5 download will be donated to this very noble cause will prevent people from stealing it. Maybe I’m being naïve, but I trust you guys.]]></description>
<pubDate>57 minutes ago</pubDate>
<score>25 points</score>
<user>rkudeshi</user>
<comments>4 comments</comments>
<id>3595285</id>
<discussion>http://news.ycombinator.com/item?id=3595285</discussion>
<link>http://www.whosay.com/jimgaffigan/content/218011</link>
</item>
And here's an example of incorrect output. Note that the elements are not empty, thus I cannot seem to catch the error and simply jump to the next item. Everything past the promotion post will break:
<item>
<id>f14</id>
<title>Build the next Legos: We're hiring an iOS Developer & Web Developer (YC S11)</title>
<description><![CDATA[Interested in building the next generation of toys on digital devices such as the iPad? That’s what we’re doing here at Launchpad Toys with apps like Toontastic (Named one of the “Top 10 iPad Apps of 2011” by the New York Times and was recently added to the iTunes Hall of Fame) and an awesom]]><![CDATA[e suite of others we have under development. We’re looking for creative and playful coders that have made games or highly visual apps/sites in the past for our two open development positions. As a kid, you probably played with Legos endlessly and grew up to be a hacker because you still love building things. Sounds like you? Email us at [email protected] with a couple links to some projects and code that we can look at along with your resume.]]></description>
<pubDate>2 hours ago</pubDate>
<score>14 points</score>
<user>bproper</user>
<comments>7 comments</comments>
<id>3594944</id>
<discussion>http://news.ycombinator.com/item?id=3594944</discussion>
<link>http://launchpadtoys.com/blog/2012/02/iosdeveloper-webdeveloper/</link>
</item>
<item>
<id>f15</id>
<title>SOPA foe Fred Wilson supports a blacklist on pirate sites</title>
<description><![CDATA[VC Fred Wilson says Google, Bing, Facebook, and Twitter should warn people when they try to log in at known pirate sites: "We don't need legislation." Fred Wilson says: If they try to pass antipiracy legislation, it will once again be 'war.' (Credit: Greg Sandoval/CNET) Fred Wilson, a well-known ven]]><![CDATA[ture capitalist from New York, says he's in favor of creating a blacklist for Web sites found to traffic in pirated films, music, and other intellectual property. The co-founder of Union Square Ventures told a gathering of media executives at the Paley Center for Media yesterday that he believes a good antipiracy measure would be for Google, Twitter, Facebook, and other major sites to issue warnings to people when they try to connect with a known pirate site. Fred Wilson, a co-founder of Union Square Ventures, says 'Our children have been taught to steal.' (Credit: Union Square Ventures) Wilson favors establishing an independent group to create a "black and white list." "The blacklist are those sites we all know are bad news," he told the audience in New York.]]></description>
<pubDate>14 points by bproper 2 hours ago | 7 comments</pubDate>
<score>24 points</score>
<user>andrewcross</user>
<comments>12 comments</comments>
<id>3594558</id>
<discussion>http://news.ycombinator.com/item?id=3594558</discussion>
<link>http://news.cnet.com/8301-31001_3-57377862-261/post-sopa-influential-tech-investor-favors-blacklisting-pirate-sites/</link>
</item>
So here's my question: How can I handle a situation where a particular element is missing and find() doesn't throw an error? Do I have to start from scratch, or is there a better approach in scraping the HN front page?
For anyone curious, here's the whole XML file: http://thequeue.org/api/tc.xml
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
为了处理这个问题,您必须按块工作,似乎有一个虚拟间隔元素可以帮助您:
然后使用子选择器:
对于其他元素,您使用相同的选择器
You have to work by chunks in order to handle that, there seems to be a dummy spacer element that can help you with that:
And then use subselectors:
And for the other elements you use the same selectors
感谢 Ivan 的思路,我现在将最初抓取的 HTML 分割成一个数组,每个节点代表一个帖子。然后,循环浏览每个帖子,我将检查向上投票箭头图像是否存在。如果没有,我不会将其添加到结果中。最后,所有内容都将被缝合在一起,并且赞助的帖子将被删除。这是代码:
We'll thanks to Ivan's trail of thought, I am now splitting the initially scraped HTML into an array, each node representing a post. Then, going through every single post in a loop, I'll check if the up voting arrow image exists. If not, I'll not add it to the result. In the end everything will be stitched back together and the sponsored post is left out. Here's the code: