从刮擦中删除
嘿大家, 我已经成功创建了一个网站抓取工具,从唱片行业网站获取前 40 名,但是我正在抓取的表中的一列有时可能不存在。基本上我需要的是一种从我的刮擦中删除任何此类实例的方法:
<td><img src="/images/bullet_red.gif" width="8" height="8" title="Red Dot" /></td>
这是迄今为止我从教程中获得的内容。
$url = "http://www.ariacharts.com.au/pages/charts_display_singles.asp?chart=1U50";
$raw = file_get_contents($url);
$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
$content = str_replace($newlines, "", html_entity_decode($raw));
$start = strpos($content,'<table class="chartTable"');
$end = strpos($content,'</table>',$start) + 8;
$table = substr($content,$start,$end-$start);
preg_match_all("|<tr(.*)</tr>|U",$table,$rows);
foreach ($rows[0] as $row){
if ((strpos($row,'<th')===false)){
preg_match_all("|<td(.*)</td>|U",$row,$cells);
$number = strip_tags($cells[0][1]);
$name = strip_tags($cells[0][5]);
$artist = strip_tags($cells[0][6]);
$name = strtolower($name);
$name = ucwords($name);
echo "{$artist} - {$name} - Number {$number} <br>\n";
}
}
Hey all,
I've successfully created a website scraper getting the top 40 from the record industry website, however one of the columns in the table I'm scraping might sometimes not be there. Basically what I need is a way to remove any instances of this from my scrape:
<td><img src="/images/bullet_red.gif" width="8" height="8" title="Red Dot" /></td>
Here's what I've got from a tutorial so far.
$url = "http://www.ariacharts.com.au/pages/charts_display_singles.asp?chart=1U50";
$raw = file_get_contents($url);
$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
$content = str_replace($newlines, "", html_entity_decode($raw));
$start = strpos($content,'<table class="chartTable"');
$end = strpos($content,'</table>',$start) + 8;
$table = substr($content,$start,$end-$start);
preg_match_all("|<tr(.*)</tr>|U",$table,$rows);
foreach ($rows[0] as $row){
if ((strpos($row,'<th')===false)){
preg_match_all("|<td(.*)</td>|U",$row,$cells);
$number = strip_tags($cells[0][1]);
$name = strip_tags($cells[0][5]);
$artist = strip_tags($cells[0][6]);
$name = strtolower($name);
$name = ucwords($name);
echo "{$artist} - {$name} - Number {$number} <br>\n";
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
尝试使用 PHP 简单 HTML DOM 解析器而不是复杂的正则表达式 http://simplehtmldom.sourceforge.net/
Try using PHP Simple HTML DOM Parser instead of complex regex http://simplehtmldom.sourceforge.net/
对于您想要的快速而肮脏的方法,请将此代码放在声明“start”变量之前:
For the quick and dirty method you want, put this code before you declare the "start" variable: