简化代码以加速 php scraper

发布于 2024-12-09 01:15:04 字数 4642 浏览 1 评论 0原文

代码只需浸入页面并从指定表中获取所有表内容，将其插入到我的数据库中并回显它。

它做得非常慢我需要一些想法来简化它以更快地工作

<?php

设置循环

$pagenumber = 1001;

while ($pagenumber <= 5000) {

获取内容

$url = "http://www.example.com/info.php?num=$pagenumber";
$raw = file_get_contents($url);

$newlines = array("\t","\n","\r","&nbsp;","\0","\x0B");
$content = str_replace($newlines, '', $raw);

$start = strpos($content,'>Details<');
$end = strpos($content,'</table>',$start);
$table1 = substr($content,$start,$end-$start);
// $table1 = strip_tags($table1);

获取

$start = strpos($table1,'<td');
$end = strpos($table1,'<br />',$start);
$fnames = substr($table1,$start,$end-$start);
$fnames = strip_tags($fnames);
$fnames = preg_replace('/\s\s+/', '', $fnames);

获取名字获取姓氏

$start = strpos($table1,'<br />');
$end = strpos($table1,'</td>',$start);
$lnames = substr($table1,$start,$end-$start);
$lnames = strip_tags($lnames);
$lnames = preg_replace('/\s\s+/', '', $lnames);

电话获取地址

$start = strpos($table1,'Phone:');
$end = strpos($table1,'</td>              </tr>              <tr>',$start);
$phone = substr($table1,$start,$end-$start);
$phone = strip_tags($phone);
$phone = str_replace("Phone:", "" ,$phone);
$phone = preg_replace('/\s\s+/', '', $phone);

获取

$start = strpos($table1,'Address:');
$end = strpos($table1,'</td>              </tr>              <tr>',$start);
$ad = substr($table1,$start,$end-$start);
$ad = strip_tags($ad);
$ad = str_replace("Address:", "" ,$ad);
$ad = preg_replace('/\s\s+/', '', $ad);

公寓不

$start = strpos($table1,'Apt:');
$end = strpos($table1,'</td>              </tr>              <tr>',$start);
$apt = substr($table1,$start,$end-$start);
$apt = strip_tags($apt);
$apt = str_replace("Apt:", "" ,$apt);
$apt = preg_replace('/\s\s+/', '', $apt);

获取国家/地区

$start = strpos($table1,'Country:');
$end = strpos($table1,'</td>              </tr>              <tr>',$start);
$country = substr($table1,$start,$end-$start);
$country = strip_tags($country);
$country = str_replace("Country:", "" ,$country);
$country = preg_replace('/\s\s+/', '', $country);

获取城市

$start = strpos($table1,'City:<br />                 State/Province:');
$end = strpos($table1,'</td>              </tr>              <tr>',$start);
$city = substr($table1,$start,$end-$start);
$city = strip_tags($city);
$city = str_replace("City:                 State/Province:", "" ,$city);
$city = preg_replace('/\s\s+/', '', $city);

获取邮政编码

$start = strpos($table1,'Zip:');
$end = strpos($table1,'</td>              </tr>              <tr>',$start);
$zip = substr($table1,$start,$end-$start);
$zip = strip_tags($zip);
$zip = str_replace("Zip:", "" ,$zip);
$zip = preg_replace('/\s\s+/', '', $zip);

获取电子邮件

$start = strpos($table1,'email:');
$end = strpos($table1,'</td>              </tr>',$start);
$email = substr($table1,$start,$end-$start);
$email = strip_tags($email);
$email = str_replace("email:", "" ,$email);
$email = preg_replace('/\s\s+/', '', $email);

回显行

echo "<tr>
<td><a href='http://www.example.com/info.php?num=$pagenumber'>link</a></td>
<td>$fnames</td>
<td>$lnames</td>
<td>$phone</td>
<td>$ad</td>
<td>$apt</td>
<td>$country</td>
<td>$city</td>
<td>$zip</td>
<td>$email</td>
</tr>";

包括数据库信息

include("inf.php");
$tablename = 'list';

$fnames = mysql_real_escape_string($fnames);
$lnames = mysql_real_escape_string($lnames);
$phone = mysql_real_escape_string($phone);
$ad = mysql_real_escape_string($ad);
$apt = mysql_real_escape_string($apt);
$country = mysql_real_escape_string($country);
$city = mysql_real_escape_string($city);
$zip = mysql_real_escape_string($zip);
$email = mysql_real_escape_string($email);

将行插入数据库

$query = "INSERT INTO $tablename VALUES('', '$pagenumber', '$fnames', '$lnames', '$phone', '$ad', 

'$apt','$country','$city','$zip', '$email')";
mysql_query($query) or die(mysql_error());

重置循环

$pagenumber = $pagenumber + 1;
}

?>

原文

the code simply dips into a page and gets all the table content from the specified table inserts it into my db and echoes it.

its doing it very slowly i need ideas to streamline it to work faster

<?php

sets the loop

$pagenumber = 1001;

while ($pagenumber <= 5000) {

gets the content

$url = "http://www.example.com/info.php?num=$pagenumber";
$raw = file_get_contents($url);

$newlines = array("\t","\n","\r"," ","\0","\x0B");
$content = str_replace($newlines, '', $raw);

$start = strpos($content,'>Details<');
$end = strpos($content,'</table>',$start);
$table1 = substr($content,$start,$end-$start);
// $table1 = strip_tags($table1);

gets first name

$start = strpos($table1,'<td');
$end = strpos($table1,'<br />',$start);
$fnames = substr($table1,$start,$end-$start);
$fnames = strip_tags($fnames);
$fnames = preg_replace('/\s\s+/', '', $fnames);

gets surname

$start = strpos($table1,'<br />');
$end = strpos($table1,'</td>',$start);
$lnames = substr($table1,$start,$end-$start);
$lnames = strip_tags($lnames);
$lnames = preg_replace('/\s\s+/', '', $lnames);

gets the phone

$start = strpos($table1,'Phone:');
$end = strpos($table1,'</td>              </tr>              <tr>',$start);
$phone = substr($table1,$start,$end-$start);
$phone = strip_tags($phone);
$phone = str_replace("Phone:", "" ,$phone);
$phone = preg_replace('/\s\s+/', '', $phone);

gets the address

$start = strpos($table1,'Address:');
$end = strpos($table1,'</td>              </tr>              <tr>',$start);
$ad = substr($table1,$start,$end-$start);
$ad = strip_tags($ad);
$ad = str_replace("Address:", "" ,$ad);
$ad = preg_replace('/\s\s+/', '', $ad);

gets the apartment no

$start = strpos($table1,'Apt:');
$end = strpos($table1,'</td>              </tr>              <tr>',$start);
$apt = substr($table1,$start,$end-$start);
$apt = strip_tags($apt);
$apt = str_replace("Apt:", "" ,$apt);
$apt = preg_replace('/\s\s+/', '', $apt);

gets the country

$start = strpos($table1,'Country:');
$end = strpos($table1,'</td>              </tr>              <tr>',$start);
$country = substr($table1,$start,$end-$start);
$country = strip_tags($country);
$country = str_replace("Country:", "" ,$country);
$country = preg_replace('/\s\s+/', '', $country);

gets the city

$start = strpos($table1,'City:<br />                 State/Province:');
$end = strpos($table1,'</td>              </tr>              <tr>',$start);
$city = substr($table1,$start,$end-$start);
$city = strip_tags($city);
$city = str_replace("City:                 State/Province:", "" ,$city);
$city = preg_replace('/\s\s+/', '', $city);

gets the zip

$start = strpos($table1,'Zip:');
$end = strpos($table1,'</td>              </tr>              <tr>',$start);
$zip = substr($table1,$start,$end-$start);
$zip = strip_tags($zip);
$zip = str_replace("Zip:", "" ,$zip);
$zip = preg_replace('/\s\s+/', '', $zip);

gets the email

$start = strpos($table1,'email:');
$end = strpos($table1,'</td>              </tr>',$start);
$email = substr($table1,$start,$end-$start);
$email = strip_tags($email);
$email = str_replace("email:", "" ,$email);
$email = preg_replace('/\s\s+/', '', $email);

echoes the row

echo "<tr>
<td><a href='http://www.example.com/info.php?num=$pagenumber'>link</a></td>
<td>$fnames</td>
<td>$lnames</td>
<td>$phone</td>
<td>$ad</td>
<td>$apt</td>
<td>$country</td>
<td>$city</td>
<td>$zip</td>
<td>$email</td>
</tr>";

includes db info

include("inf.php");
$tablename = 'list';

$fnames = mysql_real_escape_string($fnames);
$lnames = mysql_real_escape_string($lnames);
$phone = mysql_real_escape_string($phone);
$ad = mysql_real_escape_string($ad);
$apt = mysql_real_escape_string($apt);
$country = mysql_real_escape_string($country);
$city = mysql_real_escape_string($city);
$zip = mysql_real_escape_string($zip);
$email = mysql_real_escape_string($email);

inserts row to db

$query = "INSERT INTO $tablename VALUES('', '$pagenumber', '$fnames', '$lnames', '$phone', '$ad', 

'$apt','$country','$city','$zip', '$email')";
mysql_query($query) or die(mysql_error());

resets the loop

$pagenumber = $pagenumber + 1;
}

?>

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

心舞飞扬 2024-12-16 01:15:04

不要对 html 使用正则表达式。您应该使用 xpath，特别是对于 php，DOMXPath

回复收藏 0 原文

烟酉 2024-12-16 01:15:04

你可以看看curl

http://nl2.php.net/manual/ en/book.curl.php

抓取页面后，您可以使用单一模式来抓取所有必填字段。
可以使用 preg_match_all 进行匹配

另外，是否没有任何 xml/rss feed 可用于您正在查找的数据？
看看是否可以在示例网站上每页显示更多结果，这将减少您需要抓取的页面数量。

编辑：
根据要求，一个简单的示例：

确保您的服务器上启用了curl：

echo 'cURL is '.(function_exists('curl_init') ?: ' not').' enabled';

          $ch = curl_init();

    curl_setopt ($ch, CURLOPT_URL, 'http://example.com' );

    curl_setopt($ch, CURLOPT_REFERER, 'http://example.com');
    curl_setopt($ch, CURLOPT_ENCODING, 'gzip,deflate');
    curl_setopt($ch, CURLOPT_AUTOREFERER, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_TIMEOUT, 5);
    curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);

           $page =curl_exec ($ch);

You could take a look at curl

http://nl2.php.net/manual/en/book.curl.php

After grabbing the pages(s) you could us a single pattern to grab all required fields.
Matches can be done with preg_match_all

Also is there not any xml/rss feed available for the data you are seeking ?
See if you can show more results per page on your example site , this would reduce the number of pages you need to crawl.

edit :
as requested a simple example :

Make sure you have curl enabled on your server :

echo 'cURL is '.(function_exists('curl_init') ?: ' not').' enabled';

          $ch = curl_init();

    curl_setopt ($ch, CURLOPT_URL, 'http://example.com' );

    curl_setopt($ch, CURLOPT_REFERER, 'http://example.com');
    curl_setopt($ch, CURLOPT_ENCODING, 'gzip,deflate');
    curl_setopt($ch, CURLOPT_AUTOREFERER, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_TIMEOUT, 5);
    curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);

           $page =curl_exec ($ch);

回复收藏 0 原文

~没有更多了~