阅读网站的最佳方式?

发布于 2024-12-08 12:07:56 字数 4418 浏览 0 评论 0 原文

我正在尝试创建一个程序,该程序可以多次从网站获取数据,并且我正在寻找一种方法来执行此操作,而不会在此过程中出现巨大的延迟。

目前我使用下面的代码,它相当慢(即使它只获取 4 个人的名字,我预计一次会执行大约 100 个):

$skills = array(
    "overall", "attack", "defense", "strength", "constitution", "ranged",
    "prayer", "magic", "cooking", "woodcutting", "fletching", "fishing",
    "firemaking", "crafting", "smithing", "mining", "herblore", "agility",
    "thieving", "slayer", "farming", "runecrafting", "hunter", "construction",
    "summoning", "dungeoneering"
);

$participants = array("Zezima", "Allar", "Foot", "Arma150", "Green098", "Skiller 703", "Quuxx");//explode("\r\n", $_POST['names']);

$skill = isset($_GET['skill']) ? array_search($skills, $_GET['skill']) : 0;

display($participants, $skills, array_search($_GET['skill'], $skills));

function getAllStats($participants) {
    $stats = array();
    for ($i = 0; $i < count($participants); $i++) {
        $stats[] = getStats($participants[$i]);
    }
    return $stats;
}

function display($participants, $skills, $stat) {
    $all = getAllStats($participants);
    for ($i = 0; $i < count($participants); $i++) {
        $rank = getSkillData($all[$i], 0, $stat);
        $level = getSkillData($all[$i], 1, $stat);
        $experience = getSkillData($all[$i], 3, $stat);
    }
}

function getStats($username) {
    $curl = curl_init("http://hiscore.runescape.com/index_lite.ws?player=" . $username);
    curl_setopt ($curl, CURLOPT_CONNECTTIMEOUT, $timeout);
    curl_setopt ($curl, CURLOPT_USERAGENT, sprintf("Mozilla/%d.0", rand(4, 5)));
    curl_setopt ($curl, CURLOPT_HEADER, (int) $header);
    curl_setopt ($curl, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt ($curl, CURLOPT_SSL_VERIFYPEER, 0);
    curl_setopt ($curl, CURLOPT_VERBOSE, 1);
    $httpCode = curl_getinfo($curl, CURLINFO_HTTP_CODE);
    $output = curl_exec($curl);
    curl_close ($curl);
    if (strstr($output, "<html><head><title>")) {
        return false;
    }
    return $output;
}

function getSkillData($stats, $row, $skill) {
    $stats = explode("\n", $stats);
    $levels = explode(",", $stats[$skill]);
    return $levels[$row];
}

当我对此进行基准测试时,它花费了大约 5 秒,这不是太糟糕了,但想象一下如果我再这样做 93 次会怎样。我知道这不会是即时的,但我想拍摄 30 秒以内的时间。我知道这是可能的,因为我见过一些网站做了类似的事情,并且它们在 30 秒的时间内起作用。

我读过有关使用缓存数据的内容,但这行不通,因为,简单地说,它会很旧。我正在使用数据库(进一步,我还没有到达那部分)来存储旧数据并检索新数据,这将是实时的(您在下面看到的)。

有没有办法在没有大量延迟的情况下实现类似的操作(并且可能使我正在读取的服务器超载)?

PS:我正在阅读的网站只是文本,它没有任何要解析的 HTML,这应该减少加载时间。以下是页面外观的示例(它们都是相同的,只是数字不同):
<代码>69,2496,1285458634 10982,99,33055154 6608,99,30955066 6978,99,40342518 12092,99,36496288 13247,99,21606979 2812,99,13977759 926,99,36988378 415,99,153324269 329,99,59553081 472,99,40595060 2703,99,28297122 281,99,36937100 1017,99,19418910 276,99,27539259 792,99,34289312 3040,99,16675156 82,99,39712827 80,99,104504543 2386,99,21236188 655,99,28714439 852,99,30069730 29,99,200000000 3366,99,15332729 2216,99,15836767 154,120,200000000 -1,-1 -1,-1 -1,-1 -1,-1 -1,-1 30086,2183 54640,1225 89164,1028 123432,1455 -1,-1 -1,-1

我之前使用此方法与 curl_multi_exec 进行的基准测试代码>:

function getTime() { 
    $timer = explode(' ', microtime()); 
    $timer = $timer[1] + $timer[0]; 
    return $timer; 
}

function benchmarkFunctions() {
    $start = getTime();
    old_f();
    $end = getTime();
    echo 'function old_f() took ' . round($end - $start, 4) . ' seconds to complete<br><br>';
    $startt = getTime();
    new_f();
    $endd = getTime();
    echo 'function new_f() took ' . round($endd - $startt, 4) . ' seconds to complete';
}

function old_f() {
    $test = array("A E T", "Ts Danne", "Funkymunky11", "Fast993", "Fast99Three", "Jeba", "Quuxx");
    getAllStats($test);
}

function new_f() {
    $test = array("A E T", "Ts Danne", "Funkymunky11", "Fast993", "Fast99Three", "Jeba", "Quuxx");
    $curl_arr = array();
    $master = curl_multi_init();

    $amt = count($test);
    for ($i = 0; $i < $amt; $i++) {
        $curl_arr[$i] = curl_init('http://hiscore.runescape.com/index_lite.ws?player=' . $test[$i]);
        curl_setopt($curl_arr[$i], CURLOPT_RETURNTRANSFER, true);
        curl_multi_add_handle($master, $curl_arr[$i]);
    }

    do {
        curl_multi_exec($master, $running);
    } while ($running > 0);

    for ($i = 0; $i < $amt; $i++) {
        $results = curl_exec($curl_arr[$i]);
    }
}

I'm trying to create a program that grabs data from a website x amount of times and I'm looking for a way to go about doing so without huge delays in the process.

Currently I use the following code, and it's rather slow (even though it is only grabbing 4 peoples' names, I'm expecting to do about 100 at a time):

$skills = array(
    "overall", "attack", "defense", "strength", "constitution", "ranged",
    "prayer", "magic", "cooking", "woodcutting", "fletching", "fishing",
    "firemaking", "crafting", "smithing", "mining", "herblore", "agility",
    "thieving", "slayer", "farming", "runecrafting", "hunter", "construction",
    "summoning", "dungeoneering"
);

$participants = array("Zezima", "Allar", "Foot", "Arma150", "Green098", "Skiller 703", "Quuxx");//explode("\r\n", $_POST['names']);

$skill = isset($_GET['skill']) ? array_search($skills, $_GET['skill']) : 0;

display($participants, $skills, array_search($_GET['skill'], $skills));

function getAllStats($participants) {
    $stats = array();
    for ($i = 0; $i < count($participants); $i++) {
        $stats[] = getStats($participants[$i]);
    }
    return $stats;
}

function display($participants, $skills, $stat) {
    $all = getAllStats($participants);
    for ($i = 0; $i < count($participants); $i++) {
        $rank = getSkillData($all[$i], 0, $stat);
        $level = getSkillData($all[$i], 1, $stat);
        $experience = getSkillData($all[$i], 3, $stat);
    }
}

function getStats($username) {
    $curl = curl_init("http://hiscore.runescape.com/index_lite.ws?player=" . $username);
    curl_setopt ($curl, CURLOPT_CONNECTTIMEOUT, $timeout);
    curl_setopt ($curl, CURLOPT_USERAGENT, sprintf("Mozilla/%d.0", rand(4, 5)));
    curl_setopt ($curl, CURLOPT_HEADER, (int) $header);
    curl_setopt ($curl, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt ($curl, CURLOPT_SSL_VERIFYPEER, 0);
    curl_setopt ($curl, CURLOPT_VERBOSE, 1);
    $httpCode = curl_getinfo($curl, CURLINFO_HTTP_CODE);
    $output = curl_exec($curl);
    curl_close ($curl);
    if (strstr($output, "<html><head><title>")) {
        return false;
    }
    return $output;
}

function getSkillData($stats, $row, $skill) {
    $stats = explode("\n", $stats);
    $levels = explode(",", $stats[$skill]);
    return $levels[$row];
}

When I benchmarked this it took about 5 seconds, which isn't too bad, but imagine if I was doing this 93 more times. I understand it won't be instant, but I'd like to shoot for under 30 seconds. I know it's possible because I've seen websites which do something similar and they act within a 30 second time period.

I've read about using caching the data but that won't work because, simply, it will be old. I'm using a database (further on, I haven't gotten to that part yet) to store old data and retrieve new data which will be real time (what you see below).

Is there a way to achieve doing something like this without massive delays (and possibly overloading the server I am reading from)?

P.S: The website I am reading from is just text, it doesn't have any HTML to parse, which should reduce the loading time. Here's an example of what a page looks like (they're all the same, just different numbers):
69,2496,1285458634 10982,99,33055154 6608,99,30955066 6978,99,40342518 12092,99,36496288 13247,99,21606979 2812,99,13977759 926,99,36988378 415,99,153324269 329,99,59553081 472,99,40595060 2703,99,28297122 281,99,36937100 1017,99,19418910 276,99,27539259 792,99,34289312 3040,99,16675156 82,99,39712827 80,99,104504543 2386,99,21236188 655,99,28714439 852,99,30069730 29,99,200000000 3366,99,15332729 2216,99,15836767 154,120,200000000 -1,-1 -1,-1 -1,-1 -1,-1 -1,-1 30086,2183 54640,1225 89164,1028 123432,1455 -1,-1 -1,-1

My previous benchmark with this method vs. curl_multi_exec:

function getTime() { 
    $timer = explode(' ', microtime()); 
    $timer = $timer[1] + $timer[0]; 
    return $timer; 
}

function benchmarkFunctions() {
    $start = getTime();
    old_f();
    $end = getTime();
    echo 'function old_f() took ' . round($end - $start, 4) . ' seconds to complete<br><br>';
    $startt = getTime();
    new_f();
    $endd = getTime();
    echo 'function new_f() took ' . round($endd - $startt, 4) . ' seconds to complete';
}

function old_f() {
    $test = array("A E T", "Ts Danne", "Funkymunky11", "Fast993", "Fast99Three", "Jeba", "Quuxx");
    getAllStats($test);
}

function new_f() {
    $test = array("A E T", "Ts Danne", "Funkymunky11", "Fast993", "Fast99Three", "Jeba", "Quuxx");
    $curl_arr = array();
    $master = curl_multi_init();

    $amt = count($test);
    for ($i = 0; $i < $amt; $i++) {
        $curl_arr[$i] = curl_init('http://hiscore.runescape.com/index_lite.ws?player=' . $test[$i]);
        curl_setopt($curl_arr[$i], CURLOPT_RETURNTRANSFER, true);
        curl_multi_add_handle($master, $curl_arr[$i]);
    }

    do {
        curl_multi_exec($master, $running);
    } while ($running > 0);

    for ($i = 0; $i < $amt; $i++) {
        $results = curl_exec($curl_arr[$i]);
    }
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

耳钉梦 2024-12-15 12:07:56

当您执行大量这样的网络请求时,您将受到网络和远程服务器响应所需时间的支配。

因此,在最短的时间内完成所有请求的最佳方法可能是一次性完成所有请求。为每个线程生成一个新线程。对于您正在处理的数据大小,很可能一次完成所有数据,但如果这是一个问题,那么也许可以一次尝试 20 个左右。

编辑:我刚刚意识到你正在使用没有线程的 PHP。嗯,对于初学者来说,可能是一个糟糕的语言选择。但是您也许可以通过分叉新进程来模拟线程。不过,如果 PHP 在 Web 服务器进程内运行,这可能会造成破坏,因为它会克隆整个服务器。我将研究 PHP 是否提供某种可以产生类似效果的异步 Web 请求。

编辑 2:

这是一个讨论如何使用 PHP 在后台启动 HTTP 请求的页面:

http://w-shadow.com/blog/2007/10/16/how-to-run-a-php-script-in-the-background/

但是,这是“即发即忘”,它不允许您获取对您的请求的响应并对其执行某些操作。不过,您可以采取的一种方法是使用此方法向您自己的服务器上的不同页面发出许多请求,并使这些页面中的每个页面向远程服务器发出单个请求。 (或者,如果您不想一次启动太多请求,则每个工作请求可以处理一批请求。)

您仍然需要一种方法来组装所有结果,以及一种检测整个过程何时完成的方法。完成,以便您可以显示结果。我可能会使用数据库或文件系统来协调不同的进程。

(同样,为这项任务选择一种更强大的语言可能会有所帮助。在类似于 PHP 的语言领域,我知道 Perl 可以通过“使用线程”非常轻松地处理这个问题,我想 Python 或 Ruby 也可以。 )

编辑 3:

另一种解决方案,该解决方案使用 UNIX shell 通过在单独的进程中完成工作来绕过 PHP 的限制。您可以执行如下命令:

echo '$urlList' | xargs -P 10 -r -n1 wget

您可能想稍微使用一下 wget 选项,例如显式指定输出文件,但这是一般的想法。您还可以使用 curl 代替 wget,或者如果您想完全控制获取页面的作业,甚至可以调用一个设计为从命令行运行的 PHP 脚本。

同样,使用此解决方案时,您仍然存在识别工作何时完成以便显示结果的问题。

我从这个页面得到了这种方法的想法:

http: //www.commandlinefu.com/commands/view/3269/parallel-file-downloading-with-wget

When you are doing a bunch of network requests like this, you are at the mercy of the network and the remote server regarding how much time they take to respond.

Because of this, the best way to make all of your requests complete in the shortest amount of time is probably to do them all at once. Spawn a new thread for each one. For the size of the data you're working with it's probably quite possible to do literally all at once, but if that's a problem then maybe try 20 or so at once.

EDIT: I just realized you're using PHP which doesn't have threads. Well, probably a poor choice of language, for starters. But you might be able to emulate threads by forking new processes. This might be a wreck, though, if PHP is running inside the web server process, since it would clone the whole server. I'll look into whether PHP offers some sort of asynchronous web requests that could give a similar effect.

EDIT 2:

Here is a page discussing how to launch an HTTP request in the background with PHP:

http://w-shadow.com/blog/2007/10/16/how-to-run-a-php-script-in-the-background/

However, this is "fire and forget," it doesn't let you pick up the response to your request and do something with it. One approach you could take with this, though, would be to use this method fire off many requests to a different page on your own server, and have each one of those pages make a single request to the remote server. (Or, each worker request could process a batch of requests if you don't want to put start too many requests at once.)

You would still need a way to assemble all the results, and a way to detect when the whole procedure is complete so you can display the results. I would probably use either the database or the filesystem to coordinate between the different processes.

(Again, choosing a more powerful language for this task would probably be helpful. In the realm of languages similar to PHP, I know Perl would handle this problem very easily with "use threads", and I imagine Python or Ruby would as well.)

EDIT 3:

Another solution, this one using the UNIX shell to get around PHP's limitations by doing the work in separate processes. You can do a command something like this:

echo '$urlList' | xargs -P 10 -r -n1 wget

You would probably want to play with the wget options a bit, such as specifying the output file explicitly, but this is the generally idea. In place of wget you could also use curl, or even just call a PHP script that's designed to be run from the command line if you want complete control over the job of fetching the pages.

Again, with this solution you still have the problem of recognizing when the job is done so you can show the results.

I got this idea for this approach from this page:

http://www.commandlinefu.com/commands/view/3269/parallel-file-downloading-with-wget

笑着哭最痛 2024-12-15 12:07:56

您可以重用curl连接。此外,我更改了您的代码以检查httpCode,而不是使用strstr。应该会更快。

另外,您可以设置curl 来并行执行此操作,但我从未尝试过。请参阅http://www.php.net/manual/en/ function.curl-multi-exec.php

改进的 getStats(),可重用卷曲句柄。

function getStats(&$curl,$username) {
    curl_setopt($curl, CURLOPT_URL, "http://hiscore.runescape.com/index_lite.ws?player=" . $username);
    $output = curl_exec($curl);
    if (curl_getinfo($curl, CURLINFO_HTTP_CODE)!='200') {
        return null;
    }
    return $output;
}

用法:

$participants = array("Zezima", "Allar", "Foot", "Arma150", "Green098", "Skiller 703", "Quuxx");

$curl = curl_init();
curl_setopt ($curl, CURLOPT_CONNECTTIMEOUT, 0); //dangerous! will wait indefinitely
curl_setopt ($curl, CURLOPT_USERAGENT, sprintf("Mozilla/%d.0", rand(4, 5)));
curl_setopt ($curl, CURLOPT_HEADER, false);
curl_setopt ($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt ($curl, CURLOPT_VERBOSE, 1);
//try:
curl_setopt($curl, CURLOPT_HTTPHEADER, array(
    'Connection: Keep-Alive',
    'Keep-Alive: 300'
));


header('Content-type:text/plain');
foreach($participants as &$user) {
    $stats =  getStats($curl, $user);
    if($stats!==null) {
        echo $stats."\r\n";
    }
}

curl_close($curl);

You can reuse curl connections. Also, I changed your code to check the httpCode instead of using strstr. Should be quicker.

Also, you can setup curl to do it in parallel, which I've never tried. See http://www.php.net/manual/en/function.curl-multi-exec.php

An improved getStats() with reused curl handle.

function getStats(&$curl,$username) {
    curl_setopt($curl, CURLOPT_URL, "http://hiscore.runescape.com/index_lite.ws?player=" . $username);
    $output = curl_exec($curl);
    if (curl_getinfo($curl, CURLINFO_HTTP_CODE)!='200') {
        return null;
    }
    return $output;
}

Usage:

$participants = array("Zezima", "Allar", "Foot", "Arma150", "Green098", "Skiller 703", "Quuxx");

$curl = curl_init();
curl_setopt ($curl, CURLOPT_CONNECTTIMEOUT, 0); //dangerous! will wait indefinitely
curl_setopt ($curl, CURLOPT_USERAGENT, sprintf("Mozilla/%d.0", rand(4, 5)));
curl_setopt ($curl, CURLOPT_HEADER, false);
curl_setopt ($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt ($curl, CURLOPT_VERBOSE, 1);
//try:
curl_setopt($curl, CURLOPT_HTTPHEADER, array(
    'Connection: Keep-Alive',
    'Keep-Alive: 300'
));


header('Content-type:text/plain');
foreach($participants as &$user) {
    $stats =  getStats($curl, $user);
    if($stats!==null) {
        echo $stats."\r\n";
    }
}

curl_close($curl);
老娘不死你永远是小三 2024-12-15 12:07:56

由于您向同一主机发出多个请求,因此您可以重复使用curl句柄,并且如果站点支持保持活动请求,则它可以比许多请求更快地加快您的进程。

您可以像这样更改您的函数:

function getStats($username) {
    static $curl = null;

    if ($curl == null) {
        $curl = curl_init();
    }

    curl_setopt($curl, CURLOPT_URL, "http://hiscore.runescape.com/index_lite.ws?player=" . $username);
    curl_setopt ($curl, CURLOPT_HTTPHEADER, array('Connection: Keep-Alive'));

    //...

    // remove curl_close($curl)
}

这​​样做将使您不必为每个用户请求关闭并重新建立套接字。它将对所有请求使用相同的连接。

Since you are making multiple requests to the same host, you can re-use the curl handle and if the site supports keep-alive requests, it could speed up your process a good bit over many requests.

You can change your function like this:

function getStats($username) {
    static $curl = null;

    if ($curl == null) {
        $curl = curl_init();
    }

    curl_setopt($curl, CURLOPT_URL, "http://hiscore.runescape.com/index_lite.ws?player=" . $username);
    curl_setopt ($curl, CURLOPT_HTTPHEADER, array('Connection: Keep-Alive'));

    //...

    // remove curl_close($curl)
}

Doing this will make it so you don't have to close and re-establish the socket for every user request. It will use the same connection for all the requests.

兲鉂ぱ嘚淚 2024-12-15 12:07:56

curl 是阅读网站内容的一种非常好的方法 - 我想您的问题是因为下载一个页面需要时间。如果您可以并行获取所有 100 个页面,那么您可能会在 10 秒内处理完所有内容。

为了避免使用线程、锁、信号量以及线程上所有具有挑战性的内容,请阅读 这篇文章 并找到一种几乎免费使您的应用程序并行的方法。

curl is a very good way to read the content of the website - I suppose your problem is because of the time require to download ONE page. If you can get all the 100 pages in parallel then you would probably have it all processed in under 10 seconds.

In order to avoid working with threads, locks, semaphores, and all the challenging stuff on threads, read this article and find out a way to make your application parallel almost for free.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文