当前位置：文江博客话题详情

设计网络爬虫

发布于 2024-11-03 23:53:33 字数 272 浏览 8 评论 0原文

我遇到过一个面试问题“如果你设计一个网络爬虫，你会如何避免陷入无限循环？”我正在尝试回答它。

这一切是如何从头开始的。假设谷歌从一些中心页面开始，说有数百个（这些中心页面最初是如何找到的是一个不同的子问题）。当 Google 跟踪某个页面的链接等时，它是否会不断创建哈希表以确保它不会跟踪较早访问的页面。

如果同一个页面有 2 个名称（URL），那么现在我们有 URL 缩短器等。

我以 Google 为例。虽然谷歌没有透露其网络爬虫算法和页面排名等如何工作，但你有什么猜测吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

ぃ弥猫深巷。 2024-11-10 23:53:33

如果您想获得详细答案，请查看本文第 3.8 节，它描述了现代抓取工具的 URL-seen 测试：

在提取链接的过程中，任何
网络爬虫会遇到多种
指向同一文档的链接。为了避免
下载和处理文档
多次，URL-seen 测试必须
对每个提取的链接执行
在将其添加到 URL 边界之前。
（另一种设计是
相反，执行 URL-seen 测试时
URL 已从边界中删除，
但这种方法会导致
更大的边界。）
执行
URL-seen 测试，我们存储所有
墨卡托在规范中看到的 URL
形成一个名为 URL 的大表
放。再次，条目太多
让它们都适合记忆，所以就像
文档指纹集、URL
set 主要存储在磁盘上。
保存
空间，我们不存储文本
URL中每个URL的表示
集，而是一个固定大小的
校验和。与指纹不同
呈现给内容可见测试
文档指纹集、流
针对 URL 集测试的 URL 的数量
相当重要的局部性。到
减少操作次数
备份磁盘文件，因此我们保留
流行 URL 的内存缓存。
这个缓存的直觉是
某些 URL 的链接很常见，
所以将流行的缓存在内存中
将导致较高的内存命中率
率。
事实上，使用内存中
2^18 个条目的缓存和类似 LRU 的缓存
时钟更换政策，我们实现
内存中的总体命中率
缓存66.2%，命中率9.5%
在最近添加的 URL 表格中，
净命中率为75.7%。而且，
错过的请求中有 24.3%
热门 URL 的缓存和
最近添加的 URL 表，关于
1=3 在我们的缓冲区上产生命中
随机访问文件实现，
它也驻留在用户空间中。这
所有这些缓冲的最终结果是
我们执行的每次会员资格测试
URL 集结果的平均值
0.16 寻道和 0.17 读取内核
调用（其中一部分是
由内核的文件系统提供服务
缓冲区）。所以，每个 URL 设置成员资格
测试产生六分之一的内核
调用作为会员资格测试
文档指纹集。这些
节省纯粹是由于金额
URL 局部性（即重复
流行的 URL）是流中固有的
抓取过程中遇到的 URL 数量。

基本上，它们使用散列函数对所有 URL 进行散列，该散列函数保证每个 URL 的唯一散列，并且由于 URL 的局部性，因此很容易找到 URL。 Google 甚至开源了他们的哈希函数：CityHash

警告！< br>
他们也可能在谈论机器人陷阱！机器人陷阱是页面的一部分，它不断生成具有唯一 URL 的新链接，通过跟踪该页面提供的链接，您实际上会陷入“无限循环”。这并不完全是一个循环，因为循环可能是访问同一 URL 的结果，但它是一个无限的 URL 链，您应该避免对其进行爬网。

更新 12/13/2012- 世界应该结束后的第二天:)

根据 Fr0zenFyr 的评论：如果有人使用 AOPIC 算法来选择页面，那么很容易避免无限循环类型的机器人陷阱。以下是 AOPIC 工作原理的总结：

获取一组 N 个种子页面。
为每个页面分配X数量的信用，使得每个页面在爬行开始之前具有X/N信用（即等量的信用）。
选择一个页面P，其中P具有最高的信用量（或者如果所有页面具有相同的信用量，则抓取随机页面）。
抓取页面P（假设P被抓取时有100积分）。
提取页面 P 中的所有链接（假设有 10 个）。
将 P 的积分设置为 0。
取 10% 的“税”并将其分配给 Lambda 页面。
从 P 的原始信用额 - 税收中为页面 P 上找到的每个链接分配等量的信用额：因此（100（P 信用额）- 10（10% 税额））/10（链接）= 每个链接 9 个信用额。
从步骤3开始重复。

由于Lambda页面不断收税，最终它将成为信用金额最大的页面，我们必须“爬行”它。我在引号中说“爬行”，因为我们实际上并不对 Lambda 页面发出 HTTP 请求，我们只是获取其积分并将其平均分配给数据库中的所有页面。

由于机器人陷阱只提供内部链接信用，而很少从外部获得信用，因此它们会不断地将信用（来自税收）泄漏到 Lambda 页面。 Lambda 页面会将信用值均匀地分配给数据库中的所有页面，并且在每个周期中，机器人陷阱页面都会失去越来越多的信用值，直到它的信用值非常少，以至于几乎再也不会被爬行。对于好的页面来说，这种情况不会发生，因为它们经常从其他页面上找到的反向链接中获得积分。这也会导致动态页面排名，您会注意到，每当您拍摄数据库快照时，根据页面拥有的积分数量对页面进行排序，那么它们很可能会根据其 进行大致排序真实页面排名。

这只能避免无限循环类型的机器人陷阱，但还有许多其他机器人您应该提防的陷阱，并且也有方法可以绕过它们。

If you want to get a detailed answer take a look at section 3.8 this paper, which describes the URL-seen test of a modern scraper:

In the course of extracting links, any
Web crawler will encounter multiple
links to the same document. To avoid
downloading and processing a document
multiple times, a URL-seen test must
be performed on each extracted link
before adding it to the URL frontier.
(An alternative design would be to
instead perform the URL-seen test when
the URL is removed from the frontier,
but this approach would result in a
much larger frontier.)
To perform the
URL-seen test, we store all of the
URLs seen by Mercator in canonical
form in a large table called the URL
set. Again, there are too many entries
for them all to fit in memory, so like
the document fingerprint set, the URL
set is stored mostly on disk.
To save
space, we do not store the textual
representation of each URL in the URL
set, but rather a fixed-sized
checksum. Unlike the fingerprints
presented to the content-seen test’s
document fingerprint set, the stream
of URLs tested against the URL set has
a non-trivial amount of locality. To
reduce the number of operations on the
backing disk file, we therefore keep
an in-memory cache of popular URLs.
The intuition for this cache is that
links to some URLs are quite common,
so caching the popular ones in memory
will lead to a high in-memory hit
rate.
In fact, using an in-memory
cache of 2^18 entries and the LRU-like
clock replacement policy, we achieve
an overall hit rate on the in-memory
cache of 66.2%, and a hit rate of 9.5%
on the table of recently-added URLs,
for a net hit rate of 75.7%. Moreover,
of the 24.3% of requests that miss in
both the cache of popular URLs and the
table of recently-added URLs, about
1=3 produce hits on the buffer in our
random access file implementation,
which also resides in user-space. The
net result of all this buffering is
that each membership test we perform
on the URL set results in an average
of 0.16 seek and 0.17 read kernel
calls (some fraction of which are
served out of the kernel’s file system
buffers). So, each URL set membership
test induces one-sixth as many kernel
calls as a membership test on the
document fingerprint set. These
savings are purely due to the amount
of URL locality (i.e., repetition of
popular URLs) inherent in the stream
of URLs encountered during a crawl.

Basically they hash all of the URLs with a hashing function that guarantees unique hashes for each URL and due to the locality of URLs, it becomes very easy to find URLs. Google even open-sourced their hashing function: CityHash

WARNING!
They might also be talking about bot traps!!! A bot trap is a section of a page that keeps generating new links with unique URLs and you will essentially get trapped in an "infinite loop" by following the links that are being served by that page. This is not exactly a loop, because a loop would be the result of visiting the same URL, but it's an infinite chain of URLs which you should avoid crawling.

Update 12/13/2012- the day after the world was supposed to end :)

Per Fr0zenFyr's comment: if one uses the AOPIC algorithm for selecting pages, then it's fairly easy to avoid bot-traps of the infinite loop kind. Here is a summary of how AOPIC works:

Get a set of N seed pages.
Allocate X amount of credit to each page, such that each page has X/N credit (i.e. equal amount of credit) before crawling has started.
Select a page P, where the P has the highest amount of credit (or if all pages have the same amount of credit, then crawl a random page).
Crawl page P (let's say that P had 100 credits when it was crawled).
Extract all the links from page P (let's say there are 10 of them).
Set the credits of P to 0.
Take a 10% "tax" and allocate it to a Lambda page.
Allocate an equal amount of credits each link found on page P from P's original credit - the tax: so (100 (P credits) - 10 (10% tax))/10 (links) = 9 credits per each link.
Repeat from step 3.

Since the Lambda page continuously collects tax, eventually it will be the page with the largest amount of credit and we'll have to "crawl" it. I say "crawl" in quotes, because we don't actually make an HTTP request for the Lambda page, we just take its credits and distribute them equally to all of the pages in our database.

Since bot traps only give internal links credits and they rarely get credit from the outside, they will continually leak credits (from taxation) to the Lambda page. The Lambda page will distribute that credits out to all of the pages in the database evenly and upon each cycle the bot trap page will lose more and more credits, until it has so little credits that it almost never gets crawled again. This will not happen with good pages, because they often get credits from back-links found on other pages. This also results in a dynamic page rank and what you will notice is that any time you take a snapshot of your database, order the pages by the amount of credits they have, then they will most likely be ordered roughly according to their true page rank.

This only avoid bot traps of the infinite-loop kind, but there are many other bot traps which you should watch out for and there are ways to get around them too.

回复收藏 0 原文

双手揣兜 2024-11-10 23:53:33

虽然这里的每个人都已经建议了如何创建网络爬虫，但以下是 Google 如何对页面进行排名。

Google 根据回调链接的数量（其他网站上有多少链接指向特定网站/页面）为每个页面提供排名。这称为相关性得分。这是基于以下事实：如果一个页面有许多其他页面链接到它，则它可能是一个重要页面。

每个站点/页面都被视为图表中的一个节点。到其他页面的链接是有向边。顶点的度定义为传入边的数量。具有较多传入边的节点排名较高。

以下是 PageRank 的确定方法。假设页面 Pj 有 Lj 链接。如果这些链接之一是指向 Pi 页面，则 Pj 会将其重要性的 1/Lj 传递给 Pi。 Pi 的重要性排名是链接到它的页面所做的所有贡献的总和。因此，如果我们用 Bi 表示链接到 Pi 的页面集，则我们有以下公式：

Importance(Pi)= sum( Importance(Pj)/Lj ) for all links from Pi to Bi

排名放置在称为超链接矩阵的矩阵中： H[i,j]

该矩阵中的一行要么是 0，要么是 1/Lj如果有从 Pi 到 Bi 的链接。该矩阵的另一个属性是，如果我们将一列中的所有行相加，我们将得到 1。

现在我们需要将此矩阵乘以一个名为 I（特征值为 1）的特征向量，这样：

I = H*I

现在我们开始迭代： I H, IIH, IIIH .... I^k *H 直到解收敛。即我们在步骤 k 和 k+1 中得到的矩阵中的数字几乎相同。

现在 I 向量中剩下的就是每个页面的重要性。

有关简单的课堂作业示例，请参阅 http://www.math .cornell.edu/~mec/Winter2009/RalucaRemus/Lecture3/lecture3.html

至于解决面试问题中的重复问题，请对整个页面进行校验和，并使用该校验和或校验和的 bash 作为地图中的密钥来跟踪访问的页面。

While everybody here already suggested how to create your web crawler, here is how how Google ranks pages.

Google gives each page a rank based on the number of callback links (how many links on other websites point to a specific website/page). This is called relevance score. This is based on the fact that if a page has many other pages link to it, it's probably an important page.

Each site/page is viewed as a node in a graph. Links to other pages are directed edges. A degree of a vertex is defined as the number of incoming edges. Nodes with a higher number of incoming edges are ranked higher.

Here's how the PageRank is determined. Suppose that page Pj has Lj links. If one of those links is to page Pi, then Pj will pass on 1/Lj of its importance to Pi. The importance ranking of Pi is then the sum of all the contributions made by pages linking to it. So if we denote the set of pages linking to Pi by Bi, then we have this formula:

Importance(Pi)= sum( Importance(Pj)/Lj ) for all links from Pi to Bi

The ranks are placed in a matrix called hyperlink matrix: H[i,j]

A row in this matrix is either 0, or 1/Lj if there is a link from Pi to Bi. Another property of this matrix is that if we sum all rows in a column we get 1.

Now we need multiply this matrix by an Eigen vector, named I (with eigen value 1) such that:

I = H*I

Now we start iterating: IH, IIH, IIIH .... I^k *H until the solution converges. ie we get pretty much the same numbers in the matrix in step k and k+1.

Now whatever is left in the I vector is the importance of each page.

For a simple class homework example see http://www.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture3/lecture3.html

As for solving the duplicate issue in your interview question, do a checksum on the entire page and use either that or a bash of the checksum as your key in a map to keep track of visited pages.

回复收藏 0 原文

永不分离 2024-11-10 23:53:33

取决于他们的问题的意图有多深。如果他们只是想避免来回访问相同的链接，那么对 URL 进行哈希处理就足够了。

如果内容实际上有数千个指向相同内容的 URL，该怎么办？就像 QueryString 参数一样，它不会影响任何内容，但可以有无限次迭代。我想您也可以对页面的内容进行哈希处理并比较 URL，以查看它们是否类似于捕获由多个 URL 标识的内容。例如，请参阅@Lirik 的帖子中提到的机器人陷阱。

回复收藏 0 原文

天涯沦落人 2024-11-10 23:53:33

您必须有某种哈希表来存储结果，您只需在每个页面加载之前检查它即可。

回复收藏 0 原文

冰雪之触 2024-11-10 23:53:33

这里的问题不是抓取重复的 URL，这是通过使用从 url 获得的哈希值的索引来解决的。问题是抓取重复的内容。 “爬虫陷阱”的每个 url 都是不同的（年、日、SessionID...）。

没有“完美”的解决方案...但您可以使用其中一些策略：

• 在网站内保留一个 url 级别的字段。对于从页面获取 url 的每个循环，增加级别。它将像一棵树一样。你可以在某个级别停止爬行，比如10（我认为谷歌使用这个）。

• 您可以尝试创建一种可以比较的HASH，以查找相似的文档，因为您无法与数据库中的每个文档进行比较。谷歌有 SimHash，但我找不到任何可以使用的实现。然后我创建了自己的。我的哈希对 html 代码中的低频和高频字符进行计数，并生成一个 20 字节的哈希，将其与 AVLTree 中最后抓取的页面的小型缓存进行比较，并使用具有一定容差（大约 2）的近邻搜索。您不能使用对此哈希中的字符位置的任何引用。 “识别”陷阱后，您可以记录重复内容的 url 模式，并开始忽略具有该模式的页面。

• 与谷歌一样，您可以为每个网站创建排名，并比其他网站更“信任”其中一个网站。

回复收藏 0 原文

思念绕指尖 2024-11-10 23:53:33

网络爬虫是一种计算机程序，用于从给定的网站 URL 收集/爬行以下关键值（HREF 链接、图像链接、元数据等）。它被设计成智能地跟踪已经从前一个 URL 获取的不同 HREF 链接，这样，爬虫就可以从一个网站跳转到其他网站。通常，它被称为网络蜘蛛或网络机器人。该机制始终充当网络搜索引擎的支柱。

请从我的技术博客中找到源代码 - http://www.algonuts.info/how-to-built-a-simple-web-crawler-in-php.html

<?php
class webCrawler
{
    public $siteURL;
    public $error;

    function __construct()
    {
        $this->siteURL = "";
        $this->error = "";
    }

    function parser()   
    {
        global $hrefTag,$hrefTagCountStart,$hrefTagCountFinal,$hrefTagLengthStart,$hrefTagLengthFinal,$hrefTagPointer;
        global $imgTag,$imgTagCountStart,$imgTagCountFinal,$imgTagLengthStart,$imgTagLengthFinal,$imgTagPointer;
        global $Url_Extensions,$Document_Extensions,$Image_Extensions,$crawlOptions;

        $dotCount = 0;
        $slashCount = 0;
        $singleSlashCount = 0;
        $doubleSlashCount = 0;
        $parentDirectoryCount = 0;

        $linkBuffer = array();

        if(($url = trim($this->siteURL)) != "")
        {
            $crawlURL = rtrim($url,"/");
            if(($directoryURL = dirname($crawlURL)) == "http:")
            {   $directoryURL = $crawlURL;  }
            $urlParser = preg_split("/\//",$crawlURL);

            //-- Curl Start --
            $curlObject = curl_init($crawlURL);
            curl_setopt_array($curlObject,$crawlOptions);
            $webPageContent = curl_exec($curlObject);
            $errorNumber = curl_errno($curlObject);
            curl_close($curlObject);
            //-- Curl End --

            if($errorNumber == 0)
            {
                $webPageCounter = 0;
                $webPageLength = strlen($webPageContent);
                while($webPageCounter < $webPageLength)
                {
                    $character = $webPageContent[$webPageCounter];
                    if($character == "")
                    {   
                        $webPageCounter++;  
                        continue;
                    }
                    $character = strtolower($character);
                    //-- Href Filter Start --
                    if($hrefTagPointer[$hrefTagLengthStart] == $character)
                    {
                        $hrefTagLengthStart++;
                        if($hrefTagLengthStart == $hrefTagLengthFinal)
                        {
                            $hrefTagCountStart++;
                            if($hrefTagCountStart == $hrefTagCountFinal)
                            {
                                if($hrefURL != "")
                                {
                                    if($parentDirectoryCount >= 1 || $singleSlashCount >= 1 || $doubleSlashCount >= 1)
                                    {
                                        if($doubleSlashCount >= 1)
                                        {   $hrefURL = "http://".$hrefURL;  }
                                        else if($parentDirectoryCount >= 1)
                                        {
                                            $tempData = 0;
                                            $tempString = "";
                                            $tempTotal = count($urlParser) - $parentDirectoryCount;
                                            while($tempData < $tempTotal)
                                            {
                                                $tempString .= $urlParser[$tempData]."/";
                                                $tempData++;
                                            }
                                            $hrefURL = $tempString."".$hrefURL;
                                        }
                                        else if($singleSlashCount >= 1)
                                        {   $hrefURL = $urlParser[0]."/".$urlParser[1]."/".$urlParser[2]."/".$hrefURL;  }
                                    }
                                    $host = "";
                                    $hrefURL = urldecode($hrefURL);
                                    $hrefURL = rtrim($hrefURL,"/");
                                    if(filter_var($hrefURL,FILTER_VALIDATE_URL) == true)
                                    {   
                                        $dump = parse_url($hrefURL);
                                        if(isset($dump["host"]))
                                        {   $host = trim(strtolower($dump["host"]));    }
                                    }
                                    else
                                    {
                                        $hrefURL = $directoryURL."/".$hrefURL;
                                        if(filter_var($hrefURL,FILTER_VALIDATE_URL) == true)
                                        {   
                                            $dump = parse_url($hrefURL);    
                                            if(isset($dump["host"]))
                                            {   $host = trim(strtolower($dump["host"]));    }
                                        }
                                    }
                                    if($host != "")
                                    {
                                        $extension = pathinfo($hrefURL,PATHINFO_EXTENSION);
                                        if($extension != "")
                                        {
                                            $tempBuffer ="";
                                            $extensionlength = strlen($extension);
                                            for($tempData = 0; $tempData < $extensionlength; $tempData++)
                                            {
                                                if($extension[$tempData] != "?")
                                                {   
                                                    $tempBuffer = $tempBuffer.$extension[$tempData];
                                                    continue;
                                                }
                                                else
                                                {
                                                    $extension = trim($tempBuffer);
                                                    break;
                                                }
                                            }
                                            if(in_array($extension,$Url_Extensions))
                                            {   $type = "domain";   }
                                            else if(in_array($extension,$Image_Extensions))
                                            {   $type = "image";    }
                                            else if(in_array($extension,$Document_Extensions))
                                            {   $type = "document"; }
                                            else
                                            {   $type = "unknown";  }
                                        }
                                        else
                                        {   $type = "domain";   }

                                        if($hrefURL != "")
                                        {
                                            if($type == "domain" && !in_array($hrefURL,$this->linkBuffer["domain"]))
                                            {   $this->linkBuffer["domain"][] = $hrefURL;   }
                                            if($type == "image" && !in_array($hrefURL,$this->linkBuffer["image"]))
                                            {   $this->linkBuffer["image"][] = $hrefURL;    }
                                            if($type == "document" && !in_array($hrefURL,$this->linkBuffer["document"]))
                                            {   $this->linkBuffer["document"][] = $hrefURL; }
                                            if($type == "unknown" && !in_array($hrefURL,$this->linkBuffer["unknown"]))
                                            {   $this->linkBuffer["unknown"][] = $hrefURL;  }
                                        }
                                    }
                                }
                                $hrefTagCountStart = 0;
                            }
                            if($hrefTagCountStart == 3)
                            {
                                $hrefURL = "";
                                $dotCount = 0;
                                $slashCount = 0;
                                $singleSlashCount = 0;
                                $doubleSlashCount = 0;
                                $parentDirectoryCount = 0;
                                $webPageCounter++;
                                while($webPageCounter < $webPageLength)
                                {
                                    $character = $webPageContent[$webPageCounter];
                                    if($character == "")
                                    {   
                                        $webPageCounter++;  
                                        continue;
                                    }
                                    if($character == "\"" || $character == "'")
                                    {
                                        $webPageCounter++;
                                        while($webPageCounter < $webPageLength)
                                        {
                                            $character = $webPageContent[$webPageCounter];
                                            if($character == "")
                                            {   
                                                $webPageCounter++;  
                                                continue;
                                            }
                                            if($character == "\"" || $character == "'" || $character == "#")
                                            {   
                                                $webPageCounter--;  
                                                break;  
                                            }
                                            else if($hrefURL != "")
                                            {   $hrefURL .= $character; }
                                            else if($character == "." || $character == "/")
                                            {
                                                if($character == ".")
                                                {
                                                    $dotCount++;
                                                    $slashCount = 0;
                                                }
                                                else if($character == "/")
                                                {
                                                    $slashCount++;
                                                    if($dotCount == 2 && $slashCount == 1)
                                                    $parentDirectoryCount++;
                                                    else if($dotCount == 0 && $slashCount == 1)
                                                    $singleSlashCount++;
                                                    else if($dotCount == 0 && $slashCount == 2)
                                                    $doubleSlashCount++;
                                                    $dotCount = 0;
                                                }
                                            }
                                            else
                                            {   $hrefURL .= $character; }
                                            $webPageCounter++;
                                        }
                                        break;
                                    }
                                    $webPageCounter++;
                                }
                            }
                            $hrefTagLengthStart = 0;
                            $hrefTagLengthFinal = strlen($hrefTag[$hrefTagCountStart]);
                            $hrefTagPointer =& $hrefTag[$hrefTagCountStart];
                        }
                    }
                    else
                    {   $hrefTagLengthStart = 0;    }
                    //-- Href Filter End --
                    //-- Image Filter Start --
                    if($imgTagPointer[$imgTagLengthStart] == $character)
                    {
                        $imgTagLengthStart++;
                        if($imgTagLengthStart == $imgTagLengthFinal)
                        {
                            $imgTagCountStart++;
                            if($imgTagCountStart == $imgTagCountFinal)
                            {
                                if($imgURL != "")
                                {
                                    if($parentDirectoryCount >= 1 || $singleSlashCount >= 1 || $doubleSlashCount >= 1)
                                    {
                                        if($doubleSlashCount >= 1)
                                        {   $imgURL = "http://".$imgURL;    }
                                        else if($parentDirectoryCount >= 1)
                                        {
                                            $tempData = 0;
                                            $tempString = "";
                                            $tempTotal = count($urlParser) - $parentDirectoryCount;
                                            while($tempData < $tempTotal)
                                            {
                                                $tempString .= $urlParser[$tempData]."/";
                                                $tempData++;
                                            }
                                            $imgURL = $tempString."".$imgURL;
                                        }
                                        else if($singleSlashCount >= 1)
                                        {   $imgURL = $urlParser[0]."/".$urlParser[1]."/".$urlParser[2]."/".$imgURL;    }
                                    }
                                    $host = "";
                                    $imgURL = urldecode($imgURL);
                                    $imgURL = rtrim($imgURL,"/");
                                    if(filter_var($imgURL,FILTER_VALIDATE_URL) == true)
                                    {   
                                        $dump = parse_url($imgURL); 
                                        $host = trim(strtolower($dump["host"]));
                                    }
                                    else
                                    {
                                        $imgURL = $directoryURL."/".$imgURL;
                                        if(filter_var($imgURL,FILTER_VALIDATE_URL) == true)
                                        {   
                                            $dump = parse_url($imgURL); 
                                            $host = trim(strtolower($dump["host"]));
                                        }   
                                    }
                                    if($host != "")
                                    {
                                        $extension = pathinfo($imgURL,PATHINFO_EXTENSION);
                                        if($extension != "")
                                        {
                                            $tempBuffer ="";
                                            $extensionlength = strlen($extension);
                                            for($tempData = 0; $tempData < $extensionlength; $tempData++)
                                            {
                                                if($extension[$tempData] != "?")
                                                {   
                                                    $tempBuffer = $tempBuffer.$extension[$tempData];
                                                    continue;
                                                }
                                                else
                                                {
                                                    $extension = trim($tempBuffer);
                                                    break;
                                                }
                                            }
                                            if(in_array($extension,$Url_Extensions))
                                            {   $type = "domain";   }
                                            else if(in_array($extension,$Image_Extensions))
                                            {   $type = "image";    }
                                            else if(in_array($extension,$Document_Extensions))
                                            {   $type = "document"; }
                                            else
                                            {   $type = "unknown";  }
                                        }
                                        else
                                        {   $type = "domain";   }

                                        if($imgURL != "")
                                        {
                                            if($type == "domain" && !in_array($imgURL,$this->linkBuffer["domain"]))
                                            {   $this->linkBuffer["domain"][] = $imgURL;    }
                                            if($type == "image" && !in_array($imgURL,$this->linkBuffer["image"]))
                                            {   $this->linkBuffer["image"][] = $imgURL; }
                                            if($type == "document" && !in_array($imgURL,$this->linkBuffer["document"]))
                                            {   $this->linkBuffer["document"][] = $imgURL;  }
                                            if($type == "unknown" && !in_array($imgURL,$this->linkBuffer["unknown"]))
                                            {   $this->linkBuffer["unknown"][] = $imgURL;   }
                                        }
                                    }
                                }
                                $imgTagCountStart = 0;
                            }
                            if($imgTagCountStart == 3)
                            {
                                $imgURL = "";
                                $dotCount = 0;
                                $slashCount = 0;
                                $singleSlashCount = 0;
                                $doubleSlashCount = 0;
                                $parentDirectoryCount = 0;
                                $webPageCounter++;
                                while($webPageCounter < $webPageLength)
                                {
                                    $character = $webPageContent[$webPageCounter];
                                    if($character == "")
                                    {   
                                        $webPageCounter++;  
                                        continue;
                                    }
                                    if($character == "\"" || $character == "'")
                                    {
                                        $webPageCounter++;
                                        while($webPageCounter < $webPageLength)
                                        {
                                            $character = $webPageContent[$webPageCounter];
                                            if($character == "")
                                            {   
                                                $webPageCounter++;  
                                                continue;
                                            }
                                            if($character == "\"" || $character == "'" || $character == "#")
                                            {   
                                                $webPageCounter--;  
                                                break;  
                                            }
                                            else if($imgURL != "")
                                            {   $imgURL .= $character;  }
                                            else if($character == "." || $character == "/")
                                            {
                                                if($character == ".")
                                                {
                                                    $dotCount++;
                                                    $slashCount = 0;
                                                }
                                                else if($character == "/")
                                                {
                                                    $slashCount++;
                                                    if($dotCount == 2 && $slashCount == 1)
                                                    $parentDirectoryCount++;
                                                    else if($dotCount == 0 && $slashCount == 1)
                                                    $singleSlashCount++;
                                                    else if($dotCount == 0 && $slashCount == 2)
                                                    $doubleSlashCount++;
                                                    $dotCount = 0;
                                                }
                                            }
                                            else
                                            {   $imgURL .= $character;  }
                                            $webPageCounter++;
                                        }
                                        break;
                                    }
                                    $webPageCounter++;
                                }
                            }
                            $imgTagLengthStart = 0;
                            $imgTagLengthFinal = strlen($imgTag[$imgTagCountStart]);
                            $imgTagPointer =& $imgTag[$imgTagCountStart];
                        }
                    }
                    else
                    {   $imgTagLengthStart = 0; }
                    //-- Image Filter End --
                    $webPageCounter++;
                }
            }
            else
            {   $this->error = "Unable to proceed, permission denied";  }
        }
        else
        {   $this->error = "Please enter url";  }

        if($this->error != "")
        {   $this->linkBuffer["error"] = $this->error;  }

        return $this->linkBuffer;
    }   
}
?>

The web crawler is a computer program which used to collect/crawling following key values(HREF links, Image links, Meta Data .etc) from given website URL. It is designed like intelligent to follow different HREF links which are already fetched from the previous URL, so in this way, Crawler can jump from one website to other websites. Usually, it called as a Web spider or Web Bot. This mechanism always acts as a backbone of the Web search engine.

Please find the source code from my tech blog - http://www.algonuts.info/how-to-built-a-simple-web-crawler-in-php.html

<?php
class webCrawler
{
    public $siteURL;
    public $error;

    function __construct()
    {
        $this->siteURL = "";
        $this->error = "";
    }

    function parser()   
    {
        global $hrefTag,$hrefTagCountStart,$hrefTagCountFinal,$hrefTagLengthStart,$hrefTagLengthFinal,$hrefTagPointer;
        global $imgTag,$imgTagCountStart,$imgTagCountFinal,$imgTagLengthStart,$imgTagLengthFinal,$imgTagPointer;
        global $Url_Extensions,$Document_Extensions,$Image_Extensions,$crawlOptions;

        $dotCount = 0;
        $slashCount = 0;
        $singleSlashCount = 0;
        $doubleSlashCount = 0;
        $parentDirectoryCount = 0;

        $linkBuffer = array();

        if(($url = trim($this->siteURL)) != "")
        {
            $crawlURL = rtrim($url,"/");
            if(($directoryURL = dirname($crawlURL)) == "http:")
            {   $directoryURL = $crawlURL;  }
            $urlParser = preg_split("/\//",$crawlURL);

            //-- Curl Start --
            $curlObject = curl_init($crawlURL);
            curl_setopt_array($curlObject,$crawlOptions);
            $webPageContent = curl_exec($curlObject);
            $errorNumber = curl_errno($curlObject);
            curl_close($curlObject);
            //-- Curl End --

            if($errorNumber == 0)
            {
                $webPageCounter = 0;
                $webPageLength = strlen($webPageContent);
                while($webPageCounter < $webPageLength)
                {
                    $character = $webPageContent[$webPageCounter];
                    if($character == "")
                    {   
                        $webPageCounter++;  
                        continue;
                    }
                    $character = strtolower($character);
                    //-- Href Filter Start --
                    if($hrefTagPointer[$hrefTagLengthStart] == $character)
                    {
                        $hrefTagLengthStart++;
                        if($hrefTagLengthStart == $hrefTagLengthFinal)
                        {
                            $hrefTagCountStart++;
                            if($hrefTagCountStart == $hrefTagCountFinal)
                            {
                                if($hrefURL != "")
                                {
                                    if($parentDirectoryCount >= 1 || $singleSlashCount >= 1 || $doubleSlashCount >= 1)
                                    {
                                        if($doubleSlashCount >= 1)
                                        {   $hrefURL = "http://".$hrefURL;  }
                                        else if($parentDirectoryCount >= 1)
                                        {
                                            $tempData = 0;
                                            $tempString = "";
                                            $tempTotal = count($urlParser) - $parentDirectoryCount;
                                            while($tempData < $tempTotal)
                                            {
                                                $tempString .= $urlParser[$tempData]."/";
                                                $tempData++;
                                            }
                                            $hrefURL = $tempString."".$hrefURL;
                                        }
                                        else if($singleSlashCount >= 1)
                                        {   $hrefURL = $urlParser[0]."/".$urlParser[1]."/".$urlParser[2]."/".$hrefURL;  }
                                    }
                                    $host = "";
                                    $hrefURL = urldecode($hrefURL);
                                    $hrefURL = rtrim($hrefURL,"/");
                                    if(filter_var($hrefURL,FILTER_VALIDATE_URL) == true)
                                    {   
                                        $dump = parse_url($hrefURL);
                                        if(isset($dump["host"]))
                                        {   $host = trim(strtolower($dump["host"]));    }
                                    }
                                    else
                                    {
                                        $hrefURL = $directoryURL."/".$hrefURL;
                                        if(filter_var($hrefURL,FILTER_VALIDATE_URL) == true)
                                        {   
                                            $dump = parse_url($hrefURL);    
                                            if(isset($dump["host"]))
                                            {   $host = trim(strtolower($dump["host"]));    }
                                        }
                                    }
                                    if($host != "")
                                    {
                                        $extension = pathinfo($hrefURL,PATHINFO_EXTENSION);
                                        if($extension != "")
                                        {
                                            $tempBuffer ="";
                                            $extensionlength = strlen($extension);
                                            for($tempData = 0; $tempData < $extensionlength; $tempData++)
                                            {
                                                if($extension[$tempData] != "?")
                                                {   
                                                    $tempBuffer = $tempBuffer.$extension[$tempData];
                                                    continue;
                                                }
                                                else
                                                {
                                                    $extension = trim($tempBuffer);
                                                    break;
                                                }
                                            }
                                            if(in_array($extension,$Url_Extensions))
                                            {   $type = "domain";   }
                                            else if(in_array($extension,$Image_Extensions))
                                            {   $type = "image";    }
                                            else if(in_array($extension,$Document_Extensions))
                                            {   $type = "document"; }
                                            else
                                            {   $type = "unknown";  }
                                        }
                                        else
                                        {   $type = "domain";   }

                                        if($hrefURL != "")
                                        {
                                            if($type == "domain" && !in_array($hrefURL,$this->linkBuffer["domain"]))
                                            {   $this->linkBuffer["domain"][] = $hrefURL;   }
                                            if($type == "image" && !in_array($hrefURL,$this->linkBuffer["image"]))
                                            {   $this->linkBuffer["image"][] = $hrefURL;    }
                                            if($type == "document" && !in_array($hrefURL,$this->linkBuffer["document"]))
                                            {   $this->linkBuffer["document"][] = $hrefURL; }
                                            if($type == "unknown" && !in_array($hrefURL,$this->linkBuffer["unknown"]))
                                            {   $this->linkBuffer["unknown"][] = $hrefURL;  }
                                        }
                                    }
                                }
                                $hrefTagCountStart = 0;
                            }
                            if($hrefTagCountStart == 3)
                            {
                                $hrefURL = "";
                                $dotCount = 0;
                                $slashCount = 0;
                                $singleSlashCount = 0;
                                $doubleSlashCount = 0;
                                $parentDirectoryCount = 0;
                                $webPageCounter++;
                                while($webPageCounter < $webPageLength)
                                {
                                    $character = $webPageContent[$webPageCounter];
                                    if($character == "")
                                    {   
                                        $webPageCounter++;  
                                        continue;
                                    }
                                    if($character == "\"" || $character == "'")
                                    {
                                        $webPageCounter++;
                                        while($webPageCounter < $webPageLength)
                                        {
                                            $character = $webPageContent[$webPageCounter];
                                            if($character == "")
                                            {   
                                                $webPageCounter++;  
                                                continue;
                                            }
                                            if($character == "\"" || $character == "'" || $character == "#")
                                            {   
                                                $webPageCounter--;  
                                                break;  
                                            }
                                            else if($hrefURL != "")
                                            {   $hrefURL .= $character; }
                                            else if($character == "." || $character == "/")
                                            {
                                                if($character == ".")
                                                {
                                                    $dotCount++;
                                                    $slashCount = 0;
                                                }
                                                else if($character == "/")
                                                {
                                                    $slashCount++;
                                                    if($dotCount == 2 && $slashCount == 1)
                                                    $parentDirectoryCount++;
                                                    else if($dotCount == 0 && $slashCount == 1)
                                                    $singleSlashCount++;
                                                    else if($dotCount == 0 && $slashCount == 2)
                                                    $doubleSlashCount++;
                                                    $dotCount = 0;
                                                }
                                            }
                                            else
                                            {   $hrefURL .= $character; }
                                            $webPageCounter++;
                                        }
                                        break;
                                    }
                                    $webPageCounter++;
                                }
                            }
                            $hrefTagLengthStart = 0;
                            $hrefTagLengthFinal = strlen($hrefTag[$hrefTagCountStart]);
                            $hrefTagPointer =& $hrefTag[$hrefTagCountStart];
                        }
                    }
                    else
                    {   $hrefTagLengthStart = 0;    }
                    //-- Href Filter End --
                    //-- Image Filter Start --
                    if($imgTagPointer[$imgTagLengthStart] == $character)
                    {
                        $imgTagLengthStart++;
                        if($imgTagLengthStart == $imgTagLengthFinal)
                        {
                            $imgTagCountStart++;
                            if($imgTagCountStart == $imgTagCountFinal)
                            {
                                if($imgURL != "")
                                {
                                    if($parentDirectoryCount >= 1 || $singleSlashCount >= 1 || $doubleSlashCount >= 1)
                                    {
                                        if($doubleSlashCount >= 1)
                                        {   $imgURL = "http://".$imgURL;    }
                                        else if($parentDirectoryCount >= 1)
                                        {
                                            $tempData = 0;
                                            $tempString = "";
                                            $tempTotal = count($urlParser) - $parentDirectoryCount;
                                            while($tempData < $tempTotal)
                                            {
                                                $tempString .= $urlParser[$tempData]."/";
                                                $tempData++;
                                            }
                                            $imgURL = $tempString."".$imgURL;
                                        }
                                        else if($singleSlashCount >= 1)
                                        {   $imgURL = $urlParser[0]."/".$urlParser[1]."/".$urlParser[2]."/".$imgURL;    }
                                    }
                                    $host = "";
                                    $imgURL = urldecode($imgURL);
                                    $imgURL = rtrim($imgURL,"/");
                                    if(filter_var($imgURL,FILTER_VALIDATE_URL) == true)
                                    {   
                                        $dump = parse_url($imgURL); 
                                        $host = trim(strtolower($dump["host"]));
                                    }
                                    else
                                    {
                                        $imgURL = $directoryURL."/".$imgURL;
                                        if(filter_var($imgURL,FILTER_VALIDATE_URL) == true)
                                        {   
                                            $dump = parse_url($imgURL); 
                                            $host = trim(strtolower($dump["host"]));
                                        }   
                                    }
                                    if($host != "")
                                    {
                                        $extension = pathinfo($imgURL,PATHINFO_EXTENSION);
                                        if($extension != "")
                                        {
                                            $tempBuffer ="";
                                            $extensionlength = strlen($extension);
                                            for($tempData = 0; $tempData < $extensionlength; $tempData++)
                                            {
                                                if($extension[$tempData] != "?")
                                                {   
                                                    $tempBuffer = $tempBuffer.$extension[$tempData];
                                                    continue;
                                                }
                                                else
                                                {
                                                    $extension = trim($tempBuffer);
                                                    break;
                                                }
                                            }
                                            if(in_array($extension,$Url_Extensions))
                                            {   $type = "domain";   }
                                            else if(in_array($extension,$Image_Extensions))
                                            {   $type = "image";    }
                                            else if(in_array($extension,$Document_Extensions))
                                            {   $type = "document"; }
                                            else
                                            {   $type = "unknown";  }
                                        }
                                        else
                                        {   $type = "domain";   }

                                        if($imgURL != "")
                                        {
                                            if($type == "domain" && !in_array($imgURL,$this->linkBuffer["domain"]))
                                            {   $this->linkBuffer["domain"][] = $imgURL;    }
                                            if($type == "image" && !in_array($imgURL,$this->linkBuffer["image"]))
                                            {   $this->linkBuffer["image"][] = $imgURL; }
                                            if($type == "document" && !in_array($imgURL,$this->linkBuffer["document"]))
                                            {   $this->linkBuffer["document"][] = $imgURL;  }
                                            if($type == "unknown" && !in_array($imgURL,$this->linkBuffer["unknown"]))
                                            {   $this->linkBuffer["unknown"][] = $imgURL;   }
                                        }
                                    }
                                }
                                $imgTagCountStart = 0;
                            }
                            if($imgTagCountStart == 3)
                            {
                                $imgURL = "";
                                $dotCount = 0;
                                $slashCount = 0;
                                $singleSlashCount = 0;
                                $doubleSlashCount = 0;
                                $parentDirectoryCount = 0;
                                $webPageCounter++;
                                while($webPageCounter < $webPageLength)
                                {
                                    $character = $webPageContent[$webPageCounter];
                                    if($character == "")
                                    {   
                                        $webPageCounter++;  
                                        continue;
                                    }
                                    if($character == "\"" || $character == "'")
                                    {
                                        $webPageCounter++;
                                        while($webPageCounter < $webPageLength)
                                        {
                                            $character = $webPageContent[$webPageCounter];
                                            if($character == "")
                                            {   
                                                $webPageCounter++;  
                                                continue;
                                            }
                                            if($character == "\"" || $character == "'" || $character == "#")
                                            {   
                                                $webPageCounter--;  
                                                break;  
                                            }
                                            else if($imgURL != "")
                                            {   $imgURL .= $character;  }
                                            else if($character == "." || $character == "/")
                                            {
                                                if($character == ".")
                                                {
                                                    $dotCount++;
                                                    $slashCount = 0;
                                                }
                                                else if($character == "/")
                                                {
                                                    $slashCount++;
                                                    if($dotCount == 2 && $slashCount == 1)
                                                    $parentDirectoryCount++;
                                                    else if($dotCount == 0 && $slashCount == 1)
                                                    $singleSlashCount++;
                                                    else if($dotCount == 0 && $slashCount == 2)
                                                    $doubleSlashCount++;
                                                    $dotCount = 0;
                                                }
                                            }
                                            else
                                            {   $imgURL .= $character;  }
                                            $webPageCounter++;
                                        }
                                        break;
                                    }
                                    $webPageCounter++;
                                }
                            }
                            $imgTagLengthStart = 0;
                            $imgTagLengthFinal = strlen($imgTag[$imgTagCountStart]);
                            $imgTagPointer =& $imgTag[$imgTagCountStart];
                        }
                    }
                    else
                    {   $imgTagLengthStart = 0; }
                    //-- Image Filter End --
                    $webPageCounter++;
                }
            }
            else
            {   $this->error = "Unable to proceed, permission denied";  }
        }
        else
        {   $this->error = "Please enter url";  }

        if($this->error != "")
        {   $this->linkBuffer["error"] = $this->error;  }

        return $this->linkBuffer;
    }   
}
?>

回复收藏 0 原文

莫言歌 2024-11-10 23:53:33

Web 基本上是一个有向图，因此您可以根据 url 构建一个图，然后在标记访问的节点时进行 BFS 或 DFS 遍历，这样您就不会两次访问同一页面。

回复收藏 0 原文

耶耶耶 2024-11-10 23:53:33

这是一个网络爬虫示例。可用于收集 mac 地址以进行 mac 欺骗。

#!/usr/bin/env python

import sys
import os
import urlparse
import urllib
from bs4 import BeautifulSoup

def mac_addr_str(f_data):
global fptr
global mac_list
word_array = f_data.split(" ")

    for word in word_array:
        if len(word) == 17 and ':' in word[2] and ':' in word[5] and ':' in word[8] and ':' in word[11] and ':' in word[14]:
            if word not in mac_list:
                mac_list.append(word)
                fptr.writelines(word +"\n")
                print word



url = "http://stackoverflow.com/questions/tagged/mac-address"

url_list = [url]
visited = [url]
pwd = os.getcwd();
pwd = pwd + "/internet_mac.txt";

fptr = open(pwd, "a")
mac_list = []

while len(url_list) > 0:
    try:
        htmltext = urllib.urlopen(url_list[0]).read()
    except:
        url_list[0]
    mac_addr_str(htmltext)
    soup = BeautifulSoup(htmltext)
    url_list.pop(0)
    for tag in soup.findAll('a',href=True):
        tag['href'] = urlparse.urljoin(url,tag['href'])
        if url in tag['href'] and tag['href'] not in visited:
            url_list.append(tag['href'])
            visited.append(tag['href'])

更改网址以抓取更多网站......祝你好运

This is a web crawler example. Which can be used to collect mac Addresses for mac spoofing.

#!/usr/bin/env python

import sys
import os
import urlparse
import urllib
from bs4 import BeautifulSoup

def mac_addr_str(f_data):
global fptr
global mac_list
word_array = f_data.split(" ")

    for word in word_array:
        if len(word) == 17 and ':' in word[2] and ':' in word[5] and ':' in word[8] and ':' in word[11] and ':' in word[14]:
            if word not in mac_list:
                mac_list.append(word)
                fptr.writelines(word +"\n")
                print word



url = "http://stackoverflow.com/questions/tagged/mac-address"

url_list = [url]
visited = [url]
pwd = os.getcwd();
pwd = pwd + "/internet_mac.txt";

fptr = open(pwd, "a")
mac_list = []

while len(url_list) > 0:
    try:
        htmltext = urllib.urlopen(url_list[0]).read()
    except:
        url_list[0]
    mac_addr_str(htmltext)
    soup = BeautifulSoup(htmltext)
    url_list.pop(0)
    for tag in soup.findAll('a',href=True):
        tag['href'] = urlparse.urljoin(url,tag['href'])
        if url in tag['href'] and tag['href'] not in visited:
            url_list.append(tag['href'])
            visited.append(tag['href'])

Change the url to crawl more sites......good luck

回复收藏 0 原文

~没有更多了~