避免抓取重复产品的建议

发布于 2024-12-05 10:08:14 字数 539 浏览 6 评论 0原文

我编写了一个非常基本的爬虫，它从网站上抓取产品信息并将其放入数据库中。

除了某些站点似乎对页面的多个部分有不同的 URL 之外，一切都运行良好。例如，产品 url 可能是：

http://www.example.com/product?id=52

然后，它可能有另一个用于不同部分（例如评论等）的 URL：

http://www.example.com/product?id=52&revpage=1

我的爬网程序将其视为不同的 URL。我发现在一些网站上，一种产品有数百个不同的 URL。我已经添加了忽略 url 中哈希值后的任何内容的逻辑，以避免锚点，但我想知道是否有人有任何建议来避免此问题？我没有看到一个简单的解决方案。

目前，它会减慢抓取/抓取过程，因为网站可能只有 100 个产品，但添加了数千个 URL。

我考虑过忽略查询字符串，甚至忽略查询字符串的某些部分，但产品 id 通常位于查询字符串中，因此我无法找到一种方法，无需为每个站点的 URL 结构编写异常

原文

I have written a very basic crawler which scrapes product information from websites to be put into a database.

It all works well except that some sites seems to have a distinct URL for multiple parts of the page. For example, a product url might be:

http://www.example.com/product?id=52

then, it might have another URL for different parts such as comments etc:

http://www.example.com/product?id=52&revpage=1

My crawler is seeing this as a distinct URL. Ive found some sites where one product has hundreds of distinct URLs. Ive already added the logic to ignore anything after a hash in the url to avoid anchor's, but I was wondering if anyone had any suggestions to avoid this problem? There could be a simple solution im not seeing.

At the moment its slowing down the crawl/scrape process where a site might have only 100 products its adding thousands of URLs.

I thought about ignoring the querystring, or even certain parts of the querystring but the product id is usually located in the query string so I couldn't figure out a way, without writing an exception for each site's URL structure

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

甜宝宝 2024-12-12 10:08:14

为了详细说明我的评论...

您可以包含以下代码

$producturl //is the url where you first found a product to scrape
$nexturl //is the next url you plan to crawl
if (strpos($nexturl, $producturl) === false) {
    crawl
}
loop back to the next url...

，我猜您正在按顺序爬行...这意味着您找到一个页面并爬行到该页面的所有链接...然后您返回一级并重复...如果您没有按顺序爬行，您可以存储找到产品的所有页面，并使用它来检查您计划爬行的新页面是否以您已爬行的网址开头。如果是，则您不会抓取新页面。

我希望这有帮助。祝你好运！

To elaborate on my comment...

You could include the following code

$producturl //is the url where you first found a product to scrape
$nexturl //is the next url you plan to crawl
if (strpos($nexturl, $producturl) === false) {
    crawl
}
loop back to the next url...

I am guessing you are crawling in sequence... meaning you find a page and crawl to all the links from that page... then you go back one level and repeat... If you are not crawling in sequence, you could store all the pages where you found a product and use that to check if the new page you plan to crawl starts with an url you have already crawled. If yes, you don't crawl the new page.

I hope this helps. Good luck!

回复收藏 0 原文

温柔少女心 2024-12-12 10:08:14

您可以使用数据库并对 ID 或名称设置唯一约束。因此，如果您的爬网程序尝试再次添加此数据，则会引发异常。
最简单的唯一约束是主键。

编辑 url 参数解决方案：

如果您在从 url 获取正确的参数时遇到问题，可能是从 facebook api 可以提供帮助。

protected function getCurrentUrl($noQuerys = false) {
  $protocol = isset($_SERVER['HTTPS']) && $_SERVER['HTTPS'] == 'on'
    ? 'https://'
    : 'http://';
  $currentUrl = $protocol . $_SERVER['HTTP_HOST'] . $_SERVER['REQUEST_URI'];
  $parts = parse_url($currentUrl); // http://de.php.net/manual/en/function.parse-url.php

  // drop known fb params
  $query = '';
  if (!empty($parts['query'])) {
    $params = array();
    parse_str($parts['query'], $params);
    foreach(self::$DROP_QUERY_PARAMS as $key) { // self::$DROP_QUERY_PARAMS is a list of params you dont want to have in your url
      unset($params[$key]);
    }
    if (!empty($params)) {
      $query = '?' . http_build_query($params, null, '&');
    }
  }

  // use port if non default
  $port =
    isset($parts['port']) &&
    (($protocol === 'http://' && $parts['port'] !== 80) ||
     ($protocol === 'https://' && $parts['port'] !== 443))
    ? ':' . $parts['port'] : '';


  // rebuild
  if ($noQuerys) {
      // return URL without parameters aka querys
      return $protocol . $parts['host'] . $port . $parts['path'];
  } else {
      // return full URL
      return $protocol . $parts['host'] . $port . $parts['path'] . $query;
  }

You could use a database and set a unique constraint on the id or the name. So if your crawler try to add this data again an exception is raised.
The simplest unique constraint would be a primary key.

Edit for url param solution:

If you have problems fetching the right parameters from your url, maybe a snipped from the facebook api could help.

protected function getCurrentUrl($noQuerys = false) {
  $protocol = isset($_SERVER['HTTPS']) && $_SERVER['HTTPS'] == 'on'
    ? 'https://'
    : 'http://';
  $currentUrl = $protocol . $_SERVER['HTTP_HOST'] . $_SERVER['REQUEST_URI'];
  $parts = parse_url($currentUrl); // http://de.php.net/manual/en/function.parse-url.php

  // drop known fb params
  $query = '';
  if (!empty($parts['query'])) {
    $params = array();
    parse_str($parts['query'], $params);
    foreach(self::$DROP_QUERY_PARAMS as $key) { // self::$DROP_QUERY_PARAMS is a list of params you dont want to have in your url
      unset($params[$key]);
    }
    if (!empty($params)) {
      $query = '?' . http_build_query($params, null, '&');
    }
  }

  // use port if non default
  $port =
    isset($parts['port']) &&
    (($protocol === 'http://' && $parts['port'] !== 80) ||
     ($protocol === 'https://' && $parts['port'] !== 443))
    ? ':' . $parts['port'] : '';


  // rebuild
  if ($noQuerys) {
      // return URL without parameters aka querys
      return $protocol . $parts['host'] . $port . $parts['path'];
  } else {
      // return full URL
      return $protocol . $parts['host'] . $port . $parts['path'] . $query;
  }

回复收藏 0 原文

~没有更多了~