比价服务，如何有效利用sitemap文件？

发布于 2024-11-15 02:33:57 字数 587 浏览 3 评论 0原文

许多在线商店提供站点地图文件，其中包含以下形式的产品信息：

...
    <url>
          <loc>http://blabla.com/tbcart/pc/-DOOR-GYM-Full-Body-Exerciser-256p34168.htm</loc>
          <lastmod>2010-11-26</lastmod>
          <changefreq>weekly</changefreq>
    </url>
...

但是，要使在线价格比较服务正常工作，除了 URL 之外，还需要实际的产品价格。假设在线商店的典型站点地图包含 20,000 个 URL，您将如何继续获取每种产品的实际价格？这是应该如何使用站点地图来获取产品价格？

执行 20'000 个 Http Get 请求很可能会导致在线商店阻止爬虫的 IP :)

谢谢，

PS - 这将如何扩展？就像一个有 50'000 个链接的站点地图，假设每个星期日都需要重新索引，这意味着全天每 2 秒发送 1 个请求，在这种情况下如何避免被阻止？

原文

Many online shops provide a sitemap file which contains their product information in the form of:

...
    <url>
          <loc>http://blabla.com/tbcart/pc/-DOOR-GYM-Full-Body-Exerciser-256p34168.htm</loc>
          <lastmod>2010-11-26</lastmod>
          <changefreq>weekly</changefreq>
    </url>
...

But for an online price comparison service to work, it needs the actual product prices in addition to their URL. Assuming that a typical sitemap for an online shop contains 20'000 URLs, how would you proceed in getting the actual prices for each product ? Is this how the sitemap should be used for getting product prices ?

It is highly likely that, performing 20'000 Http Get requests would cause the online shop to block the IP of the crawler :)

Thanks,

PS - How would this scale ? Like a sitemap with 50'000 links, Let's say one needs to reindex every Sunday, this implies sending 1 request every 2 seconds during the whole day, How can one avoid getting blocked in this situation ?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

触ぅ动初心 2024-11-22 02:33:57

您必须对所有 URL 执行 GET，然后解析 HTML 以提取价格。你是对的，如果你访问一个网站购买他们的所有产品，他们可能会禁止你，所以你需要包含一些聪明的逻辑来分散负载，这样它就不会影响商店太多。然后，如果您变得棘手，您可以确定是否有某些产品的价格变化更频繁，然后您可以重新扫描这些产品的价格。
另外值得注意的是，并非所有网站都提供站点地图，在这种情况下，您必须抓取网站并解析产品 URL 的 HTML（就像搜索引擎一样）。

回复收藏 0 原文