比价服务,如何有效利用sitemap文件?
许多在线商店提供站点地图文件,其中包含以下形式的产品信息:
...
<url>
<loc>http://blabla.com/tbcart/pc/-DOOR-GYM-Full-Body-Exerciser-256p34168.htm</loc>
<lastmod>2010-11-26</lastmod>
<changefreq>weekly</changefreq>
</url>
...
但是,要使在线价格比较服务正常工作,除了 URL 之外,还需要实际的产品价格。假设在线商店的典型站点地图包含 20,000 个 URL,您将如何继续获取每种产品的实际价格?这是应该如何使用站点地图来获取产品价格?
执行 20'000 个 Http Get 请求很可能会导致在线商店阻止爬虫的 IP :)
谢谢,
PS - 这将如何扩展?就像一个有 50'000 个链接的站点地图,假设每个星期日都需要重新索引,这意味着全天每 2 秒发送 1 个请求,在这种情况下如何避免被阻止?
Many online shops provide a sitemap file which contains their product information in the form of:
...
<url>
<loc>http://blabla.com/tbcart/pc/-DOOR-GYM-Full-Body-Exerciser-256p34168.htm</loc>
<lastmod>2010-11-26</lastmod>
<changefreq>weekly</changefreq>
</url>
...
But for an online price comparison service to work, it needs the actual product prices in addition to their URL. Assuming that a typical sitemap for an online shop contains 20'000 URLs, how would you proceed in getting the actual prices for each product ? Is this how the sitemap should be used for getting product prices ?
It is highly likely that, performing 20'000 Http Get requests would cause the online shop to block the IP of the crawler :)
Thanks,
PS - How would this scale ? Like a sitemap with 50'000 links, Let's say one needs to reindex every Sunday, this implies sending 1 request every 2 seconds during the whole day, How can one avoid getting blocked in this situation ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您必须对所有 URL 执行 GET,然后解析 HTML 以提取价格。你是对的,如果你访问一个网站购买他们的所有产品,他们可能会禁止你,所以你需要包含一些聪明的逻辑来分散负载,这样它就不会影响商店太多。然后,如果您变得棘手,您可以确定是否有某些产品的价格变化更频繁,然后您可以重新扫描这些产品的价格。
另外值得注意的是,并非所有网站都提供站点地图,在这种情况下,您必须抓取网站并解析产品 URL 的 HTML(就像搜索引擎一样)。
You would have to execute the GET against all the URLs and then parse the HTML to pull out the price. You are correct that if you hit a site for all their products they may ban you, so you need to include some clever logic to spread the load so it won't affect the shop too much. And then if you get tricky you can determine if there are some products where the price changes more frequently then you can just re-scan those products' prices.
Also noteworthy is that not all sites supply a sitemap, in which case you have to crawl the site and parse the HTML for product URLs as well (just like search engines do).
您真的需要每周日重新索引该网站吗?您的示例中似乎设置了一个lastmod-标签,因此您可以抓取整个网站作为基础,然后检查整周(而不仅仅是一天)修改的页面。如果站点已更改,您可以重新抓取它,然后将该域上的下一页的延迟设置为 robots.txt 中的值(如果设置)或几秒(5 可能已经可以了)。
但是,只有当店主在价格更改时(而不仅仅是在更改描述文本时)确实更改了lastmod-tag时,这才有效。如果lastmod没有改变,你必须采取Haukman的方法并测量页面上更改之间的平均时间(如果你重新抓取页面并且价格没有改变,则延迟下一次访问;如果改变了,甚至尝试一下)下次快点)。
Do you really need to reindex the site every Sunday? There seems to be a lastmod-tag set in your example so you could just crawl the whole website as a base and then check for modified pages the whole week (not just on one day). If a site has been changed, you could recrawl it and then set the delay for the next page on this domain to the value in the robots.txt (if set) or several seconds (5 might already be ok).
However, this only works if the shop owner does change the lastmod-tag when the price has been changed (and not only when he changes description texts). If lastmod is not changed, you have to take Haukman’s approach and measure the average time between changes on the page (if you recrawl a page and the price has not changed, delay the next visit; if it has changed, try it even a bit faster the next time).