编写一个 C# 程序来扫描电子商务网站并提取产品图片 + 价格 + 他们的描述
我正在开发一个电子商务搜索引擎,可以让您在很多电子商务网站中搜索产品。
我该如何处理这个问题?
我需要一个能够扫描网站、解析 HTML 并确定网站中的哪些图像是产品图像、哪些是产品描述、哪些是产品价格的应用程序。
很高兴听到任何想法,例如。
提前致谢。
编辑: 我的问题不是如何从网站获取 HTML(这称为屏幕抓取),而是如何解析该信息并了解哪些 html 包含我正在查找的实际数据,哪些不包含。
I'm developing an ecommerce search engine that allows you to search for products in a lot of ecommerce websites.
How do I approach the matter?
I need an application that will be able to scan websites, parse their HTML and determine which of the images in the website are product images, which are product descriptions, which are product prices.
Would be happy to hear any idea, example.
Thanks in advance.
edit:
My question is not how to get the HTML from the websites(which is called screen scraping) but more on how to parse that information and understand which of the html contains the actual data i am looking for, and which is not.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可能会发现此帖子对您的探索有所帮助。 我在那里概述了基本步骤。 以下是所有标记为“Screen-scraping”的问题的链接。 此外,网络上还有大量资料 - Google。
You may find this thread helpful in your quest. I had outlined the basic steps there. Here's the link to all questions tagged as "Screen-scraping" on SO. Also, lots of material on the web - Google.
您要抓取的大多数网站(更准确地说网络抓取)都有合作伙伴 API “经销商”类型的交易。 如果您通过屏幕抓取来规避这一点,很快就会发现您的 IP 被他们的流量服务器屏蔽,并可能使您陷入法律困境。
这在道德上充其量是可疑的。
Most of the sites you'd be scraping (more correctly web-scraping) have partner APIs for "reseller" type deals. For you to circumvent that with screen scraping will quickly find your IP blocked by their traffic servers, and potentially put you in a legal situation.
This is ethically dubious at best.