需要帮助来构建“机器人”从 HTTP 请求中提取数据
我正在使用 ASP.net 和 C# 构建一个网站,其组件之一涉及代表用户登录用户拥有帐户的网站(例如移动电话公司),从该网站获取信息并存储它在我们的数据库中。
我觉得这个动作叫“刮”。
是否有任何产品已经可以与我的软件集成?
我不需要可以执行此操作的软件,我需要某种可以与我的 C# 代码集成的 SDK。
谢谢,
科比
I am building a web site in ASP.net and C# that one of its components involves log-in to a website that the user has an account (for example cellular phone company) on behalf of the user, take information from this site and store it in our database.
I think this action called "scraping".
Are there any products that already does so that I can use to integrate with my software ?
I don't need a software that does it, I need some sort of SDK that I can integrate with my C# code.
Thanks,
Koby
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
使用 HtmlAgilityPack 解析登录后从 Web 请求中获取的 HTML。
请参阅此处用于登录:通过 C# 登录网站
Use the HtmlAgilityPack to parse the HTML that you get from a web request once you've logged in.
See here for logging in: Login to website, via C#
到目前为止我还没有找到任何产品可以做到这一点。
处理这个问题的一种方法是
- 自行执行请求
- 使用 http://htmlagilitypack.codeplex.com/ 从下载的 html 中提取重要信息
- 自行保存提取的信息
事实是,根据上下文,有很多东西需要调整/配置,您需要非常大的产品,但它仍然无法达到自定义解决方案的性能/准确性:
a) 多线程控制
b) 提取规则
c) 持久性控制
d) 网络蜘蛛(或者如何选择下一个要解析的链接)
I haven't found any product, that would do it right so far.
One way to handle this is to
- do requests by your self
- use http://htmlagilitypack.codeplex.com/ to extract important information from downloaded html
- save extracted information by your self
Thing is, that depending on context, there are so many things to tune/configure, that you need very large product and still it won't reach custom solution performance/accuracy:
a) multithreading control
b) extraction rules
c) persistance control
d) web spidering (or how next link to parse is chosen)
检查网页抓取维基百科条目。
不过我想说,由于我们需要通过网络抓取获取的内容是特定于应用程序的,因此大多数时候,从网络响应流中抓取您需要的任何内容可能会更有效。
Check the Web Scraping Wikipedia Entry.
However I would say since what we need to acquire via web-scraping is application specific, most of the time, it may be more efficient to scrape whatever you need from a web response stream.