您只是在寻找有关如何实现这一目标的建议吗? HTML Agility Pack 可能是 DOM 解析的最佳选择。可能需要进行大量的修补和反复试验来维护屏幕抓取(通常是为了这类事情),但该库对于解析 HTML 来说非常好。
从技术上讲,任何 XML 解析(甚至是本机 LINQ to XML)都应该可以解决问题,但是网站有一个不好的习惯,即格式不正确,因此您可能会时不时地遇到一些小麻烦。
Are you just looking for suggestions on how to accomplish this? The HTML Agility Pack is probably going to be your best bet for DOM parsing in general. There may be a good bit of tinkering and trial and error to maintain your screen scrape (there usually is for that sort of thing), but that library is pretty good for parsing HTML.
Technically, any XML parsing (even native LINQ to XML) should do the trick, but websites have a nasty habit of not being well-formed so you may run into small headaches here and there.
In recent projects, I successfully used the WebRequestand related classed to download the HTML from an URL and then SgmlReader parser to actually get access to the structured content.
If the page comments and table layout code are the same whenever called, I would pull the page into a string and use a series of .IndexOf and .Substring functions to parse out the data. Use the IndexOf function to find the starting and ending indexes of each field. Use these field indexes in the Substring function to grab the data.
发布评论
评论(5)
您只是在寻找有关如何实现这一目标的建议吗? HTML Agility Pack 可能是 DOM 解析的最佳选择。可能需要进行大量的修补和反复试验来维护屏幕抓取(通常是为了这类事情),但该库对于解析 HTML 来说非常好。
从技术上讲,任何 XML 解析(甚至是本机 LINQ to XML)都应该可以解决问题,但是网站有一个不好的习惯,即格式不正确,因此您可能会时不时地遇到一些小麻烦。
Are you just looking for suggestions on how to accomplish this? The HTML Agility Pack is probably going to be your best bet for DOM parsing in general. There may be a good bit of tinkering and trial and error to maintain your screen scrape (there usually is for that sort of thing), but that library is pretty good for parsing HTML.
Technically, any XML parsing (even native LINQ to XML) should do the trick, but websites have a nasty habit of not being well-formed so you may run into small headaches here and there.
在最近的项目中,我成功地使用
WebRequest
和相关类从 URL 下载 HTML,然后 SgmlReader 解析器来实际访问结构化内容。In recent projects, I successfully used the
WebRequest
and related classed to download the HTML from an URL and then SgmlReader parser to actually get access to the structured content.如果每次调用时页面注释和表格布局代码都相同,我会将页面拉入字符串并使用一系列 .IndexOf 和 .Substring 函数来解析数据。使用 IndexOf 函数查找每个字段的起始索引和结束索引。在 Substring 函数中使用这些字段索引来获取数据。
它并不漂亮,但可以完成工作。
If the page comments and table layout code are the same whenever called, I would pull the page into a string and use a series of .IndexOf and .Substring functions to parse out the data. Use the IndexOf function to find the starting and ending indexes of each field. Use these field indexes in the Substring function to grab the data.
It's not pretty but gets the job done.
HtmlDocument 可用于处理 HTML 文档。请参阅以下示例:
http://weblogs .asp.net/grantbarrington/archive/2009/10/15/screen-scraping-in-c.aspx
http://www.stupidiocy.com/development/web-scraping-using-c/
HtmlDocument can be used to process HTML documents. See following examples:
http://weblogs.asp.net/grantbarrington/archive/2009/10/15/screen-scraping-in-c.aspx
http://www.stupidiocy.com/development/web-scraping-using-c/
如果您将 HTML 存储在字符串中,您始终可以使用带有捕获组的正则表达式来解析您需要的信息。
If you have the HTML stored in a string you can always use Regular Expressions with capture groups to parse the information you need.