使用泛型来完成 HTML 抓取。对还是错?
我的要求是下载并抓取各种 HTML 页面,根据我们在该页面上查找的对象类型从页面上的代码中提取对象列表。例如,一个页面可能包含嵌入的医生手术列表,另一个页面可能包含主要信任列表等。我必须一页一页地查看页面,最后得到适当对象类型的列表。
我选择的方法是使用一个名为 HTMLParser
IEntity
是所有可以抓取的对象类型都将实现的接口,尽管我还没有弄清楚接口成员是什么。
因此,您实际上可以说
HTMLParser<Surgery> parser = new HTMLParser<Surgery>(URL, XSD SCHEMA DOC);
IList<Surgery> results = parser.Parse();
Parse()
将验证从 URL 下载的 HTML 字符串是否包含符合所提供的 XSD 文档的块,然后以某种方式使用此模板来提取 Surgery 对象列表
,每个对象都对应于 HTML 字符串中的一个 XML 块。
我遇到的问题是
除了
HTMLParser
这有点笨拙。任何人都可以建议使用 .NET 3.0/4.0 的更好方法吗?之外,我不确定如何以一种好的方式为每个对象类型指定模板。 parser = new HTMLParser (new URI("...."),Surgery.Template); 我不确定如何以通用方式获取 HTML 字符串、获取 XSD 或 XML 模板文档,并返回通用类型的构造对象的通用列表。谁能建议如何做到这一点?
最后,我不相信泛型是解决这个问题的正确方法,因为它开始看起来非常复杂。您同意还是谴责我在这里选择的解决方案,如果不同意,您会做什么?
My requirement is to download and scrape various HTML pages, extracting lists of Objects from the code on the page depending on what object type we are looking for on that page. Eg one page might contain an embedded list of doctors surgeries, another might contain a list of primary trusts etc. I have to view the pages one by one and end up with lists of the appropriate object types.
The way I have chosen to do this is to have a Generic class called HTMLParser<T> where T : IEntity, new()
IEntity
is the interface that all the object types that can be scraped will implement, though I haven't figured out yet what the interface members will be.
So you will effectively be able to say
HTMLParser<Surgery> parser = new HTMLParser<Surgery>(URL, XSD SCHEMA DOC);
IList<Surgery> results = parser.Parse();
Parse()
will validate that the HTML string downloaded from the URL contains a block that conforms to the XSD document provided, then will somehow use this template to extract a List<Surgery>
of Surgery objects, each one corresponding to an XML block in the HTML string.
The problems I have are
Im not sure how to specify the template for each object type in a nice way, other than
HTMLParser<Surgery> parser = new HTMLParser<Surgery>(new URI("...."), Surgery.Template);
which is a bit clunky. Can anyone suggest a better way using .NET 3.0/4.0?Im not sure how in a Generic way I can take the HTML string, take an XSD or XML template document, and return a generic list of constructed objects of the Generic Type. Can anyone suggest on how to do this?
Finally, I'm not convinced generics are the right solution to this problem as it's starting to seem very convoluted. Would you agree with or condemn my choice of solution here and if not, what would you do instead?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我也不相信泛型是正确的解决方案。我使用良好的旧继承实现了与此非常相似的东西,并且我仍然认为这是完成这项工作的正确工具。
当您想要对不同类型执行相同的操作时,泛型非常有用。例如,集合就是泛型非常方便的一个很好的例子。
另一方面,当您希望对象继承通用功能,然后扩展和/或修改该功能时,继承非常有用。使用泛型来做到这一点是很混乱的。
我的 scraper 基类看起来像这样:
里面还有一些其他的东西用于日志记录、错误报告等,但这就是它的要点。
现在,假设我有一个 Wordpress 博客抓取工具:
我可以做同样的事情来编写 Blogspot 抓取工具,或者为任何页面、站点或数据类编写自定义抓取工具。
我实际上尝试做类似的事情,但我没有使用继承,而是使用了刮刀回调函数。类似于:
然后您可以这样写:
这样效果相当好,并且无需为每种抓取工具类型创建一个新类。然而,我发现在我的情况下它太有限了。我最终几乎每种类型的抓取工具都需要不同的类,所以我继续使用继承。
I'm not convinced that generics are the right solution, either. I implemented something very similar to this using good old inheritance, and I still think that's the right tool for the job.
Generics are useful when you want to perform the same operations on different types. Collections, for example, are a good example of where generics are very handy.
Inheritance, on the other hand, is useful when you want an object to inherit common functionality, but then extend and/or modify that functionality. Doing that with generics is messy.
My scraper base class looks something like this:
There's some other stuff in there for logging, error reporting, etc., but that's the gist of it.
Now, let's say I have a Wordpress blog scraper:
And I can do the same thing to write a Blogspot scraper, or a custom scraper for any page, site, or class of data.
I actually tried to do something similar, but rather than using inheritance I used a scraper callback function. Something like:
You can then write:
That works reasonably well, and eliminates having to create a new class for every scraper type. However, I found that in my situation it was too limiting. I ended up needing a different class for almost every type of scraper, so I just went ahead and used inheritance.
我宁愿关注您的解析器/验证器类,因为正确设计它们对于将来使用的便利性至关重要。我认为更重要的是如何该机制将根据输入确定使用哪个解析器/验证器。
另外,当您被告知需要解析另一种类型的网站(例如发票实体)时会发生什么 - 您是否能够通过两个简单的步骤扩展您的机制来处理此类需求?
I would rather focus on your parser/verifier classes, as designing them properly will be cruicial to the ease of future usage. I think it's more important how the mechanism will determine which parser/verifier to use basing on input.
Also, what happens when you're told you need to parse yet another type of website, say for
Invoice
entities - will you be able to extend your mechanism in 2 easy steps in order to handle such requirement?