使用泛型来完成 HTML 抓取。对还是错？

发布于 2024-12-23 14:44:51 字数 1020 浏览 2 评论 0原文

我的要求是下载并抓取各种 HTML 页面，根据我们在该页面上查找的对象类型从页面上的代码中提取对象列表。例如，一个页面可能包含嵌入的医生手术列表，另一个页面可能包含主要信任列表等。我必须一页一页地查看页面，最后得到适当对象类型的列表。

我选择的方法是使用一个名为 HTMLParser的通用类。其中 T : IEntity, new()

IEntity 是所有可以抓取的对象类型都将实现的接口，尽管我还没有弄清楚接口成员是什么。

因此，您实际上可以说

HTMLParser<Surgery> parser = new HTMLParser<Surgery>(URL, XSD SCHEMA DOC);
IList<Surgery> results = parser.Parse();

Parse() 将验证从 URL 下载的 HTML 字符串是否包含符合所提供的 XSD 文档的块，然后以某种方式使用此模板来提取 Surgery 对象列表，每个对象都对应于 HTML 字符串中的一个 XML 块。

我遇到的问题是

除了 HTMLParser之外，我不确定如何以一种好的方式为每个对象类型指定模板。 parser = new HTMLParser(new URI("...."),Surgery.Template); 这有点笨拙。任何人都可以建议使用 .NET 3.0/4.0 的更好方法吗？
我不确定如何以通用方式获取 HTML 字符串、获取 XSD 或 XML 模板文档，并返回通用类型的构造对象的通用列表。谁能建议如何做到这一点？
最后，我不相信泛型是解决这个问题的正确方法，因为它开始看起来非常复杂。您同意还是谴责我在这里选择的解决方案，如果不同意，您会做什么？

原文

My requirement is to download and scrape various HTML pages, extracting lists of Objects from the code on the page depending on what object type we are looking for on that page. Eg one page might contain an embedded list of doctors surgeries, another might contain a list of primary trusts etc. I have to view the pages one by one and end up with lists of the appropriate object types.

The way I have chosen to do this is to have a Generic class called HTMLParser<T> where T : IEntity, new()

IEntity is the interface that all the object types that can be scraped will implement, though I haven't figured out yet what the interface members will be.

So you will effectively be able to say

HTMLParser<Surgery> parser = new HTMLParser<Surgery>(URL, XSD SCHEMA DOC);
IList<Surgery> results = parser.Parse();

Parse() will validate that the HTML string downloaded from the URL contains a block that conforms to the XSD document provided, then will somehow use this template to extract a List<Surgery> of Surgery objects, each one corresponding to an XML block in the HTML string.

The problems I have are

Im not sure how to specify the template for each object type in a nice way, other than HTMLParser<Surgery> parser = new HTMLParser<Surgery>(new URI("...."), Surgery.Template); which is a bit clunky. Can anyone suggest a better way using .NET 3.0/4.0?
Im not sure how in a Generic way I can take the HTML string, take an XSD or XML template document, and return a generic list of constructed objects of the Generic Type. Can anyone suggest on how to do this?
Finally, I'm not convinced generics are the right solution to this problem as it's starting to seem very convoluted. Would you agree with or condemn my choice of solution here and if not, what would you do instead?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

从﹋此江山别 2024-12-30 14:44:51

我也不相信泛型是正确的解决方案。我使用良好的旧继承实现了与此非常相似的东西，并且我仍然认为这是完成这项工作的正确工具。

当您想要对不同类型执行相同的操作时，泛型非常有用。例如，集合就是泛型非常方便的一个很好的例子。

另一方面，当您希望对象继承通用功能，然后扩展和/或修改该功能时，继承非常有用。使用泛型来做到这一点是很混乱的。

我的 scraper 基类看起来像这样：

public class ScraperBase
{
    // Common methods for making web requests, etc.

    // When you want to download and scrape a page, you call this:
    public List<string> DownloadAndScrape(string url)
    {
        // make request and download page.
        // Then call Scrape ...
        return Scrape(pageText);
    }

    // And an abstract Scrape method that returns a List<string>
    // Inheritors implement this method.
    public abstract List<string> Scrape(string pageText);
}

里面还有一些其他的东西用于日志记录、错误报告等，但这就是它的要点。

现在，假设我有一个 Wordpress 博客抓取工具：

public class WordpressBlogScraper : ScraperBase
{
    // just implement the Scrape method
    public override List<string> Scrape(string pageText)
    {
        // do Wordpress-specific parsing and return data.
    }
}

我可以做同样的事情来编写 Blogspot 抓取工具，或者为任何页面、站点或数据类编写自定义抓取工具。

我实际上尝试做类似的事情，但我没有使用继承，而是使用了刮刀回调函数。类似于：

public delegate List<string> PageScraperDelegate(string pageText);

public class PageScraper
{
    public List<string> DownloadAndScrape(string url, PageScraperDelegate callback)
    {
        // download data to pageText;
        return callback(pageText);
    }
}

然后您可以这样写：

var myScraper = new PageScraper();
myScraper.DownloadAndScrape("http://example.com/index.html", ScrapeExample);

private List<string> ScrapeExample(string pageText)
{
    // do the scraping here and return a List<string>
}

这样效果相当好，并且无需为每种抓取工具类型创建一个新类。然而，我发现在我的情况下它太有限了。我最终几乎每种类型的抓取工具都需要不同的类，所以我继续使用继承。

I'm not convinced that generics are the right solution, either. I implemented something very similar to this using good old inheritance, and I still think that's the right tool for the job.

Generics are useful when you want to perform the same operations on different types. Collections, for example, are a good example of where generics are very handy.

Inheritance, on the other hand, is useful when you want an object to inherit common functionality, but then extend and/or modify that functionality. Doing that with generics is messy.

My scraper base class looks something like this:

public class ScraperBase
{
    // Common methods for making web requests, etc.

    // When you want to download and scrape a page, you call this:
    public List<string> DownloadAndScrape(string url)
    {
        // make request and download page.
        // Then call Scrape ...
        return Scrape(pageText);
    }

    // And an abstract Scrape method that returns a List<string>
    // Inheritors implement this method.
    public abstract List<string> Scrape(string pageText);
}

There's some other stuff in there for logging, error reporting, etc., but that's the gist of it.

Now, let's say I have a Wordpress blog scraper:

public class WordpressBlogScraper : ScraperBase
{
    // just implement the Scrape method
    public override List<string> Scrape(string pageText)
    {
        // do Wordpress-specific parsing and return data.
    }
}

And I can do the same thing to write a Blogspot scraper, or a custom scraper for any page, site, or class of data.

I actually tried to do something similar, but rather than using inheritance I used a scraper callback function. Something like:

public delegate List<string> PageScraperDelegate(string pageText);

public class PageScraper
{
    public List<string> DownloadAndScrape(string url, PageScraperDelegate callback)
    {
        // download data to pageText;
        return callback(pageText);
    }
}

You can then write:

var myScraper = new PageScraper();
myScraper.DownloadAndScrape("http://example.com/index.html", ScrapeExample);

private List<string> ScrapeExample(string pageText)
{
    // do the scraping here and return a List<string>
}

That works reasonably well, and eliminates having to create a new class for every scraper type. However, I found that in my situation it was too limiting. I ended up needing a different class for almost every type of scraper, so I just went ahead and used inheritance.

回复收藏 0 原文