在WinForm中解析html页面,C#

发布于 2024-12-28 17:31:22 字数 1314 浏览 0 评论 0 原文

我正在使用 HtmlAgility 包来解析 html 页面。我能够从必须获取数据的位置找到我的部分。实际上它是一个表,我必须解析它的 tr。 基本上,我有两个疑问。

  1. 当我在解析器中加载页面时,大约需要 20-30 秒才能将其加载到内存中,并且大约有 4738 个网页需要解析。所以,我想减少它......我想知道我可以使用委托在循环中调用该方法,以便我可以减少延迟时间。或者有什么有效的方法可以做到这一点。请指导我完成此操作。

  2. 我得到的行为 "\r\n\t\t\t\t110001新德里德里巴罗达大厦\r\n\t\t\t ”,从上面我要解析11001、新德里、德里和Baroda House。实际上,我有一个 Pincodes 类,其中有 Pincode、Area、State 和 District 属性。所以我需要一个正则表达式或某种方法将这些值放入类中。

最后,我必须将这些记录推送到我使用 Linq2Sql 的数据库。所以保留所有的东西,请告诉我解决方案。任何参考或链接都会有很大的帮助。

我的代码:

  var url = @"http://www.eximguru.com/traderesources/pincode.aspx?&GridInfo=Pincode01";
            var web = new HtmlWeb();
            var doc = web.Load(url);
            //doc.DocumentNode.SelectSingleNode("//*[@id=\"lst-ib\"]");//("/html/body/div[2]/form/div/div[2]/table/tbody/tr/td/table/tbody/tr/td/div/table/tbody/tr/td/table/tbody/tr/td[2]/div/input");
            //System.Console.WriteLine(doc.DocumentNode.SelectSingleNode("//*[@id=\"lst-ib\"]").Id);
            var htmlNode =
                doc.DocumentNode.SelectSingleNode(
                    "//*[@id=\"ctl00_uxContentPlaceHolder_ResourceAndGuideUserControl1_ResourceAndGuideGrid_myGridView_mainGridView\"]");

提前致谢

I am using HtmlAgility pack for parsing the html page. I am able to locate my section from where i have to get data.Actually its a table and i have to parse its tr.
Basically, I have two queries.

  1. When i load a page in parser, it took around 20-30 secs to load it in memory and there are around 4738 web pages to parse. So, I want to reduce it....I want to know Can I use delegate call the method in a loop so that i can reduce the time of delay. Or Is there any efficient way to do so. Please guide me thru that.

  2. I am getting my row as "\r\n\t\t\t\t<td style=\"width:20%;\">110001</td><td style=\"width:25%;\">New Delhi</td><td style=\"width:25%;\">Delhi</td><td style=\"width:30%;\">Baroda House</td>\r\n\t\t\t", from the above I have to parse 11001, New Delhi, Delhi and Baroda House. Actually I am having a class Pincodes where I have the properties Pincode, Area, State and District. So I need a regex or some way to put these values to the class.

Finally I have to push these records to my database where i am using Linq2Sql. So keeping all the things, please tell give me solution. Any reference or link will be a great help.

My Code:

  var url = @"http://www.eximguru.com/traderesources/pincode.aspx?&GridInfo=Pincode01";
            var web = new HtmlWeb();
            var doc = web.Load(url);
            //doc.DocumentNode.SelectSingleNode("//*[@id=\"lst-ib\"]");//("/html/body/div[2]/form/div/div[2]/table/tbody/tr/td/table/tbody/tr/td/div/table/tbody/tr/td/table/tbody/tr/td[2]/div/input");
            //System.Console.WriteLine(doc.DocumentNode.SelectSingleNode("//*[@id=\"lst-ib\"]").Id);
            var htmlNode =
                doc.DocumentNode.SelectSingleNode(
                    "//*[@id=\"ctl00_uxContentPlaceHolder_ResourceAndGuideUserControl1_ResourceAndGuideGrid_myGridView_mainGridView\"]");

Thanks in advance

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

咋地 2025-01-04 17:31:22

看起来该页面上的 url、id 或其他任何内容都没有模式。它设计得很糟糕。如果有一个很好的模式(例如结果的不同页码),那么也许可以并行完成。因为它不是,所以您必须按顺序执行此操作,因为没有可靠的方法(我可以看到)来获取下一页的网址。

var url = "http://eximguru.com/traderesources/pincode.aspx?&GridInfo=Pincode01";
var web = new HtmlWeb();
var results = new List<Pincode>();
while (!String.IsNullOrWhiteSpace(url))
{
    var doc = web.Load(url);
    var query = doc.DocumentNode
        .SelectNodes("//div[@class='Search']/div[3]//tr")
        .Skip(1)
        .Select(row => row.SelectNodes("td"))
        .Select(row => new Pincode
        {
            PinCode = row[0].InnerText,
            District = row[1].InnerText,
            State = row[2].InnerText,
            Area = row[3].InnerText,
        });
    results.AddRange(query);

    var next = doc.DocumentNode
        .SelectSingleNode("//div[@class='slistFooter']//a[last()]");
    if (next != null && next.InnerText == "Next")
    {
        url = next.Attributes["href"].Value;
    }
    else
    {
        url = null;
    }
}

It doesn't look like there's a pattern to the urls, ids or anything else on that page. It was poorly designed. If there was a nice pattern to it (such as the different page numbers for the results), then perhaps this could be done in parallel. Since it isn't, you'd have to do it sequentially since there's no reliable method (that I can see) to get the url to the next page.

var url = "http://eximguru.com/traderesources/pincode.aspx?&GridInfo=Pincode01";
var web = new HtmlWeb();
var results = new List<Pincode>();
while (!String.IsNullOrWhiteSpace(url))
{
    var doc = web.Load(url);
    var query = doc.DocumentNode
        .SelectNodes("//div[@class='Search']/div[3]//tr")
        .Skip(1)
        .Select(row => row.SelectNodes("td"))
        .Select(row => new Pincode
        {
            PinCode = row[0].InnerText,
            District = row[1].InnerText,
            State = row[2].InnerText,
            Area = row[3].InnerText,
        });
    results.AddRange(query);

    var next = doc.DocumentNode
        .SelectSingleNode("//div[@class='slistFooter']//a[last()]");
    if (next != null && next.InnerText == "Next")
    {
        url = next.Attributes["href"].Value;
    }
    else
    {
        url = null;
    }
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文