C＃ - 使用Web浏览器刮擦遇到问题

发布于 2025-02-02 07:54:06 字数 1904 浏览 0 评论 0 原文

我一直在尝试使用Bing搜索引擎制作Web刮板工具，以获取所有Pastebin URL。

我通过使用Web浏览器并让JavaScript运行，然后刮擦所有源来做到这一点。

string attempt = ""

^ 我有2个问题。第一个问题是，如果我不编写 messagebox.show（this.attempt）由于某种原因将是空的。另一个问题是，现在我只有9个链接，并且没有像应该下载其他页面。我认为这全是因为 MessageBox.Show（this.attempt） thice。

我知道我的代码不是最好的，也许有很多更好的方法，但我想获得帮助以了解这里发生了什么。

非常感谢

这是我的代码：

private void Scan(Label pages)
        {
            string regex = @"https:\/\/pastebin.com\/[a-zA-Z0-9]+";
            for (int i = 1; i <= Config.Amount_Of_Pages; i++)
            {
                Parse(i);
                MatchCollection matches = Regex.Matches(this.attempt, regex);
                MessageBox.Show(this.attempt);
                foreach (Match match in matches)
                {
                    Config.List_Of_Urls.Add(match.Value.ToString());
                    Config.List_Of_Urls = Config.List_Of_Urls.Distinct().ToList();
                }

                Config.Amount_Of_Pages_Scanned++;
                pages.Invoke(new MethodInvoker(delegate { pages.Text = Config.Amount_Of_Pages_Scanned.ToString(); }));
                Files.Write_Urls(Config.List_Of_Urls);
            }

            MessageBox.Show("Done");
        }

        private void Parse(int i)
        {
            WebBrowser wb = new WebBrowser();
            wb.DocumentCompleted += Wb_DocumentCompleted;
            wb.ScriptErrorsSuppressed = true;
            wb.Navigate("https://www.bing.com/search?q=site%3apastebin.com++email%3apassword&first=" + i);
        }

        private void Wb_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
        {
            var wb = (WebBrowser)sender;

            var html = wb.Document.GetElementsByTagName("HTML")[0].OuterHtml;

            this.attempt = html.ToString();
            /* ... */
        }

原文

I have been trying to make a web scraper tool to get all the pastebin urls using the bing search engine.

I managed to do that by using web browser and letting the javascript run and then scraping all the source.

string attempt = ""

^
I got 2 problems. the first problem is if I don't write the line MessageBox.Show(this.attempt) the variable will be empty for some reason. another problem is for now I get only 9 links and it doesn't download the other pages like it should be. I think it's all because of the MessageBox.Show(this.attempt) thing.

I know my code is not the best and probably there are a lot of much better ways but I would like to get help to understand what's going on here.

Thank you very much

here is my code:

private void Scan(Label pages)
        {
            string regex = @"https:\/\/pastebin.com\/[a-zA-Z0-9]+";
            for (int i = 1; i <= Config.Amount_Of_Pages; i++)
            {
                Parse(i);
                MatchCollection matches = Regex.Matches(this.attempt, regex);
                MessageBox.Show(this.attempt);
                foreach (Match match in matches)
                {
                    Config.List_Of_Urls.Add(match.Value.ToString());
                    Config.List_Of_Urls = Config.List_Of_Urls.Distinct().ToList();
                }

                Config.Amount_Of_Pages_Scanned++;
                pages.Invoke(new MethodInvoker(delegate { pages.Text = Config.Amount_Of_Pages_Scanned.ToString(); }));
                Files.Write_Urls(Config.List_Of_Urls);
            }

            MessageBox.Show("Done");
        }

        private void Parse(int i)
        {
            WebBrowser wb = new WebBrowser();
            wb.DocumentCompleted += Wb_DocumentCompleted;
            wb.ScriptErrorsSuppressed = true;
            wb.Navigate("https://www.bing.com/search?q=site%3apastebin.com++email%3apassword&first=" + i);
        }

        private void Wb_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
        {
            var wb = (WebBrowser)sender;

            var html = wb.Document.GetElementsByTagName("HTML")[0].OuterHtml;

            this.attempt = html.ToString();
            /* ... */
        }

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

隐诗 2025-02-09 07:54:06

我更喜欢使用并为您建议。
如果要获得不同的URL，则应使用 hashset 而不是列表。
您应该将可选零件（www \。）？添加到正则。
要处理重试策略，我更喜欢使用 polly

结果代码是：

RetryPolicy retryPolicy = Policy.Handle<Exception>()
    .WaitAndRetry(new[]
    {
        TimeSpan.FromSeconds(5),
        TimeSpan.FromSeconds(10),
        TimeSpan.FromSeconds(30)
    });

string regex = @"https:\/\/(www\.)?pastebin.com\/[a-zA-Z0-9]+";
HashSet<string> sites = new HashSet<string>();

retryPolicy.Execute(() =>
{
    using (IWebDriver driver = new ChromeDriver())
    {
        driver.Navigate().GoToUrl("https://www.bing.com/search?q=site%3apastebin.com++email%3apassword&first=1");
        // We have to wait until the page will download and rendered in the browser.
        Thread.Sleep(1000);
        foreach (Match match in Regex.Matches(driver.PageSource, regex))
        {
            sites.Add(match.Value);
        }
    }
});

I prefer to use Selenium and suggest it for you.
If you want to get distinct urls, you should use HashSet instead of List.
You should add optional part (www\.)? to regex.
To handle retry policy, I prefer to use Polly

The result code is:

RetryPolicy retryPolicy = Policy.Handle<Exception>()
    .WaitAndRetry(new[]
    {
        TimeSpan.FromSeconds(5),
        TimeSpan.FromSeconds(10),
        TimeSpan.FromSeconds(30)
    });

string regex = @"https:\/\/(www\.)?pastebin.com\/[a-zA-Z0-9]+";
HashSet<string> sites = new HashSet<string>();

retryPolicy.Execute(() =>
{
    using (IWebDriver driver = new ChromeDriver())
    {
        driver.Navigate().GoToUrl("https://www.bing.com/search?q=site%3apastebin.com++email%3apassword&first=1");
        // We have to wait until the page will download and rendered in the browser.
        Thread.Sleep(1000);
        foreach (Match match in Regex.Matches(driver.PageSource, regex))
        {
            sites.Add(match.Value);
        }
    }
});

回复收藏 0 原文

~没有更多了~