如何使用 C# 从 html 页面中抓取文本？

发布于 2024-12-03 16:42:20 字数 1524 浏览 1 评论 0原文

我有一个网页，导航到该网页时仅返回一个简单的文本值，例如数字 100。我需要从该页面获取该值，以便我可以在我的应用程序中使用它。该应用程序是一个简单的 Windows 窗体应用程序，上面有一个 Web 浏览器控件。

我已经尝试了很多方法，但它没有抓住文本，就好像它不存在一样。然而，如果我右键单击并查看源代码，它就在那里。

这不可能那么困难......这只是一些文字。

只是为了澄清该文档不包含 html，仅包含一个数字。使用 WebClient 或 WebRequest 时，它不返回值。

    private void RegisterWindow_Load(object sender, EventArgs e)
    {
        webBrowser1.Navigate("MYURL");
        webBrowser1.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(webBrowser1_DocumentCompleted);

    }

    void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
    {
        // Check and see if we have navigated to the final page.
        string registeredUrl = "MYURL";
        string currentPage = webBrowser1.Url.ToString();
        string response = string.Empty;

        if (webBrowser1.Url.ToString() == registeredUrl)
        {
            // Now parse the authkey from the url
            response = GetWebRequest(currentPage);
            MessageBox.Show(response);
        }
    }

    /// <summary>
    /// Send a Web Request and get a Web Response back. 
    /// This respons can be a valid URL, simple text response, or 
    /// HTML response.        
    /// </summary>
    /// <param name="url"></param>
    /// <returns></returns>
    public string GetWebRequest(string url)
    {
        var client = new WebClient();
        var content = client.DownloadString(url);

        return content;
    }

原文

I have a web page that when navigated to only returns a simple text value, like the number 100. I need to grab that value from the page, so I can use it in my application. The application is a simple Windows Forms app, with a web browser control on it.

I have tried numerous things, but it's not grabbing the text, as if it doesn't exist. Yet if I right click and view source, it's there.

This can't be that difficult...It's just some text.

Just to clarify the document contains NO html, only a number. When using WebClient or WebRequest, it doesn't return the value.

    private void RegisterWindow_Load(object sender, EventArgs e)
    {
        webBrowser1.Navigate("MYURL");
        webBrowser1.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(webBrowser1_DocumentCompleted);

    }

    void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
    {
        // Check and see if we have navigated to the final page.
        string registeredUrl = "MYURL";
        string currentPage = webBrowser1.Url.ToString();
        string response = string.Empty;

        if (webBrowser1.Url.ToString() == registeredUrl)
        {
            // Now parse the authkey from the url
            response = GetWebRequest(currentPage);
            MessageBox.Show(response);
        }
    }

    /// <summary>
    /// Send a Web Request and get a Web Response back. 
    /// This respons can be a valid URL, simple text response, or 
    /// HTML response.        
    /// </summary>
    /// <param name="url"></param>
    /// <returns></returns>
    public string GetWebRequest(string url)
    {
        var client = new WebClient();
        var content = client.DownloadString(url);

        return content;
    }

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

峩卟喜欢 2024-12-10 16:42:20

如果文档仅包含数字而没有任何 HTML，则这应该有效：

public string GetWebRequest()
{
    return webBrowser1.Document.Body.InnerText;
}

If the document contains only number without any HTML, this should work:

public string GetWebRequest()
{
    return webBrowser1.Document.Body.InnerText;
}

回复收藏 0 原文

淤浪 2024-12-10 16:42:20

您应该能够执行以下简单的操作：

var client = new WebClient();
var content = client.DownloadString("<YOUR URL>");
var number = Int32.Parse(content);

DownloadString(string) 的 MSDN 文档。

You should be able to do something as straightforward as:

var client = new WebClient();
var content = client.DownloadString("<YOUR URL>");
var number = Int32.Parse(content);

MSDN documentation for DownloadString(string).

回复收藏 0 原文

み青杉依旧 2024-12-10 16:42:20

我写了一篇关于 .NET 中的网页抓取的博文几年前。你可以尝试那里的技术。希望它们没有过时。

例如：

HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);

request.UserAgent = _UserAgent;
request.CookieContainer = cookies;  // optional

using (WebResponse response = request.GetResponse())
{
  using (Stream responseStream = response.GetResponseStream())
  {
    using (StreamReader reader = new StreamReader(responseStream))
    {
      html = reader.ReadToEnd();
    }
  }
}

请记住，您的浏览器正在发送用户-Agent 标头，可能正在发送 cookie，可能正在通过配置的代理服务器等。特别是对于安全站点或 Intranet 站点，简单的 WebClient 调用可能是不够的。您可能需要按照 @SLaks 的建议使用 Fiddler 进行一些检查。

I wrote a blog post on Web scraping in .NET several years ago. You could try the techniques there. Hopefully they're not obsolete.

For example:

HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);

request.UserAgent = _UserAgent;
request.CookieContainer = cookies;  // optional

using (WebResponse response = request.GetResponse())
{
  using (Stream responseStream = response.GetResponseStream())
  {
    using (StreamReader reader = new StreamReader(responseStream))
    {
      html = reader.ReadToEnd();
    }
  }
}

Remember that your browser is sending a User-Agent header, may be sending cookies, may be going through a configured proxy server, etc. Particularly for secured or intranet sites, a simple WebClient call may be insufficient. You may need to do some checking with Fiddler as @SLaks suggested.

回复收藏 0 原文