如何使用 C# 从 html 页面中抓取文本?
我有一个网页,导航到该网页时仅返回一个简单的文本值,例如数字 100。我需要从该页面获取该值,以便我可以在我的应用程序中使用它。该应用程序是一个简单的 Windows 窗体应用程序,上面有一个 Web 浏览器控件。
我已经尝试了很多方法,但它没有抓住文本,就好像它不存在一样。然而,如果我右键单击并查看源代码,它就在那里。
这不可能那么困难......这只是一些文字。
只是为了澄清该文档不包含 html,仅包含一个数字。使用 WebClient 或 WebRequest 时,它不返回值。
private void RegisterWindow_Load(object sender, EventArgs e)
{
webBrowser1.Navigate("MYURL");
webBrowser1.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(webBrowser1_DocumentCompleted);
}
void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
// Check and see if we have navigated to the final page.
string registeredUrl = "MYURL";
string currentPage = webBrowser1.Url.ToString();
string response = string.Empty;
if (webBrowser1.Url.ToString() == registeredUrl)
{
// Now parse the authkey from the url
response = GetWebRequest(currentPage);
MessageBox.Show(response);
}
}
/// <summary>
/// Send a Web Request and get a Web Response back.
/// This respons can be a valid URL, simple text response, or
/// HTML response.
/// </summary>
/// <param name="url"></param>
/// <returns></returns>
public string GetWebRequest(string url)
{
var client = new WebClient();
var content = client.DownloadString(url);
return content;
}
I have a web page that when navigated to only returns a simple text value, like the number 100. I need to grab that value from the page, so I can use it in my application. The application is a simple Windows Forms app, with a web browser control on it.
I have tried numerous things, but it's not grabbing the text, as if it doesn't exist. Yet if I right click and view source, it's there.
This can't be that difficult...It's just some text.
Just to clarify the document contains NO html, only a number. When using WebClient or WebRequest, it doesn't return the value.
private void RegisterWindow_Load(object sender, EventArgs e)
{
webBrowser1.Navigate("MYURL");
webBrowser1.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(webBrowser1_DocumentCompleted);
}
void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
// Check and see if we have navigated to the final page.
string registeredUrl = "MYURL";
string currentPage = webBrowser1.Url.ToString();
string response = string.Empty;
if (webBrowser1.Url.ToString() == registeredUrl)
{
// Now parse the authkey from the url
response = GetWebRequest(currentPage);
MessageBox.Show(response);
}
}
/// <summary>
/// Send a Web Request and get a Web Response back.
/// This respons can be a valid URL, simple text response, or
/// HTML response.
/// </summary>
/// <param name="url"></param>
/// <returns></returns>
public string GetWebRequest(string url)
{
var client = new WebClient();
var content = client.DownloadString(url);
return content;
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
如果文档仅包含数字而没有任何 HTML,则这应该有效:
If the document contains only number without any HTML, this should work:
您应该能够执行以下简单的操作:
DownloadString(string) 的 MSDN 文档。
You should be able to do something as straightforward as:
MSDN documentation for DownloadString(string).
我写了一篇关于 .NET 中的网页抓取的博文几年前。你可以尝试那里的技术。希望它们没有过时。
例如:
请记住,您的浏览器正在发送 用户-Agent 标头,可能正在发送 cookie,可能正在通过配置的代理服务器等。特别是对于安全站点或 Intranet 站点,简单的 WebClient 调用可能是不够的。您可能需要按照 @SLaks 的建议使用 Fiddler 进行一些检查。
I wrote a blog post on Web scraping in .NET several years ago. You could try the techniques there. Hopefully they're not obsolete.
For example:
Remember that your browser is sending a User-Agent header, may be sending cookies, may be going through a configured proxy server, etc. Particularly for secured or intranet sites, a simple WebClient call may be insufficient. You may need to do some checking with Fiddler as @SLaks suggested.
您可以将页面 HTML/TXT 内容加载到字符串中,然后使用字符串函数提取数字。
You can load the page HTML/TXT content to a string then use a string function to extract the number.