关于为什么我无法抓取网站有什么想法吗?
我正在构建一个需要从合作伙伴网站抓取信息的网站。现在,我的抓取代码在其他网站上运行得很好,但在这个网站上却不行。这是一个常规的 .html 网站。我的想法是它可能是用 php 生成的(网站是用 php 构建的)。
我不知道我只是对生成的部分进行猜测,我需要您的专业人士的帮助。如果重要的是我使用的代码。 htmlDocument 是 htmlAgilityPack 但这与它无关。我尝试的网站上的结果为空。
string result;
var objRequest = System.Net.HttpWebRequest.Create(strUrl);
var objResponse = objRequest.GetResponse();
using (var sr = new StreamReader(objResponse.GetResponseStream()))
{
result = sr.ReadToEnd();
sr.Close();
var doc = new HtmlDocument();
doc.LoadHtml(result);
foreach (var c in doc.DocumentNode.SelectNodes("//a[@href]"))
{
litStatus.Text += c.Attributes["href"].Value + "<br />";
}
}
编辑:
这是来自 w3 验证器,可能与此有关?
抱歉,我无法验证此文档,因为在第 422 行它包含一个或 更多我无法解释为 utf-8 的字节(换句话说,找到的字节无效 指定字符编码中的值)。请检查文件的内容和 字符编码指示。
错误是:utf8“\xA9”未映射到 Unicode
I am building a site that need to scrape information from a partner site. Now my scraping code works great with other sites but not this one. It is a regular .html site. My thoughts is that it might be generated some how with php (site is build with php).
I have no idea I am just taking a guess about the generated part and I would need your pros help on this. If it matters here is my code I use. The htmlDocument is htmlAgilityPack but that has nothing to do with it. Result is null on the site I try.
string result;
var objRequest = System.Net.HttpWebRequest.Create(strUrl);
var objResponse = objRequest.GetResponse();
using (var sr = new StreamReader(objResponse.GetResponseStream()))
{
result = sr.ReadToEnd();
sr.Close();
var doc = new HtmlDocument();
doc.LoadHtml(result);
foreach (var c in doc.DocumentNode.SelectNodes("//a[@href]"))
{
litStatus.Text += c.Attributes["href"].Value + "<br />";
}
}
EDIT:
this is from the w3 validator, might have something with this?
Sorry, I am unable to validate this document because on line 422 it contained one or
more bytes that I cannot interpret as utf-8 (in other words, the bytes found are not valid
values in the specified Character Encoding). Please check both the content of the file and
the character encoding indication.
The error was: utf8 "\xA9" does not map to Unicode
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我首先会查看从 wget 等简单的东西得到的响应,或者使用 http://www.fiddler2.com/fiddler2/">Fiddler 等工具来测试响应并检查您返回的任何标头。
有时网站将从不同的代理字符串返回不同的响应等等,因此如果您在运行脚本的同一台计算机上使用 Fiddler,您可能需要调整请求标头并伪装成不同的浏览器来获取您正在查找的数据。您应该能够准确地看到浏览器对页面的请求与脚本对页面的请求之间的区别,
甚至可能存在您的代码未遵循的简单 302 重定向或类似情况 Fiddler
。如果您可以使用浏览器访问该页面,那么您肯定可以通过发送与浏览器发送的完全相同的请求来访问它 -
在您自己的代码中使用起来有点棘手,因为它充当代理 -它使用常规浏览器进行设置,但您必须手动告诉您的代码通过 127.0.0.1 端口 8888 上的代理运行,以便 Fiddler 看到您的结果。
I would start by seeing what response I got from something simple like wget or using a tool like http://www.fiddler2.com/fiddler2/">Fiddler to test the response and check any headers you are getting back.
Sometimes sites will return different responses from different agent strings and so on, so you may need to adjust your request headers and masquerade as a different browser to get the data you are looking for. If you are using Fiddler on the same machine that is running the script you should be able to see exactly what is different between a request for the page from your browser and a request for the page from your script.
There may even be a simple 302 redirect or something like that going on that your code isn't following.
If you can access the page with a browser then you will definitely be able to access it by sending exactly the same request as your browser would send.
Edit- Fiddler is slightly trickier to use from your own code because it behaves as a proxy- it sets itself up with regular browsers, but you would manually have to tell your code to run through a proxy on 127.0.0.1 port 8888 in order for Fiddler to see your results.
要进行故障排除,请检查 objResponse.StatusCode 和 objResponse.StatusDescription 的值:
To troubleshoot, check the value of objResponse.StatusCode and objResponse.StatusDescription:
问题似乎是第 421 行注释中的字符:
该字符超出了声明的字符编码 iso-8859-1:
您可以尝试通过过滤器将解析的文档字符串运行到 转换或在使用htmlAgilityPack评估字符串之前删除字符串中的违规字符<代码>LoadHtml()。
The problem appears to be the character in the comment on line 421:
which is outside of the declared character encoding iso-8859-1:
You might try running the parsed document string through a filter to convert or remove the offending characters in the string before evaluating it with the htmlAgilityPack
LoadHtml()
.