提取网页的 HTML 源代码
我使用此代码提取网页htm源代码:
private static string GetWebText(string url)
{
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
request.UserAgent = "A .NET Web Crawler";
WebResponse response = request.GetResponse();
Stream stream = response.GetResponseStream();
StreamReader reader = new StreamReader(stream);
string htmlText = reader.ReadToEnd();
return htmlText;
}
在我使用此代码提取此页面的源代码后(==>提取hrefs):
http://www.bayt.com/en/qatar/jobs/
当我得到我的 html 源代码时,我没有找到很多内容如下: “成本优化顾问”
是因为它是服务器端生成的内容吗?
如何更改我的代码,以便它带来包含服务器端生成内容的 html 源代码?
I use this code to extract a web page htm source code:
private static string GetWebText(string url)
{
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
request.UserAgent = "A .NET Web Crawler";
WebResponse response = request.GetResponse();
Stream stream = response.GetResponseStream();
StreamReader reader = new StreamReader(stream);
string htmlText = reader.ReadToEnd();
return htmlText;
}
after i use this code to extract the source code of this page(==>extract hrefs):
http://www.bayt.com/en/qatar/jobs/
when i've got my html source code,i didn't find a lot of content like :
"Cost Optimization Adviser"
is it because it's a server side generated content?
How can I change my code so that it brings the html source code with server side generated content included?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
当您查看源代码时,服务器端已经完成并呈现,这不太可能是服务器端原因。也就是说,他们可能有脚本在加载完成后动态更改页面的内容,如果加载完成后发生更改并且所有内容最初都已渲染,那么只要源不存在,您就会陷入最初渲染的内容之后的任何事情。有些浏览器有能力跟上正在发生的渲染和操作,但应用程序中发生了很多事情,使浏览器能够做到这一点,这与 JavaScript 没有任何关系。
另一个看似合理的概念或2是。他们在网站的某个地方内置了一些东西,用于某种类型的 XSS 检测。如果引用脚本不在域本身上,那么它将看到他们想要的任何内容。或者..您的脚本受到“同源政策”的影响,这可能会以某种方式把它搞砸。或者就像其他人说的那样,你没有正确解析 DOM。
您可以尝试将页面加载到网站上的 iframe 中,然后获取 iframe 的innerHTML,然后将其转储到文件或其他内容中。
Its not likely server-side cause when you view source the server side is already complete and rendered. That said they may have scripts that dynamically change the the contents of the page after the load has been completed, which if its changing after load is complete and all has originally been rendered your stuck with what was originally rendered as far as the source goes not anything there after. Some browsers have the ability to keep up with the rendering and manipulations going on but theres a lot going on in the application that makes the browser do that, thats not in any way related to JavaScript.
Another plausible concept or 2 is. They have something built in somewhere on the site for XSS detection of some sort. If a refering script isn't on the domain itself then its going to see whatever they want it to. Or.. Your script is suffering from the "same origin policy" that could be mucking it up some how. Or like someone else said your not parsing the DOM right.
You could try loading the page up in an iframe on your site then getting the innerHTML of the iframe then dumping that into a file or something.
他们可能只是根据浏览器 ID(用户代理)字符串添加某些内容。
尝试将您的用户代理设置为此
They may just be adding in certain content based on browser id (user-agent) strings.
Try setting your user agent to this