C# 网页爬取

发布于 2024-10-09 07:59:30 字数 513 浏览 0 评论 0 原文

我的任务是在许多图书馆网页上抓取/解析和索引可用书籍。我通常使用 HTML Agility Pack 和 C# 来解析网站内容。其中之一如下:

http://bibliotek.kristianstad.se/ pls/bookit/pkg_www_misc.print_index?in_language_id=en_GB

如果您搜索 *(所有书籍),它将返回许多书籍列表,每页按 10 本书分页。

我发现的典型网络爬虫在此网站上失败了。我还尝试编写自己的爬虫,它会遍历页面上的所有链接并生成 post/get 变量来动态生成结果。我也无法做到这一点,主要是因为我收到了一些 404 错误(尽管我确信生成的链接是正确的)。

该网站依靠javascript生成内容,并采用GET和POST变量提交的混合模式。

I have been given a task to crawl / parse and index available books on many library web page. I usually use HTML Agility Pack and C# to parse web site content. One of them is the following:

http://bibliotek.kristianstad.se/pls/bookit/pkg_www_misc.print_index?in_language_id=en_GB

If you search for a * (all books) it will return many lists of books, paginated by 10 books per page.

Typical web crawlers that I have found fail on this website. I have also tried to write my own crawler, which would go through all links on the page and generate post/get variables to dynamically generate results. I havent been able to do this as well, mostly due to some 404 errors that I get (although I am certain that the links generated are correct).

The site relies on javascript to generate content, and uses a mixed mode of GET and POST variable submission.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

雾里花 2024-10-16 07:59:30

我正处于危险之中,但尝试使用 Fiddler 然后您可以根据这些请求进行爬行。 Fiddler 有 FiddlerCore,您可以将其放入自己的 C# 项目中。使用此功能,您可以监视 WebBrowser 控件中发出的请求,然后保存它们以供稍后抓取或执行其他操作。

使用 C# JavaScript 解释器路线听起来像是“更正确”的方法,但我敢打赌,除非您遇到的是最简单的情况,否则它会更加困难并且充满错误和错误。

祝你好运。

I'm going out on a limb, but try observing the JavaScript GETs and POSTs with Fiddler and then you can base your crawling off of those requests. Fiddler has FiddlerCore, which you can put in your own C# project. Using this, you could monitor requests made in the WebBrowser control and then save them for crawling or whatever, later.

Going down the C# JavaScript interpreter route sounds like the 'more correct' way of doing this, but I wager it will be much harder and frought with errors and bugs unless you have the simplest of cases.

Good luck.

您的好友蓝忘机已上羡 2024-10-16 07:59:30

FWIW,C# WebBrowser 控件非常非常慢。它也不支持两个以上的同时请求。

使用 SHDocVw 速度更快,但信号量也受到限制。

使用 MSHTML 仍然更快。此处的工作代码: https://svn.arachnode.net/svn/arachnodenet /trunk/Renderer/HtmlRenderer.cs 用户名/密码:公共(没有其他两个在进程外运行时所具有的请求/渲染限制...)

这是无头的,因此没有任何控件被渲染。 (快点)。

谢谢,
麦克风

FWIW, the C# WebBrowser control is very, very slow. It also doesn't support more than two simultaneous requests.

Using SHDocVw is faster, but is also semaphore limited.

Faster still is using MSHTML. Working code here: https://svn.arachnode.net/svn/arachnodenet/trunk/Renderer/HtmlRenderer.cs Username/Password: Public (doesn't have the request/rendering limitations that the other two have when run out of process...)

This is headless, so none of the controls are rendered. (Faster).

Thanks,
Mike

或十年 2024-10-16 07:59:30

如果您在 Windows 中使用 WebBrowser 控件表单应用程序打开页面,然后您应该能够通过 Html文档。这适用于 HTML 链接。

至于通过 Javascript 生成的链接,您可以查看 ObjectForScripting 属性应该允许您通过 Javascript 与 HTML 页面交互。剩下的就变成了 Javascript 问题,但它(理论上)应该是可以解决的。我没有尝试过这个所以我不能说。

If you use the WebBrowser control in a Windows Forms application to open the page then you should be able to access the DOM through the HtmlDocument. That would work for the HTML links.

As for the links that are generated through Javascript, you might look at the ObjectForScripting property which should allow you to interface with the HTML page through Javascript. The rest then becomes a Javascript problem, but it should (in theory) be solvable. I haven't tried this so I can't say.

我只土不豪 2024-10-16 07:59:30

如果网站使用 JavaScript 生成内容,那么您就不走运了。您需要一个可在 C# 中使用的完整 JavaScript 引擎,以便您可以实际执行脚本并捕获它们生成的输出。

看看这个问题:Embedding JavaScript engine into .NET - 但要知道这需要“认真”的努力做你需要做的事。

If the site generates content with JavaScript, then you are out of luck. You need a full JavaScript engine usable in C# so that you can actually execute the scripts and capture the output they generate.

Take a look at this question: Embedding JavaScript engine into .NET -- but know that it will take "serious" effort to do what you need.

混浊又暗下来 2024-10-16 07:59:30

AbotX 为您执行 JavaScript 渲染。但它不是免费的。

AbotX does javascript rendering for you. Its not free though.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文