如何使用 IE/.Net/C# 进行真正的多线程 Web 挖掘?

发布于 2024-09-14 16:54:33 字数 135 浏览 6 评论 0原文

我想使用IE浏览器从网络上挖掘大量数据。然而,通过 WatiN 生成大量 IE 实例会使系统崩溃。有更好的方法吗?请注意,我不能简单地执行 WebRequests - 我确实需要浏览器,因为必须与网站上 JS 驱动的行为进行交互。

I want to mine large amounts of data from the web using the IE browser. However, spawning lots and lots of instances of IE via WatiN crashes the system. Is there a better way of doing this? Note that I can't simply do WebRequests - I really need the browser due to having to interact with JS-driven behaviors on the site.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

仅此而已 2024-09-21 16:54:33

我正在用 WatiN 挖掘很多页面。实际上此时已经30+了。当然,这需要大量资源 - 大约 2.5 GB RAM,但使用 WebRequest 几乎不可能做到同样的事情。我无法想象自己会在合理的时间内做这样的事情。使用 WatiN 需要几个小时。

我不知道它是否对你有帮助,但我正在使用网络浏览器控件来做到这一点。每个实例都是一个单独的进程。但是,我认为对您来说更重要的是,我曾经尝试通过在单个进程中完成所有操作来减少使用的内存量。可以创建单独的 AppDomain 而不是进程,并强制它们使用相同的 dll(尤其是 Microsoft.mshtml.dll),而不是为每个新应用程序域单独加载相同的 dll。我现在不记得该怎么做,但用谷歌搜索并不难。我记得一切都运行良好,并且 RAM 的使用量显着减少,所以我认为值得尝试。

I am mining a lot of pages with WatiN. Actually 30+ in this moment. Of course it takes a lot of resources - about 2.5 GB of RAM but it is almost impossible to do the same with WebRequest. I can't imagine myself doing such a thing in reasonable time. With WatiN it takes a couple of hours.

I don't know if it helps you, but I am using webbrowser control to do that. Every instance is a separate process. But, what's I think is more important to you, I tried once to reduce amount of used memory by doing all of it in single process. It's possible to just make separate AppDomain's instead of processes and force them to use the same dll (especially Microsoft.mshtml.dll) instead of loading the same dll separately for each new app domain. I can't remember how to do that now, but it's not hard to google that. What I remember is that everything worked fine and the usage of RAM was decreased significantly, so I think is worth trying.

べ繥欢鉨o。 2024-09-21 16:54:33

在 .NET 应用程序中启动 WebBrowser 控件(无论如何都是 IE)的多个实例以异步方式处理数据挖掘作业怎么样?

如果性能是一个问题 - 拆分作业并将其推送到云端也可能有所帮助。

What about launching multiple instances of WebBrowser control (it's IE anyway) in a .NET app to process data mining jobs in async manner?

If perf is a problem - splitting the job and pushing it to the cloud might also help.

趁年轻赶紧闹 2024-09-21 16:54:33

最好的方法是为每个 Web 浏览器实例实际创建一个进程,这是因为 Web 浏览器不是托管代码,它的 COM,并且在某些情况下无法在托管代码中处理非托管异常,应用程序肯定会崩溃。

更好的办法是创建一个进程主机来生成多个进程,并且如果需要,您可以使用命名管道或套接字或 WCF 在每个进程之间进行通信。

最好的办法是创建一个小型 SQL 嵌入式数据库,您可以在其中对作业进行排队,挖掘过程可以获取新请求,并将请求发布回数据库,并且该数据库可用于同步所有内容。

The best way would be to actually create one process per instance of web browser, this is because web browser is not a managed code, its COM, and there are cases where unmanaged exceptions can not be handled in managed code, and application will certainly crash.

The better thing would be to create a process host that will spawn multiple processes and you can use named pipes or sockets or WCF to communicate between each of the process if you need to.

The best thing would be to create a small SQL Embedded database and you can queue your jobs in it, the mining process can fetch a new request, and post request back to database, and this database can be used to synchronize everything.

枯寂 2024-09-21 16:54:33

我有一个项目,我在长期的基础上收集了大约 4500 万个请求(包含表单提交)。我一直在与大约 20 个并发客户端进行斗争,而我的管道成为了瓶颈。

在尝试编写自己的 WebClient、WaTiN/WaTiR 后,我使用了 Selinium Remote-Control使用 Microsoft 的 UI 自动化 API。

Selenium RC 让您选择浏览器。我用的是火狐浏览器。设置初始抓取脚本需要大约一个小时的实验和调整。 Selenium 比我自己编写代码要快得多,而且只需很少的投资就更健壮。很棒的工具。

为了扩展该过程,我尝试了几种不同的方法,但最终最有效的方法是将每个 SRC 实例固定在其自己的精简虚拟机中,然后生成与工作站有 RAM 支持的尽可能多的实例。当我获得超过 10 个实例时,在主机(而不是虚拟机)中本地运行的同等数量的 SRC 实例不可避免地会停止。这在抓取运行之前需要更多的开销和设置时间,但它会不间断地持续运行数天。

另一个考虑因素 - 调低您的 Firefox 首选项,以便不加载主页,关闭所有非必要的内容(欺骗检查、cookie(如果您的抓取、图像、广告拦截和闪存拦截等不需要的话)。

I had a project where I scraped on the order of 45 million requests (with form submissions) over an extended basis. On a sustained basis, I was scraping with about 20 simultaneous clients and my pipe was the bottleneck.

I used Selinium Remote-Control after experimenting with writing my own WebClient, WaTiN/WaTiR, and using Microsoft's UI Automation API.

Selenium RC let's you choose your browser. I used Firefox. Setting up the initial scraping scripts took about an hour of experimentation and tuning. Selenium was vastly faster than writing my own code and a lot more robust with little investment. Great tool.

To scale the process, I tried a few different approaches, but ultimately what worked best was sticking each SRC instance in its own stripped down VM and then spawning as many of those as the workstation had ram to support. An equivalent number of SRC instances running native in the host instead of the vms inevitably ground to a halt as I got up to +10 instances. This required more overhead and setup time before a scraping run, but it would run strongly for days, uninterrupted.

Another consideration -- tune your firefox preferences down so no homepage is loading, turn off everything non-essential (spoofing checks, cookies if not required for your scrape, images, adblock and flashblock, etc).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文