如何使用 Perl 有效地提取 HTML 内容?
我正在用 Perl 编写一个爬虫,它必须提取驻留在同一服务器上的网页内容。我目前正在使用 HTML::Extract 模块来完成这项工作,但我发现该模块有点慢,所以我查看了它的源代码,发现它没有使用任何连接缓存 LWP::UserAgent。
我的最后一招是获取 HTML::Extract
的源代码并修改它以使用缓存,但我真的想尽可能避免这种情况。有谁知道任何其他模块可以更好地执行相同的工作?我基本上只需要获取 元素中的所有文本并删除 HTML 标签。
I am writing a crawler in Perl, which has to extract contents of web pages that reside on the same server. I am currently using the HTML::Extract module to do the job, but I found the module a bit slow, so I looked into its source code and found out it does not use any connection cache for LWP::UserAgent.
My last resort is to grab HTML::Extract
's source code and modify it to use a cache, but I really want to avoid that if I can. Does anyone know any other module that can perform the same job better? I basically just need to grab all the text in the <body>
element with the HTML tags removed.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我使用 pQuery 进行网页抓取。但我也听说过有关 Web::Scraper 的好消息。
这两个模块以及其他模块都出现在 SO 上针对与您类似问题的答案中:
I use pQuery for my web scraping. But I've also heard good things about Web::Scraper.
Both of these along with other modules have appeared in answers on SO for similar questions to yours:
HTML::Extract
的功能看起来非常基本且无趣。如果 draegfun 提到的模块您不感兴趣,您可以使用LWP::UserAgent
和HTML::TreeBuilder< 完成
HTML::Extract
所做的一切/code> 你自己,根本不需要太多代码,然后你就可以按照自己的方式自由地进行缓存工作。HTML::Extract
's features look very basic and uninteresting. If the modules that draegfun mentioned don't interest you, you could do everything thatHTML::Extract
does usingLWP::UserAgent
andHTML::TreeBuilder
yourself, without requiring very much code at all, and then you would be free to work in caching on your own terms.我一直在使用 Web::Scraper 来满足我的抓取需求。这对于提取数据来说确实非常好,并且因为您可以调用
->scrape($html, $originating_uri)
那么也可以很容易地缓存您需要的结果。I've been using Web::Scraper for my scraping needs. It's very nice indeed for extracting data, and because you can call
->scrape($html, $originating_uri)
then it's very easy to cache the result you need as well.您需要实时执行此操作吗?低效率对你有何影响?您是否连续执行该任务,以便必须先提取一页才能转到下一页?为什么要避免缓存?
您的爬虫程序可以下载页面并将其传递给其他东西吗?也许您的爬虫甚至可以并行运行,或者以某种分布式方式运行。
Do you need to do this in real-time? How does the inefficiency affect you? Are you doing the task serially so that you have to extract one page before you move onto the next one? Why do you want to avoid a cache?
Can your crawler download the pages and pass them off to something else? Perhaps your crawler can even run in parallel, or in some distributed manner.