爬行互联网

发布于 2024-07-16 11:08:07 字数 284 浏览 8 评论 0 原文

我想抓取特定的东西。 特别是正在发生的活动,如音乐会、电影、艺术画廊开幕式等,任何人们可能花时间去的事情。

如何实现爬虫?

我听说过 Grub (grub.org -> Wikia) 和 Heritix (http://crawler.archive.org/< /a>)

还有其他的吗?

大家都有什么意见呢?

-杰森

I want to crawl for specific things. Specifically events that are taking place like concerts, movies, art gallery openings, etc, etc. Anything that one might spend time going to.

How do I implement a crawler?

I have heard of Grub (grub.org -> Wikia) and Heritix (http://crawler.archive.org/)

Are there others?

What opinions does everyone have?

-Jason

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(10

最单纯的乌龟 2024-07-23 11:08:07

该主题的优秀介绍性文本是信息检索简介(全文可在线获取)。 它有一章关于网络爬行,但也许更重要的是,它为您要对爬网文档执行的操作提供基础。

简介信息检索
(来源:stanford.edu

An excellent introductory text for that topic is Introduction to Information Retrieval (full text available online). It has a chapter on Web crawling, but perhaps more importantly, it provides a basis for the things you want to do with the crawled documents.

Introduction to Information Retrieval
(source: stanford.edu)

望她远 2024-07-23 11:08:07

无论您做什么,请成为一个好公民并遵守 robots.txt 文件。 您可能需要查看维基百科页面上关于聚焦爬虫的参考资料。 刚刚意识到我认识主题网络爬虫:评估自适应算法的作者之一。 小世界。

Whatever you do, please be a good citizen and obey the robots.txt file. You might want to check the references at the wikipedia page on focused crawlers. Just realized that I know one of the authors of Topical Web Crawlers: Evaluating Adaptive Algorithms. Small world.

无声静候 2024-07-23 11:08:07

查看 Scrapy。 它是一个用 Python 编写的开源网络爬行框架(我听说它与 Django 类似,只是它不是提供页面而是下载页面)。 它易于扩展、分布式/并行,并且看起来非常有前途。

我会使用 Scrapy,因为这样我就可以把自己的优势集中在一些更琐碎的事情上,比如如何从抓取的内容中提取正确的数据等并将其插入数据库。

Check out Scrapy. It's an open source web crawling framework written in Python (I've heard it's similar to Django except instead of serving pages it downloads them). It's easily extensible, distributed/parallel and looks very promising.

I'd use Scrapy, because that way I could save my strengths for something more trivial like how to extract the correct data from the scraped content etc and insert into a database.

你是年少的欢喜 2024-07-23 11:08:07

我认为网络爬虫部分将是任务中最简单的部分。 困难的部分是决定要访问哪些站点以及如何发现您想要访问的站点上的活动。 也许您想了解如何使用 GoogleYahoo API 获取您想要的数据。 他们已经完成了在互联网上抓取大量页面的工作——无论如何,在我看来,你可以专注于筛选数据以获得你正在寻找的事件的更困难的问题。

I think the webcrawler part will be the easiest part of the task. The hard part will be deciding which sites to visit and how to discover events on the sites that you want to visit. Maybe you want to see about using either the Google or Yahoo API to get the data you want. They've already done the work of crawling a lot of pages on the internet--you can focus on the, to my mind anyway, much tougher problem of sifting the data to get the events you're looking for.

执手闯天涯 2024-07-23 11:08:07

实际上编写一个规模化的爬虫是一项相当具有挑战性的任务。 我在工作中实现了一个并维护了相当长一段时间。 有很多问题你不知道存在,直到你写了一个问题并解决了这些问题。 专门处理 CDN 和网站的友好爬行。 自适应算法非常重要,否则您将绊倒 DOS 过滤器。 事实上,如果你的爬行足够大,你无论如何都会在不知情的情况下发生。

需要考虑的事情:

  • 除了可用吞吐量之外还有什么?
  • 您如何处理网站中断?
  • 如果您被阻止怎么办?
  • 你想进行秘密爬行吗(有争议,而且实际上很难做到正确)?

我实际上已经写了一些东西,如果我有时间的话,我可能会将有关爬虫构建的内容放在网上,因为构建一个合适的爬虫比人们告诉你的要困难得多。 大多数开源爬虫对于大多数人来说都运行良好,所以如果可以的话,我建议您使用其中之一。 哪一个是功能/平台选择。

Actually writing a scale directed crawler is quite a challenging task. I implemented one at work and maintained it for quite a while. There are a lot of problem that you don't know exist until you write one and hit the problems. Specifically dealing with CDNs and friendly crawling of sites. Adaptive algorithms are very important or you will trip DOS filters. Actually you will anyhow without knowing it if your crawl is big enough.

Things to think about:

  • What's except able throughput?
  • How do you deal with site outages?
  • What happens if you are blocked?
  • Do you want to engage in stealth crawling (contreversial and actually quite hard to get right)?

I have actually written some stuff up that if I ever get around to it I might put online about crawler construction since building a proper one is much tougher than people will tell you. Most of the open source crawlers work well enough for most people so if you can I recommend you use one of those. Which one is a feature/platform choice.

淡笑忘祈一世凡恋 2024-07-23 11:08:07

如果您发现爬行互联网成为一项艰巨的任务,您可能需要考虑构建一个 RSS 聚合器 并订阅 craigslist 和coming.org 等热门活动网站的 RSS 源。

这些站点中的每一个都提供本地化的、可搜索的事件。 RSS 为您提供了(一些)标准化格式,而不必使用构成网络的所有格式错误的 html...

有一些开源库,例如 ROME (java),可能有助于 RSS 提要的消耗。

If you find that crawling the internet becomes to dawnting a task you may want to consider building an RSS aggregator and subscribing to RSS feeds for popular event sites like craigslist and upcoming.org.

Each of these sites provide localized, searchable events. RSS provides you with a (few) standardized formats instead of having to all the malformed html that makes up the web...

There are opensource libraries like ROME (java) that may help with the consumption of RSS feeds.

各空 2024-07-23 11:08:07

有语言特定要求吗?,

我花了一些时间玩 .net 的 Chilkat Spider Lib 进行个人实验,

最后我检查了蜘蛛库,已获得免费软件许可,
(虽然据我所知不是开源的:()

似乎他们有python Lib。

http://www.example-code.com/python/pythonspider.asp #Python
http://www.example-code.com/csharp/spider.asp #.Net

Is there a language specific requirement ?,

I spent some time playing around with the Chilkat Spider Lib's for .net a while back for personal experimentation,

Last I checked there spider Libs, are licensed as Freeware,
( Altho not open source as far as i know :( )

Seems they have python Lib's to.

http://www.example-code.com/python/pythonspider.asp #Python
http://www.example-code.com/csharp/spider.asp #.Net

够运 2024-07-23 11:08:07

根据 Kevin 对 RSS 提要的建议,您可能需要查看雅虎管道。 我还没有尝试过它们,但我认为它们允许您处理多个 RSS 提要并生成网页或更多 RSS 提要。

Following on Kevin's suggestion of RSS feeds, you might want to check out Yahoo pipes. I haven't tried them yet, but I think they allow you process several RSS feeds and generate web pages or more RSS feeds.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文