craigslist 混搭如何获取数据?
我正在对内容聚合器进行一些研究工作,我很好奇当前的一些 craigslist 聚合器如何将数据获取到他们的混搭中。
例如,www.housingmaps.com 和现已关闭的 www.chicagocrime.org
如果有一个网址可以参考,那就完美了!
I'm doing some research work into content aggregators, and I'm curious how some of the current craigslist aggregators get data into their mashups.
For example, www.housingmaps.com and the now closed www.chicagocrime.org
If there is a URL that can be used for reference, that would be perfect!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
对于 AdRavage.com,我结合使用 Magpie RSS(提取搜索返回的数据)和自定义屏幕抓取类来正确填充建筑物搜索时使用的城市/类别信息。
例如,要提取类别,您可以:
For AdRavage.com I use a combination of Magpie RSS (to extract the data returned from searches) and a custom screen scraping class to properly populate the city/category information used when building searches.
For example, to extract the categories you could:
使用框架或 Google 搜索来替代抓取(并被阻止)的方法是使用数据代理或数据交换服务。
3taps 是一项测试版服务,为包括 Craigslist 在内的许多服务提供开发人员 API。 他们的团队还构建了 Craiggers 来演示此 API 的用例。 创始人 Greg Kidd 告诉我,3taps 从非 Craigslist 来源收集 Craigslist 数据,这些数据已经被索引和缓存,因此不会给 Craigslist 带来任何压力。 还列出了其他 3taps 数据源,但这些统计数据不清楚它们当前是否受支持。 他们的目标是民主化数据交换。
80legs 是一项抓取服务,提供不太实时但可能更全面的选项。 他们的数据转储式服务包括针对数百个网站的抓取包,包括 Amazon、Facebook 和 Zillow(我不知道)目前不相信 Craigslist)。 他们的最新成果 Datafiniti 正在为此类数据提供搜索引擎。
An alternative to scraping (and getting blocked), using frames, or Google search is to use a data broker or data exchange service.
3taps is a beta service which provides a developer API to many services, including Craigslist. Their team also built Craiggers to demonstrate a use case of this API. Founder Greg Kidd told me that 3taps harvests Craigslist data from non-Craigslist sources where it is already indexed and cached so that it doesn't put any strain on Craigslist. Other 3taps data sources are also listed, but these stats make it unclear whether they're currently supported. Their goal is to Democratize the Exchange of Data.
80legs is a crawling service which provides a less real-time but potentially more comprehensive option. Their data dump-style service includes crawl packages for hundreds of sites sites including Amazon, Facebook, and Zillow (I don't believe Craigslist currently). Their newer effort Datafiniti is providing a search engine over this type of data.
另一种选择是使用 YQL 或 Yahoo 管道来收集结果。
Craiglook 和 HousingMaps 正在使用它们来收集结果
The alternative option would be to use YQL or Yahoo pipes to gather the results.
Craiglook and HousingMaps are using them to gather results
craigslist 的任何抓取解决方案的问题在于,它们会自动阻止任何“过多”访问它们的 IP 地址 - 这通常意味着每天超过数百次。 因此,一旦你的工具受到某种程度的欢迎,它就会被关闭。
这就是为什么唯一持续存在的 craigslist 搜索网站要么使用框架(如 searchtempest.com 和 crazedlist.org),要么使用谷歌(如 allofcraigs.com)。
3taps 的作用是从第三方来源“野外”收集 craigslist 列表,例如 Google 和 Bing 缓存。
编辑:这个答案不再是最新的。 大多数包含 craigslist 结果的分类搜索引擎现在都使用 Google 自定义搜索或 Yahoo 或 Bing 的类似解决方案。 SearchTempest 两者都使用。 Allofcraigs 现在是 adhuntr 并使用 Google。 Crazedlist 已关闭。
The problem with any scraping solution of craigslist is that they automatically block any IP address that accesses them 'too much' - which usually means more than a few hundred times a day. So as soon as your tool got any kind of popularity, it would be shut down.
That's why the only craigslist search sites that have lasted either use frames (like searchtempest.com and crazedlist.org) or google (like allofcraigs.com).
What 3taps does is to gather craigslist listing from third party sources 'in the wild' - things like the Google and Bing caches for example.
Edit: this answer is no longer up to date. Most classifieds search engines that include results from craigslist now use Google Custom Search or similar solutions from Yahoo or Bing. SearchTempest uses both. Allofcraigs is now adhuntr and uses Google. Crazedlist has shut down.
我从 eBay、Craigslist 和 Zillow 等网站进行了大量数据聚合。 每个来源都需要不同的方法来聚合数据。
对于 Craigslist,我使用 RSS 源获取数据。 我只想要特定城市特定类别的特定数据,RSS 源对我来说效果很好。 如果您试图获取所有数据,并且过度使用 RSS 源,Craigslist 可能会禁止您。 此外,您将无法从 Craigslist 源获取所有数据,因为源显示大部分数据,但不是全部。 如果您不需要 100% 的可靠性,那么 RSS 是最简单的方法。
I've done a lot of data aggregation from sites like eBay, Craigslist, and Zillow. Each source requires a different method to aggregate the data.
For Craigslist, I got the data using RSS feeds. I only wanted specific data in specific categories in specific cities, and the RSS feeds worked fine for me. If you're trying to get all the data, and you overuse the RSS feeds, Craigslist will likely ban you. Also, you won't be able to get all the data from Craigslist feeds, because the feeds show most of the data but not all. If your reliability doesn't need to be 100%, then RSS is the easiest way to do it.
我猜屏幕抓取
我认为还没有 craigslist API.. 我不认为他们会发布一个..
所以唯一的方法就是抓取数据.. 你可以使用 cURL 库和举正则表达式来抓取 如果您看到链接,则获得您想要的页面数据
.. 访问该页面.. 抓取新页面获取数据并显示或存储它
等等..
i am guessing screen scraping
i do not think there is a craigslist API yet.. and i do not think they will release one..
so the only way to go is to scrape data.. you could use cURL library and heave regex to scrape the data you want of a page
if you see a link .. access the page.. scrape the new page get the data and show it or store it
and so on..
我刚刚做了一个:
http://cdn.javascriptmvc.com/videos/jobs/craigslist .js
产生:
http://cdn.javascriptmvc.com/videos /jobs/craigslist.html
必须在 rhino 中运行。
I just made one:
http://cdn.javascriptmvc.com/videos/jobs/craigslist.js
That produces:
http://cdn.javascriptmvc.com/videos/jobs/craigslist.html
Must be run in rhino.
在继续研究这一领域的过程中,我发现了一个很棒的网站,它部分实现了我感兴趣的功能:
Crazedlist
它使用客户端浏览器的HTTPReferer,这很有趣,但并不理想。 该网站的作者还声称对 CL 非常满意,我明白这一点。 它还给出了与我的需求类似的业务需求的清晰示例,以及为什么我对这个主题感兴趣。
While continuing to research this area, I found an awesome site that does partly what I'm interested in:
Crazedlist
It uses the HTTPReferer of the client browser, which is interesting but not ideal. The author of the site also claims to have royally ticked on CL, which I understand. It also gives clear example of business need, which are similar to my needs, and why I'm interested in this topic.