当前位置：文江博客话题详情

craigslist 混搭如何获取数据？

发布于 2024-07-08 10:24:10 字数 152 浏览 16 评论 0原文

我正在对内容聚合器进行一些研究工作，我很好奇当前的一些 craigslist 聚合器如何将数据获取到他们的混搭中。

例如，www.housingmaps.com 和现已关闭的 www.chicagocrime.org

如果有一个网址可以参考，那就完美了！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

╰沐子 2024-07-15 10:24:10

对于 AdRavage.com，我结合使用 Magpie RSS（提取搜索返回的数据）和自定义屏幕抓取类来正确填充建筑物搜索时使用的城市/类别信息。

例如，要提取类别，您可以：

//scrape category data
$h = new http();
$h->dir = "../cache/"; 
$url = "http://craigslist.org/";

if (!$h->fetch($url, 300)) {
  echo "<h2>There is a problem with the http request!</h2>";      
  exit();
}

//we need to get all category abbreviations (data looks like: <option value="ccc">community)
preg_match_all ("/<option value=\"(.*)\">([^`]*?)\n/", $h->body, $categoryTemp);

$catNames = $categoryTemp['2']; 

//return the array of abreviations
if(sizeof($catNames) > 0)   
    return $catNames;   
else
    return $emptyArray = array();

For AdRavage.com I use a combination of Magpie RSS (to extract the data returned from searches) and a custom screen scraping class to properly populate the city/category information used when building searches.

For example, to extract the categories you could:

//scrape category data
$h = new http();
$h->dir = "../cache/"; 
$url = "http://craigslist.org/";

if (!$h->fetch($url, 300)) {
  echo "<h2>There is a problem with the http request!</h2>";      
  exit();
}

//we need to get all category abbreviations (data looks like: <option value="ccc">community)
preg_match_all ("/<option value=\"(.*)\">([^`]*?)\n/", $h->body, $categoryTemp);

$catNames = $categoryTemp['2']; 

//return the array of abreviations
if(sizeof($catNames) > 0)   
    return $catNames;   
else
    return $emptyArray = array();

回复收藏 0 原文

神爱温柔 2024-07-15 10:24:10

使用框架或 Google 搜索来替代抓取（并被阻止）的方法是使用数据代理或数据交换服务。

3taps 是一项测试版服务，为包括 Craigslist 在内的许多服务提供开发人员 API。他们的团队还构建了 Craiggers 来演示此 API 的用例。创始人 Greg Kidd 告诉我，3taps 从非 Craigslist 来源收集 Craigslist 数据，这些数据已经被索引和缓存，因此不会给 Craigslist 带来任何压力。还列出了其他 3taps 数据源，但这些统计数据不清楚它们当前是否受支持。他们的目标是民主化数据交换。

80legs 是一项抓取服务，提供不太实时但可能更全面的选项。他们的数据转储式服务包括针对数百个网站的抓取包，包括 Amazon、Facebook 和 Zillow（我不知道）目前不相信 Craigslist）。他们的最新成果 Datafiniti 正在为此类数据提供搜索引擎。

回复收藏 0 原文

你爱我像她 2024-07-15 10:24:10

另一种选择是使用 YQL 或 Yahoo 管道来收集结果。

Craiglook 和 HousingMaps 正在使用它们来收集结果

回复收藏 0 原文

不可一世的女人 2024-07-15 10:24:10

craigslist 的任何抓取解决方案的问题在于，它们会自动阻止任何“过多”访问它们的 IP 地址 - 这通常意味着每天超过数百次。因此，一旦你的工具受到某种程度的欢迎，它就会被关闭。

这就是为什么唯一持续存在的 craigslist 搜索网站要么使用框架（如 searchtempest.com 和 crazedlist.org），要么使用谷歌（如 allofcraigs.com）。

3taps 的作用是从第三方来源“野外”收集 craigslist 列表，例如 Google 和 Bing 缓存。

编辑：这个答案不再是最新的。大多数包含 craigslist 结果的分类搜索引擎现在都使用 Google 自定义搜索或 Yahoo 或 Bing 的类似解决方案。 SearchTempest 两者都使用。 Allofcraigs 现在是 adhuntr 并使用 Google。 Crazedlist 已关闭。

回复收藏 0 原文

蔚蓝源自深海 2024-07-15 10:24:10

我从 eBay、Craigslist 和 Zillow 等网站进行了大量数据聚合。每个来源都需要不同的方法来聚合数据。

对于 Craigslist，我使用 RSS 源获取数据。我只想要特定城市特定类别的特定数据，RSS 源对我来说效果很好。如果您试图获取所有数据，并且过度使用 RSS 源，Craigslist 可能会禁止您。此外，您将无法从 Craigslist 源获取所有数据，因为源显示大部分数据，但不是全部。如果您不需要 100% 的可靠性，那么 RSS 是最简单的方法。

回复收藏 0 原文