使用 R 对房地产广告进行网页抓取

发布于 2024-11-07 04:29:56 字数 707 浏览 0 评论 0原文

作为经济研究团队的实习生,我的任务是找到一种使用 R 自动收集房地产广告网站上特定数据的方法。

我假设相关包是 XML 和 < code>RCurl,但我对他们工作的理解非常有限。

这是网站的主页:http://www.leboncoin.fr/ventes_immobilieres/offres/nord_pas_de_calais/? f=a&th=1&zz=59000 理想情况下,我想构建数据库,使每一行对应一个广告。

以下是广告的详细信息:http://www.leboncoin.fr/ventes_immobilieres/197284216.htm? ca=17_s 我的变量是:价格(“Prix”)、城市(“Ville”)、表面(“surface”)、“GES”、“Classe énergie”和房间数量(“Pièces”)作为广告中显示的图片数量,我还想导出字符向量中的文本,稍后我将在其中执行文本挖掘分析,

以寻求任何帮助、教程或操作方法的链接 。这会让我领先要遵循的路径。

As an intern in an economic research team, I was given the task to find a way to automatically collect specific data on a real estate ad website, using R.

I assume that the concerned packages are XML and RCurl, but my understanding of their work is very limited.

Here is the main page of the website: http://www.leboncoin.fr/ventes_immobilieres/offres/nord_pas_de_calais/?f=a&th=1&zz=59000
Ideally, I'd like to construct my database so that each row corresponds to an ad.

Here is the detail of an ad: http://www.leboncoin.fr/ventes_immobilieres/197284216.htm?ca=17_s
My variables are: the price ("Prix"), the city ("Ville"), the surface ("surface"), the "GES, the "Classe énergie" and the number of room ("Pièces"), as well as the number of pictures shown in the ad. I would also like to export the text in a character vector over which I would perform a text mining analysis later on.

I'm looking for any help, link to a tutorial or How-to that would give me a lead over the path to follow.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

对你的占有欲 2024-11-14 04:29:56

您可以使用 R 中的 XML 包来抓取此数据。这是一段应该有帮助的代码。

# DEFINE UTILITY FUNCTIONS

# Function to Get Links to Ads by Page
get_ad_links = function(page){
  require(XML)
  # construct url to page
  url_base = "http://www.leboncoin.fr/ventes_immobilieres/offres/nord_pas_de_calais/"
  url      = paste(url_base, "?o=", page, "&zz=", 59000, sep = "")
  page     = htmlTreeParse(url, useInternalNodes = T)

  # extract links to ads on page
  xp_exp   = "//td/a[contains(@href, 'ventes_immobilieres')]"
  ad_links = xpathSApply(page, xp_exp, xmlGetAttr, "href")
  return(ad_links)  
}

# Function to Get Ad Details by Ad URL
get_ad_details = function(ad_url){
   require(XML)
   # parse ad url to html tree
   doc = htmlTreeParse(ad_url, useInternalNodes = T)

   # extract labels and values using xpath expression
   labels  = xpathSApply(doc, "//span[contains(@class, 'ad')]/label", xmlValue)
   values1 = xpathSApply(doc, "//span[contains(@class, 'ad')]/strong", xmlValue)
   values2 = xpathSApply(doc, "//span[contains(@class, 'ad')]//a", xmlValue)
   values  = c(values1, values2)

   # convert to data frame and add labels
   mydf        = as.data.frame(t(values))
   names(mydf) = labels
   return(mydf)
}

以下是如何使用这些函数将信息提取到数据框中。

# grab ad links from page 1
ad_links = get_ad_links(page = 1)

# grab ad details for first 5 links from page 1
require(plyr)
ad_details = ldply(ad_links[1:5], get_ad_details, .progress = 'text')

这将返回以下输出

Prix :     Ville :  Frais d'agence inclus :  Type de bien :  Pièces :  Surface :  Classe énergie :          GES : 
469 000 € 59000 Lille                      Oui          Maison         8     250 m2  F (de 331 à 450)           <NA>
469 000 € 59000 Lille                      Oui          Maison         8     250 m2  F (de 331 à 450)           <NA>
140 000 € 59000 Lille                     <NA>     Appartement         2      50 m2  D (de 151 à 230) E (de 36 à 55)
140 000 € 59000 Lille                     <NA>     Appartement         2      50 m2  D (de 151 à 230) E (de 36 à 55)
170 000 € 59000 Lille                     <NA>     Appartement      <NA>      50 m2  D (de 151 à 230) D (de 21 à 35)

您可以轻松使用 apply 系列函数循环多个页面以获取所有广告的详细信息。有两件事需要注意。一是从网站上抓取的合法性。二是在循环函数中使用 Sys.sleep ,这样服务器就不会受到请求的轰炸。

让我知道这是如何运作的

You can use the XML package in R to scrape this data. Here is a piece of code that should help.

# DEFINE UTILITY FUNCTIONS

# Function to Get Links to Ads by Page
get_ad_links = function(page){
  require(XML)
  # construct url to page
  url_base = "http://www.leboncoin.fr/ventes_immobilieres/offres/nord_pas_de_calais/"
  url      = paste(url_base, "?o=", page, "&zz=", 59000, sep = "")
  page     = htmlTreeParse(url, useInternalNodes = T)

  # extract links to ads on page
  xp_exp   = "//td/a[contains(@href, 'ventes_immobilieres')]"
  ad_links = xpathSApply(page, xp_exp, xmlGetAttr, "href")
  return(ad_links)  
}

# Function to Get Ad Details by Ad URL
get_ad_details = function(ad_url){
   require(XML)
   # parse ad url to html tree
   doc = htmlTreeParse(ad_url, useInternalNodes = T)

   # extract labels and values using xpath expression
   labels  = xpathSApply(doc, "//span[contains(@class, 'ad')]/label", xmlValue)
   values1 = xpathSApply(doc, "//span[contains(@class, 'ad')]/strong", xmlValue)
   values2 = xpathSApply(doc, "//span[contains(@class, 'ad')]//a", xmlValue)
   values  = c(values1, values2)

   # convert to data frame and add labels
   mydf        = as.data.frame(t(values))
   names(mydf) = labels
   return(mydf)
}

Here is how you would use these functions to extract information into a data frame.

# grab ad links from page 1
ad_links = get_ad_links(page = 1)

# grab ad details for first 5 links from page 1
require(plyr)
ad_details = ldply(ad_links[1:5], get_ad_details, .progress = 'text')

This returns the following output

Prix :     Ville :  Frais d'agence inclus :  Type de bien :  Pièces :  Surface :  Classe énergie :          GES : 
469 000 € 59000 Lille                      Oui          Maison         8     250 m2  F (de 331 à 450)           <NA>
469 000 € 59000 Lille                      Oui          Maison         8     250 m2  F (de 331 à 450)           <NA>
140 000 € 59000 Lille                     <NA>     Appartement         2      50 m2  D (de 151 à 230) E (de 36 à 55)
140 000 € 59000 Lille                     <NA>     Appartement         2      50 m2  D (de 151 à 230) E (de 36 à 55)
170 000 € 59000 Lille                     <NA>     Appartement      <NA>      50 m2  D (de 151 à 230) D (de 21 à 35)

You can easily use the apply family of functions to loop over multiple pages to get details of all ads. Two things to be mindful of. One is the legality of scraping from the website. Two is to use Sys.sleep in your looping function so that the servers are not bombarded with requests.

Let me know how this works

停顿的约定 2024-11-14 04:29:56

这是一个很大的问题,所以你需要把它分解成更小的问题,看看你会卡在哪些部分上。

是检索网页时出现问题吗? (注意代理服务器问题。)或者是棘手的部分是从中访问有用的数据位? (您可能需要为此使用 xPath。)

查看网络抓取示例 Rosetta 代码并浏览这些 SO 问题以获取更多信息。

That's quite a big question, so you need to break it down into smaller ones, and see which bits you get stuck on.

Is the problem with retrieving a web page? (Watch out for proxy server issues.) Or is the tricky bit accessing the useful bits of data from it? (You'll probably need to use xPath for this.)

Take a look at the web-scraping example on Rosetta code and browse these SO questions for more information.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文