使用 R 对房地产广告进行网页抓取
作为经济研究团队的实习生,我的任务是找到一种使用 R 自动收集房地产广告网站上特定数据的方法。
我假设相关包是 XML
和 < code>RCurl,但我对他们工作的理解非常有限。
这是网站的主页:http://www.leboncoin.fr/ventes_immobilieres/offres/nord_pas_de_calais/? f=a&th=1&zz=59000 理想情况下,我想构建数据库,使每一行对应一个广告。
以下是广告的详细信息:http://www.leboncoin.fr/ventes_immobilieres/197284216.htm? ca=17_s 我的变量是:价格(“Prix”)、城市(“Ville”)、表面(“surface”)、“GES”、“Classe énergie”和房间数量(“Pièces”)作为广告中显示的图片数量,我还想导出字符向量中的文本,稍后我将在其中执行文本挖掘分析,
以寻求任何帮助、教程或操作方法的链接 。这会让我领先要遵循的路径。
As an intern in an economic research team, I was given the task to find a way to automatically collect specific data on a real estate ad website, using R.
I assume that the concerned packages are XML
and RCurl
, but my understanding of their work is very limited.
Here is the main page of the website: http://www.leboncoin.fr/ventes_immobilieres/offres/nord_pas_de_calais/?f=a&th=1&zz=59000
Ideally, I'd like to construct my database so that each row corresponds to an ad.
Here is the detail of an ad: http://www.leboncoin.fr/ventes_immobilieres/197284216.htm?ca=17_s
My variables are: the price ("Prix"), the city ("Ville"), the surface ("surface"), the "GES, the "Classe énergie" and the number of room ("Pièces"), as well as the number of pictures shown in the ad. I would also like to export the text in a character vector over which I would perform a text mining analysis later on.
I'm looking for any help, link to a tutorial or How-to that would give me a lead over the path to follow.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以使用 R 中的
XML
包来抓取此数据。这是一段应该有帮助的代码。以下是如何使用这些函数将信息提取到数据框中。
这将返回以下输出
您可以轻松使用
apply
系列函数循环多个页面以获取所有广告的详细信息。有两件事需要注意。一是从网站上抓取的合法性。二是在循环函数中使用 Sys.sleep ,这样服务器就不会受到请求的轰炸。让我知道这是如何运作的
You can use the
XML
package in R to scrape this data. Here is a piece of code that should help.Here is how you would use these functions to extract information into a data frame.
This returns the following output
You can easily use the
apply
family of functions to loop over multiple pages to get details of all ads. Two things to be mindful of. One is the legality of scraping from the website. Two is to useSys.sleep
in your looping function so that the servers are not bombarded with requests.Let me know how this works
这是一个很大的问题,所以你需要把它分解成更小的问题,看看你会卡在哪些部分上。
是检索网页时出现问题吗? (注意代理服务器问题。)或者是棘手的部分是从中访问有用的数据位? (您可能需要为此使用 xPath。)
查看网络抓取示例 Rosetta 代码并浏览这些 SO 问题以获取更多信息。
That's quite a big question, so you need to break it down into smaller ones, and see which bits you get stuck on.
Is the problem with retrieving a web page? (Watch out for proxy server issues.) Or is the tricky bit accessing the useful bits of data from it? (You'll probably need to use xPath for this.)
Take a look at the web-scraping example on Rosetta code and browse these SO questions for more information.