如何刮“表状”来自 stackexchange 主页的数据？（在 R 中）

发布于 2024-09-15 07:11:47 字数 624 浏览 2 评论 0原文

我希望抓取一个新的 stackexchange 网站的主页： https://webapps.stackexchange.com/ （仅一次，并且只有几页，不会打扰服务器）。如果我想要从 stackoverflow 获得它，我知道有一个数据库转储，但对于新的 stackexchange，它们还不存在。

这就是我想做的。

步骤 1：选择 URL

URL <- "https://webapps.stackexchange.com/"

步骤 2：读取表格

readHTMLTable(URL)  # oops, doesn't work - gives NULL

步骤 2：这一次，让我们尝试使用 XML

htmlTreeParse(URL) # o.k, this reads the data - but it is all in <div> - now what?

所以我能够读取页面，但现在结构位于 div 中。现在如何使用它来创建与 readHTMLTable 相同的内容？

原文

I wish to scrape the home page of one of the new stackexchange websites: https://webapps.stackexchange.com/ (just once, and for only several pages, nothing that should bother the servers). If I had wanted it from stackoverflow, I know there is a database dump, but for the new stackexchange, they don't exist yet.

Here is what I want to do.

Step 1: choose URL

URL <- "https://webapps.stackexchange.com/"

Step 2: read the table

readHTMLTable(URL)  # oops, doesn't work - gives NULL

Step 2: this time, let's try it with XML

htmlTreeParse(URL) # o.k, this reads the data - but it is all in <div> - now what?

So I was able to read the page, but now the structure is in divs. How can it now be used to create the same thing as readHTMLTable ?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

一萌ing 2024-09-22 07:11:47

您可以使用 overflowr 包（通过 StackExchange应用程序编程接口）。只需使用 get.questions() 函数并提供站点前缀。它不在 CRAN 上，因为它不完整，但您可以下载并构建它。

library(overflowr)
questions <- get.questions(50)

对于统计网站，最近的 5 个问题：

questions <- get.questions(top.n=5, site="stats.stackexchange")

顺便说一句，很高兴有更多的人参与这个项目，因为我没有更多的时间花在这个项目上。 Stats.Exchange 的三位版主目前正在处理此问题。

You can do this with the overflowr package (with the StackExchange API). Just use the get.questions() function and supply the site prefix. It's not on CRAN since it isn't complete, but you can download it and build it.

library(overflowr)
questions <- get.questions(50)

For the statistics site, the top 5 most recent questions:

questions <- get.questions(top.n=5, site="stats.stackexchange")

Incidentally, happy to include more people working on this project because I don't have any more time to spend on it. Three of the moderators from Stats.Exchange are currently working on it.

回复收藏 0 原文