使用循环/自动化进行HTML Web刮擦

发布于 2025-02-13 08:32:42 字数 1129 浏览 4 评论 0原文

我正在用R（使用RVEST）在各种网页上进行特定数据集。所有网页的格式都相同，因此我可以使用正确的节点从每个页面上的位置中提取目标数据，毫无问题。但是，有100个不同的网页，所有网页都具有相同的URL（最终）。有没有办法使用循环自动执行该过程？

I am using the following code:

webpage_urls <- paste0("https://exampleurl=", endings)

where endings is a vector of the 100 endings that give the单独的网页。

and then

htmltemplate <- read_html(webpage_urls)

however, I then receive Error: `x` must be a string of length 1

After this step, I would like to perform the follow extraction:

webscraping <- htmltemplate %>%
html_nodes("td") %>%
html_text()

nth_element <- function(vector, starting_position, n) {vector[seq(starting_position, length(vector), n)]}

result <- nth_element(webscraping, 10, 5)

The code for extraction all works individually when I do it manually for each webpage, however I cannot repeat the function automatically for each webpage.

我不熟悉循环/迭代以及如何对其进行编码。有没有办法为每个网页运行此提取过程，然后将每个提取过程的结果存储到单独的矢量中，以便我可以将它们编译到表中？如果不是循环，是否有另一种方法可以自动化该过程，以便我可以克服要求单个字符串的错误？

原文

I am performing web-scraping in R (using rvest) for a specific set of data on various webpages. All of the webpages are formatted the same, so I can extract the targeted data from its placement on each page, using the correct node with no problem. However, there are 100 different web pages, all with the same url (except for the very end). Is there a way to use a loop to perform the process automatically?

I am using the following code:

webpage_urls <- paste0("https://exampleurl=", endings)

where endings is a vector of the 100 endings that give the separate webpages.

and then

htmltemplate <- read_html(webpage_urls)

however, I then receive Error: `x` must be a string of length 1

After this step, I would like to perform the follow extraction:

webscraping <- htmltemplate %>%
html_nodes("td") %>%
html_text()

nth_element <- function(vector, starting_position, n) {vector[seq(starting_position, length(vector), n)]}

result <- nth_element(webscraping, 10, 5)

The code for extraction all works individually when I do it manually for each webpage, however I cannot repeat the function automatically for each webpage.

I am rather unfamiliar with loops/iteration and how to code it. Is there a way to run this extraction process for each webpage, and then to store the result of each extraction process to a separate vector, so that I can compile them in a table? If not a loop, is there another way to automate the process so that I can get past the error demanding a single string?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

骄傲 2025-02-20 08:32:42

nth_element <- function(vector, starting_position, n) {vector[seq(starting_position, length(vector), n)]}

allresults <- lapply(webpage_urls, function(oneurl) {
  read_html(oneurl) %>%
    html_nodes("td") %>%
    html_text() %>%
    nth_element(10, 5)
})

nth_element <- function(vector, starting_position, n) {vector[seq(starting_position, length(vector), n)]}

allresults <- lapply(webpage_urls, function(oneurl) {
  read_html(oneurl) %>%
    html_nodes("td") %>%
    html_text() %>%
    nth_element(10, 5)
})

回复收藏 0 原文

~没有更多了~