使用循环/自动化进行HTML Web刮擦
我正在用R(使用RVEST)在各种网页上进行特定数据集。所有网页的格式都相同,因此我可以使用正确的节点从每个页面上的位置中提取目标数据,毫无问题。但是,有100个不同的网页,所有网页都具有相同的URL(最终)。有没有办法使用循环自动执行该过程?
I am using the following code:
webpage_urls <- paste0("https://exampleurl=", endings)
where endings
is a vector of the 100 endings that give the单独的网页。
and then
htmltemplate <- read_html(webpage_urls)
however, I then receive Error: `x` must be a string of length 1
After this step, I would like to perform the follow extraction:
webscraping <- htmltemplate %>%
html_nodes("td") %>%
html_text()
nth_element <- function(vector, starting_position, n) {vector[seq(starting_position, length(vector), n)]}
result <- nth_element(webscraping, 10, 5)
The code for extraction all works individually when I do it manually for each webpage, however I cannot repeat the function automatically for each webpage.
我不熟悉循环/迭代以及如何对其进行编码。有没有办法为每个网页运行此提取过程,然后将每个提取过程的结果
存储到单独的矢量中,以便我可以将它们编译到表中?如果不是循环,是否有另一种方法可以自动化该过程,以便我可以克服要求单个字符串的错误?
I am performing web-scraping in R (using rvest) for a specific set of data on various webpages. All of the webpages are formatted the same, so I can extract the targeted data from its placement on each page, using the correct node with no problem. However, there are 100 different web pages, all with the same url (except for the very end). Is there a way to use a loop to perform the process automatically?
I am using the following code:
webpage_urls <- paste0("https://exampleurl=", endings)
where endings
is a vector of the 100 endings that give the separate webpages.
and then
htmltemplate <- read_html(webpage_urls)
however, I then receive Error: `x` must be a string of length 1
After this step, I would like to perform the follow extraction:
webscraping <- htmltemplate %>%
html_nodes("td") %>%
html_text()
nth_element <- function(vector, starting_position, n) {vector[seq(starting_position, length(vector), n)]}
result <- nth_element(webscraping, 10, 5)
The code for extraction all works individually when I do it manually for each webpage, however I cannot repeat the function automatically for each webpage.
I am rather unfamiliar with loops/iteration and how to code it. Is there a way to run this extraction process for each webpage, and then to store the result
of each extraction process to a separate vector, so that I can compile them in a table? If not a loop, is there another way to automate the process so that I can get past the error demanding a single string?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)