获取 iframe 谷歌驱动器内的元素(链接)

发布于 2025-01-11 20:53:33 字数 974 浏览 1 评论 0原文

我正在尝试以编程方式下载此页面上的两个 zip 文件:

https: //sites.google.com/site/ucinetsoftware/datasets/covert-networks/siren

这两个 zip 文件实际上位于不同的页面上,但 href这些页面位于该页面内。所以,我想做的是:

  1. 获取两个 zip 文件所在页面的链接(它们位于公共谷歌驱动器上)
  2. 将这两个 zip 文件下载到我的计算机上

(是的,我知道我可以手动下载它们,但我需要下载更多页面,所以我想自动化此过程)

不幸的是,我什至无法迈出第一步。我首先将页面加载到 rvest 中,然后尝试获取元素 div.flip-entry-info ,但这不会产生任何结果。我相信这是因为它是该页面内 iframe 的一部分。那么,如何访问包含指向这些文件实际位置的 href 的元素呢?

对于第二步,我需要找到一种从谷歌驱动器下载数据的方法。

例如,这两个 zip 文件之一可从以下位置获取:https://drive.google.com/file/d/1BFN_1n-5EZ3rLrqrqWsAsBR9exjXuUKF/view

但我完全不知道从那里下载文件。 Chrome 中的“检查”选项在此页面上不起作用,并且 selectorgadget 也不会显示任何有用的信息。

谁能帮我通过 R 下载这些文件?我完全被困住了。

I am trying to programmatically download the two zip files on this page:

https://sites.google.com/site/ucinetsoftware/datasets/covert-networks/siren

The two zip files are actually on separate pages, but the href to those pages are inside this page. So, what I want to do:

  1. get the links to the pages where each of the two zip files reside (they are on a public google drive)
  2. download the two zip files to my computer

(yes, I know I can download them manually, but there are more pages I need to download from, so I would like to automate this process)

Unfortunately, I can't even get the first step going. I start with loading the page into rvest and then try to get the element div.flip-entry-info but this yields no results. I believe this is because it is part of an iframe inside this page. So, how do access the elements that contain the href that point to the actual location of these files?

For the second step, I need to find a way to download the data from the google drive.

For example, one of these two zip files is available at: https://drive.google.com/file/d/1BFN_1n-5EZ3rLrqrqWsAsBR9exjXuUKF/view.

But I have absolutely no clue as to download the file from there. The 'inspect' option in Chrome doesn't work on this page and selectorgadget doesn't reveal anything useful either.

Can anyone help me to download these files through R? I am totally stuck.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

弥繁 2025-01-18 20:53:33

我们可以在 iframe 中获取链接,

您可以在这里参考教程,

https://github.com/yusuzech/r-web-scraping-cheat-sheet/blob/master/README.md#rvest7.2

library(rvest)
library(magrittr)

link = 'https://sites.google.com/site/ucinetsoftware/datasets/covert-networks/siren' %>%
  read_html() %>%
  html_nodes("iframe") %>%  html_attr("src") 

#get links of both the files
link %>% read_html() %>% html_nodes(".flip-entry-info") %>%  html_nodes('a') %>% 
  html_attr('href')
[1] "https://drive.google.com/file/d/1cio3RzjDO6e78PKdEFSPgdw4tCJ7_VUi/view?usp=drive_web"
[2] "https://drive.google.com/file/d/1BFN_1n-5EZ3rLrqrqWsAsBR9exjXuUKF/view?usp=drive_web"

要下载文件,我们可以使用 googledrive< /代码> 库。

library(googledrive)
temp <- tempfile(fileext = ".zip")
drive_download(
  as_id("https://drive.google.com/file/d/1cio3RzjDO6e78PKdEFSPgdw4tCJ7_VUi/view?usp=drive_web"), path = temp, overwrite = TRUE)
out<- unzip(temp, exdir = tempdir())
df<- read.csv(out, sep = ",")
str(df)
'data.frame':   44 obs. of  45 variables:
 $ ï..: int  1 2 3 4 5 6 7 8 9 10 ...
 $ X1 : int  0 1 1 1 1 1 1 1 1 1 ...
 $ X2 : int  1 0 0 0 0 0 0 0 0 0 ...
 $ X3 : int  1 0 0 0 0 1 0 0 0 0 ...

We can get the links inside the iframe

You can refer tutorial here,

https://github.com/yusuzech/r-web-scraping-cheat-sheet/blob/master/README.md#rvest7.2

library(rvest)
library(magrittr)

link = 'https://sites.google.com/site/ucinetsoftware/datasets/covert-networks/siren' %>%
  read_html() %>%
  html_nodes("iframe") %>%  html_attr("src") 

#get links of both the files
link %>% read_html() %>% html_nodes(".flip-entry-info") %>%  html_nodes('a') %>% 
  html_attr('href')
[1] "https://drive.google.com/file/d/1cio3RzjDO6e78PKdEFSPgdw4tCJ7_VUi/view?usp=drive_web"
[2] "https://drive.google.com/file/d/1BFN_1n-5EZ3rLrqrqWsAsBR9exjXuUKF/view?usp=drive_web"

To download the files we can use googledrive library.

library(googledrive)
temp <- tempfile(fileext = ".zip")
drive_download(
  as_id("https://drive.google.com/file/d/1cio3RzjDO6e78PKdEFSPgdw4tCJ7_VUi/view?usp=drive_web"), path = temp, overwrite = TRUE)
out<- unzip(temp, exdir = tempdir())
df<- read.csv(out, sep = ",")
str(df)
'data.frame':   44 obs. of  45 variables:
 $ ï..: int  1 2 3 4 5 6 7 8 9 10 ...
 $ X1 : int  0 1 1 1 1 1 1 1 1 1 ...
 $ X2 : int  1 0 0 0 0 0 0 0 0 0 ...
 $ X3 : int  1 0 0 0 0 1 0 0 0 0 ...
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文