使用r-我想从网站提取一些表格数据

发布于 2025-02-04 07:46:51 字数 481 浏览 2 评论 0 原文

我在网站上刮擦数据时遇到了一些问题。我在网上剪贴式上没有很多经验。我的预期计划是使用以下网站的R刮擦一些数据: https://www.fatf---fatf--- gafi.org/countries/

更确切地说,我想提取具有某种制裁的国家列表

library(XML)
  url <- paste0("https://www.fatf-gafi.org/countries/")
  source <- readLines(url, encoding = "UTF-8")
  parsed_doc <- htmlParse(source, encoding = "UTF-8")

,但这不会带来预期的信息,因为不在桌子旁,而是嵌套的Div。

I'm having some problems scraping data from a website. I do have not a lot of experience with web-scraping. My intended plan is to scrape some data using R from the following website: https://www.fatf-gafi.org/countries/

More precisely, I want to extract the list of Countries with some sort of sanctions

library(XML)
  url <- paste0("https://www.fatf-gafi.org/countries/")
  source <- readLines(url, encoding = "UTF-8")
  parsed_doc <- htmlParse(source, encoding = "UTF-8")

But this doesn't bring up the intended information because is not under a table but it is a nested div.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

誰ツ都不明白 2025-02-11 07:46:51

只是为了测试JavaScript评估如何与V8,嵌入式JavaScript和WebAssembly Engine一起使用

创建上下文引擎,评估请求的JavaScript并从V8中获取 nistry> nistry>变量的值(它已变成嵌套的dataframe,因此 nest( nest)( nest)( )),最后一行充满了 na s,因此是过滤器。

library(httr)
library(V8)
library(dplyr)
library(tidyr)
url <- paste0('https://www.fatf-gafi.org/media/fatf/fatfv20/',
              'js/country-data-multi-lang.js')
js_content <- content(GET(url), 'text')

ct <- v8()
ct$eval(js_content)
ct$get("countries") %>% 
  unnest(cols = c(groups)) %>%
  select(c(1:2,4:14,16)) %>%
  filter(!is.na(name))

#> # A tibble: 209 × 14
#>    name       code  FATF  APG   CFATF EAG   ESAAMLG GABAC GAFILAT GIABA MENAFATF
#>    <chr>      <chr> <chr> <chr> <chr> <chr> <chr>   <chr> <chr>   <chr> <chr>   
#>  1 Afghanist… AF    ""    "mbr" ""    "obs" ""      ""    ""      ""    ""      
#>  2 Albania    AL    ""    ""    ""    ""    ""      ""    ""      ""    ""      
#>  3 Algeria    DZ    ""    ""    ""    ""    ""      ""    ""      ""    "mbr"   
#>  4 Andorra    AD    ""    ""    ""    ""    ""      ""    ""      ""    ""      
#>  5 Angola     AO    ""    ""    ""    ""    "mbr"   ""    ""      ""    ""      
#>  6 Anguilla   AI    ""    ""    "mbr" ""    ""      ""    ""      ""    ""      
#>  7 Antigua a… AG    ""    ""    "mbr" ""    ""      ""    ""      ""    ""      
#>  8 Argentina  AR    "mbr" "non" "non" "non" "non"   ""    "mbr"   "non" "non"   
#>  9 Armenia    AM    ""    ""    ""    "obs" ""      ""    ""      ""    ""      
#> 10 Aruba Kin… AW    "els" ""    "mbr" ""    ""      ""    ""      ""    ""      
#> # … with 200 more rows, and 3 more variables: MONEYVAL <chr>,
#> #   jurisdiction <chr>, id <chr>

Just to test how JavaScript evaluation works with V8, Embedded JavaScript and WebAssembly Engine.
https://cran.r-project.org/web/packages/V8/vignettes/v8_intro.html

Create context engine, evaluate requested JavaScript and get the value of countries variable from V8 (it's turned into nested dataframe, thus the unnest() ), last row is filled with NAs, thus the filter.

library(httr)
library(V8)
library(dplyr)
library(tidyr)
url <- paste0('https://www.fatf-gafi.org/media/fatf/fatfv20/',
              'js/country-data-multi-lang.js')
js_content <- content(GET(url), 'text')

ct <- v8()
ct$eval(js_content)
ct$get("countries") %>% 
  unnest(cols = c(groups)) %>%
  select(c(1:2,4:14,16)) %>%
  filter(!is.na(name))

#> # A tibble: 209 × 14
#>    name       code  FATF  APG   CFATF EAG   ESAAMLG GABAC GAFILAT GIABA MENAFATF
#>    <chr>      <chr> <chr> <chr> <chr> <chr> <chr>   <chr> <chr>   <chr> <chr>   
#>  1 Afghanist… AF    ""    "mbr" ""    "obs" ""      ""    ""      ""    ""      
#>  2 Albania    AL    ""    ""    ""    ""    ""      ""    ""      ""    ""      
#>  3 Algeria    DZ    ""    ""    ""    ""    ""      ""    ""      ""    "mbr"   
#>  4 Andorra    AD    ""    ""    ""    ""    ""      ""    ""      ""    ""      
#>  5 Angola     AO    ""    ""    ""    ""    "mbr"   ""    ""      ""    ""      
#>  6 Anguilla   AI    ""    ""    "mbr" ""    ""      ""    ""      ""    ""      
#>  7 Antigua a… AG    ""    ""    "mbr" ""    ""      ""    ""      ""    ""      
#>  8 Argentina  AR    "mbr" "non" "non" "non" "non"   ""    "mbr"   "non" "non"   
#>  9 Armenia    AM    ""    ""    ""    "obs" ""      ""    ""      ""    ""      
#> 10 Aruba Kin… AW    "els" ""    "mbr" ""    ""      ""    ""      ""    ""      
#> # … with 200 more rows, and 3 more variables: MONEYVAL <chr>,
#> #   jurisdiction <chr>, id <chr>
浪漫人生路 2025-02-11 07:46:51

这是一项棘手的解析工作。您需要的信息不在您从 readlines 中获得的HTML中。而是使用XHR请求通过页面动态加载它。通常,像这样的XHR请求会返回JSON字符串,但是在您的情况下,它返回JavaScript,其中将信息存储为一个包含JSON片段的变量,每个国家 /地区一个。可以通过一些字符串操纵和JSON解析来访问这一点,以获得最终结果:

library(httr)
library(rvest)

url <- paste0('https://www.fatf-gafi.org/media/fatf/fatfv20/',
              'js/country-data-multi-lang.js')
js <- content(GET(url), 'text')

vars <- strsplit(js, 'var countries = ')[[1]][2]
vars <- paste0("{", sub("^\\[\\{", "", strsplit(vars, '\\},\\{')[[1]]), "}")
countries <- do.call(rbind, lapply(vars[1:209], 
                      function(x) as.data.frame(jsonlite::parse_json(x))))
countries <- countries[c(1, 4:13)]
names(countries) <- sub('^.*\\.', '', names(countries))

dplyr::tibble(countries)
#> # A tibble: 209 x 11
#>   name     FATF  APG   CFATF EAG   ESAAMLG GABAC GAFILAT GIABA MENAFATF MONEYVAL
#>   <chr>    <chr> <chr> <chr> <chr> <chr>   <chr> <chr>   <chr> <chr>    <chr>   
#> 1 Afghani~ ""    "mbr" ""    "obs" ""      ""    ""      ""    ""       ""      
#> 2 Albania  ""    ""    ""    ""    ""      ""    ""      ""    ""       "mbr"   
#> 3 Algeria  ""    ""    ""    ""    ""      ""    ""      ""    "mbr"    ""      
#> 4 Andorra  ""    ""    ""    ""    ""      ""    ""      ""    ""       "mbr"   
#> 5 Angola   ""    ""    ""    ""    "mbr"   ""    ""      ""    ""       ""      
#> 6 Anguilla ""    ""    "mbr" ""    ""      ""    ""      ""    ""       ""      
#> 7 Antigua~ ""    ""    "mbr" ""    ""      ""    ""      ""    ""       ""      
#> 8 Argenti~ "mbr" "non" "non" "non" "non"   ""    "mbr"   "non" "non"    "non"   
#> 9 Armenia  ""    ""    ""    "obs" ""      ""    ""      ""    ""       "mbr"   
#> 10 Aruba K~ "els" ""    "mbr" ""    ""      ""    ""      ""    ""       ""      
#> # ... with 199 more rows

This is a tricky parsing job. The information you need is not in the html you are getting from readLines. Instead, it is loaded dynamically by the page using an XHR request. Often, an XHR request like this will return a json string, but in your case it returns javascript where the information is stored as a variable containing an array of json snippets, one for each country. This can be accessed through some string manipulation and json parsing to get your end result:

library(httr)
library(rvest)

url <- paste0('https://www.fatf-gafi.org/media/fatf/fatfv20/',
              'js/country-data-multi-lang.js')
js <- content(GET(url), 'text')

vars <- strsplit(js, 'var countries = ')[[1]][2]
vars <- paste0("{", sub("^\\[\\{", "", strsplit(vars, '\\},\\{')[[1]]), "}")
countries <- do.call(rbind, lapply(vars[1:209], 
                      function(x) as.data.frame(jsonlite::parse_json(x))))
countries <- countries[c(1, 4:13)]
names(countries) <- sub('^.*\\.', '', names(countries))

dplyr::tibble(countries)
#> # A tibble: 209 x 11
#>   name     FATF  APG   CFATF EAG   ESAAMLG GABAC GAFILAT GIABA MENAFATF MONEYVAL
#>   <chr>    <chr> <chr> <chr> <chr> <chr>   <chr> <chr>   <chr> <chr>    <chr>   
#> 1 Afghani~ ""    "mbr" ""    "obs" ""      ""    ""      ""    ""       ""      
#> 2 Albania  ""    ""    ""    ""    ""      ""    ""      ""    ""       "mbr"   
#> 3 Algeria  ""    ""    ""    ""    ""      ""    ""      ""    "mbr"    ""      
#> 4 Andorra  ""    ""    ""    ""    ""      ""    ""      ""    ""       "mbr"   
#> 5 Angola   ""    ""    ""    ""    "mbr"   ""    ""      ""    ""       ""      
#> 6 Anguilla ""    ""    "mbr" ""    ""      ""    ""      ""    ""       ""      
#> 7 Antigua~ ""    ""    "mbr" ""    ""      ""    ""      ""    ""       ""      
#> 8 Argenti~ "mbr" "non" "non" "non" "non"   ""    "mbr"   "non" "non"    "non"   
#> 9 Armenia  ""    ""    ""    "obs" ""      ""    ""      ""    ""       "mbr"   
#> 10 Aruba K~ "els" ""    "mbr" ""    ""      ""    ""      ""    ""       ""      
#> # ... with 199 more rows
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文