当前位置：文江博客话题详情

使用r-我想从网站提取一些表格数据

发布于 2025-02-04 07:46:51 字数 481 浏览 2 评论 0 原文

我在网站上刮擦数据时遇到了一些问题。我在网上剪贴式上没有很多经验。我的预期计划是使用以下网站的R刮擦一些数据： https：//www.fatf---fatf--- gafi.org/countries/

更确切地说，我想提取具有某种制裁的国家列表

library(XML)
  url <- paste0("https://www.fatf-gafi.org/countries/")
  source <- readLines(url, encoding = "UTF-8")
  parsed_doc <- htmlParse(source, encoding = "UTF-8")

，但这不会带来预期的信息，因为不在桌子旁，而是嵌套的Div。

原文

I'm having some problems scraping data from a website. I do have not a lot of experience with web-scraping. My intended plan is to scrape some data using R from the following website: https://www.fatf-gafi.org/countries/

More precisely, I want to extract the list of Countries with some sort of sanctions

library(XML)
  url <- paste0("https://www.fatf-gafi.org/countries/")
  source <- readLines(url, encoding = "UTF-8")
  parsed_doc <- htmlParse(source, encoding = "UTF-8")

But this doesn't bring up the intended information because is not under a table but it is a nested div.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

誰ツ都不明白 2025-02-11 07:46:51

只是为了测试JavaScript评估如何与V8，嵌入式JavaScript和WebAssembly Engine一起使用。

创建上下文引擎，评估请求的JavaScript并从V8中获取 nistry> nistry>变量的值（它已变成嵌套的dataframe，因此 nest（ nest）（ nest）（）），最后一行充满了 na s，因此是过滤器。

library(httr)
library(V8)
library(dplyr)
library(tidyr)
url <- paste0('https://www.fatf-gafi.org/media/fatf/fatfv20/',
              'js/country-data-multi-lang.js')
js_content <- content(GET(url), 'text')

ct <- v8()
ct$eval(js_content)
ct$get("countries") %>% 
  unnest(cols = c(groups)) %>%
  select(c(1:2,4:14,16)) %>%
  filter(!is.na(name))

#> # A tibble: 209 × 14
#>    name       code  FATF  APG   CFATF EAG   ESAAMLG GABAC GAFILAT GIABA MENAFATF
#>    <chr>      <chr> <chr> <chr> <chr> <chr> <chr>   <chr> <chr>   <chr> <chr>   
#>  1 Afghanist… AF    ""    "mbr" ""    "obs" ""      ""    ""      ""    ""      
#>  2 Albania    AL    ""    ""    ""    ""    ""      ""    ""      ""    ""      
#>  3 Algeria    DZ    ""    ""    ""    ""    ""      ""    ""      ""    "mbr"   
#>  4 Andorra    AD    ""    ""    ""    ""    ""      ""    ""      ""    ""      
#>  5 Angola     AO    ""    ""    ""    ""    "mbr"   ""    ""      ""    ""      
#>  6 Anguilla   AI    ""    ""    "mbr" ""    ""      ""    ""      ""    ""      
#>  7 Antigua a… AG    ""    ""    "mbr" ""    ""      ""    ""      ""    ""      
#>  8 Argentina  AR    "mbr" "non" "non" "non" "non"   ""    "mbr"   "non" "non"   
#>  9 Armenia    AM    ""    ""    ""    "obs" ""      ""    ""      ""    ""      
#> 10 Aruba Kin… AW    "els" ""    "mbr" ""    ""      ""    ""      ""    ""      
#> # … with 200 more rows, and 3 more variables: MONEYVAL <chr>,
#> #   jurisdiction <chr>, id <chr>

Just to test how JavaScript evaluation works with V8, Embedded JavaScript and WebAssembly Engine.
https://cran.r-project.org/web/packages/V8/vignettes/v8_intro.html

Create context engine, evaluate requested JavaScript and get the value of countries variable from V8 (it's turned into nested dataframe, thus the unnest() ), last row is filled with NAs, thus the filter.

library(httr)
library(V8)
library(dplyr)
library(tidyr)
url <- paste0('https://www.fatf-gafi.org/media/fatf/fatfv20/',
              'js/country-data-multi-lang.js')
js_content <- content(GET(url), 'text')

ct <- v8()
ct$eval(js_content)
ct$get("countries") %>% 
  unnest(cols = c(groups)) %>%
  select(c(1:2,4:14,16)) %>%
  filter(!is.na(name))

#> # A tibble: 209 × 14
#>    name       code  FATF  APG   CFATF EAG   ESAAMLG GABAC GAFILAT GIABA MENAFATF
#>    <chr>      <chr> <chr> <chr> <chr> <chr> <chr>   <chr> <chr>   <chr> <chr>   
#>  1 Afghanist… AF    ""    "mbr" ""    "obs" ""      ""    ""      ""    ""      
#>  2 Albania    AL    ""    ""    ""    ""    ""      ""    ""      ""    ""      
#>  3 Algeria    DZ    ""    ""    ""    ""    ""      ""    ""      ""    "mbr"   
#>  4 Andorra    AD    ""    ""    ""    ""    ""      ""    ""      ""    ""      
#>  5 Angola     AO    ""    ""    ""    ""    "mbr"   ""    ""      ""    ""      
#>  6 Anguilla   AI    ""    ""    "mbr" ""    ""      ""    ""      ""    ""      
#>  7 Antigua a… AG    ""    ""    "mbr" ""    ""      ""    ""      ""    ""      
#>  8 Argentina  AR    "mbr" "non" "non" "non" "non"   ""    "mbr"   "non" "non"   
#>  9 Armenia    AM    ""    ""    ""    "obs" ""      ""    ""      ""    ""      
#> 10 Aruba Kin… AW    "els" ""    "mbr" ""    ""      ""    ""      ""    ""      
#> # … with 200 more rows, and 3 more variables: MONEYVAL <chr>,
#> #   jurisdiction <chr>, id <chr>

回复收藏 0 原文

浪漫人生路 2025-02-11 07:46:51

这是一项棘手的解析工作。您需要的信息不在您从 readlines 中获得的HTML中。而是使用XHR请求通过页面动态加载它。通常，像这样的XHR请求会返回JSON字符串，但是在您的情况下，它返回JavaScript，其中将信息存储为一个包含JSON片段的变量，每个国家 /地区一个。可以通过一些字符串操纵和JSON解析来访问这一点，以获得最终结果：

library(httr)
library(rvest)

url <- paste0('https://www.fatf-gafi.org/media/fatf/fatfv20/',
              'js/country-data-multi-lang.js')
js <- content(GET(url), 'text')

vars <- strsplit(js, 'var countries = ')[[1]][2]
vars <- paste0("{", sub("^\\[\\{", "", strsplit(vars, '\\},\\{')[[1]]), "}")
countries <- do.call(rbind, lapply(vars[1:209], 
                      function(x) as.data.frame(jsonlite::parse_json(x))))
countries <- countries[c(1, 4:13)]
names(countries) <- sub('^.*\\.', '', names(countries))

dplyr::tibble(countries)
#> # A tibble: 209 x 11
#>   name     FATF  APG   CFATF EAG   ESAAMLG GABAC GAFILAT GIABA MENAFATF MONEYVAL
#>   <chr>    <chr> <chr> <chr> <chr> <chr>   <chr> <chr>   <chr> <chr>    <chr>   
#> 1 Afghani~ ""    "mbr" ""    "obs" ""      ""    ""      ""    ""       ""      
#> 2 Albania  ""    ""    ""    ""    ""      ""    ""      ""    ""       "mbr"   
#> 3 Algeria  ""    ""    ""    ""    ""      ""    ""      ""    "mbr"    ""      
#> 4 Andorra  ""    ""    ""    ""    ""      ""    ""      ""    ""       "mbr"   
#> 5 Angola   ""    ""    ""    ""    "mbr"   ""    ""      ""    ""       ""      
#> 6 Anguilla ""    ""    "mbr" ""    ""      ""    ""      ""    ""       ""      
#> 7 Antigua~ ""    ""    "mbr" ""    ""      ""    ""      ""    ""       ""      
#> 8 Argenti~ "mbr" "non" "non" "non" "non"   ""    "mbr"   "non" "non"    "non"   
#> 9 Armenia  ""    ""    ""    "obs" ""      ""    ""      ""    ""       "mbr"   
#> 10 Aruba K~ "els" ""    "mbr" ""    ""      ""    ""      ""    ""       ""      
#> # ... with 199 more rows

This is a tricky parsing job. The information you need is not in the html you are getting from readLines. Instead, it is loaded dynamically by the page using an XHR request. Often, an XHR request like this will return a json string, but in your case it returns javascript where the information is stored as a variable containing an array of json snippets, one for each country. This can be accessed through some string manipulation and json parsing to get your end result:

library(httr)
library(rvest)

url <- paste0('https://www.fatf-gafi.org/media/fatf/fatfv20/',
              'js/country-data-multi-lang.js')
js <- content(GET(url), 'text')

vars <- strsplit(js, 'var countries = ')[[1]][2]
vars <- paste0("{", sub("^\\[\\{", "", strsplit(vars, '\\},\\{')[[1]]), "}")
countries <- do.call(rbind, lapply(vars[1:209], 
                      function(x) as.data.frame(jsonlite::parse_json(x))))
countries <- countries[c(1, 4:13)]
names(countries) <- sub('^.*\\.', '', names(countries))

dplyr::tibble(countries)
#> # A tibble: 209 x 11
#>   name     FATF  APG   CFATF EAG   ESAAMLG GABAC GAFILAT GIABA MENAFATF MONEYVAL
#>   <chr>    <chr> <chr> <chr> <chr> <chr>   <chr> <chr>   <chr> <chr>    <chr>   
#> 1 Afghani~ ""    "mbr" ""    "obs" ""      ""    ""      ""    ""       ""      
#> 2 Albania  ""    ""    ""    ""    ""      ""    ""      ""    ""       "mbr"   
#> 3 Algeria  ""    ""    ""    ""    ""      ""    ""      ""    "mbr"    ""      
#> 4 Andorra  ""    ""    ""    ""    ""      ""    ""      ""    ""       "mbr"   
#> 5 Angola   ""    ""    ""    ""    "mbr"   ""    ""      ""    ""       ""      
#> 6 Anguilla ""    ""    "mbr" ""    ""      ""    ""      ""    ""       ""      
#> 7 Antigua~ ""    ""    "mbr" ""    ""      ""    ""      ""    ""       ""      
#> 8 Argenti~ "mbr" "non" "non" "non" "non"   ""    "mbr"   "non" "non"    "non"   
#> 9 Armenia  ""    ""    ""    "obs" ""      ""    ""      ""    ""       "mbr"   
#> 10 Aruba K~ "els" ""    "mbr" ""    ""      ""    ""      ""    ""       ""      
#> # ... with 199 more rows

回复收藏 0 原文

~没有更多了~

关于作者

世俗缘

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

使用r-我想从网站提取一些表格数据

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

helenabai_sz

993438968

若能看破又如何

情未る

纪平伟

bobowiki

友情链接

使用r-我想从网站提取一些表格数据

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

helenabai_sz

993438968

若能看破又如何

情未る

纪平伟

bobowiki

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。