r css-selectors web-scraping data-wrangling

使用CSS选择器进行Webscrap the RED会产生更多数据，然后在节点中需要

发布于 2025-02-08 19:21:01 字数 3285 浏览 2 评论 0原文

我正在尝试刮擦 https://nomics.com/ 用于资产和交换数据。我想获得排名，名称，价格等。对于i页上的每个页面（100行）。我已经成功地在那里列出的所有交易所都可以做到这一点。我正在使用Chrome（and Brave）浏览器中的CSS选择器工具获取节点ID。

当我运行此功能时，交换MWE

# libraries & dependencies
library(rvest)
library(dplyr)

# website url
base_url <- "https://nomics.com/exchanges/" 

# empty list to store page results
datalist <- list()

for(i in 1:6){
  new_url <- paste0(base_url, i) 
  
  page = read_html(new_url)
  
  rank = page %>% html_nodes(".n-pv12.f6-ns") %>% html_text()
  name = page %>% html_nodes(".fw5.nowrap.truncate") %>% html_text()
  impact_score = page %>% html_nodes(".n-ph6.f6") %>% html_text()
  volume = page %>% html_nodes(".f6-ns .mono") %>% html_text()
  volume_percent = page %>% html_nodes(".mono.n-dtc-1120") %>% html_text()
  rating = page %>% html_nodes(".n-dtc-1120+ td") %>% html_text()
  trades = page %>% html_nodes(".n-pv18.n-dtc-650") %>% html_text()
  trades_percent = page %>% html_nodes(".mono.n-dtc-1280") %>% html_text()
  pairs = page %>% html_nodes(".n-pv18.n-dtc-768") %>% html_text()
  fiat  = page %>% html_nodes(".mono.n-dtc-1024") %>% html_text()
  
  datalist[[i]] <- data.frame(rank, name, impact_score, volume, volume_percent, rating, trades, trades_percent, pairs, fiat)
  
}

# combine results and store as tibble
big_data = do.call(rbind, datalist)
tibble(big_data)

时，我会得到一个不错的东西。

。

现在，当我尝试使用CSS选择器工具选择正确的节点时，我似乎无法选择正确的节点我尝试了不同的CSS选择器工具，进入了Chrome的开发人员屏幕并尝试了不同的浏览器。我选择的节点似乎包含更多数据。因此，我和他们争吵。

对于等级变量，输出被弄乱了。有时，另一个数字会在实际排名本身后面添加。对于其他变量，我可以过滤掉行。

# base url
base_url <- "https://nomics.com/" 
page <- read_html(base_url)

### extract html nodes ###

# extract rank, output is messed up on the numbering
rank <- page %>% html_nodes(".flex-column.n-pl6") %>% html_text() %>% tibble(rank = .)

# extract names
name <- page %>% html_nodes(".overflow-visible") %>% html_text() %>% tibble(name = .)
name <- name[201:300,] # output is messy, filtering required rows

# extract price
price <- page %>% html_nodes(".n-dark-gray.f7-s.fw5") %>% html_text() %>% tibble(price = .)
price <- price[1:100,] # output is messy, filtering required rows

df <- rank %>% bind_cols(name, price)

看来只有排名变量似乎被弄乱了，我对如何纠正这一点感到不知所措。 我想我可以使用正直或子集来解决此问题，但是我希望有另一种方法可以

对我来说更容易纠缠，而当我尝试像为此一样循环时交换显然会弄乱结果。

这里有人知道为什么结果在主页上弄乱了 /在 /交换时，这是轻而易举的？`'

更新1

试图根据第一个答案循环。没有循环，它对所有单独的页面都可以正常工作。它运行正确，但不会子集或存储/附加列表中的不同页面以供rbind。

# base url
base_url <- "https://nomics.com/" 

# create empty list to store pages in
datalist <- list()

# create loop for i pages
for(i in 1:5){
  
  # new url (+1)
  new_url <- paste0(base_url, i)
  
  message("Retrieving page ", i)
  
  # new html page
  data <- read_html(new_url) %>% 
    html_element('#__NEXT_DATA__') %>% 
    html_text %>% 
    jsonlite::parse_json(simplifyVector = T)
  
  listings[[i]] <- data$props$pageProps$data$currenciesTicker[,1:25]

}

big_data = do.call(rbind, listings)

tibble(big_data)

View(big_data)
``

原文

I'm trying to scrape https://nomics.com/ for asset and exchange data. I want to get rank, name, price etc. for every page (100 rows) for up in i pages. I've successfully been able to this same for all of the exchanges listed there. I'm using the CSS Selector tool in Chrome (and Brave) browser to obtain the node ID.

Exchanges MWE

# libraries & dependencies
library(rvest)
library(dplyr)

# website url
base_url <- "https://nomics.com/exchanges/" 

# empty list to store page results
datalist <- list()

for(i in 1:6){
  new_url <- paste0(base_url, i) 
  
  page = read_html(new_url)
  
  rank = page %>% html_nodes(".n-pv12.f6-ns") %>% html_text()
  name = page %>% html_nodes(".fw5.nowrap.truncate") %>% html_text()
  impact_score = page %>% html_nodes(".n-ph6.f6") %>% html_text()
  volume = page %>% html_nodes(".f6-ns .mono") %>% html_text()
  volume_percent = page %>% html_nodes(".mono.n-dtc-1120") %>% html_text()
  rating = page %>% html_nodes(".n-dtc-1120+ td") %>% html_text()
  trades = page %>% html_nodes(".n-pv18.n-dtc-650") %>% html_text()
  trades_percent = page %>% html_nodes(".mono.n-dtc-1280") %>% html_text()
  pairs = page %>% html_nodes(".n-pv18.n-dtc-768") %>% html_text()
  fiat  = page %>% html_nodes(".mono.n-dtc-1024") %>% html_text()
  
  datalist[[i]] <- data.frame(rank, name, impact_score, volume, volume_percent, rating, trades, trades_percent, pairs, fiat)
  
}

# combine results and store as tibble
big_data = do.call(rbind, datalist)
tibble(big_data)

When I run this, I get a nice tibble with everything I can wish for.

Cryptocurrency Assets MWE

Now, when I try to do the same for the asset data on the homepage itself I can't seem to select the correct nodes with the CSS selector tool. I've tried different CSS selector tools, went into developer screen in chrome and tried different browsers. It seems that more data is included in the nodes that I selected. Because of this I wrangle with them.

For the rank variable, the output is messed up. Sometimes another number is added BEHIND the actual rank itself. For the other variables, I could filter out rows.

# base url
base_url <- "https://nomics.com/" 
page <- read_html(base_url)

### extract html nodes ###

# extract rank, output is messed up on the numbering
rank <- page %>% html_nodes(".flex-column.n-pl6") %>% html_text() %>% tibble(rank = .)

# extract names
name <- page %>% html_nodes(".overflow-visible") %>% html_text() %>% tibble(name = .)
name <- name[201:300,] # output is messy, filtering required rows

# extract price
price <- page %>% html_nodes(".n-dark-gray.f7-s.fw5") %>% html_text() %>% tibble(price = .)
price <- price[1:100,] # output is messy, filtering required rows

df <- rank %>% bind_cols(name, price)

It seems that only the rank variable appears to be messed up and I'm at a loss on how to correct this. I guess I could use regex or subsetting to fix this but I'm hoping there is another way to approach this

Tibbles were just easier for me to wrangle, when I try to loop this just as I did for the exchanges it obviously messes up the results.

Anyone here any idea why the results are messed up on the homepage and on /exchanges it's a breeze?`

Update 1

Trying to loop according to 1st answer. Without a loop, it works fine for all separate pages. It runs correctly but won't subset or store/append the different pages in the list to rbind.

# base url
base_url <- "https://nomics.com/" 

# create empty list to store pages in
datalist <- list()

# create loop for i pages
for(i in 1:5){
  
  # new url (+1)
  new_url <- paste0(base_url, i)
  
  message("Retrieving page ", i)
  
  # new html page
  data <- read_html(new_url) %>% 
    html_element('#__NEXT_DATA__') %>% 
    html_text %>% 
    jsonlite::parse_json(simplifyVector = T)
  
  listings[[i]] <- data$props$pageProps$data$currenciesTicker[,1:25]

}

big_data = do.call(rbind, listings)

tibble(big_data)

View(big_data)
``

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

我恋#小黄人 2025-02-15 19:21:01

该数据存储在脚本标签中。您可以从脚本标签中提取并具有数据框架。

library(rvest)
library(jsonlite)
library(tidyverse)

data  <- read_html('https://nomics.com/') %>% 
  html_element('#__NEXT_DATA__') %>% 
  html_text %>% 
  jsonlite::parse_json(simplifyVector = T)

listings <- data$props$pageProps$data$currenciesTicker
head(listings) %>% select(id, rank)

您可以将上述变成一个函数，并使用map_dfr从所需的输入URL生成1个大数据框架。然后以所需的任何列为子集。

library(rvest)
library(jsonlite)
library(tidyverse)

get_df <- function(url) {
  data <- read_html(url) %>%
    html_element("#__NEXT_DATA__") %>%
    html_text() %>%
    jsonlite::parse_json(simplifyVector = T)
  listings <- data$props$pageProps$data$currenciesTicker
  return(listings)
}

urls <- paste0("https://nomics.com/", 1:5)

big_df <- map_dfr(urls, get_df)

That data is stored in a script tag. You could extract from the script tag and have a dataframe off-the-bat.

library(rvest)
library(jsonlite)
library(tidyverse)

data  <- read_html('https://nomics.com/') %>% 
  html_element('#__NEXT_DATA__') %>% 
  html_text %>% 
  jsonlite::parse_json(simplifyVector = T)

listings <- data$props$pageProps$data$currenciesTicker
head(listings) %>% select(id, rank)

You can turn the above into a function and use map_dfr to generate 1 big dataframe from the desired input urls. Then subset for whichever columns you want.

library(rvest)
library(jsonlite)
library(tidyverse)

get_df <- function(url) {
  data <- read_html(url) %>%
    html_element("#__NEXT_DATA__") %>%
    html_text() %>%
    jsonlite::parse_json(simplifyVector = T)
  listings <- data$props$pageProps$data$currenciesTicker
  return(listings)
}

urls <- paste0("https://nomics.com/", 1:5)

big_df <- map_dfr(urls, get_df)

回复收藏 0 原文

~没有更多了~