gsub() 仅在从 dput() 的输出复制向量后才起作用

发布于 2025-01-13 02:35:49 字数 1275 浏览 0 评论 0原文

我有以下问题:我从多个网页抓取了价格。对于某些网页,价格被抓取为 html_text(),它在价格后面包含货币或“.-”等内容。

现在,如果我尝试使用 gsub() 从价格本身中删除这些内容,则它无法完全正常工作。 另外,如果我随后尝试使用 as.integer() 将价格转换为整数,它只会为我提供每个价格的 NA。

奇怪的是,如果我使用 dput() 获取控制台中显示的向量的内容,然后复制该内容并将其保存为新向量(如 vec<-c ("5.-","10.-","9.-") 它突然起作用了,我可以正确使用 gsub()as.integer()。 有谁知道为什么会发生这种情况?

我用来抓取价格的代码是:

input_galaxus2<-paste0('https://www.galaxus.ch/',input_galaxus$`Galaxus Artikel`)

sess <- session(input_galaxus2[1])             #to start the session
for (j in input_galaxus2){
  sess <- sess %>% session_jump_to(j)         #jump to URL
  
  i=i+1
  try(vec_galaxus[i] <- read_html(sess) %>%   #can read direct from sess
        html_nodes('.sc-algx62-1.cwhzPP') %>%
        html_text())
  Sys.sleep(runif(1, min=1, max=2))
}

代码中的 j 指的是可以粘贴在基本网址后面的产品编号,例如 14513912、14513929 或 8606656

编辑:所以产品链接例如: https://www.galaxus.ch/14513912https://www.galaxus .ch/14513929https://www.galaxus.ch/8606656

I have the following problem: I scraped prices from multiple webpages.As for some webpages the price is scraped as html_text(), it contains things as currency or ".-" after the price.

Now if I try to remove these things from the price itself using gsub(), it doesn't fully work.
Also if I then try to convert the prices to integer using as.integer(), it gives me just NA's for every price.

The strange thing is that if I use dput()to get the content of the vector shown in the console and then copy this content and save it as a new vector (like vec<-c("5.-","10.-","9.-") it suddenly works and I can properly use gsub() and as.integer().
Does anyone know why this could be happening?

The code I use to scrape the prices is:

input_galaxus2<-paste0('https://www.galaxus.ch/',input_galaxus

I have the following problem: I scraped prices from multiple webpages.As for some webpages the price is scraped as html_text(), it contains things as currency or ".-" after the price.

Now if I try to remove these things from the price itself using gsub(), it doesn't fully work.
Also if I then try to convert the prices to integer using as.integer(), it gives me just NA's for every price.

The strange thing is that if I use dput()to get the content of the vector shown in the console and then copy this content and save it as a new vector (like vec<-c("5.-","10.-","9.-") it suddenly works and I can properly use gsub() and as.integer().
Does anyone know why this could be happening?

The code I use to scrape the prices is:

Galaxus Artikel`) sess <- session(input_galaxus2[1]) #to start the session for (j in input_galaxus2){ sess <- sess %>% session_jump_to(j) #jump to URL i=i+1 try(vec_galaxus[i] <- read_html(sess) %>% #can read direct from sess html_nodes('.sc-algx62-1.cwhzPP') %>% html_text()) Sys.sleep(runif(1, min=1, max=2)) }

and the j inside the code refers to the product number that can be pasted just after the base url, for example 14513912, 14513929 or 8606656

Edit: so the product links are for example: https://www.galaxus.ch/14513912, https://www.galaxus.ch/14513929 and https://www.galaxus.ch/8606656

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

棒棒糖 2025-01-20 02:35:49
library(tidyverse)
library(rvest)
#> 
#> Attaching package: 'rvest'
#> The following object is masked from 'package:readr':
#> 
#>     guess_encoding

"https://www.galaxus.ch/8606656" %>%
  read_html() %>%
  html_nodes('.sc-algx62-1.cwhzPP') %>%
  html_text() %>%
  str_extract("[0-9]+") %>%
  as.integer()
#> [1] 385

reprex 包 (v2.0.0) 于 2022 年 3 月 9 日创建

使用as.numeric[0-9.,] 也可以获取美分。

library(tidyverse)
library(rvest)
#> 
#> Attaching package: 'rvest'
#> The following object is masked from 'package:readr':
#> 
#>     guess_encoding

"https://www.galaxus.ch/8606656" %>%
  read_html() %>%
  html_nodes('.sc-algx62-1.cwhzPP') %>%
  html_text() %>%
  str_extract("[0-9]+") %>%
  as.integer()
#> [1] 385

Created on 2022-03-09 by the reprex package (v2.0.0)

Use as.numeric and [0-9.,] to get the cents, too.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文