gsub() 仅在从 dput() 的输出复制向量后才起作用
我有以下问题:我从多个网页抓取了价格。对于某些网页,价格被抓取为 html_text()
,它在价格后面包含货币或“.-”等内容。
现在,如果我尝试使用 gsub() 从价格本身中删除这些内容,则它无法完全正常工作。 另外,如果我随后尝试使用 as.integer()
将价格转换为整数,它只会为我提供每个价格的 NA。
奇怪的是,如果我使用 dput() 获取控制台中显示的向量的内容,然后复制该内容并将其保存为新向量(如 vec<-c ("5.-","10.-","9.-") 它突然起作用了,我可以正确使用 gsub()
和 as.integer()
。 有谁知道为什么会发生这种情况?
我用来抓取价格的代码是:
input_galaxus2<-paste0('https://www.galaxus.ch/',input_galaxus$`Galaxus Artikel`)
sess <- session(input_galaxus2[1]) #to start the session
for (j in input_galaxus2){
sess <- sess %>% session_jump_to(j) #jump to URL
i=i+1
try(vec_galaxus[i] <- read_html(sess) %>% #can read direct from sess
html_nodes('.sc-algx62-1.cwhzPP') %>%
html_text())
Sys.sleep(runif(1, min=1, max=2))
}
代码中的 j
指的是可以粘贴在基本网址后面的产品编号,例如 14513912、14513929 或 8606656
编辑:所以产品链接例如: https://www.galaxus.ch/14513912,https://www.galaxus .ch/14513929 和 https://www.galaxus.ch/8606656
I have the following problem: I scraped prices from multiple webpages.As for some webpages the price is scraped as html_text()
, it contains things as currency or ".-" after the price.
Now if I try to remove these things from the price itself using gsub()
, it doesn't fully work.
Also if I then try to convert the prices to integer using as.integer()
, it gives me just NA's for every price.
The strange thing is that if I use dput()
to get the content of the vector shown in the console and then copy this content and save it as a new vector (like vec<-c("5.-","10.-","9.-")
it suddenly works and I can properly use gsub()
and as.integer()
.
Does anyone know why this could be happening?
The code I use to scrape the prices is:
input_galaxus2<-paste0('https://www.galaxus.ch/',input_galaxusI have the following problem: I scraped prices from multiple webpages.As for some webpages the price is scraped as html_text()
, it contains things as currency or ".-" after the price.
Now if I try to remove these things from the price itself using gsub()
, it doesn't fully work.
Also if I then try to convert the prices to integer using as.integer()
, it gives me just NA's for every price.
The strange thing is that if I use dput()
to get the content of the vector shown in the console and then copy this content and save it as a new vector (like vec<-c("5.-","10.-","9.-")
it suddenly works and I can properly use gsub()
and as.integer()
.
Does anyone know why this could be happening?
The code I use to scrape the prices is:
Galaxus Artikel`)
sess <- session(input_galaxus2[1]) #to start the session
for (j in input_galaxus2){
sess <- sess %>% session_jump_to(j) #jump to URL
i=i+1
try(vec_galaxus[i] <- read_html(sess) %>% #can read direct from sess
html_nodes('.sc-algx62-1.cwhzPP') %>%
html_text())
Sys.sleep(runif(1, min=1, max=2))
}
and the j
inside the code refers to the product number that can be pasted just after the base url, for example 14513912, 14513929 or 8606656
Edit: so the product links are for example: https://www.galaxus.ch/14513912, https://www.galaxus.ch/14513929 and https://www.galaxus.ch/8606656
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
由 reprex 包 (v2.0.0) 于 2022 年 3 月 9 日创建
使用
as.numeric
和[0-9.,]
也可以获取美分。Created on 2022-03-09 by the reprex package (v2.0.0)
Use
as.numeric
and[0-9.,]
to get the cents, too.