在 R 中,如何组合两组数据并在解析后将它们分别添加到单列中?

发布于 2025-01-13 04:43:56 字数 1256 浏览 0 评论 0原文

library(rvest)

link1 <- "https://somon.tj/adv/7866644_5-komn-kvartira-3-etazh-79-m2-a-sino/"
link2 <- "https://somon.tj/adv/7985721_2-komn-dom-grandzavod/"

house_link <- c(link1, link2)

house_features = lapply(houselink, function(link) {
  page_data <- 
tryCatch({
    read_html(link)
    pricing = page_data %>% html_nodes("h1") %>% html_text(trim = T)}, 
error = function(e) e, 
warning = function(w) w)

  
  if(!inherits(page_data, "error")) {
    data.frame(
      link = link,
      parameters = page_data %>% html_nodes(".label") %>% html_text(trim = TRUE),
      values = page_data %>% html_nodes(".info") %>% html_text(trim = TRUE)
    )
    list(
      pricing = page_data %>% html_nodes("h1") %>% html_text(trim = T)
    )
  } else {
    NULL
  }
})

但是当我使用 do.call(rbind) 时,它会产生错误。

do.call(rbind, house_features) %>% 
  group_by(link, parameters) %>%
  mutate(parameters = if_else(row_number() > 1, paste(parameters,row_number()), parameters)) %>% 
  pivot_wider(id_cols = link, names_from = parameters, values_from = values)

其中一个链接有 19 个变量,而第二个链接仅包含 5 个变量。你看到了差异。如何将所有变量分别放入单独的列中?如果该变量没有值,例如额外的 14 个变量,我想为变量的值添加 NA。我应该如何完成这个,偷看?

library(rvest)

link1 <- "https://somon.tj/adv/7866644_5-komn-kvartira-3-etazh-79-m2-a-sino/"
link2 <- "https://somon.tj/adv/7985721_2-komn-dom-grandzavod/"

house_link <- c(link1, link2)

house_features = lapply(houselink, function(link) {
  page_data <- 
tryCatch({
    read_html(link)
    pricing = page_data %>% html_nodes("h1") %>% html_text(trim = T)}, 
error = function(e) e, 
warning = function(w) w)

  
  if(!inherits(page_data, "error")) {
    data.frame(
      link = link,
      parameters = page_data %>% html_nodes(".label") %>% html_text(trim = TRUE),
      values = page_data %>% html_nodes(".info") %>% html_text(trim = TRUE)
    )
    list(
      pricing = page_data %>% html_nodes("h1") %>% html_text(trim = T)
    )
  } else {
    NULL
  }
})

But when I use the do.call(rbind), it produces an error.

do.call(rbind, house_features) %>% 
  group_by(link, parameters) %>%
  mutate(parameters = if_else(row_number() > 1, paste(parameters,row_number()), parameters)) %>% 
  pivot_wider(id_cols = link, names_from = parameters, values_from = values)

While one of the links has 19 variables, while the second one contains 5 variables only. You see the discrepancy. How can I make all variables each into individual columns? If it has no value on that variable, say, additional 14 variables, I want to add NA for the value of the variables. How should I accomplish this, peeps?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

江湖正好 2025-01-20 04:43:56

尝试这种方法:

  1. 使用 do.call将房屋功能收集到列表中
house_features = lapply(house_link, function(link) {
  page_data <- tryCatch(read_html(link),error = function(e) e ,warning=function(w) w)

  if(!inherits(page_data, "error")) {
    data.frame(
      link = link,
      parameters = page_data %>% html_nodes(".label") %>% html_text(trim = TRUE),
      values = page_data %>% html_nodes(".info") %>% html_text(trim = TRUE)
    )
  } else {
    NULL
  }
})
  1. rbind 它们,确保参数名称是唯一的(它们不是/例如 link1 有两个参数称为Floor),然后pivot_wider
do.call(rbind,house_features) %>% 
  group_by(link, parameters) %>%
  mutate(parameters = if_else(row_number()>1, paste(parameters,row_number()), parameters)) %>% 
  pivot_wider(id_cols = link, names_from=parameters,values_from=values)

输出:

  link   `Type of offer` Category House  Floor Area  Condition Internet Toilet Gas   `Front door` Parking Furniture `Floor 2` `Ceiling height` Security Other `Possibility of…
  <chr>  <chr>           <chr>    <chr>  <chr> <chr> <chr>     <chr>    <chr>  <chr> <chr>        <chr>   <chr>     <chr>     <chr>            <chr>    <chr> <chr>           
1 https… from owner      elite    monol… 9 fl… 107 … european… optics   2 bat… trunk armored      parking fully fu… laminate  3 m.             bars on… plas… no              
2 https… from agent      NA       panel… NA    255 … NA        NA       NA     NA    NA           NA      NA        NA        NA               NA       NA    NA              
# … with 4 more variables: Possibility of getting a mortgage <chr>, Possibility of exchange <chr>, Number of floors <chr>, Heating <chr>

Try this approach:

  1. Gather the house features in a list
house_features = lapply(house_link, function(link) {
  page_data <- tryCatch(read_html(link),error = function(e) e ,warning=function(w) w)

  if(!inherits(page_data, "error")) {
    data.frame(
      link = link,
      parameters = page_data %>% html_nodes(".label") %>% html_text(trim = TRUE),
      values = page_data %>% html_nodes(".info") %>% html_text(trim = TRUE)
    )
  } else {
    NULL
  }
})
  1. rbind them using do.call, ensure that the parameter names are unique (they are not / for example link1 has two parameters called Floor), and then pivot_wider
do.call(rbind,house_features) %>% 
  group_by(link, parameters) %>%
  mutate(parameters = if_else(row_number()>1, paste(parameters,row_number()), parameters)) %>% 
  pivot_wider(id_cols = link, names_from=parameters,values_from=values)

Output:

  link   `Type of offer` Category House  Floor Area  Condition Internet Toilet Gas   `Front door` Parking Furniture `Floor 2` `Ceiling height` Security Other `Possibility of…
  <chr>  <chr>           <chr>    <chr>  <chr> <chr> <chr>     <chr>    <chr>  <chr> <chr>        <chr>   <chr>     <chr>     <chr>            <chr>    <chr> <chr>           
1 https… from owner      elite    monol… 9 fl… 107 … european… optics   2 bat… trunk armored      parking fully fu… laminate  3 m.             bars on… plas… no              
2 https… from agent      NA       panel… NA    255 … NA        NA       NA     NA    NA           NA      NA        NA        NA               NA       NA    NA              
# … with 4 more variables: Possibility of getting a mortgage <chr>, Possibility of exchange <chr>, Number of floors <chr>, Heating <chr>
相权↑美人 2025-01-20 04:43:56
house_data <- do.call(rbind, house_features) %>% 
  group_by(link, parameters) %>%
  mutate(parameters = if_else(row_number() > 1, paste(parameters,row_number()), parameters)) %>% 
  pivot_wider(
    id_cols = c(link, pricing,), names_from = parameters, values_from = values)

我发现了什么?
尽管变量 pricing 可能会导致数据帧之间的重复和冗余,如您所见,但令人惊讶的是,与传统的 for 相比,lapply 函数以惊人的速度快速工作。 -环形!

我是说,你有一整团蜡。谢谢@langtang:)

house_data <- do.call(rbind, house_features) %>% 
  group_by(link, parameters) %>%
  mutate(parameters = if_else(row_number() > 1, paste(parameters,row_number()), parameters)) %>% 
  pivot_wider(
    id_cols = c(link, pricing,), names_from = parameters, values_from = values)

What I found?
Although the variable pricing may cause repetition and redundancy across data frame as you would see, still--surprisingly--lapply function works rapidly with an astonishing speed compared with a traditional for-loop!

You've got a whole ball of wax, I mean. Thanks @langtang :)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文