使用“ DataTable”嵌入在渲染的HTML文件中的编程提取/scrape .CSV表

发布于 2025-02-07 04:45:24 字数 1116 浏览 0 评论 0原文

是否可以从.html使用.html文件中删除dataTable 的内容或嵌入式.csv文件？例如，使用dt :: datatable（）与选项button = c（'csv'）内部.rmd file> file。

我的第一种方法是使用rvest（例如rvest :: read_html（x ='example..html'）％＆gt;％rvest :: html_table（）找不到任何表，我认为这是因为DataTable不使用HTML表元素显示数据。

以下是.rmd文件的最小示例，我从其中编织了一个'.html'文件，其中包括两个我想刮擦的数据列表。

示例rmd

    ---
    title: "Example for scraping"
    output: html_document
    ---
    
    # Example 1
    
    ```{r}
    tbl1 <- mtcars[1:20, 1:4]
    DT::datatable(tbl1, extensions = 'Buttons', options = list(dom = 'Blfrtip',buttons = c('csv'),paging=FALSE))
    ```
    
    # Example 2
   
    ```{r}
    tbl2 <- mtcars[21:32, 7:11]
    DT::datatable(tbl2, extensions = 'Buttons', options = list(dom = 'Blfrtip',buttons = c('csv'),paging=FALSE))
    ```

这可以在rstudio中编织，也可以使用rmarkDown :: render（input ='example.rmd'），在example.html <中/代码>我想从中刮擦。

谢谢你！

原文

Is it possible to scrape the contents of a datatable, or the embedded .csv file, from a .html file produced using knitr? E.g. using DT::datatable() with the option buttons = c('csv') inside a .Rmd file.

My first approach is to use rvest (e.g. rvest::read_html(x = 'example.html') %>% rvest::html_table()), but it doesn't find any tables, I assume that's because datatable doesn't use the html table element to display data.

Below is a minimal example of an .Rmd file that I knit a '.html' file from that includes two datatables I'd like to scrape.

example.Rmd

    ---
    title: "Example for scraping"
    output: html_document
    ---
    
    # Example 1
    
    ```{r}
    tbl1 <- mtcars[1:20, 1:4]
    DT::datatable(tbl1, extensions = 'Buttons', options = list(dom = 'Blfrtip',buttons = c('csv'),paging=FALSE))
    ```
    
    # Example 2
   
    ```{r}
    tbl2 <- mtcars[21:32, 7:11]
    DT::datatable(tbl2, extensions = 'Buttons', options = list(dom = 'Blfrtip',buttons = c('csv'),paging=FALSE))
    ```

This can be knitted within RStudio, or using rmarkdown::render(input = 'example.Rmd'), resulting in the example.html I'd like to scrape from.

Thank you!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

月牙弯弯 2025-02-14 04:45:24

htmlwidgets将其呈现到htmltables中，将数据存储为JSON，您可以访问和使用以重建数据范围。通过这种方式，您还将解决各种分页的设置。我在下面进行了一个说明性的示例：

library(tidyverse)
library(rvest)
library(jsonlite)

# Get the JSON that is used to create the table
json_data <-
  read_html("example.html") |> 
  html_elements(css = "#example-1") |>
  html_elements("script") |>
  html_text() |>
  jsonlite::fromJSON()

# Extract data and reshape it
table <-
  json_data$x[["data"]] |>
  as_tibble() |>
  t() |> 
  as_tibble()

# Extract variable names from html stored in the json
variable_names <-
  minimal_html(json_data$x[["container"]]) |>
  html_table() %>%
  .[[1]] |>
  colnames()

# Add the correct column names
colnames(table) <- variable_names

table

输出：

# A tibble: 20 × 5
   ``                  mpg   cyl   disp  hp   
   <chr>               <chr> <chr> <chr> <chr>
 1 Mazda RX4           21    6     160   110  
 2 Mazda RX4 Wag       21    6     160   110  
 3 Datsun 710          22.8  4     108   93   
 4 Hornet 4 Drive      21.4  6     258   110  
 5 Hornet Sportabout   18.7  8     360   175  
 6 Valiant             18.1  6     225   105  
 7 Duster 360          14.3  8     360   245  
 8 Merc 240D           24.4  4     146.7 62   
 9 Merc 230            22.8  4     140.8 95   
10 Merc 280            19.2  6     167.6 123  
11 Merc 280C           17.8  6     167.6 123  
12 Merc 450SE          16.4  8     275.8 180  
13 Merc 450SL          17.3  8     275.8 180  
14 Merc 450SLC         15.2  8     275.8 180  
15 Cadillac Fleetwood  10.4  8     472   205  
16 Lincoln Continental 10.4  8     460   215  
17 Chrysler Imperial   14.7  8     440   230  
18 Fiat 128            32.4  4     78.7  66   
19 Honda Civic         30.4  4     75.7  52   
20 Toyota Corolla      33.9  4     71.1  65

更新。
一种使用相同方法获取所有桌子的方法，不是最美丽的代码：

# Get the html of the the scripts
html_data <-
  read_html("test2.html") |> 
  html_elements(css = "script[type='application/json']") |>
  html_text() 

get_table <- function(json) {

  # Read the JSON
  json_data <- jsonlite::fromJSON(json)
  
  # Extract data and reshape it
  table <-
    json_data$x[["data"]] |>
    as_tibble() |>
    t() |> 
    as_tibble()
  
  # Extract variable names from html
  variable_names <-
    minimal_html(json_data$x[["container"]]) |>
    html_table() %>%
    .[[1]] |>
    colnames()
  
  # Add column names
  colnames(table) <- variable_names
  
  table
  
}

tables <- lapply(html_data, get_table)

tables

The htmlwidgets, that renders into htmltables, stores the data as JSON that you could access and use to rebuilt dataframes. In this way you would also get around all kinds of settings for the pagination. I've made an illustrative example of this approach below:

library(tidyverse)
library(rvest)
library(jsonlite)

# Get the JSON that is used to create the table
json_data <-
  read_html("example.html") |> 
  html_elements(css = "#example-1") |>
  html_elements("script") |>
  html_text() |>
  jsonlite::fromJSON()

# Extract data and reshape it
table <-
  json_data$x[["data"]] |>
  as_tibble() |>
  t() |> 
  as_tibble()

# Extract variable names from html stored in the json
variable_names <-
  minimal_html(json_data$x[["container"]]) |>
  html_table() %>%
  .[[1]] |>
  colnames()

# Add the correct column names
colnames(table) <- variable_names

table

Output:

# A tibble: 20 × 5
   ``                  mpg   cyl   disp  hp   
   <chr>               <chr> <chr> <chr> <chr>
 1 Mazda RX4           21    6     160   110  
 2 Mazda RX4 Wag       21    6     160   110  
 3 Datsun 710          22.8  4     108   93   
 4 Hornet 4 Drive      21.4  6     258   110  
 5 Hornet Sportabout   18.7  8     360   175  
 6 Valiant             18.1  6     225   105  
 7 Duster 360          14.3  8     360   245  
 8 Merc 240D           24.4  4     146.7 62   
 9 Merc 230            22.8  4     140.8 95   
10 Merc 280            19.2  6     167.6 123  
11 Merc 280C           17.8  6     167.6 123  
12 Merc 450SE          16.4  8     275.8 180  
13 Merc 450SL          17.3  8     275.8 180  
14 Merc 450SLC         15.2  8     275.8 180  
15 Cadillac Fleetwood  10.4  8     472   205  
16 Lincoln Continental 10.4  8     460   215  
17 Chrysler Imperial   14.7  8     440   230  
18 Fiat 128            32.4  4     78.7  66   
19 Honda Civic         30.4  4     75.7  52   
20 Toyota Corolla      33.9  4     71.1  65

Update.
A way to get all the tables using the same approach, i.e. not the most beautiful code:

# Get the html of the the scripts
html_data <-
  read_html("test2.html") |> 
  html_elements(css = "script[type='application/json']") |>
  html_text() 

get_table <- function(json) {

  # Read the JSON
  json_data <- jsonlite::fromJSON(json)
  
  # Extract data and reshape it
  table <-
    json_data$x[["data"]] |>
    as_tibble() |>
    t() |> 
    as_tibble()
  
  # Extract variable names from html
  variable_names <-
    minimal_html(json_data$x[["container"]]) |>
    html_table() %>%
    .[[1]] |>
    colnames()
  
  # Add column names
  colnames(table) <- variable_names
  
  table
  
}

tables <- lapply(html_data, get_table)

tables

回复收藏 0 原文

~没有更多了~