在表格前提取文字

发布于 2025-02-08 11:52:52 字数 499 浏览 2 评论 0 原文

我想从XML文件中的表格上提取一两行的子头。例如，在此网页上： https://en.wikipedia.orgg/wikipedia.org/wiki/cost_database

有几张表，我可以使用库（XML）和R代码来提取它们的标题，并提供 https://rud.is/b/2015/08/23/ususe-r-t-tem-te-data-data-und- 。

现在，我想在表上方一行索引并获取相应的文本有一个好方法吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

热风软妹 2025-02-15 11:52:52

您可以使用rvest软件包进行此操作以获取网页中的第一个段落：

selectorgadget 可以帮助识别正确的元素名称对于HTML页面。

library(rvest)

read_html("https://en.wikipedia.org/wiki/Cost_database") |> 
  html_element("p:nth-child(1)") |> 
  html_text2()

#> [1] "A cost database is a computerized database of cost estimating information, which is normally used with construction estimating software to support the formation of cost estimates. A cost database may also simply be an electronic reference of cost data."

^由

创建获取Word文档的第一段：

library(tidyverse)
library(officer)

# Create table
df <- tribble(
  ~col1, ~col2,
  "a", 1,
  "b", 2
)

# Create Word doc with para and table
example_doc <- read_docx() |> 
  body_add_par("Some text.") |> 
  body_add_table(df, style = "table_template")

# Save Word doc
print(example_doc, target = "example.docx")

# Read the doc
content <- read_docx("example.docx") |> 
  docx_summary()

# Get the text before the table
content |> 
  filter(doc_index == 1) |> 
  select(text)
#>         text
#> 1 Some text.

^由

You could do this using the rvest package to get the first paragraph in the web page:

selectorgadget can help identify the right element name for the html page.

library(rvest)

read_html("https://en.wikipedia.org/wiki/Cost_database") |> 
  html_element("p:nth-child(1)") |> 
  html_text2()

#> [1] "A cost database is a computerized database of cost estimating information, which is normally used with construction estimating software to support the formation of cost estimates. A cost database may also simply be an electronic reference of cost data."

^{Created on 2022-06-18 by the reprex package (v2.0.1)}

To get the first paragraph of a Word document:

library(tidyverse)
library(officer)

# Create table
df <- tribble(
  ~col1, ~col2,
  "a", 1,
  "b", 2
)

# Create Word doc with para and table
example_doc <- read_docx() |> 
  body_add_par("Some text.") |> 
  body_add_table(df, style = "table_template")

# Save Word doc
print(example_doc, target = "example.docx")

# Read the doc
content <- read_docx("example.docx") |> 
  docx_summary()

# Get the text before the table
content |> 
  filter(doc_index == 1) |> 
  select(text)
#>         text
#> 1 Some text.

^{Created on 2022-06-18 by the reprex package (v2.0.1)}

回复收藏 0 原文

~没有更多了~