在表格前提取文字

发布于 2025-02-08 11:52:52 字数 499 浏览 2 评论 0 原文

我想从XML文件中的表格上提取一两行的子头。例如,在此网页上: https://en.wikipedia.orgg/wikipedia.org/wiki/cost_database

有几张表,我可以使用库(XML)和R代码来提取它们的标题,并提供 https://rud.is/b/2015/08/23/ususe-r-t-tem-te-data-data-und- 。

现在,我想在表上方一行索引并获取相应的文本 有一个好方法吗?

I would like to extract the sub-heading just one line or two before a table from a xml file. As an example, on this webpage: https://en.wikipedia.org/wiki/Cost_database

There are several tables, which I was able to extract with their headers using library(xml) and R code provided by https://rud.is/b/2015/08/23/using-r-to-get-data-out-of-word-docs/

Now, I would like to index just one row above the table and get the corresponding text. Is there a good way to do this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

热风软妹 2025-02-15 11:52:52

您可以使用rvest软件包进行此操作以获取网页中的第一个段落:

selectorgadget 可以帮助识别正确的元素名称对于HTML页面。

library(rvest)

read_html("https://en.wikipedia.org/wiki/Cost_database") |> 
  html_element("p:nth-child(1)") |> 
  html_text2()

#> [1] "A cost database is a computerized database of cost estimating information, which is normally used with construction estimating software to support the formation of cost estimates. A cost database may also simply be an electronic reference of cost data."

创建获取Word文档的第一段:

library(tidyverse)
library(officer)

# Create table
df <- tribble(
  ~col1, ~col2,
  "a", 1,
  "b", 2
)

# Create Word doc with para and table
example_doc <- read_docx() |> 
  body_add_par("Some text.") |> 
  body_add_table(df, style = "table_template")

# Save Word doc
print(example_doc, target = "example.docx")

# Read the doc
content <- read_docx("example.docx") |> 
  docx_summary()

# Get the text before the table
content |> 
  filter(doc_index == 1) |> 
  select(text)
#>         text
#> 1 Some text.

You could do this using the rvest package to get the first paragraph in the web page:

selectorgadget can help identify the right element name for the html page.

library(rvest)

read_html("https://en.wikipedia.org/wiki/Cost_database") |> 
  html_element("p:nth-child(1)") |> 
  html_text2()

#> [1] "A cost database is a computerized database of cost estimating information, which is normally used with construction estimating software to support the formation of cost estimates. A cost database may also simply be an electronic reference of cost data."

Created on 2022-06-18 by the reprex package (v2.0.1)

To get the first paragraph of a Word document:

library(tidyverse)
library(officer)

# Create table
df <- tribble(
  ~col1, ~col2,
  "a", 1,
  "b", 2
)

# Create Word doc with para and table
example_doc <- read_docx() |> 
  body_add_par("Some text.") |> 
  body_add_table(df, style = "table_template")

# Save Word doc
print(example_doc, target = "example.docx")

# Read the doc
content <- read_docx("example.docx") |> 
  docx_summary()

# Get the text before the table
content |> 
  filter(doc_index == 1) |> 
  select(text)
#>         text
#> 1 Some text.

Created on 2022-06-18 by the reprex package (v2.0.1)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文