R：本体和网络提取的数据结构

发布于 2024-10-24 12:05:37 字数 755 浏览 5 评论 0原文

我想从大型网站中提取信息并生成本体。可以用描述逻辑处理的东西。

对于提取的 html 数据，什么数据结构是合适的？

我的想法还：
- 使用数据框、表结构
- Sets and Relations（集合包和良好关系）
- 图表

。

最后，我想导出数据并计划使用另一种编程语言通过谓词逻辑（或描述逻辑）对其进行处理。

我想使用 R 从 html 页面中提取信息。但据我了解，R（或包）中没有对谓词逻辑或 RDF/OWL 的直接支持。

所以我需要进行提取，在过程中使用一些数据结构并导出数据。

示例数据：

SomeDocument rdf:type PDFDocument
PDFDocument rdfs:subClassOf Document
SomeDocument isUsedAt DepartmentA

DepartmentA rdf:type Department
PersonA rdf:type Person
PersonA headOf DepartmentA

PersonA hasName "John"

其中实例数据是“SomeDocument”、“DepartmentA”和“PersonA”。

。

如果有道理的话，某种推理（但可能不是在 R 中）：

AccessedOften(SomeDocument) => ImportantDocument(SomeDocument)

原文

I want to extract information from a large website and generate an ontology. Something that can be processed with description logic.

What data structure is advisable for the extracted html data?

My ideas yet:
- Use Data Frames, Table Structures
- Sets and Relations (sets package and good relations)
- Graphs

In the End I want to export the data and plan to process it with predicate logic (or description logic) using another programming language.

I want to use R to extraction information from html pages. But as I understand there is no direct support in R (or packages) for predicate logic or RDF/OWL.

So I need to do the extraction, use some data structure in the process and export the data.

Example Data:

SomeDocument rdf:type PDFDocument
PDFDocument rdfs:subClassOf Document
SomeDocument isUsedAt DepartmentA

DepartmentA rdf:type Department
PersonA rdf:type Person
PersonA headOf DepartmentA

PersonA hasName "John"

Where the instance data is "SomeDocument", "DepartmentA" and "PersonA".

If it makes sense, some sort of reasoning (but probably not in R):

AccessedOften(SomeDocument) => ImportantDocument(SomeDocument)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

寄居者 2024-10-31 12:05:37

最重要的是您的网站数据是什么样的？例如，如果它已经包含 RDFa，您将使用 RDFa 蒸馏器来提取 RDF；简单的;完毕。然后您可以将 RDF 推入三元组存储中。您可以通过创建自己的本体来扩充网站的数据，您可以使用 SPARQL 查询该本体，如果您的本体与您在网站上找到的数据具有等效的类，那么您就是黄金。许多三元组存储可以仅通过 URL 作为 SPARQL 端点进行查询，并以 XML 格式返回，因此即使 R 本身没有 SPARQL 或 OWL 本体包，也不意味着您根本无法查询数据。

回复收藏 0 原文