R:本体和网络提取的数据结构
我想从大型网站中提取信息并生成本体。可以用描述逻辑处理的东西。
对于提取的 html 数据,什么数据结构是合适的?
我的想法还:
- 使用数据框、表结构
- Sets and Relations(集合包和良好关系)
- 图表
。
最后,我想导出数据并计划使用另一种编程语言通过谓词逻辑(或描述逻辑)对其进行处理。
我想使用 R 从 html 页面中提取信息。但据我了解,R(或包)中没有对谓词逻辑或 RDF/OWL 的直接支持。
所以我需要进行提取,在过程中使用一些数据结构并导出数据。
示例数据:
SomeDocument rdf:type PDFDocument
PDFDocument rdfs:subClassOf Document
SomeDocument isUsedAt DepartmentA
DepartmentA rdf:type Department
PersonA rdf:type Person
PersonA headOf DepartmentA
PersonA hasName "John"
其中实例数据是“SomeDocument”、“DepartmentA”和“PersonA”。
。
如果有道理的话,某种推理(但可能不是在 R 中):
AccessedOften(SomeDocument) => ImportantDocument(SomeDocument)
I want to extract information from a large website and generate an ontology. Something that can be processed with description logic.
What data structure is advisable for the extracted html data?
My ideas yet:
- Use Data Frames, Table Structures
- Sets and Relations (sets package and good relations)
- Graphs
.
In the End I want to export the data and plan to process it with predicate logic (or description logic) using another programming language.
I want to use R to extraction information from html pages. But as I understand there is no direct support in R (or packages) for predicate logic or RDF/OWL.
So I need to do the extraction, use some data structure in the process and export the data.
Example Data:
SomeDocument rdf:type PDFDocument
PDFDocument rdfs:subClassOf Document
SomeDocument isUsedAt DepartmentA
DepartmentA rdf:type Department
PersonA rdf:type Person
PersonA headOf DepartmentA
PersonA hasName "John"
Where the instance data is "SomeDocument", "DepartmentA" and "PersonA".
.
If it makes sense, some sort of reasoning (but probably not in R):
AccessedOften(SomeDocument) => ImportantDocument(SomeDocument)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
最重要的是您的网站数据是什么样的?例如,如果它已经包含 RDFa,您将使用 RDFa 蒸馏器来提取 RDF;简单的;完毕。然后您可以将 RDF 推入三元组存储中。您可以通过创建自己的本体来扩充网站的数据,您可以使用 SPARQL 查询该本体,如果您的本体与您在网站上找到的数据具有等效的类,那么您就是黄金。许多三元组存储可以仅通过 URL 作为 SPARQL 端点进行查询,并以 XML 格式返回,因此即使 R 本身没有 SPARQL 或 OWL 本体包,也不意味着您根本无法查询数据。
Most important is what does your website data look like? For instance, if it already has RDFa in it you would use an RDFa distiller to get the RDF out; simple; done. Then you could shove the RDF into a triple store. You could augment the website's data by creating your own ontology which you would query using SPARQL, if your ontology make equivalent classes to the data you found on your web site then you are golden. Many triple stores can be queried as SPARQL endpoints via URLs alone, and return in format of XML so even if R has no SPARQL or OWL ontolgoy packages per se, it doesn't mean you can't query the data at all.
如果需要下载很多页面,我会使用 WGET 来下载这些页面。为了处理文件,我将使用 Perl 脚本将数据转换为更易读的格式,例如。逗号分隔。然后我会转向某种编程语言以您描述的方式进行组合,但是,在这件事上我不会选择 R。
If it requires a lot of pages to be downloaded I would use WGET to download those. To proces the files I would use a Perl script to transform the data to a more readable format eg. comma separated. Then I would turn to some programming language to combine in the way you describe, however, I would not go for R in this matter.