当前位置：文江博客话题详情

在Python中解析大型RDF

发布于 2024-09-26 13:59:17 字数 101 浏览 5 评论 0 原文

我想用 python 解析一个非常大（大约 200MB）的 RDF 文件。我应该使用 sax 还是其他库？我希望有一些非常基本的代码可以用来构建，比如检索标签。

提前致谢。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

汹涌人海 2024-10-03 13:59:17

如果您正在寻找快速的性能，那么我建议您将 Raptor 与 Redland Python 绑定。用 C 语言编写的 Raptor 的性能比 RDFLib 好得多。如果您不想处理 C，您可以使用 python 绑定。

提高性能的另一个建议是，忘记解析 RDF/XML，使用其他风格的 RDF，如 Turtle 或 NTriples。专门解析 ntriples 比解析 RDF/XML 快得多。这是因为 ntriples 语法更简单。

您可以使用 rapper 将 RDF/XML 转换为 ntriples，rapper 是 raptor 附带的工具：

rapper -i rdfxml -o ntriples YOUR_FILE.rdf > YOUR_FILE.ntriples

ntriples 文件将包含如下三元组：

<s1> <p> <o> .
<s2> <p2> "literal" .

并且解析器往往可以非常有效地处理这种结构。此外，内存方面比 RDF/XML 更高效，因为如您所见，这种数据结构更小。

下面的代码是使用 redland python 绑定的简单示例：

import RDF
parser=RDF.Parser(name="ntriples") #as name for parser you can use ntriples, turtle, rdfxml, ...
model=RDF.Model()
stream=parser.parse_into_model(model,"file://file_path","http://your_base_uri.org")
for triple in model:
    print triple.subject, triple.predicate, triple.object

如果您在 RDF 文档中使用相对 URI，则基本 URI 是前缀 URI。可以在此处查看有关 Python Redland 绑定 API 的文档

如果您不太关心，如果想了解性能那就用RDFLib，简单易用。

If you are looking for fast performance then I'd recommend you to use Raptor with the Redland Python Bindings. The performance of Raptor, written in C, is way better than RDFLib. And you can use the python bindings in case you don't want to deal with C.

Another advice for improving performance, forget about parsing RDF/XML, go with other flavor of RDF like Turtle or NTriples. Specially parsing ntriples is much faster than parsing RDF/XML. This is because the ntriples syntax is simpler.

You can transform your RDF/XML into ntriples using rapper, a tool that comes with raptor:

rapper -i rdfxml -o ntriples YOUR_FILE.rdf > YOUR_FILE.ntriples

The ntriples file will contain triples like:

<s1> <p> <o> .
<s2> <p2> "literal" .

and parsers tend to be very efficient handling this structure. Moreover, memory wise is more efficient than RDF/XML because, as you can see, this data structure is smaller.

The code below is a simple example using the redland python bindings:

import RDF
parser=RDF.Parser(name="ntriples") #as name for parser you can use ntriples, turtle, rdfxml, ...
model=RDF.Model()
stream=parser.parse_into_model(model,"file://file_path","http://your_base_uri.org")
for triple in model:
    print triple.subject, triple.predicate, triple.object

The base URI is the prefixed URI in case you use relative URIs inside your RDF document. You can check documentation about the Python Redland bindings API in here

If you don't care much about performance then use RDFLib, it is simple and easy to use.

回复收藏 0 原文

如梦亦如幻 2024-10-03 13:59:17

我同意您尝试 rdflib 的建议。这是很好且快速的原型设计，如果您不想将整个图加载到内存中，BerkeleyDB 后端存储可以很好地扩展到数百万个三元组。

import rdflib

graph = rdflib.Graph("Sleepycat")
graph.open("store", create=True)
graph.parse("big.rdf")

# print out all the triples in the graph
for subject, predicate, object in graph:
    print subject, predicate, object

I second the suggestion that you try out rdflib. It's nice and quick prototyping, and the BerkeleyDB backend store scales pretty well into the millions of triples if you don't want to load the whole graph into memory.

import rdflib

graph = rdflib.Graph("Sleepycat")
graph.open("store", create=True)
graph.parse("big.rdf")

# print out all the triples in the graph
for subject, predicate, object in graph:
    print subject, predicate, object

回复收藏 0 原文

屋檐 2024-10-03 13:59:17

根据我的经验，SAX 的性能非常好，但编写起来很痛苦。除非我遇到问题，否则我倾向于避免使用它进行编程。

“非常大”取决于机器的 RAM。假设您的计算机有超过 1GB 内存，lxml, pyxml 或其他一些库 e 对于 200mb 文件就可以了。

回复收藏 0 原文

久随 2024-10-03 13:59:17

不确定 sax 是否是最佳解决方案，但 IBM 似乎认为它适用于使用 Python 进行高性能 XML 解析： http://www.ibm.com/developerworks/xml/library/x-hiperfparse/。他们的示例 RDF 在大小上使您相形见绌（200MB 与 1.9GB），因此他们的解决方案应该适合您。

本文的示例开始时非常基础，并且很快就能上手。

回复收藏 0 原文

尝蛊 2024-10-03 13:59:17

LightRdf 是一个非常快速的解析 RDF 文件的库。可以通过 pip 安装。代码示例可以在项目页面上找到。

如果你想从 gzip 压缩的 RDF 文件中解析三元组，你可以这样做：

import lightrdf
import gzip

RDF_FILENAME = 'data.rdf.gz'

f = gzip.open(RDF_FILENAME, 'rb')
doc = lightrdf.RDFDocument(f, parser=lightrdf.xml.PatternParser)
for (s, p, o) in doc.search_triples(None, None, None)):
            print(s, p, o)

A very fast library to parse RDF files is LightRdf. It could be installed via pip. Code examples can be found on the project page.

If you want to parse triples from a gzipped RDF file, you can do this like that:

import lightrdf
import gzip

RDF_FILENAME = 'data.rdf.gz'

f = gzip.open(RDF_FILENAME, 'rb')
doc = lightrdf.RDFDocument(f, parser=lightrdf.xml.PatternParser)
for (s, p, o) in doc.search_triples(None, None, None)):
            print(s, p, o)

回复收藏 0 原文

难如初 2024-10-03 13:59:17

对于 Python 中的 RDF 处理，请考虑使用 RDF 库，例如 RDFLib。如果您还需要三重存储，也可以使用更重量级的解决方案，但这里可能不需要（PySesame、neo4jrdf 与 neo4jpy）。

在为 RDF 编写自己的 SAX 解析器之前，请查看 rdfxml.py：

import rdfxml
data = open('data.rdf', 'r').read()
rdfxml.parseRDF(data)

For RDF processing in Python, consider using an RDF library such as RDFLib. If you also need a triplestore, more heavyweight solutions are available as well, but may not be needed here (PySesame, neo4jrdf with neo4jpy).

Before writing your own SAX parser for RDF, check out rdfxml.py:

import rdfxml
data = open('data.rdf', 'r').read()
rdfxml.parseRDF(data)

回复收藏 0 原文

~没有更多了~

关于作者

冰雪梦之恋

暂无简介

0 文章

0 评论

23 人气

关注发私信

友情链接

文江博客

在Python中解析大型RDF

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

关于作者

相关话题

热门标签

推荐作者

尘世孤行

烟─花易冷

你是年少的欢喜

倒带

忱杏

送君千里

友情链接

在Python中解析大型RDF

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

关于作者

相关话题

热门标签

推荐作者

尘世孤行

烟─花易冷

你是年少的欢喜

倒带

忱杏

送君千里

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。