如何使用 RDFLib 解析大数据集?
我正在尝试使用 RDFLib 3.0 解析几个大图,显然它处理第一个图并在第二个图上死掉(MemoryError)...看起来 MySQL 不再支持作为存储,您能建议一种以某种方式解析这些图的方法吗?
Traceback (most recent call last):
File "names.py", line 152, in <module>
main()
File "names.py", line 91, in main
locals()[graphname].parse(filename, format="nt")
File "/usr/local/lib/python2.6/dist-packages/rdflib-3.0.0-py2.6.egg/rdflib/graph.py", line 938, in parse
location=location, file=file, data=data, **args)
File "/usr/local/lib/python2.6/dist-packages/rdflib-3.0.0-py2.6.egg/rdflib/graph.py", line 757, in parse
parser.parse(source, self, **args)
File "/usr/local/lib/python2.6/dist-packages/rdflib-3.0.0-py2.6.egg/rdflib/plugins/parsers/nt.py", line 24, in parse
parser.parse(f)
File "/usr/local/lib/python2.6/dist-packages/rdflib-3.0.0-py2.6.egg/rdflib/plugins/parsers/ntriples.py", line 124, in parse
self.line = self.readline()
File "/usr/local/lib/python2.6/dist-packages/rdflib-3.0.0-py2.6.egg/rdflib/plugins/parsers/ntriples.py", line 151, in readline
m = r_line.match(self.buffer)
MemoryError
I'm trying to parse several big graphs with RDFLib 3.0, apparently it handles first one and dies on the second (MemoryError)... looks like MySQL is not supported as store anymore, can you please suggest a way to somehow parse those?
Traceback (most recent call last):
File "names.py", line 152, in <module>
main()
File "names.py", line 91, in main
locals()[graphname].parse(filename, format="nt")
File "/usr/local/lib/python2.6/dist-packages/rdflib-3.0.0-py2.6.egg/rdflib/graph.py", line 938, in parse
location=location, file=file, data=data, **args)
File "/usr/local/lib/python2.6/dist-packages/rdflib-3.0.0-py2.6.egg/rdflib/graph.py", line 757, in parse
parser.parse(source, self, **args)
File "/usr/local/lib/python2.6/dist-packages/rdflib-3.0.0-py2.6.egg/rdflib/plugins/parsers/nt.py", line 24, in parse
parser.parse(f)
File "/usr/local/lib/python2.6/dist-packages/rdflib-3.0.0-py2.6.egg/rdflib/plugins/parsers/ntriples.py", line 124, in parse
self.line = self.readline()
File "/usr/local/lib/python2.6/dist-packages/rdflib-3.0.0-py2.6.egg/rdflib/plugins/parsers/ntriples.py", line 151, in readline
m = r_line.match(self.buffer)
MemoryError
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这些 RDF 文件中有多少个三元组?我已经测试了 rdflib ,它的扩展范围不会超过几十个 ktriples - 如果你幸运的话。它对于具有数百万个三元组的文件来说不可能真正表现良好。
最好的解析器是来自 Redland Libraries 的
rapper
。我的第一个建议是不要使用 RDF/XML,而使用 ntriples。 Ntriples 是比 RDF/XML 更轻量的格式。您可以使用 rapper 将 RDF/XML 转换为 ntriples:rapper -i rdfxml -o ntriples YOUR_FILE.rdf > YOUR_FILE.ntriples
如果你喜欢 Python,你可以使用 Redland python 绑定:
我已经解析了相当大的内容带有 redland 库的文件(几千兆字节)没有问题。
最终,如果您正在处理大数据集,您可能需要将数据断言到可扩展的三重存储中,我通常使用的一个是 4store 。 4store内部使用redland来解析RDF文件。从长远来看,我认为,寻求可扩展的三重存储是你必须要做的。有了它,您将能够使用 SPARQL 来查询您的数据并SPARQL/Update 用于插入和删除三元组。
How many triples on those RDF files ? I have tested
rdflib
and it won't scale much further than few tens of ktriples - if you are lucky. No way it really performs well for files with millions of triples.The best parser out there is
rapper
from Redland Libraries. My first advice is to not useRDF/XML
and go forntriples
. Ntriples is a lighter format than RDF/XML. You can transform from RDF/XML to ntriples usingrapper
:rapper -i rdfxml -o ntriples YOUR_FILE.rdf > YOUR_FILE.ntriples
If you like Python you can use the Redland python bindings:
I have parsed fairly big files (couple of gigabyes) with redland libraries with no problem.
Eventually if you are handling big datasets you might need to assert your data into a scalable triple store, the one I normally use is 4store. 4store internally uses redland to parse RDF files. In the long term, I think, going for a scalable triple store is what you'll have to do. And with it you'll be able to use SPARQL to query your data and SPARQL/Update to insert and delete triples.