如何从 Freebase 转储中提取三元组?
我想收集一个大型的三元组知识库:主语、宾语、谓语,因此我从 开发者页面,其中包含 RDF 格式的三元组,我想将其解码为可读的格式。我怎样才能实现这个目标?
目前我正在关注 nchah 的 Github
并正在运行 shell 脚本 VirtualBox Ubuntu 上的 s0-run-parse-extract-triples.sh ,它应该通过删除 URL 但保留 ID 来清理 RDF 的输入数据,并将我的输入数据传递为freebase-triples.txt 这是 30Gb freebase-rdf-latest.gz 中 100 行的示例作为论证。
你可以找到代码 here
请注意,我收到消息 目录中没有此类文件,因此我删除了第 8 行,并在第 17 行添加了 $1 而不是$INPUT_FILE 负责处理此消息,并且在第 21 行中我删除了 # 符号并将 gsed 更改为 sed,我还添加了回显消息来进行一些跟踪。
这就是我运行它的方式:sh s0-run-parse-extract-triples.sh freebase-triples.txt
检查出现的错误这里
我正在获取输出文件 fb-rdf-s01-c01 但它仍然具有 URL 并且其未更改从我的输入中,我还得到了另一个文件 fb-rdf-s01-c02 但它是空的 .
I would like to collect a large knowledge base of triples as: subject, object, predicate, so I downloaded the Freebase dump from the developers page, which contains triples in RDF format, and I want to decode it to a readable format. How can I achieve this?
Currently I am following the Github of nchah
and am running the shell script s0-run-parse-extract-triples.sh on VirtualBox Ubuntu, which should clean the input data of RDF's by removing URL's but keeping the ID's, and am passing my input data as freebase-triples.txt which is a sample of 100 rows from the 30Gb freebase-rdf-latest.gz as argument.
you can find the code here
Note that I was getting the message No such file in directory, so I removed line 8, and added $1 in line 17 instead of $INPUT_FILE which took care of this message, and also in line 21 I removed the # sign and changed gsed to sed, and I also added echo messages to do some tracing.
and this is how am running it:sh s0-run-parse-extract-triples.sh freebase-triples.txt
Check the error that am getting here
Am getting the output file fb-rdf-s01-c01 but it still has the URL's and its unchanged from my input, and am also getting the other file fb-rdf-s01-c02 but its empty .
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论