将RDF数据集拆分为两个随机数据集
我有一个来自
我能想到的是,通过谓词订购三元组数据集,然后随机洗牌并挑选每个谓词三元组。
I have an RDF dataset with 100M triples from the watdiv RDF benchmark. How can I split this dataset into two smaller randomly-distributed datasets, each with about 50M triples? While some triples can appear in both datasets.
What I can think of, is to order the triples dataset by the predicate, and randomly shuffle and pick out of each predicate triples.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
由于您的数据集似乎以每行三个三倍的格式提供,因此您可以通过文件迭代,并以50%的概率进行每行,以创建一个随机子集,其中包含大约一半的三元组。
例如,这是如何用尴尬做到这一点的方法:
解释:在开始块中,通过调用
srand()
函数。当无参数调用时,它将将当前日期和时间用作种子。如果您需要可重现的结果,请在此处将种子设置为一些固定值。然后,对于每行,生成一个随机整数(0或1。如果您想要两个这样的随机子集,其中每个三倍可能会出现在两者中,只需两次运行命令即可。
如果您想要两个不相交的随机子集,而每个三倍的恰好在其中之一中,则可以这样做:
Since it appears that your dataset is available in a format with one triple per line, you can just iterate through the file and take every row with a 50% probability to create a random subset containing approximately half of the triples.
For example, here is how to do it with AWK:
Explained: In the BEGIN block, initialize the random number generator by calling the
srand()
function. When called without an argument, it will use the current date and time as the seed. If you want reproducible results, set the seed here to some fixed value. Then for each line, generate a random integer, either 0 or 1. If it is non-zero (true), print the current line.If you want two such random subsets where each triple may appear in both, just run the command twice.
If you want two disjoint random subsets, where each triple is in exactly one of them you can do it like this: