如何从SPARQL中随机选择DBPedia节点?
如何使用 sparql 端点从 DBpedia 选择随机样本?
此查询
SELECT ?s WHERE { ?s ?p ?o . FILTER ( 1 > bif:rnd (10, ?s, ?p, ?o) ) } LIMIT 10
(在此处找到) 似乎在大多数 SPARQL 端点上都可以正常工作,但是在 http://dbpedia.org/sparql 上它会被缓存(所以它总是返回相同的 10 个节点)。
如果我从 JENA 尝试,我会得到以下异常:
Unresolved prefixed name: bif:rnd
并且我找不到“bif”命名空间是什么。
关于如何解决这个问题有什么想法吗?
穆隆
How can I select random sample from DBpedia using the sparql endpoint?
This query
SELECT ?s WHERE { ?s ?p ?o . FILTER ( 1 > bif:rnd (10, ?s, ?p, ?o) ) } LIMIT 10
(found here)
seems to work ok on most SPARQL endpoints, but on http://dbpedia.org/sparql it gets cached (so it returns always the same 10 nodes).
If i try from JENA, I get the following exception:
Unresolved prefixed name: bif:rnd
And I can't find the what the 'bif' namespace is.
Any idea on how to solve this?
Mulone
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
在 SPARQL 1.1 中,您可以这样做:
我不知道有多少商店会优化,甚至还没有实现这一点。
In SPARQL 1.1 you can do:
I don't know offhand how many store will optimise, or even implement this yet though.
bif:rnd
不是 SPARQL 标准,因此无法移植到任何 SPARQL 端点。您可以使用 LIMIT 、 ORDER 和 OFFSET 来通过标准查询模拟随机样本。类似于 ...其中
some_random_number
是由您的应用程序生成的数字。这应该可以避免缓存问题,但是这个查询无论如何都非常昂贵,而且我不知道公共端点是否支持它。尽量避免像
?s ?p ?o
这样的完全开放模式,您的查询将会更加高效。bif:rnd
is not SPARQL standard and therefore not portable to any SPARQL endpoint. You can use LIMIT , ORDER and OFFSET to simulate a random sample with a standard query. Something like ...Where
some_random_number
is a number that is generated by your application. This should avoid the caching problem but this query is anyway quite expensive and I don't know if public endpoints will support it.Try to avoid completely open patterns like
?s ?p ?o
and your query will be much more efficient.bif:rnd 是 Virtuoso 特定的扩展,因此只能再次在 Virtuoso SPARQL 端点上工作。
bif 是 Virtuoso 内置函数的前缀,可以调用任何 Virtuoso 函数在 SPARQL 中,使用 rnd 是一个返回随机数的 Virtuoso 函数。
bif:rnd is a Virtuoso specific extension and will thus only work again Virtuoso SPARQL endpoints.
bif is the prefix for Virtuoso Built In Functions which enable any Virtuoso function to be called in SPARQL, with rnd being a Virtuoso function for returning random numbers.
我遇到了同样的问题,这里的解决方案都没有解决我的问题。这是我的解决方案;这不是一件简单的事,而且是一次黑客攻击。目前,这适用于 DBPedia,并且可能适用于其他 SPARQL 端点,但不保证适用于未来版本。
DBPedia 使用 Virtuoso,它支持
RAND
函数的未记录参数;该参数有效指定 PRNG 使用的范围。该游戏旨在欺骗 Virtuoso 相信在计算每个结果行之前无法对输入参数进行静态求值,从而强制程序对每个绑定求值RAND()
:神奇的事情发生在
中rand(1 + strlen(str(?s))*0)
生成与rand()
等效的内容;但通过利用程序无法预测涉及某些变量的表达式的值(在本例中,我们只是将 IRI 的长度计算为字符串)这一事实,强制它在每次匹配上运行。实际的表达式并不重要,因为我们将其乘以0
以完全忽略它,然后添加1
以使rand
正常执行。这只有效,因为开发人员在表达式的静态代码评估方面没有走得这么远。他们本可以轻松地编写一个“乘以零”的分支,但可惜他们没有:)
I encountered the same problem and none of the solutions here addressed my issue. Here is my solution; it was non-trivial and quite a hack. This works for DBPedia as of now, and may work for other SPARQL endpoints, but it is not guaranteed to work for future releases.
DBPedia uses Virtuoso, which supports an undocumented argument to the
RAND
function; the argument effectively specifies the range to use for the PRNG. The game is to trick Virtuoso into believing that the input argument cannot be statically-evaluated before each result row is computed, forcing the program to evaluateRAND()
for every binding:The magic happens in
rand(1 + strlen(str(?s))*0)
which generates the equivalent ofrand()
; but forces it to run on every match by exploiting the fact that the program cannot predict the value of an expression that involves some variable (in this case, we just compute the length of the IRI as a string). The actual expression is not important, since we multiply it by0
to ignore it completely, then add1
to makerand
execute normally.This only works because the developers did not go this far in their static-code-evaluation of expressions. They could have easily written a branch for "multiply by zero", but alas they did not :)
上述方法都不适用于 Jena/Fuseki,所以我用另一种方式完成了它:
显然这不会选择随机三元组,但前 k 个 MD5 排序的受试者的集合应该具有统计显着性样本的相关特征(即样本代表了整个总体,不存在特定的选择偏差)。
None of the above methods works with Jena/Fuseki, so I've done it in another way:
Obviously this doesn't select random triples, but the set of the first k MD5-ordered subjects should have relevant features of a statistically significant sample (i.e. the sample is representative of the entire population, there is no particular selection bias).
经过多次实验,我最终得到了以下解决方案,结合使用哈希来避免
RAND()
被静态评估,并使用RAND()
来避免选择偏差仅使用哈希造成的。这里用于从维基数据中选择一个随机山谷冰川:
尝试一下(服务缓存响应,您可以通过以下方式绕过它)只是在运行查询之前发表新评论)
After much experimentation I have ended up with the following solution, a combination of using a hash to avoid
RAND()
being statically-evaluated andRAND()
to avoid the selection biases caused by only using a hash.Here used to select a random valley glacier from Wikidata:
Try it (the service caches responses, you can bypass this by just making a new comment before running the query)
这个怎么样?
(http://virtuoso.openlinksw.com/dataspace/doc/dav/ wiki/Main/VirtTipsAndTricksGuideRandomSampleAllTriples)
您只需绑定随机 id (?rid) 到绑定的每一行 (?s ?p ?o),然后按随机 id 对结果进行排序。
How about this one?
<SHORT_OR_LONG::bif:rnd> may be better than <bif:rnd>.
(http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VirtTipsAndTricksGuideRandomSampleAllTriples)
You simply bind random id (?rid) to each row of binding (?s ?p ?o) then order results by random id.