如何从SPARQL中随机选择DBPedia节点?

发布于 2024-11-01 14:41:19 字数 593 浏览 1 评论 0原文

如何使用 sparql 端点从 DBpedia 选择随机样本?

此查询

SELECT ?s WHERE { ?s ?p ?o . FILTER ( 1 > bif:rnd (10, ?s, ?p, ?o) ) } LIMIT 10

(在此处找到) 似乎在大多数 SPARQL 端点上都可以正常工作,但是在 http://dbpedia.org/sparql 上它会被缓存(所以它总是返回相同的 10 个节点)。

如果我从 JENA 尝试,我会得到以下异常:

Unresolved prefixed name: bif:rnd

并且我找不到“bif”命名空间是什么。

关于如何解决这个问题有什么想法吗?

穆隆

How can I select random sample from DBpedia using the sparql endpoint?

This query

SELECT ?s WHERE { ?s ?p ?o . FILTER ( 1 > bif:rnd (10, ?s, ?p, ?o) ) } LIMIT 10

(found here)
seems to work ok on most SPARQL endpoints, but on http://dbpedia.org/sparql it gets cached (so it returns always the same 10 nodes).

If i try from JENA, I get the following exception:

Unresolved prefixed name: bif:rnd

And I can't find the what the 'bif' namespace is.

Any idea on how to solve this?

Mulone

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

岁月打碎记忆 2024-11-08 14:41:19

在 SPARQL 1.1 中,您可以这样做:

SELECT ?s
WHERE {
  ?s ?p ?o
}
ORDER BY RAND()
LIMIT 10

我不知道有多少商店会优化,甚至还没有实现这一点。

[请参阅下面的评论,这不太有效]

另一种选择是:

SELECT (样本(?s) AS ?ss)
哪里 { ?s ?p ?o }
分组依据?

但我认为这更不可能被优化。

In SPARQL 1.1 you can do:

SELECT ?s
WHERE {
  ?s ?p ?o
}
ORDER BY RAND()
LIMIT 10

I don't know offhand how many store will optimise, or even implement this yet though.

[see comment below, this doesn't quite work]

An alternative is:

SELECT (SAMPLE(?s) AS ?ss)
WHERE { ?s ?p ?o }
GROUP BY ?s

But I'd think that's even less likely to be optimised.

空心↖ 2024-11-08 14:41:19

bif:rnd 不是 SPARQL 标准,因此无法移植到任何 SPARQL 端点。您可以使用 LIMIT 、 ORDER 和 OFFSET 来通过标准查询模拟随机样本。类似于 ...

SELECT * WHERE { ?s ?p ?o } 
ORDER BY ?s OFFSET $some_random_number$ LIMIT 10

其中 some_random_number 是由您的应用程序生成的数字。这应该可以避免缓存问题,但是这个查询无论如何都非常昂贵,而且我不知道公共端点是否支持它。

尽量避免像 ?s ?p ?o 这样的完全开放模式,您的查询将会更加高效。

bif:rnd is not SPARQL standard and therefore not portable to any SPARQL endpoint. You can use LIMIT , ORDER and OFFSET to simulate a random sample with a standard query. Something like ...

SELECT * WHERE { ?s ?p ?o } 
ORDER BY ?s OFFSET $some_random_number$ LIMIT 10

Where some_random_number is a number that is generated by your application. This should avoid the caching problem but this query is anyway quite expensive and I don't know if public endpoints will support it.

Try to avoid completely open patterns like ?s ?p ?o and your query will be much more efficient.

2024-11-08 14:41:19

bif:rnd 是 Virtuoso 特定的扩展,因此只能再次在 Virtuoso SPARQL 端点上工作。

bif 是 Virtuoso 内置函数的前缀,可以调用任何 Virtuoso 函数在 SPARQL 中,使用 rnd 是一个返回随机数的 Virtuoso 函数。

bif:rnd is a Virtuoso specific extension and will thus only work again Virtuoso SPARQL endpoints.

bif is the prefix for Virtuoso Built In Functions which enable any Virtuoso function to be called in SPARQL, with rnd being a Virtuoso function for returning random numbers.

风渺 2024-11-08 14:41:19

我遇到了同样的问题,这里的解决方案都没有解决我的问题。这是我的解决方案;这不是一件简单的事,而且是一次黑客攻击。目前,这适用于 DBPedia,并且可能适用于其他 SPARQL 端点,但不保证适用于未来版本。

DBPedia 使用 Virtuoso,它支持 RAND 函数的未记录参数;该参数有效指定 PRNG 使用的范围。该游戏旨在欺骗 Virtuoso 相信在计算每个结果行之前无法对输入参数进行静态求值,从而强制程序对每个绑定求值 RAND()

select * {
    ?s dbo:isPartOf ?o .  # Whatever your pattern is
    bind(rand(1 + strlen(str(?s))*0) as ?rid)
} order by ?rid

神奇的事情发生在 中rand(1 + strlen(str(?s))*0) 生成与 rand() 等效的内容;但通过利用程序无法预测涉及某些变量的表达式的值(在本例中,我们只是将 IRI 的长度计算为字符串)这一事实,强制它在每次匹配上运行。实际的表达式并不重要,因为我们将其乘以 0 以完全忽略它,然后添加 1 以使 rand 正常执行。

这只有效,因为开发人员在表达式的静态代码评估方面没有走得这么远。他们本可以轻松地编写一个“乘以零”的分支,但可惜他们没有:)

I encountered the same problem and none of the solutions here addressed my issue. Here is my solution; it was non-trivial and quite a hack. This works for DBPedia as of now, and may work for other SPARQL endpoints, but it is not guaranteed to work for future releases.

DBPedia uses Virtuoso, which supports an undocumented argument to the RAND function; the argument effectively specifies the range to use for the PRNG. The game is to trick Virtuoso into believing that the input argument cannot be statically-evaluated before each result row is computed, forcing the program to evaluate RAND() for every binding:

select * {
    ?s dbo:isPartOf ?o .  # Whatever your pattern is
    bind(rand(1 + strlen(str(?s))*0) as ?rid)
} order by ?rid

The magic happens in rand(1 + strlen(str(?s))*0) which generates the equivalent of rand(); but forces it to run on every match by exploiting the fact that the program cannot predict the value of an expression that involves some variable (in this case, we just compute the length of the IRI as a string). The actual expression is not important, since we multiply it by 0 to ignore it completely, then add 1 to make rand execute normally.

This only works because the developers did not go this far in their static-code-evaluation of expressions. They could have easily written a branch for "multiply by zero", but alas they did not :)

临风闻羌笛 2024-11-08 14:41:19

上述方法都不适用于 Jena/Fuseki,所以我用另一种方式完成了它:

SELECT DISTINCT ?s ?p ?o
{
  ?s ?p ?o.
  BIND ( MD5 ( ?s ) AS ?rnd)
}
ORDER BY ?rnd ?p
LIMIT 100

显然这不会选择随机三元组,但前 k 个 MD5 排序的受试者的集合应该具有统计显着性样本的相关特征(即样本代表了整个总体,不存在特定的选择偏差)。

None of the above methods works with Jena/Fuseki, so I've done it in another way:

SELECT DISTINCT ?s ?p ?o
{
  ?s ?p ?o.
  BIND ( MD5 ( ?s ) AS ?rnd)
}
ORDER BY ?rnd ?p
LIMIT 100

Obviously this doesn't select random triples, but the set of the first k MD5-ordered subjects should have relevant features of a statistically significant sample (i.e. the sample is representative of the entire population, there is no particular selection bias).

城歌 2024-11-08 14:41:19

经过多次实验,我最终得到了以下解决方案,结合使用哈希来避免 RAND() 被静态评估,并使用 RAND() 来避免选择偏差仅使用哈希造成的。

SELECT ?s WHERE {
  ?s ?p ?o .
  BIND(SHA512(CONCAT(STR(RAND()), STR(?s))) AS ?random) .
} ORDER BY ?random
LIMIT 1

这里用于从维基数据中选择一个随机山谷冰川:

SELECT ?item ?itemLabel ?random WHERE {
  ?item wdt:P31 wd:Q11762356 .
  BIND(SHA512(CONCAT(STR(RAND()), STR(?item))) AS ?random) .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE], en" . }
} ORDER BY ?random
LIMIT 1

尝试一下(服务缓存响应,您可以通过以下方式绕过它)只是在运行查询之前发表新评论)

After much experimentation I have ended up with the following solution, a combination of using a hash to avoid RAND() being statically-evaluated and RAND() to avoid the selection biases caused by only using a hash.

SELECT ?s WHERE {
  ?s ?p ?o .
  BIND(SHA512(CONCAT(STR(RAND()), STR(?s))) AS ?random) .
} ORDER BY ?random
LIMIT 1

Here used to select a random valley glacier from Wikidata:

SELECT ?item ?itemLabel ?random WHERE {
  ?item wdt:P31 wd:Q11762356 .
  BIND(SHA512(CONCAT(STR(RAND()), STR(?item))) AS ?random) .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE], en" . }
} ORDER BY ?random
LIMIT 1

Try it (the service caches responses, you can bypass this by just making a new comment before running the query)

复古式 2024-11-08 14:41:19
SELECT ?s WHERE { 
    ?s ?p ?o . 
    bind(<SHORT_OR_LONG::bif:rnd> (10, ?s, ?p, ?o) as ?rid)
}
ORDER BY ?rid
LIMIT 10

这个怎么样?

可能比更好。
(http://virtuoso.openlinksw.com/dataspace/doc/dav/ wiki/Main/VirtTipsAndTricksGuideRandomSampleAllTriples)

您只需绑定随机 id (?rid) 到绑定的每一行 (?s ?p ?o),然后按随机 id 对结果进行排序。

SELECT ?s WHERE { 
    ?s ?p ?o . 
    bind(<SHORT_OR_LONG::bif:rnd> (10, ?s, ?p, ?o) as ?rid)
}
ORDER BY ?rid
LIMIT 10

How about this one?

<SHORT_OR_LONG::bif:rnd> may be better than <bif:rnd>.
(http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VirtTipsAndTricksGuideRandomSampleAllTriples)

You simply bind random id (?rid) to each row of binding (?s ?p ?o) then order results by random id.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文