使用 Python SPARQL-Wrapper 进行 DBpedia 查询时出现奇怪的空间间隙

发布于 2025-01-14 14:17:56 字数 1536 浏览 4 评论 0原文

我正在尝试查询有关英国地点（必须进行地理定位）的所有维基百科文章。我使用 Python 的 SPARQL 包装器进行查询，以访问坐标、文章链接、层次结构和其他元数据。它看起来像这样：

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT ?id ?label ?link ?lat ?long ?cat_lab ?cat_lab2 ?nchar
WHERE  {  

?uri a dbo:Place .        

?uri rdfs:label ?label . FILTER(lang(?label) = 'en') .

?uri dbo:wikiPageID ?id .

?uri rdf:type ?cat . FILTER (?cat LIKE <http://dbpedia.org/ontology/%>). 
?cat rdfs:subClassOf ?cat2 . FILTER (?cat2 LIKE <http://dbpedia.org/ontology/%> AND
                                     ! ?cat2 LIKE <http://dbpedia.org/ontology/Place> AND
                                     ! ?cat2 LIKE <http://dbpedia.org/ontology/Location>) .
?cat rdfs:label ?cat_lab . FILTER(lang(?cat_lab) = 'en')
?cat2 rdfs:label  ?cat_lab2 . FILTER(lang(?cat_lab2) = 'en')

?uri geo:lat ?lat . 

?uri geo:long ?long . 

?uri dbo:wikiPageLength ?nchar .

?uri prov:wasDerivedFrom ?link .

FILTER(?long >= -1.1 AND ?long <= 1.8 AND ?lat >= 51.1 AND ?lat <= 54.27)

} 

LIMIT 10000
OFFSET 0

我通过以 10'000 为步长更改查询的偏移量（每个查询 10'000 条记录的查询限制）来查询数据，然后将它们附加到单个数据帧。这工作得很好，虽然我得到了很多重复的记录，但这是另一个问题。

然而，当我查看地图上绘制的数据时，发现记录似乎不完整，因为整个研究区域有两条非常独特的条纹，没有任何记录。因为这不太可能是数据的正常空间分布，我怀疑这与查询数据库的方式有关。

有两条缺失数据的研究区域（每个点是一篇地理定位的 wiki 文章）

如果我将查询的空间边界的范围更改为较小的范围，条纹仍然存在，但出现在不同的位置，有时甚至只有一个条纹。由于我对 SPARQL 缺乏经验，所以我不知道这些奇怪的结果是如何发生的。也许你们中的一个人可以给我一些关于为什么数据看起来像这样的提示。

干杯!

原文

I'm trying to query all the Wikipedia articles about places (have to be geolocated) in the United Kingdom. I'm using the SPARQL wrapper for python for my query to access the coordinates, article link, hierarchy and other metadata. and it looks like this:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT ?id ?label ?link ?lat ?long ?cat_lab ?cat_lab2 ?nchar
WHERE  {  

?uri a dbo:Place .        

?uri rdfs:label ?label . FILTER(lang(?label) = 'en') .

?uri dbo:wikiPageID ?id .

?uri rdf:type ?cat . FILTER (?cat LIKE <http://dbpedia.org/ontology/%>). 
?cat rdfs:subClassOf ?cat2 . FILTER (?cat2 LIKE <http://dbpedia.org/ontology/%> AND
                                     ! ?cat2 LIKE <http://dbpedia.org/ontology/Place> AND
                                     ! ?cat2 LIKE <http://dbpedia.org/ontology/Location>) .
?cat rdfs:label ?cat_lab . FILTER(lang(?cat_lab) = 'en')
?cat2 rdfs:label  ?cat_lab2 . FILTER(lang(?cat_lab2) = 'en')

?uri geo:lat ?lat . 

?uri geo:long ?long . 

?uri dbo:wikiPageLength ?nchar .

?uri prov:wasDerivedFrom ?link .

FILTER(?long >= -1.1 AND ?long <= 1.8 AND ?lat >= 51.1 AND ?lat <= 54.27)

} 

LIMIT 10000
OFFSET 0

I query the data by changing the offset of my query in steps of 10'000 (b.c. of the query limit of 10'000 records per query) and then append them to a single data frame. This works fine, though I get a lot of duplicate records, but that's another issue.

However, when I look at the data plotted on a map it appears that the records are incomplete as there are two very distinctive stripes devoid of any records across the whole study area. As it is unlikely that this is the normal spatial distribution of the data and I suspect it has to do with way the database is queried.

Study area with the two stripes of missing data (each dot is a geo-located wiki article)

If I change the extent of the queried spatial bounds to a smaller one, the stripes persist but appear in a different place, sometimes it's even only one stripe. As I'm quite inexperienced with SPARQL, I'm out of ideas how these strange results can occur. Maybe one of you can give me a hint on why the data might look like this.

Cheers!

分享到QQ

分享到微博