避免 HTTP“太多请求”使用 SPARQLWrapper 和 Wikidata 时出错

发布于 2025-01-09 13:36:37 字数 2093 浏览 3 评论 0原文

我有一个大约 6k wikidata 实例 ID 的列表（以 Q###### 开头），我想查找人类可读的标签。我对 SPARQL 不太熟悉，但遵循一些指南已成功找到适用于单个 ID 的查询。

from SPARQLWrapper import SPARQLWrapper, JSON

query = """
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX wd: <http://www.wikidata.org/entity/>
    SELECT *
    WHERE {
            wd: Q##### rdfs:label ?label .
            FILTER (langMatches( lang(?label), "EN" ) )
          }
    LIMIT 1
    """

sparql = SPARQLWrapper("http://query.wikidata.org/sparql")
sparql.setQuery(query)
sparql.setReturnFormat(JSON)
output = sparql.query().convert()

我曾希望迭代 ID 列表就像将 ID 放入数据框中并使用 apply 函数一样简单......

ids_DF['label'] = ids_DF['instance_id'].apply(my_query_function)

但是，当我这样做时，它会出错并显示 "HTTPError: Too许多请求”警告。查看文档，特别是查询限制部分，内容如下：

查询限制

配置了硬查询截止时间，设置为 60 秒。还有以下限制：
每 60 秒允许一个客户端（用户代理 + IP）有 60 秒的处理时间
一个客户端每分钟允许 30 个错误查询

我不确定如何解决这个问题。我是否希望运行 6k 错误查询（我不确定错误查询是什么）？在这种情况下，我可能需要分批运行它们以避免超过 30 秒的窗口。

我解决此问题的第一次尝试是在每个查询后延迟 2 秒（请参见下面倒数第三行）。我注意到每个实例 ID 大约需要 1 秒才能返回一个值，因此我的想法是延迟会将花费的时间增加到 3 秒（这应该能让我轻松地保持在限制内）。但是，这仍然返回相同的错误。我也尝试过延长这个睡眠时间，得到了同样的结果。

from SPARQLWrapper import SPARQLWrapper, JSON

query = """
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX wd: <http://www.wikidata.org/entity/>
    SELECT *
    WHERE {
            wd: Q##### rdfs:label ?label .
            FILTER (langMatches( lang(?label), "EN" ) )
          }
    LIMIT 1
    """

sparql = SPARQLWrapper("http://query.wikidata.org/sparql")
sparql.setQuery(query)
time.sleep(2) # imported from time
sparql.setReturnFormat(JSON)
output = sparql.query().convert()

此处提出了有关此主题的类似问题但我无法遵循给出的建议。

原文

I have a list of approximately 6k wikidata instance IDs (beginning Q#####) I want to look up the human-readable labels for. I am not too familiar with SPARQL, but following some guidelines have managed to find a query that works for a single ID.

from SPARQLWrapper import SPARQLWrapper, JSON

query = """
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX wd: <http://www.wikidata.org/entity/>
    SELECT *
    WHERE {
            wd: Q##### rdfs:label ?label .
            FILTER (langMatches( lang(?label), "EN" ) )
          }
    LIMIT 1
    """

sparql = SPARQLWrapper("http://query.wikidata.org/sparql")
sparql.setQuery(query)
sparql.setReturnFormat(JSON)
output = sparql.query().convert()

I had hoped that iterating over a list of IDs would be as simple as putting the IDs in a dataframe and using the apply function...

ids_DF['label'] = ids_DF['instance_id'].apply(my_query_function)

... However, when I do that it errors out with a "HTTPError: Too Many Requests" warning. Looking into the documentation, specifically the query limits section, it says the following:

Query limits

There is a hard query deadline configured which is set to 60 seconds. There are also following limits:
One client (user agent + IP) is allowed 60 seconds of processing time each 60 seconds
One client is allowed 30 error queries per minute

I'm unsure how to go about resolving this. Am I looking to run 6k error queries (i'm unsure what an error query even is)? In which case I presumably need to run them in batches to avoid going over the 30 second window.

My first attempt to resolve this was been to put a delay of 2 seconds after each query (see third from last line below). I noticed that each instance ID was taking approximately 1 second to return a value so my thinking was that a delay would boost the amount of time taken to 3 seconds (which should comfortably keep me within the limit). However, that still returns the same error. I've tried extending this sleep period as well, with the same results.

from SPARQLWrapper import SPARQLWrapper, JSON

query = """
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX wd: <http://www.wikidata.org/entity/>
    SELECT *
    WHERE {
            wd: Q##### rdfs:label ?label .
            FILTER (langMatches( lang(?label), "EN" ) )
          }
    LIMIT 1
    """

sparql = SPARQLWrapper("http://query.wikidata.org/sparql")
sparql.setQuery(query)
time.sleep(2) # imported from time
sparql.setReturnFormat(JSON)
output = sparql.query().convert()

A similar question on this topic was asked here but I've not been able to follow the advice given.

分享到QQ

分享到微博