检索维基百科页面的另一种语言

发布于 2024-10-02 05:08:14 字数 443 浏览 8 评论 0原文

任务:我们有维基百科英文页面,需要检索俄语的相同页面地址。

我知道语义网解决方案 - 使用 DbPedia 的简单查询,但我很好奇是否有传统的解决方案。我在 semanticoverflow.com 中提出了同样的问题,其中Toby Inkster 建议解析 http://en.wikipedia.org/wiki/Colugo?action= raw 结果(底部有其他语言链接),但是这种方式效率太低了。还有其他方法吗?或者 DbPedia 是唯一真正的选择?

Task: We have Wikipedia English page and need to retrieve the same page address in Russian.

I know the Semantic Web solution - use simple query to DbPedia, but I am curious whether there are traditional solutions. I have asked the same question in semanticoverflow.com where Toby Inkster suggested to parse http://en.wikipedia.org/wiki/Colugo?action=raw results (there are other languages links in the bottom), but this way is too inefficient. Are there any other ways or DbPedia is the one real option?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

也只是曾经 2024-10-09 05:08:15

维基百科有一个扩展的API,它可以提供语言链接信息等。在这种特殊情况下,您正在寻找 api.php?action=query&prop=langlinks&titles=...请参见此处的示例

Wikipedia has an extensive API, which can provide language links information among others. In this particular case, you're looking for api.php?action=query&prop=langlinks&titles=.... See here for example.

执妄 2024-10-09 05:08:15

有时,在查找页面的日语 (ja) 标题等效项时,https://en.wikipedia.org/ wiki/Aframomum_corrorima

import json
import requests

site = "enwiki"  # For English queries, set `&sites=enwiki`
page = "Aframomum_corrorima"
trg_lang = "ja"

url = f"https://www.wikidata.org/w/api.php?action=wbgetentities&sites={site}&titles={page}&languages={trg_lang}&format=json"

result = json.loads(requests.get(url).content.decode('utf8'))

translations = [result['entities'][k]['labels'] for k in result['entities']]
print(translations)

[out]:

[{'ja': {'language': 'ja', 'value': 'コロリマ'}}]

然后你会发现 https://ja.wikipedia.org/w/index.php?title=kororima 未写入但 wikidata API 能够找到正确的实体翻译。

要提取所有可能的链接,请执行以下操作:

url = f"https://www.wikidata.org/w/api.php?action=wbgetentities&sites={site}&titles={page}&prop=langlinks&format=json"

result = json.loads(requests.get(url).content.decode('utf8'))

links = [result['entities'][e]['sitelinks'] for e in result['entities'].keys()]

print(json.dumps(links))

[out]:

[
    {
        "amwiki": {
            "site": "amwiki",
            "title": "\\u12ae\\u1228\\u122a\\u121b",
            "badges": []
        },
        "cebwiki": {
            "site": "cebwiki",
            "title": "Aframomum corrorima",
            "badges": []
        },
        "commonswiki": {
            "site": "commonswiki",
            "title": "Category:Aframomum corrorima",
            "badges": []
        },
        "elwiki": {
            "site": "elwiki",
            "title": "Aframomum corrorima",
            "badges": []
        },
        "enwiki": {
            "site": "enwiki",
            "title": "Aframomum corrorima",
            "badges": []
        },
        "eswiki": {
            "site": "eswiki",
            "title": "Aframomum corrorima",
            "badges": []
        },
        "frwiki": {
            "site": "frwiki",
            "title": "Aframomum corrorima",
            "badges": []
        },
        "kowiki": {
            "site": "kowiki",
            "title": "\\ucf54\\ub7ec\\ub9ac\\ub9c8",
            "badges": []
        },
        "lawiki": {
            "site": "lawiki",
            "title": "Aframomum corrorima",
            "badges": []
        },
        "specieswiki": {
            "site": "specieswiki",
            "title": "Aframomum corrorima",
            "badges": []
        },
        "svwiki": {
            "site": "svwiki",
            "title": "Korarima",
            "badges": []
        },
        "ukwiki": {
            "site": "ukwiki",
            "title": "Aframomum corrorima",
            "badges": []
        },
        "viwiki": {
            "site": "viwiki",
            "title": "Aframomum corrorima",
            "badges": []
        },
        "warwiki": {
            "site": "warwiki",
            "title": "Aframomum corrorima",
            "badges": []
        }
    }
]

Sometimes, when finding the Japanese (ja) title equivalence for page, https://en.wikipedia.org/wiki/Aframomum_corrorima

import json
import requests

site = "enwiki"  # For English queries, set `&sites=enwiki`
page = "Aframomum_corrorima"
trg_lang = "ja"

url = f"https://www.wikidata.org/w/api.php?action=wbgetentities&sites={site}&titles={page}&languages={trg_lang}&format=json"

result = json.loads(requests.get(url).content.decode('utf8'))

translations = [result['entities'][k]['labels'] for k in result['entities']]
print(translations)

[out]:

[{'ja': {'language': 'ja', 'value': 'コロリマ'}}]

Then you'll find that the https://ja.wikipedia.org/w/index.php?title=コロリマ isn't written yet but the wikidata API is able to find the right entity translation.

To extract all the possible links, do something like:

url = f"https://www.wikidata.org/w/api.php?action=wbgetentities&sites={site}&titles={page}&prop=langlinks&format=json"

result = json.loads(requests.get(url).content.decode('utf8'))

links = [result['entities'][e]['sitelinks'] for e in result['entities'].keys()]

print(json.dumps(links))

[out]:

[
    {
        "amwiki": {
            "site": "amwiki",
            "title": "\\u12ae\\u1228\\u122a\\u121b",
            "badges": []
        },
        "cebwiki": {
            "site": "cebwiki",
            "title": "Aframomum corrorima",
            "badges": []
        },
        "commonswiki": {
            "site": "commonswiki",
            "title": "Category:Aframomum corrorima",
            "badges": []
        },
        "elwiki": {
            "site": "elwiki",
            "title": "Aframomum corrorima",
            "badges": []
        },
        "enwiki": {
            "site": "enwiki",
            "title": "Aframomum corrorima",
            "badges": []
        },
        "eswiki": {
            "site": "eswiki",
            "title": "Aframomum corrorima",
            "badges": []
        },
        "frwiki": {
            "site": "frwiki",
            "title": "Aframomum corrorima",
            "badges": []
        },
        "kowiki": {
            "site": "kowiki",
            "title": "\\ucf54\\ub7ec\\ub9ac\\ub9c8",
            "badges": []
        },
        "lawiki": {
            "site": "lawiki",
            "title": "Aframomum corrorima",
            "badges": []
        },
        "specieswiki": {
            "site": "specieswiki",
            "title": "Aframomum corrorima",
            "badges": []
        },
        "svwiki": {
            "site": "svwiki",
            "title": "Korarima",
            "badges": []
        },
        "ukwiki": {
            "site": "ukwiki",
            "title": "Aframomum corrorima",
            "badges": []
        },
        "viwiki": {
            "site": "viwiki",
            "title": "Aframomum corrorima",
            "badges": []
        },
        "warwiki": {
            "site": "warwiki",
            "title": "Aframomum corrorima",
            "badges": []
        }
    }
]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文