有人解析过维基词典吗?

发布于 2024-09-12 00:14:34 字数 151 浏览 10 评论 0 原文

Wiktionary 是一个涵盖多种语言的 wiki 词典。甚至还有翻译。我有兴趣解析它并处理数据,以前有人做过类似的事情吗?有我可以使用的图书馆吗? (最好是Python。)

Wiktionary is a wiki dictionary that covers many languages. It even has translations. I would be interested in parsing it and playing with the data, has anyone does anything like this before? Is there any library I can use? (Preferably Python.)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(11

皓月长歌 2024-09-19 00:14:34

我曾经下载过一个维基词典转储,试图收集斯拉夫语言的单词和定义。我使用 elementtree 来处理转储的 xml 文件。我会避免尝试抓取或抓取网站,而只是下载 wikimedia 为维基词典提供的 xml 转储。前往 wikimedia 下载,查找英语维基词典转储 (enwiktionary)并转到最新的转储。您可能需要pages-articles.xml.bz2 文件,它只是文章内容,没有历史记录或评论。使用您喜欢的 python 中的任何 xml 处理库来解析它。我个人更喜欢elementtree。祝你好运。

I had at one time downloaded a wiktionary dump, trying to gather together words and definitions for slavic languages. I approached it using elementtree to go thru the xml file that is the dump. I would avoid trying to scrape or crawl the site, and just download the xml dump that wikimedia provides for wiktionary. Go to the wikimedia downloads, look for the english wiktionary dumps (enwiktionary) and go to the most recent dump. You'll probably want the pages-articles.xml.bz2 file, which is just the article content, no history or comments. Parse this with whatever xml processing libraries you prefer in python. I personally prefer elementtree. Good luck.

镜花水月 2024-09-19 00:14:34

维基词典在 MediaWiki 上运行,MediaWiki 有一个 API

API 文档的子页面之一是客户端代码,其中列出了一些 Python 库

Wiktionary runs on MediaWiki, which has an API.

One of the subpages for the API documentation is Client code, which lists some Python libraries.

夏至、离别 2024-09-19 00:14:34

wordnik 在解析定义等方面做得很好
他们有一个 很棒的 api

就像其他人提到的那样,维基词典是一个格式化灾难,并且不是为了计算机可读

wordnik has done a good job parsing-out definitions, etc
and they have a great api

like the others have mentioned, wiktionary is a formatting-disaster, and was not built to be computer-readable

柏林苍穹下 2024-09-19 00:14:34

是的,很多人都解析过维基词典。您通常可以在Wiktionary-l 邮件列表档案中找到过去的经验。

其他答案中未提及的一个项目是 DBPedia 的 维基词典 RDF 提取

解析维基词典的其他数十个研究项目:您可以在最近的 维基词典专题中找到一些示例其他问题

最近有人还制作了一个英语维基词典 REST API,其中包含维基词典数据的未指定子集;目前尚不清楚该事物的未来计划。

Yes, many people parsed Wiktionary. You can usually find past experiences in the Wiktionary-l mailing list archives.

A project not mentioned by other answers is DBPedia's Wiktionary RDF extraction.

Dozens other research projects parsed Wiktionary: you can find some examples in a recent Wiktionary special and in other issues of the Wikimedia research newsletter.

Recently someone also made an English Wiktionary REST API which includes an unspecified subset of the Wiktionary data; future plans for the thing are not known yet.

故事和酒 2024-09-19 00:14:34

我在解析德语维基词典方面取得了一定的成绩。我最终认为它太难了,但我把我的(根本没有整理过)代码放在 https://github.com/benreynwar/wiktionary-parser 在我放弃之前。尽管编辑们使用了一些约定,但除了同行监督之外,它们并没有被任何其他东西强制执行。使用的模板的多样性以及页面中的所有拼写错误使得解析非常具有挑战性。

我认为问题在于他们使用了与维基词典相同的系统,这对于编辑者的易用性来说非常有用,但不适合维基词典更加结构化的内容。这是一种耻辱,因为如果维基词典可以很容易地解析,它将是一个非常有用的资源。

I had a crack at parsing the german wiktionary. I ended up writing it off as too difficult, but I put my (not at all tidied up) code up at https://github.com/benreynwar/wiktionary-parser before I gave up. Although there are conventions used by the editors they are not enforced by anything other than peer oversight. The diversity of templates used along with all the typos in the pages makes the parsing quite challenging.

I think the problem is that they've used the same system as for wiktionary which is great for ease of use by the editors, but is not appropriate for the much more structured content of wiktionary. It's a shame because if wiktionary could be easily parsed it would be a hugely useful resource.

浅沫记忆 2024-09-19 00:14:34

我刚刚从德语转储中制作了一个单词列表,如下所示:

bzcat pages-articles.xml.bz2 | grep '<title>[^[:space:][:punct:]]*</title>' | sed 's:.*<title>\(.*\)</title>.*:\1:' > words

I just made a word list from the German dump like that:

bzcat pages-articles.xml.bz2 | grep '<title>[^[:space:][:punct:]]*</title>' | sed 's:.*<title>\(.*\)</title>.*:\1:' > words
幸福丶如此 2024-09-19 00:14:34

欢迎您使用 MySQL 解析的维基词典数据库。
用 Java 编写的解析器创建了两个数据库(英语维基词典和俄语维基词典): http://wikokit .googlecode.com

如果您喜欢 PHP,那么欢迎您使用 piwidict - PHP此机器可读维基词典的 API 2

You are welcome to play with the MySQL parsed Wiktionary database.
There are two databases (English Wiktionary and Russian Wiktionary) created by the parser written in Java: http://wikokit.googlecode.com

If you like PHP, then you are welcome to play with piwidict - PHP API to this machine-readable Wiktionary 2

開玄 2024-09-19 00:14:34

您可能对 dbnary 项目感兴趣,不是 python,但很有趣。
声明支持 21 种语言的解析,并为 wikdict 提供支持。

You may be interested in dbnary project, not python but interesting.
Claims support parsing for 21 languages and it powers wikdict.

追我者格杀勿论 2024-09-19 00:14:34

还有JWKTL,它在从维基词典中解析和提取结构化数据方面做得很好。它是用 Java 编写的,支持英语、德语和俄语版本。

There is also JWKTL which does a good job at parsing and extracting structured data from wiktionary. It is written in Java and has support for the English, German and Russian editions.

嗼ふ静 2024-09-19 00:14:34

这取决于您需要解析它的彻底程度。如果您只需要获取某种语言中某个单词的所有内容(定义、词源、发音、词形变化等),那么这非常简单。我以前做过这个,虽然 在 Java 中使用 jsoup

但是,如果您需要将其解析为内容的不同组成部分(例如,仅获取单词的定义),那么它将更具挑战性。某种语言中单词的维基词典条目没有预定义的模板,因此标题可以是从

的任何内容,顺序为这些部分可能是混乱的,它们可能是重复的,等等。

It depends on how thoroughly you need to parse it. If you just need to get all contents of a word in a language (definition, etymology, pronunciation, conjugation, etc.) then it's pretty easy. I had done this before, although in Java using jsoup

However, if you need to parse it down to different components of the content (e.g. just getting the definitions of a word), then it will be much more challenging. A Wiktionary entry for a word in a language has no pre-defined template, so a header can be anything from <h3> to <h6>, the order of the sections may be jumbled, they can be repetitive, etc.

烟织青萝梦 2024-09-19 00:14:34

我编写了一个原始的德语维基词典转储的解析器在 Java 中,仅提取名词及其冠词,以及它们的阿拉伯语翻译,没有任何依赖关系。执行需要很长时间,因此请注意。如果有兴趣/需要解析更多或其他数据,请告诉我,我可能会在时间允许的情况下进行研究。

I wrote a primitive parser for the German Wiktionary dump in Java that only extracts nouns and their articles, plus their Arabic translation, without any dependencies. Execution takes a long time, so be warned. If there’s interest/need to parse more or other data, please tell me, I might look into it as time permits.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文