Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 11 years ago.
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
接受
或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
发布评论
评论(3)
http://www.cs 中有一个语料库列表。 technion.ac.il/~gabr/resources/data/ne_datasets.html
该列表中的 CoNLL 2003 语料库是免费的,可从 http://www.cnts.ua.ac.be/conll2003/ner/(注释)和 NIST(文本)。
There's a list of corpora at http://www.cs.technion.ac.il/~gabr/resources/data/ne_datasets.html
The CoNLL 2003 corpus, which is on that list, is free and is available from http://www.cnts.ua.ac.be/conll2003/ner/ (annotations) and NIST (text).
Python NLTK 可以访问
nltk.corpus.conll2000
语料库。调用 conll2000.iob_words() 会返回(单词、词性、IOB)三元组列表,其中 IOB 是 Inside-entity/Outside-entity/Beginning-of-entity 中的标签格式。新闻专线风格的上下文中总共约有 25 万个单词。
The Python NLTK has access to the
nltk.corpus.conll2000
corpus. Callingconll2000.iob_words()
returns a list of (word, part-of-speech, IOB) triples, where IOB is a tag in the Inside-entity/Outside-entity/Beginning-of-entity format.There are about 250k total words in a newswire-style context.
dbPedia 是开放且免费的
dbPedia 是根据 WikiPedia 构建的,并且这是一个非常大的语料库。在所有 dbPedia 标题转储rdfs:label 的三元组上构建 Lucene 索引一个>。
dbPedia is open and free
dbPedia is built from WikiPedia and it is a very big corpus. Build an Lucene index on triples involving
rdfs:label
on all dbPedia titles dump.