我有多种语言的文本文件。如何在NLTK中有选择地删除一种语言?
也许这是不可能的,我应该放弃所有的希望。或者也许有一种我没有想到的非常聪明的方法。
这是我得到的两个例子:
尼亚巴尼亚 - 尼亚萨 (yabisa, yaybasu)[ybs][ey-巴-斯](变得干燥, 僵硬的、僵硬的)20:77 yabasan = 干的。
Maybe this is just impossible and I should give up all hope. Or maybe there's a really clever way to do it that I haven't thought of.
Here's two examples of what I've got:
يَبِسَ - يَيْبَسُ (yabisa,
yaybasu)[y-b-s][ي-ب-س] (To become dry,
stiff, rigid) 20:77 yabasan = dry.
يَسَّرَ - يُيَسِّرُ (yassara,
yuyassiru)[y-s-r][ي-س-ر] (To
facilitate, make it easy) 92:7
nuyassiruhuu = We will ease him.
and
Zu Hülfe! zu Hülfe! Help! Help!
Sonst bin ich verloren! Otherwise I am
lost! Zu Hülfe! Zu Hülfe! Help!
Help! Sonst bin ich
verloren! Otherwise I am lost! Der
listigen Schlange zum Opfer erkoren,
Selected as offering to the cunning
snake, Barmherzigige Götter! Merciful
Gods! Schon nahet sie sich, Already it
gets closer, Schon nahet sie
sich, Already it gets closer,
... it would be really annoying to go through and delete one language in order to further process these lines of text.
One way I was thinking this could be done in NLTK was to split the text into tokens, have some way of knowing the provenance of each token based on a small corpus, and then ask NLTK to 'reconstitute' only the tokens of my choosing. Is this just a wild fantasy?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以使用 nltk.NaiveBayesClassifier 来完成上述工作。
以下链接应该有帮助:
http://nltk.googlecode.com/svn/trunk/doc/ book/ch06.html
它有一个使用 nltk.NaiveBayesClassifier 进行性别识别的示例。您使用相同的语言识别。
您引用的第一个示例将与 nltk.NaiveBayesClassifier 配合使用,因为 unicode 集完全不同。
在第二个示例中,专有名词之类的单词可能在两种语言中拼写相同,这可能会导致语言识别中出现一些错误。
You can use nltk.NaiveBayesClassifier to do the job exactly as you said above.
The following link should help:
http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html
It has an example of using nltk.NaiveBayesClassifier for gender identification. you use the same for language identification.
The first example you quoted will work well with nltk.NaiveBayesClassifier since the unicode set is completely different.
In the second example, there is a possibility of words like proper nouns spelled the same in both the languages which might cause some error in identification of the language.