我有多种语言的文本文件。如何在NLTK中有选择地删除一种语言?

发布于 2024-09-15 21:03:42 字数 172 浏览 7 评论 0原文

也许这是不可能的,我应该放弃所有的希望。或者也许有一种我没有想到的非常聪明的方法。

这是我得到的两个例子:

尼亚巴尼亚 - 尼亚萨 (yabisa, yaybasu)[ybs][ey-巴-斯](变得干燥, 僵硬的、僵硬的)20:77 yabasan = 干的。

Maybe this is just impossible and I should give up all hope. Or maybe there's a really clever way to do it that I haven't thought of.

Here's two examples of what I've got:

يَبِسَ - يَيْبَسُ (yabisa,
yaybasu)[y-b-s][ي-ب-س] (To become dry,
stiff, rigid) 20:77 yabasan = dry.
يَسَّرَ - يُيَسِّرُ (yassara,
yuyassiru)[y-s-r][ي-س-ر] (To
facilitate, make it easy) 92:7
nuyassiruhuu = We will ease him.

and

Zu Hülfe! zu Hülfe! Help! Help!
Sonst bin ich verloren! Otherwise I am
lost! Zu Hülfe! Zu Hülfe! Help!
Help! Sonst bin ich
verloren! Otherwise I am lost! Der
listigen Schlange zum Opfer erkoren,
Selected as offering to the cunning
snake, Barmherzigige Götter! Merciful
Gods! Schon nahet sie sich, Already it
gets closer, Schon nahet sie
sich, Already it gets closer,

... it would be really annoying to go through and delete one language in order to further process these lines of text.

One way I was thinking this could be done in NLTK was to split the text into tokens, have some way of knowing the provenance of each token based on a small corpus, and then ask NLTK to 'reconstitute' only the tokens of my choosing. Is this just a wild fantasy?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

慢慢从新开始 2024-09-22 21:03:42

您可以使用 nltk.NaiveBayesClassifier 来完成上述工作。

以下链接应该有帮助:
http://nltk.googlecode.com/svn/trunk/doc/ book/ch06.html

它有一个使用 nltk.NaiveBayesClassifier 进行性别识别的示例。您使用相同的语言识别。

您引用的第一个示例将与 nltk.NaiveBayesClassifier 配合使用,因为 unicode 集完全不同。

在第二个示例中,专有名词之类的单词可能在两种语言中拼写相同,这可能会导致语言识别中出现一些错误。

You can use nltk.NaiveBayesClassifier to do the job exactly as you said above.

The following link should help:
http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html

It has an example of using nltk.NaiveBayesClassifier for gender identification. you use the same for language identification.

The first example you quoted will work well with nltk.NaiveBayesClassifier since the unicode set is completely different.

In the second example, there is a possibility of words like proper nouns spelled the same in both the languages which might cause some error in identification of the language.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文