如何通过 tika 检测波斯语网页?
我需要一个示例代码来帮助我通过 apache tika 工具包检测波斯语网页。
LanguageIdentifier identifier = new LanguageIdentifier("فارسی");
String language = identifier.getLanguage();
我已经下载了 apache.tika jar 文件并将它们添加到类路径中。但此代码对于波斯语给出错误,但它适用于英语。 如何将波斯语添加到 tika 的 languageIdentifier 包中?
I need a sample code to help me detect farsi language web pages by apache tika toolkit.
LanguageIdentifier identifier = new LanguageIdentifier("فارسی");
String language = identifier.getLanguage();
I have download apache.tika jar files and add them to the classpath. but this code gives error for Farsi language but it works for english.
how can I add Farsi to languageIdentifier package of tika?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
Tika 尚未附带波斯语的语言配置文件。从版本 1.0 开始 开箱即用地支持 27 种语言:
在您的示例中,输入被误检测为
li
(立陶宛语),距离为 0.41,高于确定性阈值为 0.022。请参阅 源代码,了解有关LanguageIdentifier
内部工作的更多信息。波斯语(波斯语,ISO 639-1 2 字母代码
fa
)默认情况下不被识别。如果你想让 Tika 识别另一种语言,你必须先创建一个语言配置文件。
为此,需要执行以下步骤:
查找适合您的语言的文本语料库。我找到了Hamshahri Collection。这应该足够了。下载语料库或其中的一部分,并从 XML 创建一个纯文本文件。
为语言标识符创建一个 ngram 文件。这可以使用 TikaCLI 来完成:
java -jar tika-app-1.0.jar --create-profile=fa -eUTF-8 fa-corpus.txt
这将生成一个名为
fa.ngp
的文件,其中包含 n-gram。配置 Tika 以使其识别新语言。使用
LanguageIdentifier.initProfiles()
以编程方式执行此操作,或者将名为tika.language.override.properties
的属性文件放入类路径中。确保 ngram 文件也位于类路径中。如果您现在运行 Tika,它应该正确检测您的语言。
更新:
详细介绍了创建语言配置文件所需的步骤。
Tika doesn't ship with a language profile for the Farsi language yet. As of version 1.0 27 languages are supported out of the box:
In your example the input is misdetected as
li
(Lithuanian) with a distance of 0.41, which is above the certainty threshold of 0.022. See the source code for more information on the inner works ofLanguageIdentifier
.The Farsi language (Persian, ISO 639-1 2-letter code
fa
) is not recognized by default.If you want Tika to recognize another language, you have to create a language profile first.
For this the following steps are necessary:
Find a text corpus for your language. I found the Hamshahri Collection. This should be sufficient. Download the corpus or parts of it and create a plain text file out of the XML.
Create an ngram file for the language identifier. This can be done using TikaCLI:
java -jar tika-app-1.0.jar --create-profile=fa -eUTF-8 fa-corpus.txt
This will a file called
fa.ngp
which contains the n-grams.Configure Tika so that it recognizes the new language. Either do this programmatically using
LanguageIdentifier.initProfiles()
or put a property file with the nametika.language.override.properties
into the classpath. Make sure the ngram file is in the classpath as well.If you now run Tika, it should correctly detect your language.
Update:
Detailed the steps necessary to create a language profile.