如何通过 tika 检测波斯语网页？

发布于 2024-12-29 22:06:37 字数 298 浏览 2 评论 0原文

我需要一个示例代码来帮助我通过 apache tika 工具包检测波斯语网页。

 LanguageIdentifier identifier = new LanguageIdentifier("فارسی");
        String language = identifier.getLanguage();

我已经下载了 apache.tika jar 文件并将它们添加到类路径中。但此代码对于波斯语给出错误，但它适用于英语。如何将波斯语添加到 tika 的 languageIdentifier 包中？

原文

I need a sample code to help me detect farsi language web pages by apache tika toolkit.

 LanguageIdentifier identifier = new LanguageIdentifier("فارسی");
        String language = identifier.getLanguage();

I have download apache.tika jar files and add them to the classpath. but this code gives error for Farsi language but it works for english.
how can I add Farsi to languageIdentifier package of tika?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

忆依然 2025-01-05 22:06:38

Tika 尚未附带波斯语的语言配置文件。从版本 1.0 开始开箱即用地支持 27 种语言：

languages=be,ca,da,de,eo,et,el,en,es,fi,fr,gl,hu,is,it,lt,nl,no,pl,pt,ro,ru,sk,sl,sv,th,uk

在您的示例中，输入被误检测为 li（立陶宛语），距离为 0.41，高于确定性阈值为 0.022。请参阅源代码，了解有关 LanguageIdentifier 内部工作的更多信息。

波斯语（波斯语，ISO 639-1 2 字母代码 fa ）默认情况下不被识别。
如果你想让 Tika 识别另一种语言，你必须先创建一个语言配置文件。

为此，需要执行以下步骤：

查找适合您的语言的文本语料库。我找到了Hamshahri Collection。这应该足够了。下载语料库或其中的一部分，并从 XML 创建一个纯文本文件。
为语言标识符创建一个 ngram 文件。这可以使用 TikaCLI 来完成：
java -jar tika-app-1.0.jar --create-profile=fa -eUTF-8 fa-corpus.txt
这将生成一个名为 fa.ngp 的文件，其中包含 n-gram。
配置 Tika 以使其识别新语言。使用 LanguageIdentifier.initProfiles() 以编程方式执行此操作，或者将名为 tika.language.override.properties 的属性文件放入类路径中。确保 ngram 文件也位于类路径中。

如果您现在运行 Tika，它应该正确检测您的语言。

更新：
详细介绍了创建语言配置文件所需的步骤。

Tika doesn't ship with a language profile for the Farsi language yet. As of version 1.0 27 languages are supported out of the box:

languages=be,ca,da,de,eo,et,el,en,es,fi,fr,gl,hu,is,it,lt,nl,no,pl,pt,ro,ru,sk,sl,sv,th,uk

In your example the input is misdetected as li(Lithuanian) with a distance of 0.41, which is above the certainty threshold of 0.022. See the source code for more information on the inner works of LanguageIdentifier.

The Farsi language (Persian, ISO 639-1 2-letter code fa) is not recognized by default.
If you want Tika to recognize another language, you have to create a language profile first.

For this the following steps are necessary:

Find a text corpus for your language. I found the Hamshahri Collection. This should be sufficient. Download the corpus or parts of it and create a plain text file out of the XML.
Create an ngram file for the language identifier. This can be done using TikaCLI:
java -jar tika-app-1.0.jar --create-profile=fa -eUTF-8 fa-corpus.txt
This will a file called fa.ngp which contains the n-grams.
Configure Tika so that it recognizes the new language. Either do this programmatically using LanguageIdentifier.initProfiles() or put a property file with the name tika.language.override.properties into the classpath. Make sure the ngram file is in the classpath as well.

If you now run Tika, it should correctly detect your language.

Update:
Detailed the steps necessary to create a language profile.

回复收藏 0 原文

~没有更多了~