向 Apache Tika 添加语言配置文件
请任何设法做到这一点的人解释一下如何做到这一点:-)
我是否需要获取我需要添加的语言的 n-gram 文件?
是否需要创建 tika.language.override.properties
,添加一些其他 lang 代码并在 classPath 上添加 lang-code.ngp n-gram 文件?在这种情况下,我从哪里得到它以及为什么 Tika 不支持更多语言,如果这只是一个问题?
目前支持这些语言进行语言检测
da,de,et,el,en,es,fi,fr,hu,is,it,lt,nl,no,pl,pt,ru,sv,th
,tika 使用传统的 n-gram 表示法
er_ 132232
_de 103517
en_ 82666
et_ 80661
for 65286
_fo 57945
de_ 51382
der 44049
at_ 41915
det 41381
_og 40344
_at 39482
ing 38707
den 36795
og_ 36577
_me 34924
nde 34528
af bg cs de en fa fr he hr id ja ko ml ne no pl ro sk sq sw te tl uk vi zh-tw ar bn da el es fi gu hi hu it kn mk mr nl pa pt ru so sv ta th tr ur zh-cn
在 JSON 表示法中具有一些不同的 n-gram 文件
{"freq":{"D":9246,"E":2445,"F":2510,"G":3299,"A":6930,"B":3706,"C":2451,"L":2519,"M":3951,"N":3334,"O":2514,"H" ....
Could please anybody who managed to do that explain how to do that :-)
Do I need to get n-gram files for the language I need to add ?
Is it a matter of creating tika.language.override.properties
, add some other lang codes and add lang-code.ngp n-gram file on the classPath ? In that case, where do I get it and why Tika doesn't support more languages, if it is just a matter of this ?
There are currently these languages supported for language detection
da,de,et,el,en,es,fi,fr,hu,is,it,lt,nl,no,pl,pt,ru,sv,th
and tika uses traditional n-gram notation
er_ 132232
_de 103517
en_ 82666
et_ 80661
for 65286
_fo 57945
de_ 51382
der 44049
at_ 41915
det 41381
_og 40344
_at 39482
ing 38707
den 36795
og_ 36577
_me 34924
nde 34528
This lang detection application currently supports these languages, but has kinda different n-gram files
af bg cs de en fa fr he hr id ja ko ml ne no pl ro sk sq sw te tl uk vi zh-tw ar bn da el es fi gu hi hu it kn mk mr nl pa pt ru so sv ta th tr ur zh-cn
in JSON notation
{"freq":{"D":9246,"E":2445,"F":2510,"G":3299,"A":6930,"B":3706,"C":2451,"L":2519,"M":3951,"N":3334,"O":2514,"H" ....
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
看起来从 TIKA-490 开始,应该可以添加新语言配置文件。 TIKA-546 似乎表明它还没有想象的那么容易,同时您需要启动 Nutch 的 NGramProfile 工具并调整输出。
我建议您尝试使用 Nutch 工具生成文件,然后查看 TIKA-490 上的评论以了解如何使用它们的详细信息。
It looks like as of TIKA-490, it should be possible to add new language profiles. TIKA-546 seems to indicate it isn't yet as easy as it might be, and in the mean time you'll need to start with Nutch's NGramProfile tool and tweak the output.
I'd suggest you try using the Nutch tool to generate the files, then look at the comments on TIKA-490 for details on how to use them.