增加 nutch 中语言标识符插件的 Java 堆空间
我正在尝试向自动语言检测工具 Apache 的 tika 添加一种新语言。它需要构建一个语言配置文件以添加新语言。所以我使用 nutch 语言标识符插件来构建此配置文件。
命令如下:
bin/nutch plugin language-identifier org.apache.nutch.analysis.lang.NGramProfile -create ./language-detection-profile/jp ./language-detection-profile/japanese4ngram-1.txt utf-8
其中 ./language-detection-profile/japanese4ngram-1.txt 是新语言语料库。
我已经在小规模语料库(1 MB)上进行了测试,一切都很好,配置文件已按我的预期创建。
然而,当语料库很大时(> 1 GB)。我遇到内存不足(堆空间)的问题,例如
Exception in thread "main" java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.nutch.plugin.PluginRepository.main(PluginRepository.java:421) Caused by: java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2882) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390) at java.lang.StringBuilder.append(StringBuilder.java:119) at org.apache.nutch.analysis.lang.NGramProfile.create(NGramProfile.java:374) at org.apache.nutch.analysis.lang.NGramProfile.main(NGramProfile.java:484) ... 5 more
有人知道如何为 nutch 的插件指定堆空间大小吗?谢谢。
编辑: 在米卡维利的帮助下。 在Ubuntu中: 放
if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then NUTCH_OPTS="$NUTCH_OPTS -Djava.library.path=$JAVA_LIBRARY_PATH -Xmx2048m" fi
I am trying to add a new language To Automatic Language Detection tool Apache's tika. It needs to build a language profile for adding a new language. So i am using nutch language-identifier plug-in to build this profile.
The command is the following:
bin/nutch plugin language-identifier org.apache.nutch.analysis.lang.NGramProfile -create ./language-detection-profile/jp ./language-detection-profile/japanese4ngram-1.txt utf-8
Where ./language-detection-profile/japanese4ngram-1.txt is the new language corpus.
I have tested on a small size corpus (1 MB), and everything is fine, the profile is created as I expected.
However, when the corpus is large (> 1 GB). I have the problem of out of memory (heap space), like
Exception in thread "main" java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.nutch.plugin.PluginRepository.main(PluginRepository.java:421) Caused by: java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2882) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390) at java.lang.StringBuilder.append(StringBuilder.java:119) at org.apache.nutch.analysis.lang.NGramProfile.create(NGramProfile.java:374) at org.apache.nutch.analysis.lang.NGramProfile.main(NGramProfile.java:484) ... 5 more
Does anyone know how to specify heap space size for nutch's plugin? Thanks.
Edit:
With the help from Mikaveli.
In Ubuntu:
set
if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then NUTCH_OPTS="$NUTCH_OPTS -Djava.library.path=$JAVA_LIBRARY_PATH -Xmx2048m" fi
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
假设您正在 Windows 机器上进行开发,请编辑 nutch.bat 并在
rem NUTCH_OPTS
行后添加以下内容:显然,将 RAM 量设置在计算机的物理限制内 - 请注意,Nutch 可以轻松地设置需要4g,具体取决于你用它做什么。
Assuming you're developing on a Windows box, edit nutch.bat and add the following after the
rem NUTCH_OPTS
line:Obviously set the amount of RAM within the physical limit of your machine - note that Nutch can easily require 4g, depending on what you're doing with it.