增加 nutch 中语言标识符插件的 Java 堆空间

发布于 2024-11-30 05:23:48 字数 1644 浏览 6 评论 0原文

我正在尝试向自动语言检测工具 Apache 的 tika 添加一种新语言。它需要构建一个语言配置文件以添加新语言。所以我使用 nutch 语言标识符插件来构建此配置文件。

命令如下:

bin/nutch plugin language-identifier org.apache.nutch.analysis.lang.NGramProfile -create ./language-detection-profile/jp ./language-detection-profile/japanese4ngram-1.txt utf-8

其中 ./language-detection-profile/japanese4ngram-1.txt 是新语言语料库。

我已经在小规模语料库(1 MB)上进行了测试,一切都很好,配置文件已按我的预期创建。

然而,当语料库很大时(> 1 GB)。我遇到内存不足(堆空间)的问题,例如

Exception in thread "main" java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.nutch.plugin.PluginRepository.main(PluginRepository.java:421)
Caused by: java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:2882)
    at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
    at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390)
    at java.lang.StringBuilder.append(StringBuilder.java:119)
    at org.apache.nutch.analysis.lang.NGramProfile.create(NGramProfile.java:374)
    at org.apache.nutch.analysis.lang.NGramProfile.main(NGramProfile.java:484)
    ... 5 more

有人知道如何为 nutch 的插件指定堆空间大小吗?谢谢。

编辑: 在米卡维利的帮助下。 在Ubuntu中: 放

if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then
  NUTCH_OPTS="$NUTCH_OPTS -Djava.library.path=$JAVA_LIBRARY_PATH -Xmx2048m"
fi

I am trying to add a new language To Automatic Language Detection tool Apache's tika. It needs to build a language profile for adding a new language. So i am using nutch language-identifier plug-in to build this profile.

The command is the following:

bin/nutch plugin language-identifier org.apache.nutch.analysis.lang.NGramProfile -create ./language-detection-profile/jp ./language-detection-profile/japanese4ngram-1.txt utf-8

Where ./language-detection-profile/japanese4ngram-1.txt is the new language corpus.

I have tested on a small size corpus (1 MB), and everything is fine, the profile is created as I expected.

However, when the corpus is large (> 1 GB). I have the problem of out of memory (heap space), like

Exception in thread "main" java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.nutch.plugin.PluginRepository.main(PluginRepository.java:421)
Caused by: java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:2882)
    at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
    at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390)
    at java.lang.StringBuilder.append(StringBuilder.java:119)
    at org.apache.nutch.analysis.lang.NGramProfile.create(NGramProfile.java:374)
    at org.apache.nutch.analysis.lang.NGramProfile.main(NGramProfile.java:484)
    ... 5 more

Does anyone know how to specify heap space size for nutch's plugin? Thanks.

Edit:
With the help from Mikaveli.
In Ubuntu:
set

if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then
  NUTCH_OPTS="$NUTCH_OPTS -Djava.library.path=$JAVA_LIBRARY_PATH -Xmx2048m"
fi

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

盗心人 2024-12-07 05:23:48

假设您正在 Windows 机器上进行开发,请编辑 nutch.bat 并在 rem NUTCH_OPTS 行后添加以下内容:

set NUTCH_OPTS=%NUTCH_OPTS% -Xmx1024m

显然,将 RAM 量设置在计算机的物理限制内 - 请注意,Nutch 可以轻松地设置需要4g,具体取决于你用它做什么。

Assuming you're developing on a Windows box, edit nutch.bat and add the following after the rem NUTCH_OPTS line:

set NUTCH_OPTS=%NUTCH_OPTS% -Xmx1024m

Obviously set the amount of RAM within the physical limit of your machine - note that Nutch can easily require 4g, depending on what you're doing with it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文