使用 java hashmap 进行 n 元模型建模

发布于 2024-11-05 22:10:56 字数 555 浏览 0 评论 0原文

我需要对 n 元语法(n 个单词的序列)及其上下文(出现在 n 元语法附近的单词及其频率)的集合进行建模。我的想法是这样的:

public class Ngram {

    private String[] words;
    private HashMap<String, Integer> contextCount = new HashMap<String, Integer>();
}

然后,对于所有不同 n-gram 的计数,我使用另一个 Hashmap,就像

HashMap<String, Ngram> ngrams = new HashMap<String, Ngram>();

我在接收文本时添加到它一样。问题是,当 n-gram 的数量超过 10,000 左右时,JVM 堆就会填满(最大设置为 1.5GB),并且一切都会严重减慢。

有没有更好的方法来做到这一点,以避免这种内存消耗?此外,n 元语法之间的上下文应该很容易进行比较,我不确定我的解决方案是否可以做到这一点。

I need to model a collection of n-grams (sequences of n words) and their contexts (words that appear near the n-gram along with their frequency). My idea of was this:

public class Ngram {

    private String[] words;
    private HashMap<String, Integer> contextCount = new HashMap<String, Integer>();
}

Then, for the count of all the different n-grams, I use another Hashmap, like

HashMap<String, Ngram> ngrams = new HashMap<String, Ngram>();

and I add to it while receiving text. The problem is, when the number of n-grams surpasses 10,000 or so, the JVM Heap fills up (it's set to a max of 1.5GB), and everything slows down really badly.

Is there a better way to do this, so to avoid such memory consumption? Also, the contexts should be easily comparable between the n-grams, which I'm not sure is possible with my solution.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

听,心雨的声音 2024-11-12 22:10:56

您可以将 HADOOP MapReducer 用于大型数据库(通常用于大数据)。使用 Mapper 将输入拆分为 Ngram,并使用组合器和映射器来对这些 Ngram 执行任何您想要执行的操作。

HADOOP uses <Key,value> as like you wish to process with Hashmap.

我猜它类似于分类。所以它很适合。但它需要集群。

如果可能的话,您最好从Hadoop The Definitive Guide(Orielly 出版物)开始。

You can make use HADOOP MapReducer for Huge database (normally for Bigdata). use Mapper to split the input to Ngrams and combiner and mapper to do whatever you want to do with those Ngrams.

HADOOP uses <Key,value> as like you wish to process with Hashmap.

I guess its something like Classification. so it well suits. But it requires cluster.

if possible, you can better start with Hadoop The Definitive Guide (Orielly publications).

美胚控场 2024-11-12 22:10:56

也许您已经找到了问题的解决方案,但本文中有一种非常好的大规模语言模型方法:

的布隆过滤器语言模型:廉价的 Tera-Scale LM

平滑 ldc.upenn.edu/D/D07/D07-1049.pdf" rel="nofollow">http://acl.ldc.upenn.edu/D/D07/D07-1049.pdf

Maybe you already found the solution to your problem, but there is a very nice approach to large scale language models on this paper:

Smoothed Bloom filter language models: Tera-Scale LMs on the Cheap

http://acl.ldc.upenn.edu/D/D07/D07-1049.pdf

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文