使用 java hashmap 进行 n 元模型建模

发布于 2024-11-05 22:10:56 字数 555 浏览 3 评论 0原文

我需要对 n 元语法（n 个单词的序列）及其上下文（出现在 n 元语法附近的单词及其频率）的集合进行建模。我的想法是这样的：

public class Ngram {

    private String[] words;
    private HashMap<String, Integer> contextCount = new HashMap<String, Integer>();
}

然后，对于所有不同 n-gram 的计数，我使用另一个 Hashmap，就像

HashMap<String, Ngram> ngrams = new HashMap<String, Ngram>();

我在接收文本时添加到它一样。问题是，当 n-gram 的数量超过 10,000 左右时，JVM 堆就会填满（最大设置为 1.5GB），并且一切都会严重减慢。

有没有更好的方法来做到这一点，以避免这种内存消耗？此外，n 元语法之间的上下文应该很容易进行比较，我不确定我的解决方案是否可以做到这一点。

原文

I need to model a collection of n-grams (sequences of n words) and their contexts (words that appear near the n-gram along with their frequency). My idea of was this:

public class Ngram {

    private String[] words;
    private HashMap<String, Integer> contextCount = new HashMap<String, Integer>();
}

Then, for the count of all the different n-grams, I use another Hashmap, like

HashMap<String, Ngram> ngrams = new HashMap<String, Ngram>();

and I add to it while receiving text. The problem is, when the number of n-grams surpasses 10,000 or so, the JVM Heap fills up (it's set to a max of 1.5GB), and everything slows down really badly.

Is there a better way to do this, so to avoid such memory consumption? Also, the contexts should be easily comparable between the n-grams, which I'm not sure is possible with my solution.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

听，心雨的声音 2024-11-12 22:10:56

您可以将 HADOOP MapReducer 用于大型数据库（通常用于大数据）。使用 Mapper 将输入拆分为 Ngram，并使用组合器和映射器来对这些 Ngram 执行任何您想要执行的操作。

HADOOP uses <Key,value> as like you wish to process with Hashmap.

我猜它类似于分类。所以它很适合。但它需要集群。

如果可能的话，您最好从Hadoop The Definitive Guide（Orielly 出版物）开始。

You can make use HADOOP MapReducer for Huge database (normally for Bigdata). use Mapper to split the input to Ngrams and combiner and mapper to do whatever you want to do with those Ngrams.

HADOOP uses <Key,value> as like you wish to process with Hashmap.

I guess its something like Classification. so it well suits. But it requires cluster.

if possible, you can better start with Hadoop The Definitive Guide (Orielly publications).

回复收藏 0 原文

美胚控场 2024-11-12 22:10:56

也许您已经找到了问题的解决方案，但本文中有一种非常好的大规模语言模型方法：

的布隆过滤器语言模型：廉价的 Tera-Scale LM

平滑 ldc.upenn.edu/D/D07/D07-1049.pdf" rel="nofollow">http://acl.ldc.upenn.edu/D/D07/D07-1049.pdf

回复收藏 0 原文

~没有更多了~

关于作者

聽兲甴掵

暂无简介

0 文章

0 评论

23 人气

关注发私信

友情链接

文江博客

使用 java hashmap 进行 n 元模型建模

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

使用 java hashmap 进行 n 元模型建模

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。