使用 java hashmap 进行 n 元模型建模
我需要对 n 元语法(n 个单词的序列)及其上下文(出现在 n 元语法附近的单词及其频率)的集合进行建模。我的想法是这样的:
public class Ngram {
private String[] words;
private HashMap<String, Integer> contextCount = new HashMap<String, Integer>();
}
然后,对于所有不同 n-gram 的计数,我使用另一个 Hashmap,就像
HashMap<String, Ngram> ngrams = new HashMap<String, Ngram>();
我在接收文本时添加到它一样。问题是,当 n-gram 的数量超过 10,000 左右时,JVM 堆就会填满(最大设置为 1.5GB),并且一切都会严重减慢。
有没有更好的方法来做到这一点,以避免这种内存消耗?此外,n 元语法之间的上下文应该很容易进行比较,我不确定我的解决方案是否可以做到这一点。
I need to model a collection of n-grams (sequences of n words) and their contexts (words that appear near the n-gram along with their frequency). My idea of was this:
public class Ngram {
private String[] words;
private HashMap<String, Integer> contextCount = new HashMap<String, Integer>();
}
Then, for the count of all the different n-grams, I use another Hashmap, like
HashMap<String, Ngram> ngrams = new HashMap<String, Ngram>();
and I add to it while receiving text. The problem is, when the number of n-grams surpasses 10,000 or so, the JVM Heap fills up (it's set to a max of 1.5GB), and everything slows down really badly.
Is there a better way to do this, so to avoid such memory consumption? Also, the contexts should be easily comparable between the n-grams, which I'm not sure is possible with my solution.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以将 HADOOP MapReducer 用于大型数据库(通常用于大数据)。使用 Mapper 将输入拆分为 Ngram,并使用组合器和映射器来对这些 Ngram 执行任何您想要执行的操作。
我猜它类似于分类。所以它很适合。但它需要集群。
如果可能的话,您最好从Hadoop The Definitive Guide(Orielly 出版物)开始。
You can make use HADOOP MapReducer for Huge database (normally for Bigdata). use Mapper to split the input to Ngrams and combiner and mapper to do whatever you want to do with those Ngrams.
I guess its something like Classification. so it well suits. But it requires cluster.
if possible, you can better start with Hadoop The Definitive Guide (Orielly publications).
也许您已经找到了问题的解决方案,但本文中有一种非常好的大规模语言模型方法:
的布隆过滤器语言模型:廉价的 Tera-Scale LM
平滑 ldc.upenn.edu/D/D07/D07-1049.pdf" rel="nofollow">http://acl.ldc.upenn.edu/D/D07/D07-1049.pdf
Maybe you already found the solution to your problem, but there is a very nice approach to large scale language models on this paper:
Smoothed Bloom filter language models: Tera-Scale LMs on the Cheap
http://acl.ldc.upenn.edu/D/D07/D07-1049.pdf