C++ Qt WordCount 和大数据集
我想计算一组纯文本文件中的单词出现次数。就像这里 http://doc.trolltech.com/4.5/qtconcurrent- wordcount-main-cpp.html
问题是我需要处理大量的纯文本文件 - 所以我在 QMap 中存储的结果无法放入内存。
我用谷歌搜索了外部内存(基于文件)合并排序算法,但我懒得自己实现。所以我想将结果集按部分划分以将它们每个部分放入内存中。然后将这部分存储在磁盘上的文件中。然后调用魔术函数 mergeSort(QList, result_file) 并将最终结果保存在 result_file 中。
有谁知道这个算法的 Qt 兼容实现吗?
简而言之,我正在寻找 python heapq.merge (http://docs.python.org/library/heapq.html#heapq.merge) 模拟,但用于 Qt 容器。
I want to count words occurrences in a set of plain text files. Just like here http://doc.trolltech.com/4.5/qtconcurrent-wordcount-main-cpp.html
The problem is that i need to process very big amount of plain text files - so my result srored in QMap could not fit into memory.
I googled external memory (file based) merge sort algorithm, but i'm too lazy to implement myself. So i want to divide result set by portions to fit each of them into memory. Then store this portions in files on disk. Then call magic function mergeSort(QList, result_file) and have final result in result_file.
Does anyone know Qt compatible implementation of this algo?
In short i'm looking for pythons heapq.merge (http://docs.python.org/library/heapq.html#heapq.merge) analog but for Qt containers.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
你可能想看看这个:
http://stxxl.sourceforge.net/
这不正是您正在寻找的东西(虽然足够接近),但我猜你不会找到你想要的 Qt 列表。由于您正在实现创建此列表的算法,因此更改其类型应该不成问题。据我记得在这些列表中你可以使用标准的 stl 排序算法。唯一的问题仍然是性能。
You might wanna check out this one:
http://stxxl.sourceforge.net/
It's not exactly what you are looking for (close enough though), but I guess you will not find exactly what you want working with Qt lists. Since you are are implementing alghoritm creating this list, changing it's type shouldn't be a problem. As far as i remember on those list you can use standard stl sorting alghoritms. The only problem remains preformance.
我认为该地图包含单词和出现次数之间的关联。既然如此,为什么说你的内存消耗如此之大呢?您可以拥有多少个不同的单词和形式?一个单词的平均内存消耗是多少?
考虑到 1.000.000 个单词,每个单词消耗 1K 内存(包括单词文本、QMap 特定存储),这将导致(大约)1GB 内存,这......对我来说似乎并不多。
I presume that the map contains the association between the word and the number of occurences. In this case, why do you say you have such a significant memory consumption? How many distinct words and forms could you have and what is the average memory consumption for one word?
Considering 1.000.000 words, with 1K memory consumption per word (that includes the word text, the QMap specific storage), that would lead to (aprox) 1GB of memory, which... doesn't seem so much to me.