最快,最大的内存有效方法可以访问Python中的大量数据?
所以我有一个充满单词和相应向量的文本文件。该文件约为 4GB(1.9m 条目),我需要使用 python 快速访问给定单词的向量。我有一些代码可以迭代该文件,并以字符串形式检索单词,并以 numpy 数组形式检索每一行的向量。目前我正在使用它来生成一本字典,然后对其进行酸洗。
当程序运行时,pickle 文件会作为字典加载回 python 中,这样我就可以查询它,它工作正常,并且加载后速度相当快。
然而,在最初生成和腌制字典时,它使用约 3GB 的 RAM 和大约 4.5GB 的内存,这并不理想。那么使用例如 sqlite 会改善什么吗?我对此做了很多挖掘,但其他答案不太适合我想做的事情。对于较少的数据,使用字典似乎会更快,但对于这种大小的文件尚不清楚。
我应该使用字典吗,还是应该使用 SQL?还是完全不同的东西?
感谢您的帮助
So I have a text file full of words with corresponding vectors. The file is ~4GB (1.9m entries), and I need to quickly access the vector for a given word using python. I have some code that iterates through this file and retrieves the word as a string and the vector as a numpy array for each line. At the moment I'm using this to generate a dictionary, which I'm then pickling.
When the program is run the pickle file is loaded back into python as a dictionary so I can query it, which works fine and is reasonably fast once its loaded.
However, its using ~3GB of RAM and around 4.5GB when initially generating and pickling the dict, which isn't ideal. So would using e.g. sqlite improve anything? I've done a lot of digging on this but other answers don't quite fit what I'm trying to do. It seems like using a dictionary would be faster for less data, but it's unclear for this size of file.
Am I right to use a dictionary, or should I use SQL instead? Or a completely different thing?
Thanks for any help
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论