在内存中保存大型列表的替代方案(python)

发布于 2024-08-16 04:47:52 字数 1432 浏览 2 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

眼睛会笑 2024-08-23 04:47:52

如果您的“数字”足够简单(每个最多 4 个字节的有符号或无符号整数,或者每个 4 或 8 个字节的浮点数),我推荐标准库 array 模块是在内存中保存数百万个模块的最佳方式(“虚拟”的“提示”) array”)和一个二进制文件(以二进制读/写方式打开)支持磁盘上结构的其余部分。 array.array 具有非常快速的 fromfiletofile 方法,以方便来回移动数据。

即,基本上,假设无符号长数字,例如:

import os

# no more than 100 million items in memory at a time
MAXINMEM = int(1e8)

class bigarray(object):
  def __init__(self):
    self.f = open('afile.dat', 'w+')
    self.a = array.array('L')
  def append(self, n):
    self.a.append(n)
    if len(self.a) > MAXINMEM:
      self.a.tofile(self.f)
      del self.a[:]
  def pop(self):
    if not len(self.a):
      try: self.f.seek(-self.a.itemsize * MAXINMEM, os.SEEK_END)
      except IOError: return self.a.pop()  # ensure normal IndexError &c
      try: self.a.fromfile(self.f, MAXINMEM)
      except EOFError: pass
      self.f.seek(-self.a.itemsize * MAXINMEM, os.SEEK_END)
      self.f.truncate()
    return self.a.pop()

当然,您可以根据需要添加其他方法(例如,跟踪总长度,添加 extend 等),但是如果 popappend 确实是您所需要的一切。

If your "numbers" are simple-enough ones (signed or unsigned integers of up to 4 bytes each, or floats of 4 or 8 bytes each), I recommend the standard library array module as the best way to keep a few millions of them in memory (the "tip" of your "virtual array") with a binary file (open for binary R/W) backing the rest of the structure on disk. array.array has very fast fromfile and tofile methods to facilitate the moving of data back and forth.

I.e., basically, assuming for example unsigned-long numbers, something like:

import os

# no more than 100 million items in memory at a time
MAXINMEM = int(1e8)

class bigarray(object):
  def __init__(self):
    self.f = open('afile.dat', 'w+')
    self.a = array.array('L')
  def append(self, n):
    self.a.append(n)
    if len(self.a) > MAXINMEM:
      self.a.tofile(self.f)
      del self.a[:]
  def pop(self):
    if not len(self.a):
      try: self.f.seek(-self.a.itemsize * MAXINMEM, os.SEEK_END)
      except IOError: return self.a.pop()  # ensure normal IndexError &c
      try: self.a.fromfile(self.f, MAXINMEM)
      except EOFError: pass
      self.f.seek(-self.a.itemsize * MAXINMEM, os.SEEK_END)
      self.f.truncate()
    return self.a.pop()

Of course you can add other methods as necessary (e.g. keep track of the overall length, add extend, whatever), but if pop and append are indeed all you need this should serve.

柠檬心 2024-08-23 04:47:52

可能有多种方法可以将列表数据存储在文件中而不是内存中。您选择如何执行此操作完全取决于您需要对数据执行哪种操作。您需要随机访问第 N 个元素吗?您需要迭代所有元素吗?您会搜索符合特定条件的元素吗?列表元素采用什么形式?您只会在列表的末尾插入,还是也在中间插入?是否有元数据可以与磁盘上的大部分项目一起保留在内存中?等等等等。

一种可能性是建立关系数据结构,并将其存储在 SQLite 数据库中。

There are probably dozens of ways to store your list data in a file instead of in memory. How you choose to do it will depend entirely on what sort of operations you need to perform on the data. Do you need random access to the Nth element? Do you need to iterate over all elements? Will you be searching for elements that match certain criteria? What form do the list elements take? Will you only be inserting at the end of the list, or also in the middle? Is there metadata you can keep in memory with the bulk of the items on disk? And so on and so on.

One possibility is to structure your data relationally, and store it in a SQLite database.

薔薇婲 2024-08-23 04:47:52

答案是“视情况而定”。

您在列表中存储什么?字符串?整数?物体?

与读取列表相比,写入列表的频率如何?项目是否仅附加在末尾,或者可以在中间修改或插入条目?

如果您只是追加到末尾,那么写入平面文件可能是最简单的可行方法。

如果您要存储可变大小的对象(例如字符串),那么可能会保留每个字符串开头的内存索引,以便您可以快速读取它。

如果您想要字典行为,那么请查看数据库模块 - dbm、gdbm、bsddb 等。

如果您想要随机访问写入,那么 SQL 数据库可能会更好。

无论你做什么,写入磁盘都会比写入内存慢几个数量级,但如果不知道数据将如何使用,就不可能更具体。

编辑:
根据您更新的要求,我将使用一个平面文件并保留最后 N 个元素的内存缓冲区。

The answer is very much "it depends".

What are you storing in the lists? Strings? integers? Objects?

How often is the list written to compared with being read? Are items only appended on the end, or can entries be modified or inserted in the middle?

If you are only appending to the end then writing to a flat file may be the simplest thing that could possibly work.

If you are storing objects of variable size such as strings then maybe keep an in-memory index of the start of each string, so you can read it quickly.

If you want dictionary behaviour then look at the db modules - dbm, gdbm, bsddb, etc.

If you want random access writing then maybe a SQL database may be better.

Whatever you do, going to disk is going to be orders of magnitude slower than in-memory, but without knowing how the data is going to be used it is impossible to be more specific.

edit:
From your updated requirements I would go with a flat file and keep an in-memory buffer of the last N elements.

萌酱 2024-08-23 04:47:52

好吧,如果您正在寻求速度并且您的数据本质上是数字,您可以考虑使用 numpy 和 PyTablesh5py。据我所知,界面并不像简单的列表那么好,但是可扩展性非常棒!

Well, if you are looking for speed and your data is numerical in nature, you could consider using numpy and PyTables or h5py. From what I remember, the interface is not as nice as simple lists, but the scalability is fantastic!

逆流 2024-08-23 04:47:52

你检查过基于pickle的shelve python模块吗?

http://docs.python.org/library/shelve.html

Did you check shelve python module which is based on pickle?

http://docs.python.org/library/shelve.html

夏夜暖风 2024-08-23 04:47:52

现代操作系统将为您处理此问题,而您无需担心。它称为虚拟内存

Modern operating systems will handle this for you without you having to worry about it. It's called virtual memory.

跨年 2024-08-23 04:47:52

您可能想要考虑一种不同类型的结构:不是列表,而是弄清楚如何使用生成器或自定义迭代器来完成(您的任务)。

You might want to consider a different kind of structure: not a list, but figuring out how to do (your task) with a generator or a custom iterator.

抚笙 2024-08-23 04:47:52

您可以尝试水泡:
https://pypi.python.org/pypi/blist/

blist 是 Python 列表的直接替代品,在修改大型列表时提供更好的性能。

You can try blist:
https://pypi.python.org/pypi/blist/

The blist is a drop-in replacement for the Python list the provides better performance when modifying large lists.

是你 2024-08-23 04:47:52

面向文档的数据库怎么样?
有多种选择;我认为目前最知名的是 CouchDB,但您也可以选择 东京内阁,或MongoDB。最后一个的优点是直接从主项目进行 python 绑定,而不需要任何额外的模块。

What about a document oriented database?
There are several alternatives; I think the most known one currently is CouchDB, but you can also go for Tokyo Cabinet, or MongoDB. The last one has the advantage of python bindings directly from the main project, without requiring any additional module.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文