来自Python，Pandas，dask，numpy的大数据框架的大数据框

发布于 2025-02-09 23:19:34 字数 843 浏览 1 评论 0原文

我在非常大的文件中以顺序（包装的C结构）格式有时间表数据。每个结构都包含不同类型的K字段，以某种顺序。该文件本质上是这些结构的数组（划分）。我希望能够毫米映射文件并将每个字段映射到一个numpy数组（或其他形式），在该数组中，该阵列识别dataframe中列为列的步幅（结构的大小）。

一个示例结构可能是：

struct {
   int32_t a;
   double b;
   int16_t c;
}

可以用Python生成这样的记录文件，因为

from struct import pack, unpack
db = open("binarydb", "wb")
for i in range(1,1000):
    packed = pack('<idh', i, i*3.14, i*2)
    db.write(packed)
db.close()

问题是如何将其作为数据框架有效地查看此类文件。如果我们假设该文件是数亿行的长度，则需要使用MEM-MAP解决方案。

使用MEMMAP，如何将Numpy数组（或替代阵列结构）映射到列a的整数序列。在我看来，对于INT32系列“ A”，需要能够指示大步（14个字节）和偏移（在这种情况下为0），float64系列“ B”的偏移量为4，并且偏移了12对于INT16系列“ C”。

我已经看到，如果文件包含单个dtype，则可以轻松地针对mmap'ed创建一个numpy数组。是否可以通过指示类型，偏移和大步来拉出此文件中的不同系列？通过这种方法，可以将MMMAPPAPED列出现到Pandas或其他数据框架实现。

更好的是，是否有一种简单的方法将自定义的MEM映射格式集成到DASK中，从而使懒惰分页的好处进入文件？

原文

I have timeseries data in sequential (packed c-struct) format in very large files. Each structure contain K fields of different types in some order. The file is essentially an array of these structures (row-wise). I would like to be able to mmap the file and map each field to a numpy array (or another form) where the array recognizes a stride (the size of the struct) aliased as columns in a dataframe.

An example struct might be:

struct {
   int32_t a;
   double b;
   int16_t c;
}

Such a file of records could be generated with python as:

from struct import pack, unpack
db = open("binarydb", "wb")
for i in range(1,1000):
    packed = pack('<idh', i, i*3.14, i*2)
    db.write(packed)
db.close()

The question is then how to view such a file efficiently as a dataframe. If we assume the file is hundreds of millions of rows in length would need to use a mem-map solution.

Using memmap, how can i map a numpy array (or alternative array structure) to the sequence of integers for column a. It seems to me that would need to be able to indicate a stride (14 bytes) and offset (0 in this case) for the int32 series "a", an offset of 4 for the float64 series "b", and an offset of 12 for the int16 series "c".

I have seen that one can easily create a numpy array against a mmap'ed file if the file contains a single dtype. Is there a way to pull the different series in this file by indicating a type, offset, and stride? With this approach could present mmapped columns to pandas or another dataframe implementation.

Even better, is there a simple way to integrate a custom mem-mapped format into Dask, such that get the benefits of lazy paging into the file?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

萌逼全场 2025-02-16 23:19:34

您可以使用 numpy.memmap 这样做。由于您的输入数据类型不是本机类型，因此您需要使用高级numpy数据类型。请注意，由于Numpy不支持无界的流，而是固定大小的数组，因此您需要提前阵列的大小。

size = 999

datatype = np.dtype([('a', np.int32), ('b', np.float64), ('c', np.int16)])

# The final memory-mapped array
data = np.memmap("binarydb", dtype=datatype, mode='write', shape=size)

for i in range(1,1+size):
    data[i]['a'] = i
    data[i]['b'] = i*3.14
    data[i]['c'] = i*2

请注意，矢量化操作通常比Numpy中的直接索引要快得多。如果操作无法矢量化，则NUMBA也可以用于加快直接索引。

请注意，内存映射的区域可以冲洗但尚未以Numpy关闭。

You can use numpy.memmap to do that. Since your input data type is not a native type, you need to use advanced Numpy data types. Note that you need the size of the array ahead of time since Numpy does not support unbounded streams but fixed-size array.

size = 999

datatype = np.dtype([('a', np.int32), ('b', np.float64), ('c', np.int16)])

# The final memory-mapped array
data = np.memmap("binarydb", dtype=datatype, mode='write', shape=size)

for i in range(1,1+size):
    data[i]['a'] = i
    data[i]['b'] = i*3.14
    data[i]['c'] = i*2

Note that vectorized operation are generally much faster than direct indexing in Numpy. Numba can also be used to speed up the direct indexing if the operation cannot be vectorized.

Note that memory mapped area can be flushed but not yet closed in Numpy.

回复收藏 0 原文

鞋纸虽美，但不合脚ㄋ〞 2025-02-16 23:19:34

从上面的 @jérômeRichard的回答中推断出来。这是从二进制记录序列中读取的代码：

size = 999
datatype = np.dtype([('a', np.int32), ('b', np.float64), ('c', np.int32)])

# The final memory-mapped array
data = np.memmap("binarydb", dtype=datatype, mode='readonly', shape=size)

然后可以将每个系列拉到：

data['a']
data['b']
data['c']

Extrapolating from @Jérôme Richard's answer above. Here is code to read from a binary sequence of records:

size = 999
datatype = np.dtype([('a', np.int32), ('b', np.float64), ('c', np.int32)])

# The final memory-mapped array
data = np.memmap("binarydb", dtype=datatype, mode='readonly', shape=size)

Can then pull each series as: