Python 中 DataFrame 的内存高效查找表

发布于 2025-01-12 20:12:40 字数 1587 浏览 5 评论 0原文

在我提出的上一个问题中，一位响应者建议我将数据组织为数据帧的数据帧。

df = pd.DataFrame({'Form': {0:'SUV', 1:'Truck', 2:'SUV', 3:'Sedan', 4:'SUV', 5:'Truck'},
                   'Make': {0:'Ford', 1:'Toyota', 2:'Honda', 3:'Ford', 4:'Honda', 5:'Toyota'},
                   'Color': {0:'White', 1:'Black', 2:'Gray', 3:'White', 4:'White', 5:'Black'},
                   'Driver age': {0:25, 1:37, 2:21, 3:54, 4:50, 5:67},
                   'Data': {0: pd.DataFrame([[0, 0], [0.25, 1.7], [1.2, 1.8], [4.5, 4.0]]), 
                            1: pd.DataFrame([[0, 0], [0.15, 1.3], [1.6, 1.3], [4.2, 4.1]]), 
                            2: pd.DataFrame([[0, 0], [0.24, 1.2], [1.3, 1.6], [4.1, 3.9]]), 
                            3: pd.DataFrame([[0, 0], [0.45, 1.6], [1.8, 1.8], [4.2, 4.6]]), 
                            4: pd.DataFrame([[0, 0], [0.85, 1.9], [1.5, 1.7], [4.5, 4.3]]), 
                            5: pd.DataFrame([[0, 0], [0.35, 1.8], [1.5, 1.8], [4.6, 4.1]])} })

这个 DataFrame 的 DataFrame 允许我有条件地选择数据组，例如 df[(df['make'] == 'SUV') 和 (df['age']<=40)]['Data'] 。问题是，当每行数据本身就是一个大的 .csv 时，它就很难加载到内存中。

我正在寻找像 h5py 这样的模块，它可以“流”/读取数据的特定部分（允许指定一个键，例如 df = pd.read_hdf('large_data.hdf', 'SUV-Ford-White-25')，除了而不是嵌套字典，我更希望它是一个允许过滤的表格，例如df = module.read(large_data.some_ext, make == 'SUV', 20 <=age <= 40) 是否有 xarray 或 pandas 有东西为此内置？

原文

In a previous question I asked, a responder suggested I organize my data as a DataFrame of DataFrames.

df = pd.DataFrame({'Form': {0:'SUV', 1:'Truck', 2:'SUV', 3:'Sedan', 4:'SUV', 5:'Truck'},
                   'Make': {0:'Ford', 1:'Toyota', 2:'Honda', 3:'Ford', 4:'Honda', 5:'Toyota'},
                   'Color': {0:'White', 1:'Black', 2:'Gray', 3:'White', 4:'White', 5:'Black'},
                   'Driver age': {0:25, 1:37, 2:21, 3:54, 4:50, 5:67},
                   'Data': {0: pd.DataFrame([[0, 0], [0.25, 1.7], [1.2, 1.8], [4.5, 4.0]]), 
                            1: pd.DataFrame([[0, 0], [0.15, 1.3], [1.6, 1.3], [4.2, 4.1]]), 
                            2: pd.DataFrame([[0, 0], [0.24, 1.2], [1.3, 1.6], [4.1, 3.9]]), 
                            3: pd.DataFrame([[0, 0], [0.45, 1.6], [1.8, 1.8], [4.2, 4.6]]), 
                            4: pd.DataFrame([[0, 0], [0.85, 1.9], [1.5, 1.7], [4.5, 4.3]]), 
                            5: pd.DataFrame([[0, 0], [0.35, 1.8], [1.5, 1.8], [4.6, 4.1]])} })

This DataFrame of DataFrames permits me to conditionally select groups of data e.g. df[(df['make'] == 'SUV') and (df['age']<=40)]['Data']. The trouble is when each row of data is itself a large .csv, it becomes hard to load into memory.

I'm looking for a module like h5py that can "stream"/read in specific portions of data (which permits specifying a key e.g. df = pd.read_hdf('large_data.hdf', 'SUV-Ford-White-25'), except rather than a nested dictionary I would prefer it be a table that permits filtering e.g. df = module.read(large_data.some_ext, make == 'SUV', 20 <= age <= 40). Does xarray or pandas have something built in for this?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

手长情犹 2025-01-19 20:12:40

与 h5py 一样，PyTables（又名 tables）也可以创建和读取 HDF5 文件。熊猫
使用 PyTables“在幕后”创建和读取 HDF5 文件。 PyTables 具有一些有用的搜索功能，可以准确地完成您想做的事情。为了完整起见，我在本答案的末尾添加了一个简短的摘要，对每个包进行了比较。

这是我创建的一个示例，用于演示使用数据框（字典）数据的搜索行为。

创建 HDF5 文件：
注意：创建 HDF5 文件的大部分“工作”是将字典数据（重新）组织到 NumPy 重新数组中。如果修改数据结构（移动字典键/值级别），则可以简化该过程 - 假设结构尚未设置。
步骤摘要：

创建一个定义数据字段（列）的np.dtype。
通过计算与每个主键关联的字典项的数量来确定重新排列行。
使用上面的 1 和 2 创建一个由 0 组成的重新数组。
循环遍历字典并将键和值映射到适当的行和字段（列）名称。

代码如下：

import tables as tb
import numpy as np

data_dict = {'Form': {0:'SUV', 1:'Truck', 2:'SUV', 3:'Sedan', 4:'SUV', 5:'Truck'},
                   'Make': {0:'Ford', 1:'Toyota', 2:'Honda', 3:'Ford', 4:'Honda', 5:'Toyota'},
                   'Color': {0:'White', 1:'Black', 2:'Gray', 3:'White', 4:'White', 5:'Black'},
                   'Driver_age': {0:25, 1:37, 2:21, 3:54, 4:50, 5:67},
                   'Data': {0: np.array([[0, 0], [0.25, 1.7], [1.2, 1.8], [4.5, 4.0]]), 
                            1: np.array([[0, 0], [0.15, 1.3], [1.6, 1.3], [4.2, 4.1]]), 
                            2: np.array([[0, 0], [0.24, 1.2], [1.3, 1.6], [4.1, 3.9]]), 
                            3: np.array([[0, 0], [0.45, 1.6], [1.8, 1.8], [4.2, 4.6]]), 
                            4: np.array([[0, 0], [0.85, 1.9], [1.5, 1.7], [4.5, 4.3]]), 
                            5: np.array([[0, 0], [0.35, 1.8], [1.5, 1.8], [4.6, 4.1]])} }

recarr_dt = np.dtype( [ ('Form','S10'), ('Make','S10') , ('Color','S10'),
                        ('Driver_age',int), ('Data',float, (4,2)) ] )
nrows = 0
for k, d in data_dict.items():
    nrows = max(nrows, len(d))

recarr = np.zeros(shape=(nrows,), dtype=recarr_dt)  

for k1, v1 in data_dict.items():
    for k2, v2 in  v1.items():
        recarr[k2][k1] = v2
        
with tb.File('SO_71388372.h5','w') as h5w:
    h5w.create_table('/', 'test', obj=recarr)

打开并搜索HDF5文件：
此示例演示了使用 Table.read_where(condition) 方法进行的 2 次搜索。它显示了多个搜索条件的语法。需要注意的一些事项：

多个条件需要括号
无复合条件 (20 <= Driver_age <= 40) 是 2 个条件
字符串输入为 b"text" （b/c HDF5 字符串不是 Unicode）。

下面的代码：

import tables as tb
with tb.File('SO_71388372.h5','r') as h5r:
    data_tbl = h5r.root.test
    
    condition = '(Form == b"SUV") & (20 <= Driver_age) & (Driver_age <= 40)'
    data_arr = data_tbl.read_where(condition)
    print(f'\nFor search condition: {condition}')
    print(f'# of rows found: {data_arr.shape}')
    for row in data_arr:
        print(row)
        
    condition = '(Form == b"SUV") & (Make == b"Honda")'
    data_arr = data_tbl.read_where(condition)
    print(f'\nFor search condition: {condition}')
    print(f'# of rows found: {data_arr.shape}')
    for row in data_arr:
        print(row)

以下是从各自的常见问题解答页面中提取的每个包的摘要。

PyTables（来自 PyTables 常见问题解答）：
在 HDF5 和 NumPy 之上构建额外的抽象层。拥有支持复杂查询的引擎、高效的计算内核和高级索引功能。有一个自定义系统来表示 HDF5 库中可用但 NumPy 中不可用的数据类型。

h5py（来自 h5py 常见问题解答）：
尝试将 HDF5 功能集尽可能接近地映射到 NumPy。还提供对几乎所有 HDF5 C API 的访问。高级类型系统专门使用 NumPy dtype 对象，方法和属性命名遵循 Python 和 NumPy 字典和数组访问约定。

Like h5py, PyTables (aka tables) can also create and read HDF5 files. Pandas
uses PyTables "under the hood" to create and read HDF5 files. PyTables has some useful search features to do exactly what you want to do. For completeness, I included a brief summary at the end of this answer that compares each package.

Here is an example I created to demonstrate the search behavior using your dataframe (dictionary) data.

To create the HDF5 file:
Note: most of the "work" creating the HDF5 file is (re)organizing your dictionary data into a NumPy recarray. The process can be simplified if the data structure is modified (shifting dictionary key/value levels) -- that assumes the structure isn't set yet.
Summary of steps:

Create a np.dtype that defines the fields (columns) of data.
Determine recarray rows by counting the number of dictionary items associated to each primary key.
Create a recarray of zeros with 1 and 2 above.
Loop thru the dictionary and map keys and values to appropriate row and field(column) name.

Code below:

import tables as tb
import numpy as np

data_dict = {'Form': {0:'SUV', 1:'Truck', 2:'SUV', 3:'Sedan', 4:'SUV', 5:'Truck'},
                   'Make': {0:'Ford', 1:'Toyota', 2:'Honda', 3:'Ford', 4:'Honda', 5:'Toyota'},
                   'Color': {0:'White', 1:'Black', 2:'Gray', 3:'White', 4:'White', 5:'Black'},
                   'Driver_age': {0:25, 1:37, 2:21, 3:54, 4:50, 5:67},
                   'Data': {0: np.array([[0, 0], [0.25, 1.7], [1.2, 1.8], [4.5, 4.0]]), 
                            1: np.array([[0, 0], [0.15, 1.3], [1.6, 1.3], [4.2, 4.1]]), 
                            2: np.array([[0, 0], [0.24, 1.2], [1.3, 1.6], [4.1, 3.9]]), 
                            3: np.array([[0, 0], [0.45, 1.6], [1.8, 1.8], [4.2, 4.6]]), 
                            4: np.array([[0, 0], [0.85, 1.9], [1.5, 1.7], [4.5, 4.3]]), 
                            5: np.array([[0, 0], [0.35, 1.8], [1.5, 1.8], [4.6, 4.1]])} }

recarr_dt = np.dtype( [ ('Form','S10'), ('Make','S10') , ('Color','S10'),
                        ('Driver_age',int), ('Data',float, (4,2)) ] )
nrows = 0
for k, d in data_dict.items():
    nrows = max(nrows, len(d))

recarr = np.zeros(shape=(nrows,), dtype=recarr_dt)  

for k1, v1 in data_dict.items():
    for k2, v2 in  v1.items():
        recarr[k2][k1] = v2
        
with tb.File('SO_71388372.h5','w') as h5w:
    h5w.create_table('/', 'test', obj=recarr)

To open and search the HDF5 file:
This example demonstrates 2 searches using the Table.read_where(condition) method. It shows the syntax for multiple search conditions. Some things to watch for:

Parenthesis required for multiple conditions
No compound conditions (20 <= Driver_age <= 40) is 2 conditions
Strings are entered as b"text" (b/c HDF5 strings aren't Unicode).

Code below:

import tables as tb
with tb.File('SO_71388372.h5','r') as h5r:
    data_tbl = h5r.root.test
    
    condition = '(Form == b"SUV") & (20 <= Driver_age) & (Driver_age <= 40)'
    data_arr = data_tbl.read_where(condition)
    print(f'\nFor search condition: {condition}')
    print(f'# of rows found: {data_arr.shape}')
    for row in data_arr:
        print(row)
        
    condition = '(Form == b"SUV") & (Make == b"Honda")'
    data_arr = data_tbl.read_where(condition)
    print(f'\nFor search condition: {condition}')
    print(f'# of rows found: {data_arr.shape}')
    for row in data_arr:
        print(row)

Here is summary of each package extracted from their respective FAQ pages.

PyTables (from PyTables FAQ):
Builds an additional abstraction layer on top of HDF5 and NumPy. Has an engine to enable complex queries, an efficient computational kernel, and advanced indexing capabilities. Has a custom system to represent data types available in the HDF5 library but not in NumPy.

h5py (from h5py FAQ):
Attempts to map the HDF5 feature set to NumPy as closely as possible. Also provides access to nearly all of the HDF5 C API. The high-level type system uses NumPy dtype objects exclusively, and method and attribute naming follows Python and NumPy conventions for dictionary and array access.

回复收藏 0 原文

~没有更多了~