使用 Pandas 读取使用 h5py 创建的 HDF5 文件
我有一堆 hdf5 文件,我想将其中的一些数据转换为 parquet 文件。不过,我正在努力将它们读入 pandas/pyarrow 中。我认为这与文件最初创建的方式有关。
如果我使用 h5py 打开文件,数据看起来完全符合我的预期。
import h5py
file_path = "/data/some_file.hdf5"
hdf = h5py.File(file_path, "r")
print(list(hdf.keys()))
在这种情况下,我对“bar”组感兴趣
>>> ['foo', 'bar', 'baz']
,其中包含 3 个项目。
如果我尝试使用 HDFStore 读取数据,我将无法访问任何组。
import pandas as pd
file_path = "/data/some_file.hdf5"
store = pd.HDFStore(file_path, "r")
那么 HDFStore
对象没有键或组。
assert not store.groups()
assert not store.keys()
如果我尝试访问数据,我会收到以下错误
bar = store.get("/bar")
TypeError: cannot create a storer if the object is not existing nor a value are passed
同样,如果我尝试使用 pd.read_hdf ,它看起来像文件是空的。
import pandas as pd
file_path = "/data/some_file.hdf"
df = pd.read_hdf(file_path, mode="r")
ValueError: Dataset(s) incompatible with Pandas data types, not table, or no datasets found in HDF5 file.
并
import pandas as pd
file_path = "/data/some_file.hdf5"
pd.read_hdf(file_path, key="/interval", mode="r")
TypeError: cannot create a storer if the object is not existing nor a value are passed
基于这个答案我假设问题与 Pandas 期望一种非常特殊的层次结构这一事实有关,该结构与实际 hdf5 文件的结构不同。
将任意 hdf5 文件读入 pandas 或 pytables 的直接方法是什么?如果需要,我可以使用 h5py 加载数据。但这些文件足够大,如果可以的话,我希望避免将它们加载到内存中。所以理想情况下,我想尽可能多地从事 pandas 和 pyarrow 工作。
I have a bunch of hdf5 files, and I want to turn some of the data in them into parquet files. I'm struggling to read them into pandas/pyarrow though. Which I think is related to the way that the files were originally created.
If I open the file using h5py the data looks exactly how I would expect.
import h5py
file_path = "/data/some_file.hdf5"
hdf = h5py.File(file_path, "r")
print(list(hdf.keys()))
gives me
>>> ['foo', 'bar', 'baz']
In this case I'm interested in the group "bar", which has 3 items in it.
If I try to read the data in using HDFStore
I am unable to access any of the groups.
import pandas as pd
file_path = "/data/some_file.hdf5"
store = pd.HDFStore(file_path, "r")
Then the HDFStore
object has no keys or groups.
assert not store.groups()
assert not store.keys()
And if I try to access the data I get the following error
bar = store.get("/bar")
TypeError: cannot create a storer if the object is not existing nor a value are passed
Similarly if I try use pd.read_hdf
it looks like the file is empty.
import pandas as pd
file_path = "/data/some_file.hdf"
df = pd.read_hdf(file_path, mode="r")
ValueError: Dataset(s) incompatible with Pandas data types, not table, or no datasets found in HDF5 file.
and
import pandas as pd
file_path = "/data/some_file.hdf5"
pd.read_hdf(file_path, key="/interval", mode="r")
TypeError: cannot create a storer if the object is not existing nor a value are passed
Based on this answer I'm assuming that the problem is related to the fact that Pandas is expecting a very particular hierarchical structure, which is different to the one that the the actual hdf5 file has.
Is the a straightforward way to read an arbitrary hdf5 file into pandas or pytables? I can load the data using h5py if I need to. But the files are large enough that I'd like to avoid loading them into memory if I can. So ideally I'd like to work in pandas and pyarrow as much as I can.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我遇到了类似的问题,无法将 hdf5 读入 pandas df。通过 这篇文章,我制作了一个脚本,将 hdf5 转换为字典,然后将字典转换为 pandas df,如下所示:
只要每个 hdf5 键 (
f.keys()
) 只是您要在 pandas df 中使用的列的名称而不是组名称,就可以使用此方法,这似乎是一个更复杂的层次结构,可以存在于 hdf5 中,但不能存在于 pandas 中。如果一个组出现在键上方的层次结构中,例如名称为data_group
,对我来说,替代解决方案是将f.keys()
替换为>f['data_group'].keys()
和f[key]
以及f['data_group'][key]
I had a similar problem with not being able to read hdf5 into pandas df. With this post I made a script that turns the hdf5 into a dictionary and then the dictionary into a pandas df, like this:
This works as long as each of the hdf5 keys (
f.keys()
) is simply a name of a column you want to use in the pandas df and not a group name, which seems to be a more complicated hierarchical structure that can exist in hdf5, but not in pandas. If a group appears in the hierarchy above the keys, e.g. with the namedata_group
, what worked for me as an alternative solution was to substitutef.keys()
withf['data_group'].keys()
andf[key]
withf['data_group'][key]