“读入”通过 PyTables 或 PyHDF 将大文本文件导入 hdf5?
我正在尝试使用 SciPy 进行一些统计,但我的输入数据集非常大(~1.9GB)并且采用 dbf 格式。 该文件足够大,以至于当我尝试使用 genfromtxt 创建数组时,Numpy 返回错误消息。 (我有 3GB 内存,但运行 win32)。
即:
Traceback (most recent call last):
File "<pyshell#5>", line 1, in <module>
ind_sum = numpy.genfromtxt(r"W:\RACER_Analyses\Terrestrial_Heterogeneity\IND_SUM.dbf", dtype = (int, int, int, float, float, int), names = True, usecols = (5))
File "C:\Python26\ArcGIS10.0\lib\site-packages\numpy\lib\npyio.py", line 1335, in genfromtxt
for (i, line) in enumerate(itertools.chain([first_line, ], fhd)):
MemoryError
从其他帖子中,我看到 PyTables 提供的分块数组可能很有用,但我的问题是首先读取这些数据。或者换句话说,PyTables 或 PyHDF 可以轻松创建所需的 HDF5 输出,但我应该先做什么才能将数据放入数组中?
例如:
import numpy, scipy, tables
h5file = tables.openFile(r"W:\RACER_Analyses\Terrestrial_Heterogeneity\HET_IND_SUM2.h5", mode = "w", title = "Diversity Index Results")
group = h5.createGroup("/", "IND_SUM", "Aggregated Index Values"`)
然后我可以创建一个表或数组,但如何引用原始 dbf 数据?在描述中?
感谢您的任何想法!
I'm attempting some statistics using SciPy, but my input dataset is quite large (~1.9GB) and in dbf format.
The file is large enough that Numpy returns an error message when I try to create an array with genfromtxt. (I've got 3GB ram, but running win32).
i.e.:
Traceback (most recent call last):
File "<pyshell#5>", line 1, in <module>
ind_sum = numpy.genfromtxt(r"W:\RACER_Analyses\Terrestrial_Heterogeneity\IND_SUM.dbf", dtype = (int, int, int, float, float, int), names = True, usecols = (5))
File "C:\Python26\ArcGIS10.0\lib\site-packages\numpy\lib\npyio.py", line 1335, in genfromtxt
for (i, line) in enumerate(itertools.chain([first_line, ], fhd)):
MemoryError
From other posts, I see that the chunked array provided by PyTables could be useful, but my problem is reading in this data in the first place. Or in other words, PyTables or PyHDF easily create a HDF5 output that is desired, but what should I do to get my data into an array first?
For instance:
import numpy, scipy, tables
h5file = tables.openFile(r"W:\RACER_Analyses\Terrestrial_Heterogeneity\HET_IND_SUM2.h5", mode = "w", title = "Diversity Index Results")
group = h5.createGroup("/", "IND_SUM", "Aggregated Index Values"`)
and then I could either create a table or array, but how do I refer back to the original dbf data? In the description?
Thanks for any thoughts you might have!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果数据太大而无法放入内存,您可以使用内存映射文件(它类似于 numpy 数组,但存储在磁盘上 -
因为您达到了内存限制,所以我认为您不能使用 genfromtxt。相反,您应该一次一行地迭代文本文件,并将数据写入 memmap/hdf5 对象中的相关位置。
不清楚“引用原始 dbf 数据”是什么意思?显然你可以只存储它来自某处的文件名。 HDF5 对象具有旨在存储此类元数据的“属性”。
另外,我发现使用 h5py 是比 pytables 更简单、更干净的访问 hdf5 文件的方法,尽管这很大程度上是一个偏好问题。
If the data is too big to fit in memory, you can work with a memory-mapped file (it's like a numpy array but stored on disk - see docs here), though you may be able to get similar results using HDF5 depending on what operations you need to perform on the array. Obviously this will make many operations slower but this is better than not being able to do them at all.
Because you are hitting a memory limit, I think you cannot use genfromtxt. Instead, you should iterate through your text file one line at a time, and write the data to the relevant position in the memmap/hdf5 object.
It is not clear what you mean by "referring back to the original dbf data"? Obviously you can just store the filename it came from somewhere. HDF5 objects have "attributes" which are designed to store this kind of meta-data.
Also, I have found that using h5py is a much simpler and cleaner way to access hdf5 files than pytables, though this is largely a matter of preference.
如果数据位于 dbf 文件中,您可以尝试 我的 dbf 包 - 它只保留正在访问的内存中的记录,因此您应该能够循环浏览记录,提取所需的数据:
If the data is in a dbf file, you might try my dbf package -- it only keeps the records in memory that are being accessed, so you should be able to cycle through the records pulling out the data that you need: