“读入”通过 PyTables 或 PyHDF 将大文本文件导入 hdf5?

发布于 2024-10-11 02:49:49 字数 1049 浏览 2 评论 0原文

我正在尝试使用 SciPy 进行一些统计,但我的输入数据集非常大(~1.9GB)并且采用 dbf 格式。 该文件足够大,以至于当我尝试使用 genfromtxt 创建数组时,Numpy 返回错误消息。 (我有 3GB 内存,但运行 win32)。

即:

Traceback (most recent call last):

  File "<pyshell#5>", line 1, in <module>
    ind_sum = numpy.genfromtxt(r"W:\RACER_Analyses\Terrestrial_Heterogeneity\IND_SUM.dbf", dtype = (int, int, int, float, float, int), names = True, usecols = (5))

File "C:\Python26\ArcGIS10.0\lib\site-packages\numpy\lib\npyio.py", line 1335, in genfromtxt
    for (i, line) in enumerate(itertools.chain([first_line, ], fhd)):

MemoryError

从其他帖子中,我看到 PyTables 提供的分块数组可能很有用,但我的问题是首先读取这些数据。或者换句话说,PyTables 或 PyHDF 可以轻松创建所需的 HDF5 输出,但我应该先做什么才能将数据放入数组中?

例如:

import numpy, scipy, tables

h5file = tables.openFile(r"W:\RACER_Analyses\Terrestrial_Heterogeneity\HET_IND_SUM2.h5", mode = "w", title = "Diversity Index Results")

group = h5.createGroup("/", "IND_SUM", "Aggregated Index Values"`)

然后我可以创建一个表或数组,但如何引用原始 dbf 数据?在描述中?

感谢您的任何想法!

I'm attempting some statistics using SciPy, but my input dataset is quite large (~1.9GB) and in dbf format.
The file is large enough that Numpy returns an error message when I try to create an array with genfromtxt. (I've got 3GB ram, but running win32).

i.e.:

Traceback (most recent call last):

  File "<pyshell#5>", line 1, in <module>
    ind_sum = numpy.genfromtxt(r"W:\RACER_Analyses\Terrestrial_Heterogeneity\IND_SUM.dbf", dtype = (int, int, int, float, float, int), names = True, usecols = (5))

File "C:\Python26\ArcGIS10.0\lib\site-packages\numpy\lib\npyio.py", line 1335, in genfromtxt
    for (i, line) in enumerate(itertools.chain([first_line, ], fhd)):

MemoryError

From other posts, I see that the chunked array provided by PyTables could be useful, but my problem is reading in this data in the first place. Or in other words, PyTables or PyHDF easily create a HDF5 output that is desired, but what should I do to get my data into an array first?

For instance:

import numpy, scipy, tables

h5file = tables.openFile(r"W:\RACER_Analyses\Terrestrial_Heterogeneity\HET_IND_SUM2.h5", mode = "w", title = "Diversity Index Results")

group = h5.createGroup("/", "IND_SUM", "Aggregated Index Values"`)

and then I could either create a table or array, but how do I refer back to the original dbf data? In the description?

Thanks for any thoughts you might have!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

吃素的狼 2024-10-18 02:49:49

如果数据太大而无法放入内存,您可以使用内存映射文件(它类似于 numpy 数组,但存储在磁盘上 -

因为您达到了内存限制,所以我认为您不能使用 genfromtxt。相反,您应该一次一行地迭代文本文件,并将数据写入 memmap/hdf5 对象中的相关位置。

不清楚“引用原始 dbf 数据”是什么意思?显然你可以只存储它来自某处的文件名。 HDF5 对象具有旨在存储此类元数据的“属性”。

另外,我发现使用 h5py 是比 pytables 更简单、更干净的访问 hdf5 文件的方法,尽管这很大程度上是一个偏好问题。

If the data is too big to fit in memory, you can work with a memory-mapped file (it's like a numpy array but stored on disk - see docs here), though you may be able to get similar results using HDF5 depending on what operations you need to perform on the array. Obviously this will make many operations slower but this is better than not being able to do them at all.

Because you are hitting a memory limit, I think you cannot use genfromtxt. Instead, you should iterate through your text file one line at a time, and write the data to the relevant position in the memmap/hdf5 object.

It is not clear what you mean by "referring back to the original dbf data"? Obviously you can just store the filename it came from somewhere. HDF5 objects have "attributes" which are designed to store this kind of meta-data.

Also, I have found that using h5py is a much simpler and cleaner way to access hdf5 files than pytables, though this is largely a matter of preference.

指尖上得阳光 2024-10-18 02:49:49

如果数据位于 dbf 文件中,您可以尝试 我的 dbf 包 - 它只保留正在访问的内存中的记录,因此您应该能够循环浏览记录,提取所需的数据:

import dbf

table = dbf.Table(r"W:\RACER_Analyses\Terrestrial_Heterogeneity\IND_SUM.dbf")

sums = [0, 0, 0, 0.0, 0.0, 0]

for record in table:
    for index in range(5):
         sums[index] += record[index]

If the data is in a dbf file, you might try my dbf package -- it only keeps the records in memory that are being accessed, so you should be able to cycle through the records pulling out the data that you need:

import dbf

table = dbf.Table(r"W:\RACER_Analyses\Terrestrial_Heterogeneity\IND_SUM.dbf")

sums = [0, 0, 0, 0.0, 0.0, 0]

for record in table:
    for index in range(5):
         sums[index] += record[index]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文