“读入”通过 PyTables 或 PyHDF 将大文本文件导入 hdf5？

发布于 2024-10-11 02:49:49 字数 1049 浏览 2 评论 0原文

我正在尝试使用 SciPy 进行一些统计，但我的输入数据集非常大（~1.9GB）并且采用 dbf 格式。该文件足够大，以至于当我尝试使用 genfromtxt 创建数组时，Numpy 返回错误消息。（我有 3GB 内存，但运行 win32）。

即：

Traceback (most recent call last):

  File "<pyshell#5>", line 1, in <module>
    ind_sum = numpy.genfromtxt(r"W:\RACER_Analyses\Terrestrial_Heterogeneity\IND_SUM.dbf", dtype = (int, int, int, float, float, int), names = True, usecols = (5))

File "C:\Python26\ArcGIS10.0\lib\site-packages\numpy\lib\npyio.py", line 1335, in genfromtxt
    for (i, line) in enumerate(itertools.chain([first_line, ], fhd)):

MemoryError

从其他帖子中，我看到 PyTables 提供的分块数组可能很有用，但我的问题是首先读取这些数据。或者换句话说，PyTables 或 PyHDF 可以轻松创建所需的 HDF5 输出，但我应该先做什么才能将数据放入数组中？

例如：

import numpy, scipy, tables

h5file = tables.openFile(r"W:\RACER_Analyses\Terrestrial_Heterogeneity\HET_IND_SUM2.h5", mode = "w", title = "Diversity Index Results")

group = h5.createGroup("/", "IND_SUM", "Aggregated Index Values"`)

然后我可以创建一个表或数组，但如何引用原始 dbf 数据？在描述中？

感谢您的任何想法！

原文

I'm attempting some statistics using SciPy, but my input dataset is quite large (~1.9GB) and in dbf format.
The file is large enough that Numpy returns an error message when I try to create an array with genfromtxt. (I've got 3GB ram, but running win32).

i.e.:

Traceback (most recent call last):

  File "<pyshell#5>", line 1, in <module>
    ind_sum = numpy.genfromtxt(r"W:\RACER_Analyses\Terrestrial_Heterogeneity\IND_SUM.dbf", dtype = (int, int, int, float, float, int), names = True, usecols = (5))

File "C:\Python26\ArcGIS10.0\lib\site-packages\numpy\lib\npyio.py", line 1335, in genfromtxt
    for (i, line) in enumerate(itertools.chain([first_line, ], fhd)):

MemoryError

From other posts, I see that the chunked array provided by PyTables could be useful, but my problem is reading in this data in the first place. Or in other words, PyTables or PyHDF easily create a HDF5 output that is desired, but what should I do to get my data into an array first?

For instance:

import numpy, scipy, tables

h5file = tables.openFile(r"W:\RACER_Analyses\Terrestrial_Heterogeneity\HET_IND_SUM2.h5", mode = "w", title = "Diversity Index Results")

group = h5.createGroup("/", "IND_SUM", "Aggregated Index Values"`)

and then I could either create a table or array, but how do I refer back to the original dbf data? In the description?

Thanks for any thoughts you might have!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

吃素的狼 2024-10-18 02:49:49

如果数据太大而无法放入内存，您可以使用内存映射文件（它类似于 numpy 数组，但存储在磁盘上 -

因为您达到了内存限制，所以我认为您不能使用 genfromtxt。相反，您应该一次一行地迭代文本文件，并将数据写入 memmap/hdf5 对象中的相关位置。

不清楚“引用原始 dbf 数据”是什么意思？显然你可以只存储它来自某处的文件名。 HDF5 对象具有旨在存储此类元数据的“属性”。

另外，我发现使用 h5py 是比 pytables 更简单、更干净的访问 hdf5 文件的方法，尽管这很大程度上是一个偏好问题。

回复收藏 0 原文

指尖上得阳光 2024-10-18 02:49:49

如果数据位于 dbf 文件中，您可以尝试我的 dbf 包 - 它只保留正在访问的内存中的记录，因此您应该能够循环浏览记录，提取所需的数据：

import dbf

table = dbf.Table(r"W:\RACER_Analyses\Terrestrial_Heterogeneity\IND_SUM.dbf")

sums = [0, 0, 0, 0.0, 0.0, 0]

for record in table:
    for index in range(5):
         sums[index] += record[index]

If the data is in a dbf file, you might try my dbf package -- it only keeps the records in memory that are being accessed, so you should be able to cycle through the records pulling out the data that you need:

import dbf

table = dbf.Table(r"W:\RACER_Analyses\Terrestrial_Heterogeneity\IND_SUM.dbf")

sums = [0, 0, 0, 0.0, 0.0, 0]

for record in table:
    for index in range(5):
         sums[index] += record[index]

回复收藏 0 原文

~没有更多了~