Python File Slurp 带字节序转换
最近有人问如何在 python 中执行文件slurp,接受的答案建议如下:
with open('x.txt') as x: f = x.read()
如何我会这样做来读入文件并转换数据的字节序表示吗?
例如,我有一个 1GB 的二进制文件,它只是一堆包装为大端的单精度浮点数,我想将其转换为小端并转储到 numpy 数组中。下面是我为完成此任务而编写的函数以及调用它的一些实际代码。我使用 struct.unpack 进行字节序转换,并尝试使用 mmap 来加快速度。
我的问题是,我是否正确地将 slurp 与 mmap
和 struct.unpack
一起使用?有没有更干净、更快的方法来做到这一点?现在我所拥有的有效,但我真的很想学习如何做得更好。
提前致谢!
#!/usr/bin/python
from struct import unpack
import mmap
import numpy as np
def mmapChannel(arrayName, fileName, channelNo, line_count, sample_count):
"""
We need to read in the asf internal file and convert it into a numpy array.
It is stored as a single row, and is binary. Thenumber of lines (rows), samples (columns),
and channels all come from the .meta text file
Also, internal format files are packed big endian, but most systems use little endian, so we need
to make that conversion as well.
Memory mapping seemed to improve the ingestion speed a bit
"""
# memory-map the file, size 0 means whole file
# length = line_count * sample_count * arrayName.itemsize
print "\tMemory Mapping..."
with open(fileName, "rb") as f:
map = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
map.seek(channelNo*line_count*sample_count*arrayName.itemsize)
for i in xrange(line_count*sample_count):
arrayName[0, i] = unpack('>f', map.read(arrayName.itemsize) )[0]
# Same method as above, just more verbose for the maintenance programmer.
# for i in xrange(line_count*sample_count): #row
# be_float = map.read(arrayName.itemsize) # arrayName.itemsize should be 4 for float32
# le_float = unpack('>f', be_float)[0] # > for big endian, < for little endian
# arrayName[0, i]= le_float
map.close()
return arrayName
print "Initializing the Amp HH HV, and Phase HH HV arrays..."
HHamp = np.ones((1, line_count*sample_count), dtype='float32')
HHphase = np.ones((1, line_count*sample_count), dtype='float32')
HVamp = np.ones((1, line_count*sample_count), dtype='float32')
HVphase = np.ones((1, line_count*sample_count), dtype='float32')
print "Ingesting HH_Amp..."
HHamp = mmapChannel(HHamp, 'ALPSRP042301700-P1.1__A.img', 0, line_count, sample_count)
print "Ingesting HH_phase..."
HHphase = mmapChannel(HHphase, 'ALPSRP042301700-P1.1__A.img', 1, line_count, sample_count)
print "Ingesting HV_AMP..."
HVamp = mmapChannel(HVamp, 'ALPSRP042301700-P1.1__A.img', 2, line_count, sample_count)
print "Ingesting HV_phase..."
HVphase = mmapChannel(HVphase, 'ALPSRP042301700-P1.1__A.img', 3, line_count, sample_count)
print "Reshaping...."
HHamp_orig = HHamp.reshape(line_count, -1)
HHphase_orig = HHphase.reshape(line_count, -1)
HVamp_orig = HVamp.reshape(line_count, -1)
HVphase_orig = HVphase.reshape(line_count, -1)
It was recently asked how to do a file slurp in python, and the accepted answer suggested something like:
with open('x.txt') as x: f = x.read()
How would I go about doing this to read the file in and convert the endian representation of the data?
For example, I have a 1GB binary file that's just a bunch of single precision floats packed as a big endian and I want to convert it to little endian and dump into a numpy array. Below is the function I wrote to accomplish this and some real code that calls it. I use struct.unpack
do the endian conversion and tried to speed everything up by using mmap
.
My question then is, am I using the slurp correctly with mmap
and struct.unpack
? Is there a cleaner, faster way to do this? Right now what I have works, but I'd really like to learn how to do this better.
Thanks in advance!
#!/usr/bin/python
from struct import unpack
import mmap
import numpy as np
def mmapChannel(arrayName, fileName, channelNo, line_count, sample_count):
"""
We need to read in the asf internal file and convert it into a numpy array.
It is stored as a single row, and is binary. Thenumber of lines (rows), samples (columns),
and channels all come from the .meta text file
Also, internal format files are packed big endian, but most systems use little endian, so we need
to make that conversion as well.
Memory mapping seemed to improve the ingestion speed a bit
"""
# memory-map the file, size 0 means whole file
# length = line_count * sample_count * arrayName.itemsize
print "\tMemory Mapping..."
with open(fileName, "rb") as f:
map = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
map.seek(channelNo*line_count*sample_count*arrayName.itemsize)
for i in xrange(line_count*sample_count):
arrayName[0, i] = unpack('>f', map.read(arrayName.itemsize) )[0]
# Same method as above, just more verbose for the maintenance programmer.
# for i in xrange(line_count*sample_count): #row
# be_float = map.read(arrayName.itemsize) # arrayName.itemsize should be 4 for float32
# le_float = unpack('>f', be_float)[0] # > for big endian, < for little endian
# arrayName[0, i]= le_float
map.close()
return arrayName
print "Initializing the Amp HH HV, and Phase HH HV arrays..."
HHamp = np.ones((1, line_count*sample_count), dtype='float32')
HHphase = np.ones((1, line_count*sample_count), dtype='float32')
HVamp = np.ones((1, line_count*sample_count), dtype='float32')
HVphase = np.ones((1, line_count*sample_count), dtype='float32')
print "Ingesting HH_Amp..."
HHamp = mmapChannel(HHamp, 'ALPSRP042301700-P1.1__A.img', 0, line_count, sample_count)
print "Ingesting HH_phase..."
HHphase = mmapChannel(HHphase, 'ALPSRP042301700-P1.1__A.img', 1, line_count, sample_count)
print "Ingesting HV_AMP..."
HVamp = mmapChannel(HVamp, 'ALPSRP042301700-P1.1__A.img', 2, line_count, sample_count)
print "Ingesting HV_phase..."
HVphase = mmapChannel(HVphase, 'ALPSRP042301700-P1.1__A.img', 3, line_count, sample_count)
print "Reshaping...."
HHamp_orig = HHamp.reshape(line_count, -1)
HHphase_orig = HHphase.reshape(line_count, -1)
HVamp_orig = HVamp.reshape(line_count, -1)
HVphase_orig = HVphase.reshape(line_count, -1)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
稍作修改@Alex Martelli 的答案:
Slightly modified @Alex Martelli's answer:
在速度和简洁性方面很难被击败;-)。对于 byteswap,请参阅 here(
True
参数的意思是“就地执行”);对于 fromfile,请参阅此处。这与小端机器上一样工作(因为数据是大端机器,所以需要字节交换)。您可以测试是否是有条件地执行 byteswap 的情况,将最后一行从无条件调用 byteswap 更改为,例如:
即,在测试小尾数法时有条件调用 byteswap。
Pretty hard to beat for speed AND conciseness;-). For byteswap see here (the
True
argument means, "do it in place"); for fromfile see here.This works as is on little-endian machines (since the data are big-endian, the byteswap is needed). You can test if that is the case to do the byteswap conditionally, change the last line from an unconditional call to byteswap into, for example:
i.e., a call to byteswap conditional on a test of little-endianness.
您可以拼凑出基于 ASM 的解决方案使用 CorePy。但我想知道,您是否能够从算法的其他部分获得足够的性能。无论您采用哪种方式切片,对 1GB 数据块的 I/O 和操作都将花费一段时间。
您可能会发现有用的另一件事是,一旦您在 python 中构建了算法原型,就切换到 C。我曾经这样做是为了对全球 DEM(高度)数据集进行操作。一旦我摆脱了解释的脚本,整个事情就变得更容易忍受了。
You could coble together an ASM based solution using CorePy. I wonder though, if you might be able to gain enough performance from the some other part of your algorithm. I/O and manipulations on 1GB chunks of data are going to take a while which ever way you slice it.
One other thing you might find helpful would be to switch to C once you have prototyped the algorithm in python. I did this for manipulations on a whole-world DEM (height) data set one time. The whole thing was much more tolerable once I got away from the interpreted script.
我希望这样的东西会更快
请不要使用
map
作为变量名I'd expect something like this to be faster
Please don't use
map
as a variable name