快速转换大型 (2.1gb+) 二进制文件(纬度/经度/海拔高度至 ECEF)

发布于 2024-12-10 14:43:13 字数 3185 浏览 0 评论 0原文

现在,我正在尝试将大量纬度经度高度格式的点的二进制文件转换为基于文本的 ECEF 笛卡尔格式(x,y,z)。现在的问题是这个过程非常非常非常慢。

我有超过 100 GB 的数据需要运行,并且可能会输入更多数据。我希望尽可能快地编写这段代码。

现在我的代码看起来像这样:

import mmap
import sys
import struct
import time

pointSize = 41

def getArguments():
    if len(sys.argv) != 2:
        print """Not enough arguments.
        example:
            python tllargbin_reader.py input_filename.tllargbin output_filename
        """
        return None
    else:
        return sys.argv

print getArguments()

def read_tllargbin(filename, outputCallback):
    f = open(filename, "r+")
    map = mmap.mmap(f.fileno(),0)
    t = time.clock()
    if (map.size() % pointSize) != 0:
        print "File size not aligned."
        #return
    for i in xrange(0,map.size(),pointSize):
        data_list = struct.unpack('=4d9B',map[i:i+pointSize])
        writeStr = formatString(data_list)
        if i % (41*1000) == 0:
            print "%d/%d points processed" % (i,map.size())
    print "Time elapsed: %f" % (time.clock() - t)
    map.close()


def generate_write_xyz(filename):
    f = open(filename, 'w', 128*1024)
    def write_xyz(writeStr):
        f.write(writeStr)
    return write_xyz

def formatString(data_list):
    return "%f %f %f" % (data_list[1], data_list[2],data_list[3])
args = getArguments()
if args != None:
    read_tllargbin(args[1],generate_write_xyz("out.xyz"))

convertXYZ() 基本上是这里的转换公式: http://en.wikipedia.org/wiki/Geodetic_system

我想知道它是否会更快使用一个线程读取大约 4MB 的内容,将它们放入有界缓冲区中,使用不同的线程转换为字符串格式,并让最终线程将字符串写回到不同硬盘上的文件中。不过,我可能有点操之过急了……

我现在正在使用 python 进行测试,但如果我能更快地处理这些文件,我不会反对切换。

任何建议都会很棒。谢谢

编辑:

我再次使用 cProfile 对代码进行了分析,这次拆分了字符串格式和 io.看来我实际上被字符串格式杀死了......这是探查器报告

         20010155 function calls in 548.993 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000  548.993  548.993 <string>:1(<module>)
        1    0.016    0.016  548.991  548.991 tllargbin_reader.py:1(<module>)
        1   24.018   24.018  548.955  548.955 tllargbin_reader.py:20(read_tllargbin)
        1    0.000    0.000    0.020    0.020 tllargbin_reader.py:36(generate_write_xyz)
 10000068  517.233    0.000  517.233    0.000 tllargbin_reader.py:42(formatString)
        2    0.000    0.000    0.000    0.000 tllargbin_reader.py:8(getArguments)
 10000068    6.684    0.000    6.684    0.000 {_struct.unpack}
        1    0.002    0.002  548.993  548.993 {execfile}
        2    0.000    0.000    0.000    0.000 {len}
        1    0.065    0.065    0.065    0.065 {method 'close' of 'mmap.mmap' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {method 'fileno' of 'file' objects}
    10003    0.955    0.000    0.955    0.000 {method 'size' of 'mmap.mmap' objects}
        2    0.020    0.010    0.020    0.010 {open}
        2    0.000    0.000    0.000    0.000 {time.clock}            

有没有更快的方法来格式化字符串?

Right now, I'm trying to convert a large quantity of binary files of points in latitude longitude altitude format to text based ECEF cartesian format (x, y, z). The problem right now is that the process is very very very slow.

I have over 100 gigabytes of this stuff to run through, and more data could be coming in. I would like to make this bit of code as fast as possible.

Right now my code looks something like this:

import mmap
import sys
import struct
import time

pointSize = 41

def getArguments():
    if len(sys.argv) != 2:
        print """Not enough arguments.
        example:
            python tllargbin_reader.py input_filename.tllargbin output_filename
        """
        return None
    else:
        return sys.argv

print getArguments()

def read_tllargbin(filename, outputCallback):
    f = open(filename, "r+")
    map = mmap.mmap(f.fileno(),0)
    t = time.clock()
    if (map.size() % pointSize) != 0:
        print "File size not aligned."
        #return
    for i in xrange(0,map.size(),pointSize):
        data_list = struct.unpack('=4d9B',map[i:i+pointSize])
        writeStr = formatString(data_list)
        if i % (41*1000) == 0:
            print "%d/%d points processed" % (i,map.size())
    print "Time elapsed: %f" % (time.clock() - t)
    map.close()


def generate_write_xyz(filename):
    f = open(filename, 'w', 128*1024)
    def write_xyz(writeStr):
        f.write(writeStr)
    return write_xyz

def formatString(data_list):
    return "%f %f %f" % (data_list[1], data_list[2],data_list[3])
args = getArguments()
if args != None:
    read_tllargbin(args[1],generate_write_xyz("out.xyz"))

convertXYZ() is basically the conversion formula here:
http://en.wikipedia.org/wiki/Geodetic_system

I was wondering if it would be faster to read things in chunks of ~4MB with one thread, put them in a bounded buffer, have a different thread for conversion to string format, and have a final thread write the string back into a file on a different harddisk. I might be jumping the gun though...

I'm using python right now for testing, but I wouldn't be opposed to switching if I can work through these files faster.

Any suggestions would be great. Thanks

EDIT:

I have profiled the code with cProfile again and this time split the string format and the io. It seems I'm actually being killed by the string format... Here's the profiler report

         20010155 function calls in 548.993 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000  548.993  548.993 <string>:1(<module>)
        1    0.016    0.016  548.991  548.991 tllargbin_reader.py:1(<module>)
        1   24.018   24.018  548.955  548.955 tllargbin_reader.py:20(read_tllargbin)
        1    0.000    0.000    0.020    0.020 tllargbin_reader.py:36(generate_write_xyz)
 10000068  517.233    0.000  517.233    0.000 tllargbin_reader.py:42(formatString)
        2    0.000    0.000    0.000    0.000 tllargbin_reader.py:8(getArguments)
 10000068    6.684    0.000    6.684    0.000 {_struct.unpack}
        1    0.002    0.002  548.993  548.993 {execfile}
        2    0.000    0.000    0.000    0.000 {len}
        1    0.065    0.065    0.065    0.065 {method 'close' of 'mmap.mmap' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {method 'fileno' of 'file' objects}
    10003    0.955    0.000    0.955    0.000 {method 'size' of 'mmap.mmap' objects}
        2    0.020    0.010    0.020    0.010 {open}
        2    0.000    0.000    0.000    0.000 {time.clock}            

Is there a faster way to format strings?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

呆萌少年 2024-12-17 14:43:13

为了更精确地解决这个问题,我建议通过将“convertXYZ”设为无操作函数并对结果进行计时来测量文件读取操作。并通过更改“读取”以始终返回一个简单点来测量转换函数,但调用转换和输出的次数与真正读取文件的次数相同。 (可能还有另一次运行,其中最终的转换后输出是空操作。)根据时间的推移,攻击其中一个可能更有意义。

您可以通过将输出写入 Python 的 stdout 并让 shell 执行实际的文件 IO,让本地操作系统为您执行一些交错操作。类似地,通过将​​文件流式传输到标准输入(例如,cat oldformat | python conversion.py > outputfile

输入和输出文件位于哪种类型的存储上?与 Python 代码相比,存储特性对性能的影响可能更大。

更新:鉴于输出是最慢的,并且您的存储非常慢并且在读取和写入之间共享,请尝试添加一些缓冲。从 python doc 中,您应该能够通过添加第三个参数来添加一些缓冲到 os.open 调用。尝试一些像 128*1024 这样相当大的尺寸?

To more precisely attack the problem, I suggest measuring the file read operation by making 'convertXYZ' a no-op function and timing the result. And measuring the convert function, by changing the 'read' to always return a simple point, but calling the conversion and output the same number of times as if you were really reading the file. (And probably another run where the final post-conversion output is a no-op.) Depending on where the time is going, it may make a lot more sense to attack one or the other.

You might be able to get the local OS to do some interleaving for you by writing the output to the Python's stdout, and having the shell do the actual file IO. And similarly by streaming the file into stdin (e.g., cat oldformat | python conversion.py > outputfile)

What sort of storage are the input and output files on? The storage characteristics may have a lot more to do with the performance than the Python code.

Update: Given the output is the slowest, and your storage is pretty slow and shared between both reads and writes, try adding some buffering. From the python doc you should be able to add some buffering by adding a third argument to the os.open call. Try something pretty large like 128*1024?

dawn曙光 2024-12-17 14:43:13

鉴于 formatString 是最慢的操作,请尝试以下操作:

def formatString(data_list):
    return " ".join((str(data_list[1]), str(data_list[2]), str(data_list[3])))

Given that formatString is the slowest operation, try this:

def formatString(data_list):
    return " ".join((str(data_list[1]), str(data_list[2]), str(data_list[3])))
狼亦尘 2024-12-17 14:43:13

仅读取 2.1 GB 的数据就需要 21 (@ 100 MB/s) 到 70 (@ 30 MB/s) 秒。然后,您将其格式化并写入可能是五倍大的数据。这意味着总共 13 GB 的读取和写入需要 130-420 秒。

您的采样显示读取需要 24 秒。因此,写作大约需要两分钟。例如,可以使用 SSD 来改善读取和写入时间。

当我转换文件时(使用我用 C 编写的程序),我假设转换所花费的时间不会比读取数据本身花费的时间多,通常可能会少得多。重叠的读写也可以减少 I/O 时间。我编写自己的自定义格式化例程,因为 printf 通常太慢。

24秒是多少?现代 CPU 上至少有 400 亿条指令。这意味着届时您可以使用至少 19 条指令来处理每个字节的数据。对于 C 程序来说很容易实现,但对于解释语言(Python、Java、C#、VB)则不然。

525 秒处理 (549-24) 余数表明 Python 至少花费了 8750 亿条指令处理或每字节数据读取 415 条指令。结果是 22 比 1:解释型语言和编译型语言之间的比例并不罕见。一个结构良好的 C 程序每字节应该减少大约 10 条指令或更少。

2.1 GB of data should take between 21 (@ 100 MB/s) to 70 (@ 30 MB/s) seconds just to read. You're then formatting that into and writing data which is perhaps five times as large. This means a total of 13 GB to read and write requiring 130-420 seconds.

Your sampling shows that reading takes 24 seconds. Writing should therefore require about two minutes. The reading and writing times can be improved using an SSD for example.

When I convert files (using programs I write in C) I assume that a conversion should take no more time than it takes to read the data itself, a lot less is usually possible. Overlapped reads and writes can also reduce the I/O time. I write my own custom formatting routines since printf is usually far too slow.

How much is 24 seconds? On a modern CPU at least 40 billion instructions. That means that in that time you can process every single byte of data with at least 19 instructions. Easily doable for a C program but not for an interpreted language (Python, Java, C#, VB).

Your 525 second processing (549-24) remainder indicates that Python is spending at least 875 billion instructions processing or 415 instructions per byte of data read. That comes out to 22 to 1: a not uncommon ratio between interpreted and compiled languages. A well-constructed C program should be down around ten instructions per byte or less.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文