快速转换大型 (2.1gb+) 二进制文件（纬度/经度/海拔高度至 ECEF）

发布于 2024-12-10 14:43:13 字数 3185 浏览 0 评论 0原文

现在，我正在尝试将大量纬度经度高度格式的点的二进制文件转换为基于文本的 ECEF 笛卡尔格式（x，y，z）。现在的问题是这个过程非常非常非常慢。

我有超过 100 GB 的数据需要运行，并且可能会输入更多数据。我希望尽可能快地编写这段代码。

现在我的代码看起来像这样：

import mmap
import sys
import struct
import time

pointSize = 41

def getArguments():
    if len(sys.argv) != 2:
        print """Not enough arguments.
        example:
            python tllargbin_reader.py input_filename.tllargbin output_filename
        """
        return None
    else:
        return sys.argv

print getArguments()

def read_tllargbin(filename, outputCallback):
    f = open(filename, "r+")
    map = mmap.mmap(f.fileno(),0)
    t = time.clock()
    if (map.size() % pointSize) != 0:
        print "File size not aligned."
        #return
    for i in xrange(0,map.size(),pointSize):
        data_list = struct.unpack('=4d9B',map[i:i+pointSize])
        writeStr = formatString(data_list)
        if i % (41*1000) == 0:
            print "%d/%d points processed" % (i,map.size())
    print "Time elapsed: %f" % (time.clock() - t)
    map.close()


def generate_write_xyz(filename):
    f = open(filename, 'w', 128*1024)
    def write_xyz(writeStr):
        f.write(writeStr)
    return write_xyz

def formatString(data_list):
    return "%f %f %f" % (data_list[1], data_list[2],data_list[3])
args = getArguments()
if args != None:
    read_tllargbin(args[1],generate_write_xyz("out.xyz"))

convertXYZ() 基本上是这里的转换公式： http://en.wikipedia.org/wiki/Geodetic_system

我想知道它是否会更快使用一个线程读取大约 4MB 的内容，将它们放入有界缓冲区中，使用不同的线程转换为字符串格式，并让最终线程将字符串写回到不同硬盘上的文件中。不过，我可能有点操之过急了……

我现在正在使用 python 进行测试，但如果我能更快地处理这些文件，我不会反对切换。

任何建议都会很棒。谢谢

编辑：

我再次使用 cProfile 对代码进行了分析，这次拆分了字符串格式和 io.看来我实际上被字符串格式杀死了......这是探查器报告

         20010155 function calls in 548.993 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000  548.993  548.993 <string>:1(<module>)
        1    0.016    0.016  548.991  548.991 tllargbin_reader.py:1(<module>)
        1   24.018   24.018  548.955  548.955 tllargbin_reader.py:20(read_tllargbin)
        1    0.000    0.000    0.020    0.020 tllargbin_reader.py:36(generate_write_xyz)
 10000068  517.233    0.000  517.233    0.000 tllargbin_reader.py:42(formatString)
        2    0.000    0.000    0.000    0.000 tllargbin_reader.py:8(getArguments)
 10000068    6.684    0.000    6.684    0.000 {_struct.unpack}
        1    0.002    0.002  548.993  548.993 {execfile}
        2    0.000    0.000    0.000    0.000 {len}
        1    0.065    0.065    0.065    0.065 {method 'close' of 'mmap.mmap' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {method 'fileno' of 'file' objects}
    10003    0.955    0.000    0.955    0.000 {method 'size' of 'mmap.mmap' objects}
        2    0.020    0.010    0.020    0.010 {open}
        2    0.000    0.000    0.000    0.000 {time.clock}

有没有更快的方法来格式化字符串？

原文

Right now, I'm trying to convert a large quantity of binary files of points in latitude longitude altitude format to text based ECEF cartesian format (x, y, z). The problem right now is that the process is very very very slow.

I have over 100 gigabytes of this stuff to run through, and more data could be coming in. I would like to make this bit of code as fast as possible.

Right now my code looks something like this:

import mmap
import sys
import struct
import time

pointSize = 41

def getArguments():
    if len(sys.argv) != 2:
        print """Not enough arguments.
        example:
            python tllargbin_reader.py input_filename.tllargbin output_filename
        """
        return None
    else:
        return sys.argv

print getArguments()

def read_tllargbin(filename, outputCallback):
    f = open(filename, "r+")
    map = mmap.mmap(f.fileno(),0)
    t = time.clock()
    if (map.size() % pointSize) != 0:
        print "File size not aligned."
        #return
    for i in xrange(0,map.size(),pointSize):
        data_list = struct.unpack('=4d9B',map[i:i+pointSize])
        writeStr = formatString(data_list)
        if i % (41*1000) == 0:
            print "%d/%d points processed" % (i,map.size())
    print "Time elapsed: %f" % (time.clock() - t)
    map.close()


def generate_write_xyz(filename):
    f = open(filename, 'w', 128*1024)
    def write_xyz(writeStr):
        f.write(writeStr)
    return write_xyz

def formatString(data_list):
    return "%f %f %f" % (data_list[1], data_list[2],data_list[3])
args = getArguments()
if args != None:
    read_tllargbin(args[1],generate_write_xyz("out.xyz"))

convertXYZ() is basically the conversion formula here:
http://en.wikipedia.org/wiki/Geodetic_system

I was wondering if it would be faster to read things in chunks of ~4MB with one thread, put them in a bounded buffer, have a different thread for conversion to string format, and have a final thread write the string back into a file on a different harddisk. I might be jumping the gun though...

I'm using python right now for testing, but I wouldn't be opposed to switching if I can work through these files faster.

Any suggestions would be great. Thanks

EDIT:

I have profiled the code with cProfile again and this time split the string format and the io. It seems I'm actually being killed by the string format... Here's the profiler report

         20010155 function calls in 548.993 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000  548.993  548.993 <string>:1(<module>)
        1    0.016    0.016  548.991  548.991 tllargbin_reader.py:1(<module>)
        1   24.018   24.018  548.955  548.955 tllargbin_reader.py:20(read_tllargbin)
        1    0.000    0.000    0.020    0.020 tllargbin_reader.py:36(generate_write_xyz)
 10000068  517.233    0.000  517.233    0.000 tllargbin_reader.py:42(formatString)
        2    0.000    0.000    0.000    0.000 tllargbin_reader.py:8(getArguments)
 10000068    6.684    0.000    6.684    0.000 {_struct.unpack}
        1    0.002    0.002  548.993  548.993 {execfile}
        2    0.000    0.000    0.000    0.000 {len}
        1    0.065    0.065    0.065    0.065 {method 'close' of 'mmap.mmap' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {method 'fileno' of 'file' objects}
    10003    0.955    0.000    0.955    0.000 {method 'size' of 'mmap.mmap' objects}
        2    0.020    0.010    0.020    0.010 {open}
        2    0.000    0.000    0.000    0.000 {time.clock}

Is there a faster way to format strings?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

呆萌少年 2024-12-17 14:43:13

为了更精确地解决这个问题，我建议通过将“convertXYZ”设为无操作函数并对结果进行计时来测量文件读取操作。并通过更改“读取”以始终返回一个简单点来测量转换函数，但调用转换和输出的次数与真正读取文件的次数相同。（可能还有另一次运行，其中最终的转换后输出是空操作。）根据时间的推移，攻击其中一个可能更有意义。

您可以通过将输出写入 Python 的 stdout 并让 shell 执行实际的文件 IO，让本地操作系统为您执行一些交错操作。类似地，通过将文件流式传输到标准输入（例如，cat oldformat | python conversion.py > outputfile）

输入和输出文件位于哪种类型的存储上？与 Python 代码相比，存储特性对性能的影响可能更大。

更新：鉴于输出是最慢的，并且您的存储非常慢并且在读取和写入之间共享，请尝试添加一些缓冲。从 python doc 中，您应该能够通过添加第三个参数来添加一些缓冲到 os.open 调用。尝试一些像 128*1024 这样相当大的尺寸？

回复收藏 0 原文

dawn曙光 2024-12-17 14:43:13

鉴于 formatString 是最慢的操作，请尝试以下操作：

def formatString(data_list):
    return " ".join((str(data_list[1]), str(data_list[2]), str(data_list[3])))

Given that formatString is the slowest operation, try this:

def formatString(data_list):
    return " ".join((str(data_list[1]), str(data_list[2]), str(data_list[3])))

回复收藏 0 原文

狼亦尘 2024-12-17 14:43:13

仅读取 2.1 GB 的数据就需要 21 (@ 100 MB/s) 到 70 (@ 30 MB/s) 秒。然后，您将其格式化并写入可能是五倍大的数据。这意味着总共 13 GB 的读取和写入需要 130-420 秒。

您的采样显示读取需要 24 秒。因此，写作大约需要两分钟。例如，可以使用 SSD 来改善读取和写入时间。

当我转换文件时（使用我用 C 编写的程序），我假设转换所花费的时间不会比读取数据本身花费的时间多，通常可能会少得多。重叠的读写也可以减少 I/O 时间。我编写自己的自定义格式化例程，因为 printf 通常太慢。

24秒是多少？现代 CPU 上至少有 400 亿条指令。这意味着届时您可以使用至少 19 条指令来处理每个字节的数据。对于 C 程序来说很容易实现，但对于解释语言（Python、Java、C#、VB）则不然。

525 秒处理 (549-24) 余数表明 Python 至少花费了 8750 亿条指令处理或每字节数据读取 415 条指令。结果是 22 比 1：解释型语言和编译型语言之间的比例并不罕见。一个结构良好的 C 程序每字节应该减少大约 10 条指令或更少。

回复收藏 0 原文

~没有更多了~

关于作者

謌踐踏愛綪

暂无简介

0 文章

0 评论

24 人气

关注发私信

友情链接

文江博客

快速转换大型 (2.1gb+) 二进制文件（纬度/经度/海拔高度至 ECEF）

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

Gabu-gabumon

qq_CgiN62

荔枝明

赏烟花じ飞满天

独守阴晴ぅ圆缺

¤→小豸慧

友情链接

快速转换大型 (2.1gb+) 二进制文件（纬度/经度/海拔高度至 ECEF）

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

Gabu-gabumon

qq_CgiN62

荔枝明

赏烟花じ飞满天

独守阴晴ぅ圆缺

¤→小豸慧

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。