将行/行号添加到文本文件的快速方法

发布于 2024-08-02 00:47:05 字数 715 浏览 5 评论 0原文

我有一个大约有 1200 万行的文件,每行如下所示:

0701648016480002020000002030300000200907242058CRLF

我想要完成的是在数据之前添加行号,这些数字应该有固定的长度。

其背后的想法是能够将此文件批量插入到 SQLServer 表中,然后用它执行某些要求每行都有唯一标识符的操作。 我已经尝试在数据库端执行此操作,但无法实现良好的性能(至少在 4' 以下,而在 1' 以下是理想的)。

现在我正在尝试使用 python 的解决方案,看起来像这样。

file=open('file.cas', 'r')
lines=file.readlines()
file.close()
text = ['%d %s' % (i, line) for i, line in enumerate(lines)]
output = open("output.cas","w")
output.writelines(str("".join(text)))
output.close()

我不知道这是否会起作用,但它会帮助我在继续尝试新事物之前了解它的性能和副作用,我也认为用 C 语言来做,这样我就有更好的内存控制。

用低级语言来做这件事会有帮助吗? 有谁知道更好的方法来做到这一点,我很确定它已经完成,但我无法找到任何东西。

谢谢

I have a file wich has about 12 millon lines, each line looks like this:

0701648016480002020000002030300000200907242058CRLF

What I'm trying to accomplish is adding a row numbers before the data, the numbers should have a fixed length.

The idea behind this is able to do a bulk insert of this file into a SQLServer table, and then perform certain operations with it that require each line to have a unique identifier. I've tried doing this in the database side but I haven´t been able to accomplish a good performance (under 4' at least, and under 1' would be ideal).

Right now I'm trying a solution in python that looks something like this.

file=open('file.cas', 'r')
lines=file.readlines()
file.close()
text = ['%d %s' % (i, line) for i, line in enumerate(lines)]
output = open("output.cas","w")
output.writelines(str("".join(text)))
output.close()

I don't know if this will work, but it'll help me having an idea of how will it perform and side effects before I keep on trying new things, I also thought doing it in C so I have a better memory control.

Will it help doing it in a low level language? Does anyone know a better way to do this, I'm pretty sure it has being done but I haven't being able to find anything.

thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

花落人断肠 2024-08-09 00:47:05

天哪,不,不要一次读完所有 1200 万行! 如果您要使用 Python,至少要这样做:

file = open('file.cas', 'r')
try:
    output = open('output.cas', 'w')
    try:
        output.writelines('%d %s' % tpl for tpl in enumerate(file))
    finally:
        output.close()
finally:
    file.close()

使用生成器表达式,该表达式一次运行一次处理一行的文件。

oh god no, don't read all 12 million lines in at once! If you're going to use Python, at least do it this way:

file = open('file.cas', 'r')
try:
    output = open('output.cas', 'w')
    try:
        output.writelines('%d %s' % tpl for tpl in enumerate(file))
    finally:
        output.close()
finally:
    file.close()

That uses a generator expression which runs through the file processing one line at a time.

酷遇一生 2024-08-09 00:47:05

你为什么不尝试 cat -n ?

Why don't you try cat -n ?

指尖上的星空 2024-08-09 00:47:05

Stefano 是对的:

$ time cat -n file.cas > output.cas

使用时间只是为了看看它有多快。 它会比 python 更快,因为 cat 是纯 C 代码。

Stefano is right:

$ time cat -n file.cas > output.cas

Use time just so you can see how fast it is. It'll be faster than python since cat is pure C code.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文