将行/行号添加到文本文件的快速方法
我有一个大约有 1200 万行的文件,每行如下所示:
0701648016480002020000002030300000200907242058CRLF
我想要完成的是在数据之前添加行号,这些数字应该有固定的长度。
其背后的想法是能够将此文件批量插入到 SQLServer 表中,然后用它执行某些要求每行都有唯一标识符的操作。 我已经尝试在数据库端执行此操作,但无法实现良好的性能(至少在 4' 以下,而在 1' 以下是理想的)。
现在我正在尝试使用 python 的解决方案,看起来像这样。
file=open('file.cas', 'r')
lines=file.readlines()
file.close()
text = ['%d %s' % (i, line) for i, line in enumerate(lines)]
output = open("output.cas","w")
output.writelines(str("".join(text)))
output.close()
我不知道这是否会起作用,但它会帮助我在继续尝试新事物之前了解它的性能和副作用,我也认为用 C 语言来做,这样我就有更好的内存控制。
用低级语言来做这件事会有帮助吗? 有谁知道更好的方法来做到这一点,我很确定它已经完成,但我无法找到任何东西。
谢谢
I have a file wich has about 12 millon lines, each line looks like this:
0701648016480002020000002030300000200907242058CRLF
What I'm trying to accomplish is adding a row numbers before the data, the numbers should have a fixed length.
The idea behind this is able to do a bulk insert of this file into a SQLServer table, and then perform certain operations with it that require each line to have a unique identifier. I've tried doing this in the database side but I haven´t been able to accomplish a good performance (under 4' at least, and under 1' would be ideal).
Right now I'm trying a solution in python that looks something like this.
file=open('file.cas', 'r')
lines=file.readlines()
file.close()
text = ['%d %s' % (i, line) for i, line in enumerate(lines)]
output = open("output.cas","w")
output.writelines(str("".join(text)))
output.close()
I don't know if this will work, but it'll help me having an idea of how will it perform and side effects before I keep on trying new things, I also thought doing it in C so I have a better memory control.
Will it help doing it in a low level language? Does anyone know a better way to do this, I'm pretty sure it has being done but I haven't being able to find anything.
thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
天哪,不,不要一次读完所有 1200 万行! 如果您要使用 Python,至少要这样做:
使用生成器表达式,该表达式一次运行一次处理一行的文件。
oh god no, don't read all 12 million lines in at once! If you're going to use Python, at least do it this way:
That uses a generator expression which runs through the file processing one line at a time.
你为什么不尝试 cat -n ?
Why don't you try cat -n ?
Stefano 是对的:
$ time cat -n file.cas > output.cas
使用时间只是为了看看它有多快。 它会比 python 更快,因为 cat 是纯 C 代码。
Stefano is right:
$ time cat -n file.cas > output.cas
Use time just so you can see how fast it is. It'll be faster than python since cat is pure C code.