如何在 PowerShell 中将文件作为流逐行处理
我正在处理一些数 GB 的文本文件,并希望使用 PowerShell 对它们进行一些流处理。这很简单,只需解析每一行并提取一些数据,然后将其存储在数据库中。
不幸的是,get-content | %{whatever($_)}
似乎将管道此阶段的整组行保留在内存中。它的速度也令人惊讶地慢,需要很长时间才能真正读入所有内容。
所以我的问题是两个部分:
- 如何让它逐行处理流而不是将整个内容缓冲在内存中?我想避免为此目的使用几GB RAM。
- 我怎样才能让它运行得更快? PowerShell 迭代
get-content
似乎比 C# 脚本慢 100 倍。
我希望我在这里做了一些愚蠢的事情,比如缺少 -LineBufferSize
参数或其他东西......
I'm working with some multi-gigabyte text files and want to do some stream processing on them using PowerShell. It's simple stuff, just parsing each line and pulling out some data, then storing it in a database.
Unfortunately, get-content | %{ whatever($_) }
appears to keep the entire set of lines at this stage of the pipe in memory. It's also surprisingly slow, taking a very long time to actually read it all in.
So my question is two parts:
- How can I make it process the stream line by line and not keep the entire thing buffered in memory? I would like to avoid using up several gigs of RAM for this purpose.
- How can I make it run faster? PowerShell iterating over a
get-content
appears to be 100x slower than a C# script.
I'm hoping there's something dumb I'm doing here, like missing a -LineBufferSize
parameter or something...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
如果您确实要处理数 GB 的文本文件,请不要使用 PowerShell。即使您找到一种更快地读取它的方法,无论如何,在 PowerShell 中处理大量行都会很慢,并且您无法避免这种情况。即使是简单的循环也是昂贵的,比如说 1000 万次迭代(在您的情况下相当真实),我们有:
更新:如果您仍然不害怕,那么尝试使用 .NET 阅读器:
更新 2
有关于可能更好/更短的代码的评论。
for
的原始代码没有任何问题,也不是伪代码。但是阅读循环的较短(最短?)变体是If you are really about to work on multi-gigabyte text files then do not use PowerShell. Even if you find a way to read it faster processing of huge amount of lines will be slow in PowerShell anyway and you cannot avoid this. Even simple loops are expensive, say for 10 million iterations (quite real in your case) we have:
UPDATE: If you are still not scared then try to use the .NET reader:
UPDATE 2
There are comments about possibly better / shorter code. There is nothing wrong with the original code with
for
and it is not pseudo-code. But the shorter (shortest?) variant of the reading loop isSystem.IO.File.ReadLines()
非常适合这种情况。它返回文件的所有行,但让您立即开始迭代这些行,这意味着它不必将整个内容存储在内存中。需要 .NET 4.0 或更高版本。
http://msdn.microsoft.com/en-us/library/dd383503.aspx
System.IO.File.ReadLines()
is perfect for this scenario. It returns all the lines of a file, but lets you begin iterating over the lines immediately which means it does not have to store the entire contents in memory.Requires .NET 4.0 or higher.
http://msdn.microsoft.com/en-us/library/dd383503.aspx
如果您想直接使用 PowerShell,请查看以下代码。
If you want to use straight PowerShell check out the below code.
对于那些感兴趣的人...
对此有一点看法,因为我必须处理非常大的文件。
以下是包含 5600 万行/记录的 39 GB xml 文件的结果。查找文本是一个 10 位数字
1 和 6 是明显的获胜者,但我选择了 1
PS。该测试是在装有 PS 5.1 的 Windows Server 2012 R2 服务器上进行的。该服务器具有 16 个 vCPU 和 64 GB 内存,但在本次测试中仅使用了 1 个 CPU,而 PS 进程内存占用量是最低限度,因为上述测试使用的内存非常少。
For those interested...
A bit of perspective on this, since I had to work with very large files.
Below are the results on a 39 GB xml file containing 56 million lines/records. The lookup text is a 10 digit number
1 and 6 are clear winners but I have gone with 1
PS. The test was conducted on a Windows Server 2012 R2 server with PS 5.1. The server has 16 vCPUs and 64 GB memory but for this test only 1 CPU was utilised whereas the PS process memory footprint was bare minimum as the tests above make use of very little memory.