如何在 PowerShell 中将文件作为流逐行处理

发布于 2024-10-03 09:18:29 字数 428 浏览 1 评论 0原文

我正在处理一些数 GB 的文本文件,并希望使用 PowerShell 对它们进行一些流处理。这很简单,只需解析每一行并提取一些数据,然后将其存储在数据库中。

不幸的是,get-content | %{whatever($_)} 似乎将管道此阶段的整组行保留在内存中。它的速度也令人惊讶地慢,需要很长时间才能真正读入所有内容。

所以我的问题是两个部分:

  1. 如何让它逐行处理流而不是将整个内容缓冲在内存中?我想避免为此目的使用几GB RAM。
  2. 我怎样才能让它运行得更快? PowerShell 迭代 get-content 似乎比 C# 脚本慢 100 倍。

我希望我在这里做了一些愚蠢的事情,比如缺少 -LineBufferSize 参数或其他东西......

I'm working with some multi-gigabyte text files and want to do some stream processing on them using PowerShell. It's simple stuff, just parsing each line and pulling out some data, then storing it in a database.

Unfortunately, get-content | %{ whatever($_) } appears to keep the entire set of lines at this stage of the pipe in memory. It's also surprisingly slow, taking a very long time to actually read it all in.

So my question is two parts:

  1. How can I make it process the stream line by line and not keep the entire thing buffered in memory? I would like to avoid using up several gigs of RAM for this purpose.
  2. How can I make it run faster? PowerShell iterating over a get-content appears to be 100x slower than a C# script.

I'm hoping there's something dumb I'm doing here, like missing a -LineBufferSize parameter or something...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

好久不见√ 2024-10-10 09:18:29

如果您确实要处理数 GB 的文本文件,请不要使用 PowerShell。即使您找到一种更快地读取它的方法,无论如何,在 PowerShell 中处理大量行都会很慢,并且您无法避免这种情况。即使是简单的循环也是昂贵的,比如说 1000 万次迭代(在您的情况下相当真实),我们有:

# "empty" loop: takes 10 seconds
measure-command { for($i=0; $i -lt 10000000; ++$i) {} }

# "simple" job, just output: takes 20 seconds
measure-command { for($i=0; $i -lt 10000000; ++$i) { $i } }

# "more real job": 107 seconds
measure-command { for($i=0; $i -lt 10000000; ++$i) { $i.ToString() -match '1' } }

更新:如果您仍然不害怕,那么尝试使用 .NET 阅读器:

$reader = [System.IO.File]::OpenText("my.log")
try {
    for() {
        $line = $reader.ReadLine()
        if ($line -eq $null) { break }
        # process the line
        $line
    }
}
finally {
    $reader.Close()
}

更新 2

有关于可能更好/更短的代码的评论。 for 的原始代码没有任何问题,也不是伪代码。但是阅读循环的较短(最短?)变体是

$reader = [System.IO.File]::OpenText("my.log")
while($null -ne ($line = $reader.ReadLine())) {
    $line
}

If you are really about to work on multi-gigabyte text files then do not use PowerShell. Even if you find a way to read it faster processing of huge amount of lines will be slow in PowerShell anyway and you cannot avoid this. Even simple loops are expensive, say for 10 million iterations (quite real in your case) we have:

# "empty" loop: takes 10 seconds
measure-command { for($i=0; $i -lt 10000000; ++$i) {} }

# "simple" job, just output: takes 20 seconds
measure-command { for($i=0; $i -lt 10000000; ++$i) { $i } }

# "more real job": 107 seconds
measure-command { for($i=0; $i -lt 10000000; ++$i) { $i.ToString() -match '1' } }

UPDATE: If you are still not scared then try to use the .NET reader:

$reader = [System.IO.File]::OpenText("my.log")
try {
    for() {
        $line = $reader.ReadLine()
        if ($line -eq $null) { break }
        # process the line
        $line
    }
}
finally {
    $reader.Close()
}

UPDATE 2

There are comments about possibly better / shorter code. There is nothing wrong with the original code with for and it is not pseudo-code. But the shorter (shortest?) variant of the reading loop is

$reader = [System.IO.File]::OpenText("my.log")
while($null -ne ($line = $reader.ReadLine())) {
    $line
}
乖乖公主 2024-10-10 09:18:29

System.IO.File.ReadLines() 非常适合这种情况。它返回文件的所有行,但让您立即开始迭代这些行,这意味着它不必将整个内容存储在内存中。

需要 .NET 4.0 或更高版本。

foreach ($line in [System.IO.File]::ReadLines($filename)) {
    # do something with $line
}

http://msdn.microsoft.com/en-us/library/dd383503.aspx

System.IO.File.ReadLines() is perfect for this scenario. It returns all the lines of a file, but lets you begin iterating over the lines immediately which means it does not have to store the entire contents in memory.

Requires .NET 4.0 or higher.

foreach ($line in [System.IO.File]::ReadLines($filename)) {
    # do something with $line
}

http://msdn.microsoft.com/en-us/library/dd383503.aspx

蓝眸 2024-10-10 09:18:29

如果您想直接使用 PowerShell,请查看以下代码。

$content = Get-Content C:\Users\You\Documents\test.txt
foreach ($line in $content)
{
    Write-Host $line
}

If you want to use straight PowerShell check out the below code.

$content = Get-Content C:\Users\You\Documents\test.txt
foreach ($line in $content)
{
    Write-Host $line
}
非要怀念 2024-10-10 09:18:29

对于那些感兴趣的人...

对此有一点看法,因为我必须处理非常大的文件。

以下是包含 5600 万行/记录的 39 GB xml 文件的结果。查找文本是一个 10 位数字

1) GC -rc 1000 | % -match -> 183 seconds
2) GC -rc 100 | % -match  -> 182 seconds
3) GC -rc 1000 | % -like  -> 840 seconds
4) GC -rc 100 | % -like   -> 840 seconds
5) sls -simple            -> 730 seconds
6) sls                    -> 180 seconds (sls default uses regex, but pattern in my case is passed as literal text)
7) Switch -file -regex    -> 258 seconds
8) IO.File.Readline       -> 250 seconds

1 和 6 是明显的获胜者,但我选择了 1

PS。该测试是在装有 PS 5.1 的 Windows Server 2012 R2 服务器上进行的。该服务器具有 16 个 vCPU 和 64 GB 内存,但在本次测试中仅使用了 1 个 CPU,而 PS 进程内存占用量是最低限度,因为上述测试使用的内存非常少。

For those interested...

A bit of perspective on this, since I had to work with very large files.

Below are the results on a 39 GB xml file containing 56 million lines/records. The lookup text is a 10 digit number

1) GC -rc 1000 | % -match -> 183 seconds
2) GC -rc 100 | % -match  -> 182 seconds
3) GC -rc 1000 | % -like  -> 840 seconds
4) GC -rc 100 | % -like   -> 840 seconds
5) sls -simple            -> 730 seconds
6) sls                    -> 180 seconds (sls default uses regex, but pattern in my case is passed as literal text)
7) Switch -file -regex    -> 258 seconds
8) IO.File.Readline       -> 250 seconds

1 and 6 are clear winners but I have gone with 1

PS. The test was conducted on a Windows Server 2012 R2 server with PS 5.1. The server has 16 vCPUs and 64 GB memory but for this test only 1 CPU was utilised whereas the PS process memory footprint was bare minimum as the tests above make use of very little memory.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文