优化 PowerShell 中的简单搜索脚本

发布于 2024-10-11 13:43:30 字数 589 浏览 1 评论 0原文

我需要创建一个脚本来搜索不到一百万个文本、代码等文件以查找匹配项，然后将特定字符串模式的所有命中输出到 CSV 文件。

到目前为止我做到了；

$location = 'C:\Work*'

$arr = "foo", "bar" #Where "foo" and "bar" are string patterns I want to search for (separately)

for($i=0;$i -lt $arr.length; $i++) {
Get-ChildItem $location -recurse | select-string -pattern $($arr[$i]) | select-object Path | Export-Csv "C:\Work\Results\$($arr[$i]).txt"
}

这将返回一个名为“foo.txt”的 CSV 文件，其中包含包含单词“foo”的所有文件的列表，以及一个名为“bar.txt”的文件，其中包含包含单词“bar”的所有文件的列表。

有没有人能想到优化这个脚本以使其运行得更快的方法？或者关于如何制作一个完全不同但等效且运行速度更快的脚本的想法？

感谢所有输入！

原文

I need to create a script to search through just below a million files of text, code, etc. to find matches and then output all hits on a particular string pattern to a CSV file.

So far I made this;

$location = 'C:\Work*'

$arr = "foo", "bar" #Where "foo" and "bar" are string patterns I want to search for (separately)

for($i=0;$i -lt $arr.length; $i++) {
Get-ChildItem $location -recurse | select-string -pattern $($arr[$i]) | select-object Path | Export-Csv "C:\Work\Results\$($arr[$i]).txt"
}

This returns to me a CSV file named "foo.txt" with a list of all files with the word "foo" in it, and a file named "bar.txt" with a list of all files containing the word "bar".

Is there any way anyone can think of to optimize this script to make it work faster? Or ideas on how to make an entirely different, but equivalent script that just works faster?

All input appreciated!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

演出会有结束 2024-10-18 13:43:30

如果您的文件不是很大并且可以读入内存，那么这个版本应该运行得更快（我的快速而肮脏的本地测试似乎证明了这一点）：

$location = 'C:\ROM'
$arr = "Roman", "Kuzmin"

# remove output files
foreach($test in $arr) {
    Remove-Item ".\$test.txt" -ErrorAction 0 -Confirm
}

Get-ChildItem $location -Recurse | .{process{ if (!$_.PSIsContainer) {
    # read all text once
    $content = [System.IO.File]::ReadAllText($_.FullName)
    # test patterns and output paths once
    foreach($test in $arr) {
        if ($content -match $test) {
            $_.FullName >> ".\$test.txt"
        }
    }
}}}

注意：1）改变了示例中的路径和模式； 2）输出文件不是CSV而是纯文本；如果您只对路径感兴趣，那么 CSV 中没有太多理由 - 纯文本文件每行一个路径就可以了。

If your files are not huge and can be read into memory then this version should work quite faster (and my quick and dirty local test seems to prove that):

$location = 'C:\ROM'
$arr = "Roman", "Kuzmin"

# remove output files
foreach($test in $arr) {
    Remove-Item ".\$test.txt" -ErrorAction 0 -Confirm
}

Get-ChildItem $location -Recurse | .{process{ if (!$_.PSIsContainer) {
    # read all text once
    $content = [System.IO.File]::ReadAllText($_.FullName)
    # test patterns and output paths once
    foreach($test in $arr) {
        if ($content -match $test) {
            $_.FullName >> ".\$test.txt"
        }
    }
}}}

Notes: 1) mind changed paths and patterns in the example; 2) output files are not CSV but plain text; there is not much reason in CSV if you are interested just in paths - plain text files one path per line will do.

回复收藏 0 原文

小梨窩很甜 2024-10-18 13:43:30

让我们假设 1) 文件不太大，您可以将其加载到内存中，2) 您实际上只需要匹配的文件路径（而不是行等）。

我尝试仅读取该文件一次，然后迭代正则表达式。有一些增益（它比原始解决方案更快），但最终结果将取决于其他因素，例如文件大小、文件数量等。

此外，删除 'ignorecase' 也会使其更快一点。

$res = @{}
$arr | % { $res[$_] = @() }

Get-ChildItem $location -recurse | 
  ? { !$_.PsIsContainer } |
  % { $file = $_
      $text = [Io.File]::ReadAllText($file.FullName)
      $arr | 
        % { $regex = $_
            if ([Regex]::IsMatch($text, $regex, 'ignorecase')) {
              $res[$regex] = $file.FullName
            }
        }
  }
$res.GetEnumerator() | % { 
  $_.Value | Export-Csv "d:\temp\so-res$($_.Key).txt"
}

Let's suppose that 1) the files are not too big and you can load it into memory, 2) you really just want the Path of the file, that matches (not the line etc.).

I tried to read the file only once and then iterate through the regexes. There is some gain (it's a faster then the original solution), but the final result will depend on other factors like file sizes, count of files etc.

Also removing 'ignorecase' makes it faster a little bit.

$res = @{}
$arr | % { $res[$_] = @() }

Get-ChildItem $location -recurse | 
  ? { !$_.PsIsContainer } |
  % { $file = $_
      $text = [Io.File]::ReadAllText($file.FullName)
      $arr | 
        % { $regex = $_
            if ([Regex]::IsMatch($text, $regex, 'ignorecase')) {
              $res[$regex] = $file.FullName
            }
        }
  }
$res.GetEnumerator() | % { 
  $_.Value | Export-Csv "d:\temp\so-res$($_.Key).txt"
}

回复收藏 0 原文

~没有更多了~