需要帮助来加速我的 Perl 程序

发布于 2024-11-01 17:47:37 字数 828 浏览 1 评论 0原文

好的，所以我正在开发一个漏洞查找器来针对更改根运行，我的问题是，当在大量文件 IE htdocs 中搜索大量字符串时，它花费的时间比我想要的要长，我很肯定一些高级 Perl 编写者可以帮助我加快速度。这是我的程序中我想改进的部分。

sub sStringFind {
  if (-B $_ ) {
  }else{
   open FH, '<', $_ ;
   my @lines = <FH>;
   foreach $fstring(@lines) {
    if ($fstring =~ /sendraw|portscan|stunshell|Bruteforce|fakeproc|sub google|sub alltheweb|sub uol|sub bing|sub altavista|sub ask|sub yahoo|virgillio|filestealth|IO::Socket::INET|\/usr\/sbin\/bjork|\/usr\/local\/apache\/bin\/httpd|\/sbin\/syslogd|\/sbin\/klogd|\/usr\/sbin\/acpid|\/usr\/sbin\/cron|\/usr\/sbin\/httpd|irc\.byroe\.net|milw0rm|tcpflooder/) {
     push(@huhFiles, "$_");
   }
  }
 }
}
#End suspicious string find.
find(\&sStringFind, "$cDir/www/htdocs");
for(@huhFiles) {
 print "$_\n";
}

也许一些散列？不确定我对 perl atm 不太好。感谢任何帮助，谢谢大家。

原文

Ok, so I am working on a exploit finder to run against change roots, My issue is that when searching for a large number of strings in a large number of files I.E. htdocs, It is taking longer than I would like, I'm positive some advanced perl writers out there can help me speed things up a bit. Here is the part of my program I would like to improve.

sub sStringFind {
  if (-B $_ ) {
  }else{
   open FH, '<', $_ ;
   my @lines = <FH>;
   foreach $fstring(@lines) {
    if ($fstring =~ /sendraw|portscan|stunshell|Bruteforce|fakeproc|sub google|sub alltheweb|sub uol|sub bing|sub altavista|sub ask|sub yahoo|virgillio|filestealth|IO::Socket::INET|\/usr\/sbin\/bjork|\/usr\/local\/apache\/bin\/httpd|\/sbin\/syslogd|\/sbin\/klogd|\/usr\/sbin\/acpid|\/usr\/sbin\/cron|\/usr\/sbin\/httpd|irc\.byroe\.net|milw0rm|tcpflooder/) {
     push(@huhFiles, "$_");
   }
  }
 }
}
#End suspicious string find.
find(\&sStringFind, "$cDir/www/htdocs");
for(@huhFiles) {
 print "$_\n";
}

Perhaps some hashing? Not sure am not great with perl atm. Any help is appreciated, thanks guys.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

不爱素颜 2024-11-08 17:47:37

因此，通过“散列”，我认为您的意思是在文件或行级别进行校验和，这样您就不必再次检查它？

基本问题是，无论是否有校验和，您仍然必须读取每个文件的每一行以对其进行扫描或散列。所以这并没有从根本上改变你的算法，它只是改变了常量。

如果您有大量重复文件，在文件级别进行校验可能会节省您大量时间。如果不这样做，就会浪费很多时间。

cost = (checksum_cost * num_files) + (regex_cost * lines_per(unique_files))

行级别的校验和是正则表达式的成本和校验和的成本之间的折衷。如果重复行不多，你就输了。如果你的校验和太贵，你就会输。您可以这样写：

cost = (checksum_cost * total_lines) + (regex_cost * (total_lines - duplicate_lines))

我首先计算出文件和行重复的百分比。这很简单：

$line_frequency{ checksum($line) }++

然后查看频率 >= 2 的百分比。该百分比是您通过校验和看到的最大性能提升。如果是 50%，您只会看到 50% 的增长。假设校验和成本为 0，但事实并非如此，因此您会看到更少的内容。如果校验和的成本是正则表达式成本的一半，那么您只会看到 25%。

这就是我推荐 grep 的原因。它会比 Perl 更快地迭代文件和行，从而解决根本问题：您必须读取每个文件和每一行。

您能做的就是不要每次都查看每个文件。一个简单的事情就是记住上次扫描的时间并查看每个文件的修改时间。它没有改变，你的正则表达式也没有改变，不要再检查它。更强大的版本是存储每个文件的校验和，以防文件因修改时间而改变。如果您的所有文件不经常更改，那将是一个巨大的胜利。

# Write a timestamp file at the top of the directory you're scanning
sub set_last_scan_time {
    my $dir = shift;

    my $file = "$dir/.last_scan";
    open my $fh, ">", $file or die "Can't open $file for writing: $!";
    print $fh time;

    return
}

# Read the timestamp file
sub get_last_scan_time {
    my $dir = shift;

    my $file = "$dir/.last_scan";

    return 0 unless -e $file;

    open my $fh, "<", $file or die "Can't open $file: $!";
    my $time = <$fh>;
    chomp $time;

    return $time;
}

use File::Slurp 'read_file';
use File::stat;

my $last_scan_time = get_last_scan_time($dir);

# Place the regex outside the routine just to make things tidier.
my $regex = qr{this|that|blah|...};
my @huhFiles;
sub scan_file {
    # Only scan text files
    return unless -T $_;

    # Don't bother scanning if it hasn't changed
    return if stat($_)->mtime < $last_scan_time;

    push(@huhFiles, $_) if read_file($_) =~ $regex;
}

# Set the scan time to before you start so if anything is edited
# while you're scanning you'll catch it next time.
set_last_scan_time($dir);

find(\&scan_file, $dir);

So by "hashing" I presume you mean doing a checksum at the file or line level so you don't have to check it again?

The basic problem is, checksum or not, you still have to read every line of every file either to scan it or to hash it. So this doesn't fundamentally change your algorithm, it just pushes around the constants.

If you have a lot of duplicate files, checksuming at the file level might save you a lot of time. If you don't, it will waste a lot of time.

cost = (checksum_cost * num_files) + (regex_cost * lines_per(unique_files))

Checksuming at the line level is a toss up between the cost of the regex and the cost of the checksum. If there's not many duplicate lines, you lose. If your checksum is too expensive, you lose. You can write it out like so:

cost = (checksum_cost * total_lines) + (regex_cost * (total_lines - duplicate_lines))

I'd start by figuring out what percentage of the files and lines are duplicates. That's as simple as:

$line_frequency{ checksum($line) }++

and then looking at the percentage where the frequency is >= 2. That percentage is the maximum performance increase you will see by checksuming. If it's 50% you will only ever see an increase of 50%. That assumes the checksum cost is 0, which it isn't, so you're going to see less. If the checksum costs half what the regex costs then you'll only see 25%.

This is why I recommend grep. It will iterate through files and lines faster than Perl can attacking the fundamental problem: you have to read every file and every line.

What you can do is not look at every file every time. A simple thing to do is remember the last time you scanned and look at the modification time of each file. It is hasn't changed, and your regex hasn't changed, don't check it again. A more robust version would be to store the checksums of each file, in case the file was changed by the modification time was altered. If all your files aren't changing very often, that will see a big win.

# Write a timestamp file at the top of the directory you're scanning
sub set_last_scan_time {
    my $dir = shift;

    my $file = "$dir/.last_scan";
    open my $fh, ">", $file or die "Can't open $file for writing: $!";
    print $fh time;

    return
}

# Read the timestamp file
sub get_last_scan_time {
    my $dir = shift;

    my $file = "$dir/.last_scan";

    return 0 unless -e $file;

    open my $fh, "<", $file or die "Can't open $file: $!";
    my $time = <$fh>;
    chomp $time;

    return $time;
}

use File::Slurp 'read_file';
use File::stat;

my $last_scan_time = get_last_scan_time($dir);

# Place the regex outside the routine just to make things tidier.
my $regex = qr{this|that|blah|...};
my @huhFiles;
sub scan_file {
    # Only scan text files
    return unless -T $_;

    # Don't bother scanning if it hasn't changed
    return if stat($_)->mtime < $last_scan_time;

    push(@huhFiles, $_) if read_file($_) =~ $regex;
}

# Set the scan time to before you start so if anything is edited
# while you're scanning you'll catch it next time.
set_last_scan_time($dir);

find(\&scan_file, $dir);

回复收藏 0 原文

胡渣熟男 2024-11-08 17:47:37

您没有做任何会导致明显性能问题的事情，因此您必须将目光投向 Perl 之外。使用grep。它应该快得多。

open my $grep, "-|", "grep", "-l", "-P", "-I", "-r", $regex, $dir;
my @files = <$grep>;
chomp @files;

-l 将仅返回匹配的文件名。 -P 将使用 Perl 兼容的正则表达式。 -r 将使其在文件中递归。 -I 将忽略二进制文件。确保您系统的 grep 具有所有这些选项。

You're not doing anything that will cause an obvious performance problem, so you will have to look outside Perl. Use grep. It should be much faster.

open my $grep, "-|", "grep", "-l", "-P", "-I", "-r", $regex, $dir;
my @files = <$grep>;
chomp @files;

-l will return just filenames that match. -P will use Perl compatible regular expressions. -r will make it recurse through files. -I will ignore binary files. Make sure your system's grep has all those options.

回复收藏 0 原文

逆蝶 2024-11-08 17:47:37

与其他答案相反，我建议对每个整个文件执行一次正则表达式，而不是每行执行一次。

use File::Slurp 'read_file';
        ...
    if (-B $_ ) {
    }else{
        if ( read_file("$_") =~ /sendraw|portscan|stunshell|Bruteforce|fakeproc|sub google|sub alltheweb|sub uol|sub bing|sub altavista|sub ask|sub yahoo|virgillio|filestealth|IO::Socket::INET|\/usr\/sbin\/bjork|\/usr\/local\/apache\/bin\/httpd|\/sbin\/syslogd|\/sbin\/klogd|\/usr\/sbin\/acpid|\/usr\/sbin\/cron|\/usr\/sbin\/httpd|irc\.byroe\.net|milw0rm|tcpflooder/) {
            push(@huhFiles, $_);
        }
    }

确保您至少使用 perl5.10.1。

Contrary to the other answers, I would suggest performing the regex once on each entire file, not once per line.

use File::Slurp 'read_file';
        ...
    if (-B $_ ) {
    }else{
        if ( read_file("$_") =~ /sendraw|portscan|stunshell|Bruteforce|fakeproc|sub google|sub alltheweb|sub uol|sub bing|sub altavista|sub ask|sub yahoo|virgillio|filestealth|IO::Socket::INET|\/usr\/sbin\/bjork|\/usr\/local\/apache\/bin\/httpd|\/sbin\/syslogd|\/sbin\/klogd|\/usr\/sbin\/acpid|\/usr\/sbin\/cron|\/usr\/sbin\/httpd|irc\.byroe\.net|milw0rm|tcpflooder/) {
            push(@huhFiles, $_);
        }
    }

Make sure you are using at least perl5.10.1.

回复收藏 0 原文

夜夜流光相皎洁 2024-11-08 17:47:37

我会做很多事情来提高绩效。

首先，您应该预编译正则表达式。一般来说，我这样做：
我的@items=qw(foo bar baz); #通常我从配置文件中提取它
我的 $regex='^' 。连接“|”，@items。 '$'; #举个例子。我也做了很多捕捉。
$regex=qr($regex)i;

其次，如上所述，您应该一次读取一行文件。我所看到的大多数性能都是内存耗尽，而不是CPU。

第三，如果您的 1 个 cpu 耗尽且有大量文件需要处理，请使用 fork() 将应用程序拆分为调用方和接收方，以便您可以使用多个 cpu 一次处理多个文件。您可以写入一个通用文件，完成后对其进行解析。

最后，注意内存使用情况——很多时候，文件追加可以让内存中的内容变得更小。

我必须使用 5.8 和 5.10 处理大型数据转储，这对我有用。

回复收藏 0 原文

素染倾城色 2024-11-08 17:47:37

我不确定这是否有帮助，但是当您打开时，您会将整个文件读入 perl 数组 (@lines)一次。通过打开文件并逐行读取它，而不是在处理之前将整个文件加载到内存中，您可能会获得更好的性能。但是，如果您的文件很小，您当前的方法实际上可能会更快...

请参阅此页面的示例：http://www.perlfect.com/articles/perlfile.shtml

它可能看起来像这样（注意标量 $line 变量 - 不是数组）：

open FH, '<' $_;

while ($line = <FH>)
{
    # do something with line
}

close FH;

I'm not sure if this will help, but when you open the <FH> you're reading the entire file into a perl array (@lines) all at once. You might get better performance by opening the file, and reading it line by line, rather than loading the entire file into memory before processing it. However, if your files are small, your current method might actually be faster...

See this page for an example: http://www.perlfect.com/articles/perlfile.shtml

It might look something like this (note the scalar $line variable - not an array):

open FH, '<' $_;

while ($line = <FH>)
{
    # do something with line
}

close FH;

回复收藏 0 原文

夜访吸血鬼 2024-11-08 17:47:37

正如所写，您的脚本将每个文件的全部内容读取到 @lines 中，然后扫描每一行。这意味着两个改进：一次读取一行，并在一行匹配时立即停止。

一些其他改进： if (-B $_) {} else { ... } 很奇怪 - 如果您只想处理文本文件，请使用 -T测试。您应该始终检查 open() 的返回值。并且 push() 中无用地使用了引号。综合起来：

sub sStringFind {
    if (-T $_) {
        # Always - yes, ALWAYS check for failure on open()
        open(my $fh, '<', $_) or die "Could not open $_: $!";

        while (my $fstring = <$fh>) {
            if ($fstring =~ /sendraw|portscan|stunshell|Bruteforce|fakeproc|sub google|sub alltheweb|sub uol|sub bing|sub altavista|sub ask|sub yahoo|virgillio|filestealth|IO::Socket::INET|\/usr\/sbin\/bjork|\/usr\/local\/apache\/bin\/httpd \/sbin\/syslogd|\/sbin\/klogd|\/usr\/sbin\/acpid|\/usr\/sbin\/cron|\/usr\/sbin\/httpd|irc\.byro \.net|milw0rm|tcpflooder/) {
                push(@huhFiles, $_);
                last; # No need to keep checking once this file's been flagged
            }
        }
    }
}

As written, your script reads the entire contents of each file into @lines, then scans every line. That suggests two improvements: Reading a line at a time, and stopping as soon as a line matches.

Some additional improvements: The if (-B $_) {} else { ... } is odd - if you only want to process text files, use the -T test. You should always check the return value of open(). And there's a useless use of quotes in your push(). Taken all together:

sub sStringFind {
    if (-T $_) {
        # Always - yes, ALWAYS check for failure on open()
        open(my $fh, '<', $_) or die "Could not open $_: $!";

        while (my $fstring = <$fh>) {
            if ($fstring =~ /sendraw|portscan|stunshell|Bruteforce|fakeproc|sub google|sub alltheweb|sub uol|sub bing|sub altavista|sub ask|sub yahoo|virgillio|filestealth|IO::Socket::INET|\/usr\/sbin\/bjork|\/usr\/local\/apache\/bin\/httpd \/sbin\/syslogd|\/sbin\/klogd|\/usr\/sbin\/acpid|\/usr\/sbin\/cron|\/usr\/sbin\/httpd|irc\.byro \.net|milw0rm|tcpflooder/) {
                push(@huhFiles, $_);
                last; # No need to keep checking once this file's been flagged
            }
        }
    }
}

回复收藏 0 原文

故乡的云 2024-11-08 17:47:37

只是为了添加其他东西。

如果您要从搜索词列表中组装正则表达式。然后可以使用 Regexp::Assemble::Compressed 来折叠将您的搜索项转换为较短的正则表达式：

use Regexp::Assemble::Compressed;

my @terms = qw(sendraw portscan stunshell Bruteforce fakeproc sub google sub alltheweb sub uol sub bing sub altavista sub ask sub yahoo virgillio filestealth IO::Socket::INET /usr/sbin/bjork /usr/local/apache/bin/httpd /sbin/syslogd /sbin/klogd /usr/sbin/acpid /usr/sbin/cron /usr/sbin/httpd irc.byroe.net milw0rm tcpflooder);

my $ra = Regexp::Assemble::Compressed->new;
$ra->add("\Q${_}\E") for @terms;
my $re = $ra->re;
print $re."\n";

print "matched" if 'blah blah yahoo' =~ m{$re};

这将产生：

(?-xism:(?:\/(?:usr\/(?:sbin\/(?:(?:acpi|http)d|bjork|cron)|local\/apache\/bin\/httpd)|sbin\/(?:sys|k)logd)|a(?:l(?:ltheweb|tavista)|sk)|f(?:ilestealth|akeproc)|s(?:tunshell|endraw|ub)|(?:Bruteforc|googl)e|(?:virgilli|yaho)o|IO::Socket::INET|irc\.byroe\.net|tcpflooder|portscan|milw0rm|bing|uol))
matched

这对于非常长的搜索项列表可能有好处，特别是对于 Perl 5.10 之前的版本。

Just to add something else.

If you're assembling you regexp from a list of search terms. Then Regexp::Assemble::Compressed can be used to fold your search terms into a shorter regular expression:

use Regexp::Assemble::Compressed;

my @terms = qw(sendraw portscan stunshell Bruteforce fakeproc sub google sub alltheweb sub uol sub bing sub altavista sub ask sub yahoo virgillio filestealth IO::Socket::INET /usr/sbin/bjork /usr/local/apache/bin/httpd /sbin/syslogd /sbin/klogd /usr/sbin/acpid /usr/sbin/cron /usr/sbin/httpd irc.byroe.net milw0rm tcpflooder);

my $ra = Regexp::Assemble::Compressed->new;
$ra->add("\Q${_}\E") for @terms;
my $re = $ra->re;
print $re."\n";

print "matched" if 'blah blah yahoo' =~ m{$re};

This produces:

(?-xism:(?:\/(?:usr\/(?:sbin\/(?:(?:acpi|http)d|bjork|cron)|local\/apache\/bin\/httpd)|sbin\/(?:sys|k)logd)|a(?:l(?:ltheweb|tavista)|sk)|f(?:ilestealth|akeproc)|s(?:tunshell|endraw|ub)|(?:Bruteforc|googl)e|(?:virgilli|yaho)o|IO::Socket::INET|irc\.byroe\.net|tcpflooder|portscan|milw0rm|bing|uol))
matched

This may be of benefit for very long lists of search terms, particularly for Perl pre 5.10.

回复收藏 0 原文

何以笙箫默 2024-11-08 17:47:37

只需使用您的代码即可：

#!/usr/bin/perl

# it looks awesome to use strict
use strict;
# using warnings is beyond awesome
use warnings;
use File::Find;

my $keywords = qr[sendraw|portscan|stunshell|Bruteforce|fakeproc|sub google|sub alltheweb|sub uol|sub bing|sub altavista|sub ask|sub yahoo|virgillio|filestealth|IO::Socket::INET|\/usr\/sbin\/bjork|\/usr\/local\/apache\/bin\/httpd|\/sbin\/syslogd|\/sbin\/klogd|\/usr\/sbin\/acpid|\/usr\/sbin\/cron|\/usr\/sbin\/httpd|irc\.byroe\.net|milw0rm|tcpflooder];

my @huhfiles;

find sub {
        return unless -f;
        my $file = $File::Find::name;

        open my $fh, '<', $file or die "$!\n";
        local $/ = undef;
        my $contents = <$fh>;
        # modern Perl handles this but it's a good practice
        # to close the file handle after usage
        close $fh;

        if ($contents =~ $keywords) {
                push @huhfiles, $file;
        }
}, "$cDir/www/htdocs";

if (@huhfiles) {
        print join "\n", @huhfiles;
} else {
        print "No vulnerable files found\n";
}

Just working from your code:

#!/usr/bin/perl

# it looks awesome to use strict
use strict;
# using warnings is beyond awesome
use warnings;
use File::Find;

my $keywords = qr[sendraw|portscan|stunshell|Bruteforce|fakeproc|sub google|sub alltheweb|sub uol|sub bing|sub altavista|sub ask|sub yahoo|virgillio|filestealth|IO::Socket::INET|\/usr\/sbin\/bjork|\/usr\/local\/apache\/bin\/httpd|\/sbin\/syslogd|\/sbin\/klogd|\/usr\/sbin\/acpid|\/usr\/sbin\/cron|\/usr\/sbin\/httpd|irc\.byroe\.net|milw0rm|tcpflooder];

my @huhfiles;

find sub {
        return unless -f;
        my $file = $File::Find::name;

        open my $fh, '<', $file or die "$!\n";
        local $/ = undef;
        my $contents = <$fh>;
        # modern Perl handles this but it's a good practice
        # to close the file handle after usage
        close $fh;

        if ($contents =~ $keywords) {
                push @huhfiles, $file;
        }
}, "$cDir/www/htdocs";

if (@huhfiles) {
        print join "\n", @huhfiles;
} else {
        print "No vulnerable files found\n";
}

回复收藏 0 原文

滴情不沾 2024-11-08 17:47:37

不要一次读完所有的行。一次读取一行，然后当您在文件中找到匹配项时，跳出循环并停止从该文件读取。

另外，不需要时不要插值。而不是

push(@huhFiles, "$_");

push(@huhFiles, $_);

这不会是速度问题，而是更好的编码风格。

Don't read all the of the lines at once. Read one line at a time, and then when you find a match in the file, break out of the loop and stop reading from that file.

Also, don't interpolate when you don't need to. Instead of