哪些情况可以从 Perl 的研究中受益?

发布于 2024-12-19 08:21:34 字数 827 浏览 1 评论 0原文

我正在研究 study,这是一个 Perl 功能,用于检查字符串以使后续正则表达式可能更快:

while( <> ) {
    study;
    $count++ if /PATTERN/;
    $count++ if /OTHER/;
    $count++ if /PATTERN2/;
    }

关于哪些情况会从中受益并没有太多说明。您可以从文档中梳理出一些内容:

  • 具有常量字符串的模式
  • 多个模式
  • 较短的目标字符串可能会更好(需要更少的时间来学习)

我正在寻找具体的案例,在这些案例中我不仅可以展示出巨大的优势,而且还可以稍微调整以失去这种优势。 文档中的警告之一是您应该对个别案例进行基准测试。我想找到一些边缘情况,其中字符串(或模式)的微小差异会对性能产生很大影响。

如果您没有使用过学习,请不要回答。我宁愿有格式良好的正确答案,而不是快速猜测。这里没有紧急情况,也不会妨碍任何工作。

而且,作为奖励,我一直在使用基准测试工具来比较两次 NYTProf 运行,我宁愿使用它而不是通常的基准测试工具。如果我想出一种自动化的方法,我也会分享。

I'm playing around with study, a Perl feature to examine a string to make subsequent regular expressions potentially much speedier:

while( <> ) {
    study;
    $count++ if /PATTERN/;
    $count++ if /OTHER/;
    $count++ if /PATTERN2/;
    }

There's not much said about which situations will benefit from this. A few things you can tease out of the docs:

  • Patterns with constant strings
  • Multiple patterns
  • Shorter target strings might be better (takes less time to study)

I'm looking for concrete cases where I not only can demonstrate a big advantage, but also cases that I can slightly tweak to lose that advantage. One of the warnings in the docs is that you should benchmark individual cases. I want to find some of the edge cases where a small difference in a string (or pattern) makes a big difference in performance.

If you haven't used study, please don't answer. I'd rather have well-formed correct answers instead fast guesses. There's no urgency here, and this isn't holding up any work.

And, as a bonus, I've been playing with a benchmarking tool comparing two NYTProf runs, which I'd rather use than the usual benchmarking tool. If I come up with a way to automate that, I'll share that too.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

孤千羽 2024-12-26 08:21:34

Google 发现了这个可爱的测试场景< /a>:

#!/usr/bin/perl
# 
#  Exercise 7.8 
# 
# This is a more difficult exercise. The study function in Perl may speed up searches 
# for motifs in DNA or protein. Read the Perl documentation on this function. Its use 
# is simple: given some sequence data in a variable $sequence, type:
# 
# study $sequence;
# 
# before doing the searches. Do you think study will speed up searches in DNA or 
# protein, based on what you've read about it in the documentation?
# 
# For lots of extra credit! Now read the Perl documentation on the standard module 
# Benchmark. (Type perldoc Benchmark, or visit the Perl home page at http://www.
# perl.com.) See if your guess is right by writing a program that benchmarks motif 
# searches of DNA and of protein, with and without study.
#
# Answer to Exercise 7.8

use strict;
use warnings;

use Benchmark;

my $dna = join ('', qw(
agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg
tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct
gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc
tgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggt
cgtgagggagtgcgccgggagcggagatatggagggagatggttcagacc
cagagcctccagatgccggggaggacagcaagtccgagaatggggagaat
gcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgat
cgggtgtgacaactgcaatgagtggttccatggggactgcatccggatca
ctgagaagatggccaaggccatccgggagtggtactgtcgggagtgcaga
gagaaagaccccaagctagagattcgctatcggcacaagaagtcacggga
gcgggatggcaatgagcgggacagcagtgagccccgggatgagggtggag
ggcgcaagaggcctgtccctgatccagacctgcagcgccgggcagggtca
gggacaggggttggggccatgcttgctcggggctctgcttcgccccacaa
atcctctccgcagcccttggtggccacacccagccagcatcaccagcagc
agcagcagcagatcaaacggtcagcccgcatgtgtggtgagtgtgaggca
tgtcggcgcactgaggactgtggtcactgtgatttctgtcgggacatgaa
gaagttcgggggccccaacaagatccggcagaagtgccggctgcgccagt
gccagctgcgggcccgggaatcgtacaagtacttcccttcctcgctctca
ccagtgacgccctcagagtccctgccaaggccccgccggccactgcccac
ccaacagcagccacagccatcacagaagttagggcgcatccgtgaagatg
agggggcagtggcgtcatcaacagtcaaggagcctcctgaggctacagcc
acacctgagccactctcagatgaggaccta
));

my $protein = join('', qw(
MNIDDKLEGLFLKCGGIDEMQSSRTMVVMGGVSGQSTVSGELQD
SVLQDRSMPHQEILAADEVLQESEMRQQDMISHDELMVHEETVKNDEEQMETHERLPQ
GLQYALNVPISVKQEITFTDVSEQLMRDKKQIR
));

my $count = 1000;

print "DNA pattern matches without 'study' function:\n";
timethis($count,
    ' for(my $i=1 ; $i < 10000; ++$i) {
        $dna =~ /aggtc/;
        $dna =~ /aatggccgt/;
        $dna =~ /gatcgatcagctagcat/;
        $dna =~ /gtatgaac/;
        $dna =~ /[ac][cg][gt][ta]/;
        $dna =~ /ccccccccc/;
    } '
);

print "\nDNA pattern matches with 'study' function:\n";
timethis($count,
    ' study $dna;
    for(my $i=1 ; $i < 10000; ++$i) {
        $dna =~ /aggtc/;
        $dna =~ /aatggccgt/;
        $dna =~ /gatcgatcagctagcat/;
        $dna =~ /gtatgaac/;
        $dna =~ /[ac][cg][gt][ta]/;
        $dna =~ /ccccccccc/;
    } '
);

print "\nProtein pattern matches without 'study' function:\n";
timethis($count,
    ' for(my $i=1 ; $i < 10000; ++$i) {
        $protein =~ /PH.EI/;
        $protein =~ /KFTEQGESMRLY/;
        $protein =~ /[YAL][NVP][ISV][KQE]/;
        $protein =~ /DKKQIR/;
        $protein =~ /[MD][VT][HQ][ER]/;
        $protein =~ /NVPISVKQEITFTDVSEQL/;
    } '
);

print "\nProtein pattern matches with 'study' function:\n";
timethis($count,
    ' study $protein;
    for(my $i=1 ; $i < 10000; ++$i) {
        $protein =~ /PH.EI/;
        $protein =~ /KFTEQGESMRLY/;
        $protein =~ /[YAL][NVP][ISV][KQE]/;
        $protein =~ /DKKQIR/;
        $protein =~ /[MD][VT][HQ][ER]/;
        $protein =~ /NVPISVKQEITFTDVSEQL/;
    } '
);

请注意,对于最有利可图的情况(蛋白质匹配),报告的收益仅为约 2%:

#  $ perl exer07.08
# On my computer, this is the output I get: your results probably vary.

#  DNA pattern matches without 'study' function:
#  timethis 1000: 29 wallclock secs (29.25 usr +  0.00 sys = 29.25 CPU) @ 34.19/s (n=1000)
#  
#  DNA pattern matches with 'study' function:
#  timethis 1000: 30 wallclock secs (29.21 usr +  0.15 sys = 29.36 CPU) @ 34.06/s (n=1000)
#  
#  Protein pattern matches without 'study' function:
#  timethis 1000: 32 wallclock secs (29.47 usr +  0.04 sys = 29.51 CPU) @ 33.89/s (n=1000)
#  
#  Protein pattern matches with 'study' function:
#  timethis 1000: 30 wallclock secs (28.97 usr +  0.02 sys = 28.99 CPU) @ 34.49/s (n=1000)
#  

Google turned up this lovely test scenario:

#!/usr/bin/perl
# 
#  Exercise 7.8 
# 
# This is a more difficult exercise. The study function in Perl may speed up searches 
# for motifs in DNA or protein. Read the Perl documentation on this function. Its use 
# is simple: given some sequence data in a variable $sequence, type:
# 
# study $sequence;
# 
# before doing the searches. Do you think study will speed up searches in DNA or 
# protein, based on what you've read about it in the documentation?
# 
# For lots of extra credit! Now read the Perl documentation on the standard module 
# Benchmark. (Type perldoc Benchmark, or visit the Perl home page at http://www.
# perl.com.) See if your guess is right by writing a program that benchmarks motif 
# searches of DNA and of protein, with and without study.
#
# Answer to Exercise 7.8

use strict;
use warnings;

use Benchmark;

my $dna = join ('', qw(
agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg
tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct
gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc
tgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggt
cgtgagggagtgcgccgggagcggagatatggagggagatggttcagacc
cagagcctccagatgccggggaggacagcaagtccgagaatggggagaat
gcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgat
cgggtgtgacaactgcaatgagtggttccatggggactgcatccggatca
ctgagaagatggccaaggccatccgggagtggtactgtcgggagtgcaga
gagaaagaccccaagctagagattcgctatcggcacaagaagtcacggga
gcgggatggcaatgagcgggacagcagtgagccccgggatgagggtggag
ggcgcaagaggcctgtccctgatccagacctgcagcgccgggcagggtca
gggacaggggttggggccatgcttgctcggggctctgcttcgccccacaa
atcctctccgcagcccttggtggccacacccagccagcatcaccagcagc
agcagcagcagatcaaacggtcagcccgcatgtgtggtgagtgtgaggca
tgtcggcgcactgaggactgtggtcactgtgatttctgtcgggacatgaa
gaagttcgggggccccaacaagatccggcagaagtgccggctgcgccagt
gccagctgcgggcccgggaatcgtacaagtacttcccttcctcgctctca
ccagtgacgccctcagagtccctgccaaggccccgccggccactgcccac
ccaacagcagccacagccatcacagaagttagggcgcatccgtgaagatg
agggggcagtggcgtcatcaacagtcaaggagcctcctgaggctacagcc
acacctgagccactctcagatgaggaccta
));

my $protein = join('', qw(
MNIDDKLEGLFLKCGGIDEMQSSRTMVVMGGVSGQSTVSGELQD
SVLQDRSMPHQEILAADEVLQESEMRQQDMISHDELMVHEETVKNDEEQMETHERLPQ
GLQYALNVPISVKQEITFTDVSEQLMRDKKQIR
));

my $count = 1000;

print "DNA pattern matches without 'study' function:\n";
timethis($count,
    ' for(my $i=1 ; $i < 10000; ++$i) {
        $dna =~ /aggtc/;
        $dna =~ /aatggccgt/;
        $dna =~ /gatcgatcagctagcat/;
        $dna =~ /gtatgaac/;
        $dna =~ /[ac][cg][gt][ta]/;
        $dna =~ /ccccccccc/;
    } '
);

print "\nDNA pattern matches with 'study' function:\n";
timethis($count,
    ' study $dna;
    for(my $i=1 ; $i < 10000; ++$i) {
        $dna =~ /aggtc/;
        $dna =~ /aatggccgt/;
        $dna =~ /gatcgatcagctagcat/;
        $dna =~ /gtatgaac/;
        $dna =~ /[ac][cg][gt][ta]/;
        $dna =~ /ccccccccc/;
    } '
);

print "\nProtein pattern matches without 'study' function:\n";
timethis($count,
    ' for(my $i=1 ; $i < 10000; ++$i) {
        $protein =~ /PH.EI/;
        $protein =~ /KFTEQGESMRLY/;
        $protein =~ /[YAL][NVP][ISV][KQE]/;
        $protein =~ /DKKQIR/;
        $protein =~ /[MD][VT][HQ][ER]/;
        $protein =~ /NVPISVKQEITFTDVSEQL/;
    } '
);

print "\nProtein pattern matches with 'study' function:\n";
timethis($count,
    ' study $protein;
    for(my $i=1 ; $i < 10000; ++$i) {
        $protein =~ /PH.EI/;
        $protein =~ /KFTEQGESMRLY/;
        $protein =~ /[YAL][NVP][ISV][KQE]/;
        $protein =~ /DKKQIR/;
        $protein =~ /[MD][VT][HQ][ER]/;
        $protein =~ /NVPISVKQEITFTDVSEQL/;
    } '
);

Note that the reported gain is only around ~2% for the most profitable case (protein matches):

#  $ perl exer07.08
# On my computer, this is the output I get: your results probably vary.

#  DNA pattern matches without 'study' function:
#  timethis 1000: 29 wallclock secs (29.25 usr +  0.00 sys = 29.25 CPU) @ 34.19/s (n=1000)
#  
#  DNA pattern matches with 'study' function:
#  timethis 1000: 30 wallclock secs (29.21 usr +  0.15 sys = 29.36 CPU) @ 34.06/s (n=1000)
#  
#  Protein pattern matches without 'study' function:
#  timethis 1000: 32 wallclock secs (29.47 usr +  0.04 sys = 29.51 CPU) @ 33.89/s (n=1000)
#  
#  Protein pattern matches with 'study' function:
#  timethis 1000: 30 wallclock secs (28.97 usr +  0.02 sys = 28.99 CPU) @ 34.49/s (n=1000)
#  
娇女薄笑 2024-12-26 08:21:34

我将留下笔记作为答案,稍后我会将其发展为实际答案:

pp.cPP(pp_study) 中,它有这些奇怪的行(减去注释):

if (len == 0 || len > I32_MAX || !SvPOK(sv) || SvUTF8(sv) || SvVALID(sv)) {
RETPUSHNO;
}

看起来设置了 UTF8 标志的标量根本没有被研究过。

I'm going to leave notes as an answer, and later I'll develop it into an actual answer:

In pp.c's PP(pp_study), it has these curious lines (minus a comment):

if (len == 0 || len > I32_MAX || !SvPOK(sv) || SvUTF8(sv) || SvVALID(sv)) {
RETPUSHNO;
}

It looks like scalars with the UTF8 flag set aren't studied at all.

梦幻的味道 2024-12-26 08:21:34

没有任何。自 2012 年以来,研究没有任何作用

目前,该代码

if (len == 0 || len > I32_MAX || !SvPOK(sv) || SvUTF8(sv) || SvVALID(sv)) {
    /* Historically, study was skipped in these cases. */
    SETs(&PL_sv_no);
    return NORMAL;
}

/* Make study a no-op. It's no longer useful and its existence
   complicates matters elsewhere. */
SETs(&PL_sv_yes);
return NORMAL;

意味着 study 在以前会执行某些操作的情况下返回 true,否则返回 false - 但它实际上从未执行任何操作。

None. Since 2012, study does nothing.

Currently the code has

if (len == 0 || len > I32_MAX || !SvPOK(sv) || SvUTF8(sv) || SvVALID(sv)) {
    /* Historically, study was skipped in these cases. */
    SETs(&PL_sv_no);
    return NORMAL;
}

/* Make study a no-op. It's no longer useful and its existence
   complicates matters elsewhere. */
SETs(&PL_sv_yes);
return NORMAL;

which means that study returns true in the case where it would formerly have done something, and false otherwise -- but it never actually does anything.

西瓜 2024-12-26 08:21:34

并不真地。如果你搜索,大多数结果都在 Perl 测试套件中,这意味着没有人使用它。另外,由于错误,您只能 注意到全局变量的速度优势。它实际上在处理英语时带来了一些速度增强(有时甚至快了 2 倍),但你必须使变量全局化。

有时它还会导致无限循环误报研究可以添加程序中的错误,即使它只是应该使其更快),因此它是 在 Perl 5.16 中被删除(或者更确切地说,无操作) – 没有人愿意维护一个没人关心的部分。

Not really. If you search, and most results are in Perl test suite, that means nobody uses it. Also, because of bug, you could only notice speed benefits on global variables. It actually brought some speed enhancements when dealing with English (sometimes even 2 times faster), but you had to make variable global.

It also sometimes caused infinite loops or false positives (study could add bugs to your program, even when it was just supposed to make it faster), and because of that it was removed (or rather, made no-op) in Perl 5.16 – nobody wanted to maintain a part nobody cares about anyway.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文