为什么 Perl 在生物学研究中应用如此广泛?

发布于 2024-08-26 22:38:01 字数 1431 浏览 5 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(12

您的好友蓝忘机已上羡 2024-09-02 22:38:01

Lincoln Stein 在他的文章中强调了 Perl 对生物信息学的一些优点:
Perl 如何拯救人类基因组计划

根据他的分析:


我认为有几个因素是造成这种情况的原因:

  1. Perl 非常适合对文本进行切片、切块、扭曲、扭曲、平滑、总结和其他处理。尽管生物科学现在确实涉及大量的数值分析,但大多数主要数据仍然是文本:克隆名称、注释、评论、参考书目。甚至 DNA 序列也像文本一样。相互转换不兼容的数据格式是文本修改与一些创造性猜测相结合的问题。 Perl 强大的正则表达式匹配和字符串操作运算符以任何其他现代语言都无法比拟的方式简化了这项工作。

  2. Perl 是宽容的。生物数据通常不完整,字段可能会丢失,或者预期出现一次的字段会出现多次(例如,因为实验重复运行),或者数据是手动输入的,但并不完全一致符合预期的格式。 Perl 并不特别介意值是否为空或包含奇数字符。可以编写正则表达式来发现并纠正数据输入中的各种常见错误。当然,这种灵活性也可能是一种诅咒。我在下面详细讨论 Perl 的问题。

  3. Perl 是面向组件的。 Perl 鼓励人们以小模块的形式编写软件,可以使用 Perl 库模块,也可以使用经典的 Unix 面向工具的方法。使用管道、系统调用或套接字可以轻松地将外部程序合并到 Perl 脚本中。 Perl5 引入的动态加载器允许人们使用 C 例程扩展 Perl 语言,或者使整个编译库可供 Perl 解释器使用。目前正在努力将世界上所有收集到的有关生物数据的智慧收集到一组名为“bioPerl”的模块中(稍后将在 Perl 杂志上发表的一篇文章中详细讨论)。

  4. Perl 易于编写且开发速度快。解释器不需要您提前声明所有函数原型和数据类型,新变量会根据需要出现,调用未定义的函数只会在以下情况下导致错误需要该功能。调试器与 Emacs 配合良好,并允许舒适的交互式开发风格。

  5. Perl 是一种很好的原型语言。由于 Perl 快速且肮脏,因此在将新算法转移到快速编译语言之前,在 Perl 中构建新算法的原型通常是有意义的。
    有时事实证明 Perl 足够快,因此不必移植算法;更常见的是,人们可以用 C 编写算法的一小部分核心,将其编译为动态加载的模块或外部可执行文件,然后将应用程序的其余部分保留在 Perl 中(有关以这种方式实现的复杂基因组作图应用程序的示例,请参见http://waldo.wi.mit.edu/ftp/distribution/software/ rhmapper/)。

  6. Perl 是一种很好的 Web CGI 脚本语言,随着越来越多的实验室转向 Web 发布数据,Perl 的重要性与日俱增。

Lincoln Stein highlighted some of the saving graces of Perl for bioinformatics in his article:
How Perl Saved the Human Genome Project.

From his analysis:


I think several factors are responsible:

  1. Perl is remarkably good for slicing, dicing, twisting, wringing, smoothing, summarizing and otherwise mangling text. Although the biological sciences do involve a good deal of numeric analysis now, most of the primary data is still text: clone names, annotations, comments, bibliographic references. Even DNA sequences are textlike. Interconverting incompatible data formats is a matter of text mangling combined with some creative guesswork. Perl's powerful regular expression matching and string manipulation operators simplify this job in a way that isn't equalled by any other modern language.

  2. Perl is forgiving. Biological data is often incomplete, fields can be missing, or a field that is expected to be present once occurs several times (because, for example, an experiment was run in duplicate), or the data was entered by hand and doesn't quite fit the expected format. Perl doesn't particularly mind if a value is empty or contains odd characters. Regular expressions can be written to pick up and correct a variety of common errors in data entry. Of course this flexibility can be also be a curse. I talk more about the problems with Perl below.

  3. Perl is component-oriented. Perl encourages people to write their software in small modules, either using Perl library modules or with the classic Unix tool-oriented approach. External programs can easily be incorporated into a Perl script using a pipe, system call or socket. The dynamic loader introduced with Perl5 allows people to extend the Perl language with C routines or to make entire compiled libraries available for the Perl interpreter. An effort is currently under way to gather all the world's collected wisdom about biological data into a set of modules called "bioPerl" (discussed at length in an article to be published later in the Perl Journal).

  4. Perl is easy to write and fast to develop in. The interpreter doesn't require you to declare all your function prototypes and data types in advance, new variables spring into existence as needed, calls to undefined functions only cause an error when the function is needed. The debugger works well with Emacs and allows a comfortable interactive style of development.

  5. Perl is a good prototyping language. Because Perl is quick and dirty, it often makes sense to prototype new algorithms in Perl before moving them to a fast compiled language.
    Sometimes it turns out that Perl is fast enough so that of the algorithm doesn't have to be ported; more frequently one can write a small core of the algorithm in C, compile it as a dynamically loaded module or external executable, and leave the rest of the application in Perl (for an example of a complex genome mapping application implemented in this way, see http://waldo.wi.mit.edu/ftp/distribution/software/rhmapper/).

  6. Perl is a good language for Web CGI scripting, and is growing in importance as more labs turn to the Web for publishing their data.

神爱温柔 2024-09-02 22:38:01

真正的答案可能与 Perl 的关系没有你想象的那么大。发生的许多事情都是历史的偶然。当时,Perl 相当流行,Java 越来越流行,没有太多人关注 Python,而 Ruby 才刚刚起步。

需要完成工作的人使用 Perl 并用 Perl 制作了一些库,其他人开始使用这些库。一旦人们开始使用对他们有一定用处的东西,他们往往不会转换(经济学家称之为“转换成本”)。从那时起,更多的人开始使用它,因为很多其他人也在使用它。

今天可能不会发生同样的演变。我想说 Perl、Python 和 Ruby 都完全足以胜任这项任务。 mobrule 引用 Lincoln Stein 的所有内容< /a> 可以适用于今天的三个中的任何一个。如果今天每个人都必须从头开始,那么其中任何一种语言都可能是每个人都使用的语言。

我注意到,从我自己的客户群(生物技术的一个非常小的且不具代表性的样本)来看,推动许多生物材料编程的人似乎至少是支持科学家的兼职系统管理员。科学家们关心科学并做了一些简单的编程,但 IT 支持人员为非科学部分做了很多繁重的工作。 Perl 非常适合作为系统管理工具,因为它是互联网的管道胶带。

The real answer probably has less to do with Perl than you think. Many of the things that happen are accidents of history. At the time, way back when, Perl was pretty popular, Java was getting more popular, not too many people were paying attention to Python, and Ruby was just getting started.

The people who needed to get work done used Perl and made some libraries in Perl, and other people started using those libraries. Once people start using something that is moderately useful to them, they tend not to switch (economists call those "switching costs"). From there, even more people start using it because a lot of other people are using it.

The same evolution might not happen today. I'd say that Perl, Python, and Ruby are all completely adequate and up to the task. All the things that mobrule quotes from Lincoln Stein could apply to any of the three today. If everyone had to start from scratch today, any one of those languages could be the one that everyone uses.

I've noticed, from my own client base though (a very small and unrepresentative sample of biotech), that the people pushing the programming for a lot of the biological stuff seemed to be at least part-time sysadmins who were supporting scientists. The scientists worried about the science and did some light programming, but the IT support people were doing a lot of the heavy lifting for the non-science parts. Perl is very well positioned as a sysadmin tool since it's the duct-tape of the internet.

离笑几人歌 2024-09-02 22:38:01

可能是因为 Perl 擅长操作字符串,而许多遗传学研究都涉及对非常长的“ACTGCATG...”字符串的操作。只是猜测...

Probably because Perl is good at manipulating strings, and much research in genetics involves the manipulation of veeery long "ACTGCATG..." strings. Just guessing...

橘味果▽酱 2024-09-02 22:38:01

我使用大量 Perl 来处理社会科学研究中的定性和定量数据。在快速完成工作(主要是文本)、在 CPAN(不错的中心位置)上查找库以及通常快速完成工作方面,它是无法超越的。

Perl 也是极好的粘合剂,因此如果您有一些仪器记录,并且需要将它们粘合到数据分析例程中,那么 Perl 就是您的语言。

I use lots of Perl for dealing with qualitative and quantitative data in social science research. In terms of getting things done (largely with text) quickly, finding libraries on CPAN (nice central location), and generally just getting things done quickly, it can't be surpassed.

Perl is also excellent glue, so if you have some instrumental records, and you need to glue them to data analysis routines, then Perl is your language.

墨洒年华 2024-09-02 22:38:01

Perl 似乎是生物信息学的首选语言 - 甚至有一个关于这个主题的 O'Reilly 标题:开始使用 Perl 进行生物信息学

Perl seems to be the language of choice for bioinformatics - there's even an O'Reilly title on just this subject: Beginning Perl for Bioinformatics.

半窗疏影 2024-09-02 22:38:01

Perl 在处理文本方面非常强大,几乎每个 Linux/Unix 发行版中都存在它。在生物信息学中,不仅序列数据很容易用 Perl 操作,而且大多数生物信息学算法都会输出某种文本结果。

然后,像 EBI 这样最大的生物信息学中心有一个伟大的人,Ewan Birney,他是领导者BioPerl 项目。该库有许多解析器,适用于各种流行的生物信息学算法的结果,以及操作主要序列数据库中使用的不同序列格式。

然而,如今,Perl 并不是生物信息学家使用的唯一语言:除了序列数据之外,实验室还产生越来越多不同类型的数据类型,并且在这些领域中更经常使用其他语言。

例如,R 统计编程语言广泛用于微阵列的统计分析和 qPCR 数据(等等)。再说一遍,为什么我们如此频繁地使用它?因为它拥有针对此类数据的优秀库(请参阅 bioconductor 项目)。

现在,说到 Web 开发,CGI 并不是当今最先进的技术,但人们谁知道 Perl 可能会坚持下去。在我的公司,虽然它不再使用......

我希望这会有所帮助。

Perl is very powerful when it comes to deal with text and it's present in almost every Linux/Unix distribution. In bioinformatics, not only are sequence data very easy to manipulate with Perl, but also most of the bionformatics algorithms will output some kind of text results.

Then, the biggest bioinformatics centers like the EBI had that great guy, Ewan Birney, who was leading the BioPerl project. That library has lots of parsers for every kind of popular bioinformatics algorithms' results, and for manipulating the different sequence formats used in major sequence databases.

Nowadays, however, Perl is not the only language used by bioinformaticians: along with sequence data, labs produce more and more different kinds of data types and other languages are more often used in those areas.

The R statistics programming language for example, is widely used for statistical analysis of microarray and qPCR data (among others). Again, why are we using it so much? Because it has great libraries for that kind of data (see bioconductor project).

Now when it comes to web development, CGI is not really state of the art today, but people who know Perl may stick to it. In my company though it is no longer used...

I hope this helps.

小瓶盖 2024-09-02 22:38:01

Perl 基本上强制要求非常短的开发周期。这就是那种能把事情做好的开发。

这足以弥补Perl的缺点。

Perl basically forces very short development cycles. That's the kind of development that gets stuff done.

It's enough to outweigh Perl's disadvantages.

面犯桃花 2024-09-02 22:38:01

生物信息学主要处理文本解析,而 Perl 是最适合这项工作的编程语言,因为它是为字符串解析而设计的。正如 O'Reilly 的书(Beginning Perl for Bioinformatics)所说,“凭借 [Perl] 高度发达的检测数据模式的能力,Perl 已成为生物数据分析中最流行的语言之一。”

Bioinformatics deals primarily in text parsing and Perl is the best programming language for the job as it is made for string parsing. As the O'Reilly book (Beginning Perl for Bioinformatics) says that "With [Perl]s highly developed capacity to detect patterns in data, Perl has become one of the most popular languages for biological data analysis."

夜未央樱花落 2024-09-02 22:38:01

这似乎是一个相当全面的回应。然而,也许缺少一件事是大多数生物学家(也许直到最近)根本没有太多的编程经验。 Perl 的学习曲线比编译语言(如 C 或 Java)低得多,但 Perl 在文本处理方面仍然提供了大量功能。那么如果运行时间更长怎么办?生物学家绝对可以解决这个问题。实验室实验通常需要一个小时或更长时间才能完成,因此额外等待几分钟以完成数据处理并不会让他们丧命!

请注意,我在这里谈论的是出于必要而编程的生物学家。我知道有一些非常熟练的程序员和计算机科学家也使用 Perl,这些评论可能不适用于他们。

This seems to be a pretty comprehensive response. Perhaps one thing missing, however, is that most biologists (until recently, perhaps) don't have much programming experience at all. The learning curve for Perl is much lower than for compiled languages (like C or Java), and yet Perl still provides a ton of features when it comes to text processing. So what if it takes longer to run? Biologists can definitely handle that. Lab experiments routinely take one hour or more finish, so waiting a few extra minutes for that data processing to finish isn't going to kill them!

Just note that I am talking here about biologists that program out of necessity. I understand that there are some very skilled programmers and computer scientists out there that use Perl as well, and these comments may not apply to them.

时光礼记 2024-09-02 22:38:01

人们错过了 DBI,Perl 抽象数据库接口,它使生物信息数据库的使用变得非常容易。

还有 one-liner 角度。您可以在 Perl 中的一行中编写一些内容来重新格式化数据,然后只需使用 -pe 标志将其嵌入到命令行中。许多人使用 AWKsed 已移至 Perl。即使在完整的程序中,文件 I/O 的编写也非常容易和快速,并且与任何工程语言相比,文本转换的表达能力都很高。使用 Java 甚至 Python 进行一次性文本转换的人只是懒得学习另一种语言。 Java 尤其高度依赖 JVM 实现及其 I/O 性能。

至少你知道 Perl 在任何地方都会有多快或多慢,比 CI/O 稍慢一些。不要学习 grep剪切sed ,或AWK;只需学习 Perl 作为命令行工具,即使您不使用它编写大型程序。关于 CGI,Perl 有很多更好的 Web 框架,例如 CatalystMojolicious,但其思想份额肯定来自 CGI 和生物信息学,它们是互联网最早的重度用户之一。

People missed out DBI, the Perl abstract database interface that makes it really easy to work with bioinformatic databases.

There is also the one-liner angle. You can write something to reformat data in a single line in Perl and just use the -pe flag to embed that at the command line. Many people using AWK and sed moved to Perl. Even in full programs, file I/O is incredibly easy and quick to write, and text transformation is expressive at a high level compared to any engineering language around. People who use Java or even Python for one-off text transformation are just too lazy to learn another language. Java especially has a high dependence on the JVM implementation and its I/O performance.

At least you know how fast or slow Perl will be everywhere, slightly slower than C I/O. Don't learn grep, cut, sed, or AWK; just learn Perl as your command line tool, even if you don't produce large programs with it. Regarding CGI, Perl has plenty of better web frameworks such as Catalyst and Mojolicious, but the mindshare definitely came from CGI and bioinformatics being one of the earliest heavy users of the Internet.

我很OK 2024-09-02 22:38:01

与其他语言相比,Perl 非常容易学习。它可以充分利用正在成为大数据的生物数据。它可以操纵大数据,并在操纵数据管理和所有类型的 DNA 编程方面表现良好,由于 Perl、Python 和 Ruby。对于那些了解生物学但不知道如何用其他编程语言进行编程的人来说,这是非常容易的。

Perl is very easy to learn as compared to other languages. It can fully exploit the biological data which is becoming the big data. It can manipulate big data and perform good for manipulation data curation and all type of DNA programming, automation of biology has become easy due languages like Perl, Python and Ruby. It is very easy for those who are knowing biology, but not knowing how to program that in other programming languages.

柠栀 2024-09-02 22:38:01

就我个人而言,我知道这会让我过时,但这是因为我先学了 Perl。我被要求获取 FASTA 文件并与其他 FASTA 文件混合。当我到处询问时,Perl 是推荐的工具。

当时我已经上过一些计算机科学课程,但我对编程并不是很了解。

Perl 被证明相当容易学习。一旦我脑子里有了正则表达式,我就可以在一天之内解析并制作新的 FASTA 文件。

正如所建议的,我不是程序员。我是一名生物化学毕业生,在实验室工作,我犯了一个错误,建立了一个每个人都可以看到我的 Linux 服务器。那是在那个需要全天项目的时代。

不管怎样,Perl 成为我在实验室中需要做的任何事情的首选。它非常棒,易于使用,超级灵活,其他实验室的其他 Perl 人员我们很像我。

因此,总而言之,Perl 易于学习、灵活且宽容,并且它满足了我的需要。

当我真正涉足生物信息学时,我选择了 R、Python,甚至 Java。 Perl 在帮助创建可维护的代码方面并不是那么出色,主要是因为它非常灵活。现在我只是使用该语言来完成工作,但 Perl 仍然是我最喜欢的语言之一,就像初吻之类的。

重申一下,大多数生物信息学人员只是通过将东西拼凑在一起来学习编码,并且大多数时候您只是想获得 首席研究员 (PI),这样你就不能花几天时间在代码设计上。 Perl 非常擅长仅仅得到一个答案,它可能不会再工作第二次,并且如果六个月后你看到它,你将无法理解你自己的代码中的任何内容;但如果你现在需要一些东西,那么它是一个不错的选择,尽管我现在主要使用Python。

我希望这能给你一个经历过的人的答案。

Personally, and I know this will date me, but it's because I learned Perl first. I was being asked to take FASTA files and mix with other FASTA files. Perl was the recommended tool when I asked around.

At the time I'd been through a few computer science classes, but I didn't really know programming all that well.

Perl proved fairly easy to learn. Once I'd gotten regular expressions into my head I was parsing and making new FASTA files within a day.

As has been suggested, I was not a programmer. I was a biochemistry graduate working in a lab, and I'd made the mistake of setting up a Linux server where everyone could see me. This was back in the day when that was an all-day project.

Anyway, Perl became my goto for anything I needed to do around the lab. It was awesome, easy to use, super flexible, other Perl guys in other labs we're a lot like me.

So, to cut it short, Perl is easy to learn, flexible and forgiving, and it did what I needed.

Once I really got into bioinformatics I picked up R, Python, and even Java. Perl is not that great at helping to create maintainable code, mostly because it is so flexible. Now I just use the language for the job, but Perl is still one of my favorite languages, like a first kiss or something.

To reiterate, most bioinformatics folks learned coding by just kluging stuff together, and most of the time you're just trying to get an answer for the principal investigator (PI), so you can't spend days on code design. Perl is superb at just getting an answer, it probably won't work a second time, and you will not understand anything in your own code if you see it six months later; BUT if you need something now, then it is a good choice even though I mostly use Python now.

I hope that gives you an answer from someone who lived it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文