I can personally vouch for Clojure as a great tool for this kind of work. (I believe Scala would be great too, I just have less experience with it).
My personal research is in the field of predictive modelling / machine learning and is very computationally intensive - so I think it has many parallels with bioinformatics or biostatistics.
My personal approach / setup includes:
Incanter used primarily as a data visualisation tool. Great for producing quick visualisations which are usually just 1-liners at the REPL. There are also lots of statistical and numerical processing tools which I believe use the Colt library under the hood. I'm not an expert in R but I understand that Incanter is roughly "R translated to Clojure/Lisp".
Exploiting quite a few Java libraries as needed. Some of these are my own, for example algorithms that I have written in Java in order to get the best possible fine-tuned performance out of the JVM. But you could equally easily use any of the other great Java libraries available, as calling Java from Clojure is very simple (.methodName object param1 param2)
Quite a lot of higher order functions to automate my workflow. For example I have a higher order function that will run an optimisation algorithm of any kind in a loop for a specified amount of time and then produce an Incanter graph of the improvement on each iteration. Not rocket science, but really easy to code up in a few lines of Clojure.
Never really having to worry about performance. You can make Clojure go pretty fast if you want to (e.g. with type hints, primitive arithmetic support etc.) but normally it's irrelevant as you're going to spend 99%+ of your cycles in well-optimised library code anyway. Hence a bit of overhead in the "glue" code is negligible - I feel I gain much more in terms of personal productivity by having a dynamic, high-level, functional language to work in.
Major use of Clojure's concurrency features - this has to be one of Clojure's strongest features. I tend to use the STM to code concurrent processes with transactions that can't interfere with each other, then kick off long-running calculations in a future so that I can get on with other tasks and wait for notification of the result.
A slowly growing collection of macros to "extend the language" when needed. I actually use macros less than I thought I would (higher order functions are often a better choice). But when you need them they are invaluable - this is where you really appreciate the value of a homoiconic language. Since they effectively allow you to add new syntax to the language itself, they are very powerful when used correctly to build the DSL that you need.
In short - I don't think you can go wrong with Clojure as a researcher.
The one thing I probably wouldn't use it for (yet) is actually writing a new numerical library - this would probably be better done in Scala or pure Java as you would probably want to adopt a more imperative / OOP style.
I am not sure about bioinformatics and biostatistics per se, but I do scientific data analysis frequently and I appreciate that Scala allows me to write as-fast-as-Java code with relative ease. I believe that it is often possible in Clojure now, but I haven't seen the benchmarks to back that up. For the time being, I think the prudent thing to assume is that they do not perform equally well. See, for example, the Computer Languages Benchmark Game, where Scala is faster than Clojure in every single test. (Ignore the horrible "pidigits" result for Clojure--Scala (and Java) are calling the GMP library written in C, which Clojure could do but because of a technical detail requiring a different wrapping for the library, isn't presently allowed in the game). Looking at multicore comparisons doesn't improve Clojure's showing, and note that the Clojure code is no shorter for these sorts of lowish-level algorithmic tasks.
Clojure is ahead for the time being with parallel collections, though the upcoming 2.9 release of Scala should make up much of the difference. Neither has a gentle learning curve when coming from C++; Scala is maybe a little easier given that the syntax outwardly looks a little more familiar. I believe there are good materials for learning each of them.
Edit: P.S. You can call R from Java (and therefore from either Clojure or Scala) using rJava (specifically the JRI interface). Edit to edit: and, these days, rScala.
Edit #2: Scala was faster than Clojure in everything at the time of writing; as of this edit, Clojure's a little ahead in one (at the cost of a huge amount of code)--but anyway, the overall point stands. (And the Scala implementation on that one test could be sped up.)
Scala's is geared toward being syntactically easy for people coming from Java, which was intended to be syntactically easy for people coming from C though with two levels of indirection like this the advantage may be lost.
Clojure is getting a lot of traction in the Big Data space and maps very well onto Hadoop jobs for Huge Data. I think this would be a big advantage in the bioinformatics world.
Really, these things are largely personal taste so try both and see that makes you happy :)
If you are looking to get a feel for Clojure without a lot of "intellectual overhead" may I suggest using leiningen to get a test project started quickly?
I don't know Scala, so I can't offer a comparison, but I am actively using Clojure in bioinformatics projects.
The Java integration is excellent, and I have had no problem making use of the BioJava libraries.
Where Clojure's concurrency model shines is in the immutable default data types and functional programming with the seq abstraction.
In my bioinformatic work I very often find myself with a lot input data (say gene sequences) which need to be subjected to the same analysis. Once I have my analysis function I can map it over a sequence of inputs (with the results lazily generated). I have gotten full utilization of a large 48-core server simply by changing that map to a pmap.
Large scale parallelization with a single character change is hard to beat!
Of course pmap isn't a magic bullet and only helps when the analysis function computationally dominates, but the fact that map and pmap can just be plugged in and out shows the elegance and simplicity enabled by Clojure's design.
I am only passingly familiar with Scala, so the best I can do is evangelize a bit for Clojure. It's a great language, but take all this advice with a grain of salt as it's coming from an enthusiast.
If you are looking for concurrency, Clojure is fantastic both for ease of programming and for performance. The immutable data structures mean that it's trivial to work with a coherent snapshot of the world without any manual and error-prone locking; the STM makes it fairly simple to change data in a thread-sensitive way without breaking anyone else's snapshots.
My understanding is that Scala has a lot of the nice functional tools that Clojure does, but Clojure will always win syntactically by virtue of being a Lisp. If you're looking to do some specialized bioinformatics stuff, Clojure is able to hide the bits of Lisp that you don't want, and raise your own constructs to the same level as the built-in language constructs. I can't find the reference right now, but there's some well-known quote about Lisp that goes like:
Lisp is not the perfect language for any program. But it is the perfect language for building the perfect language for every program.
That's horribly paraphrased, but in my experience it has been true. It looks like you'll want a fairly specialized set of tools, and no language will make those feel as natural as a Lisp.
You have to ask yourself how important functional programming is for you. You know C++ so you probably know OO. I would say it's easier to do FP in Clojure (because you can't really drop back to OO-style) in Scala you will probebly end up dropping FP and do more OO style.
I can't really say anything about your application space.
Since you mentioned R, there is an R-like Clojure library for statistics called Incanter. I don't know about other existing projects in your application space.
There is a lot of information about both languages, so that should not be a problem. The learning curve is kind of steep with both languages. Clojure is a much smaller language and since you already know some lisp it should not be to hard to learn the important stuff. Scala has a type system that will be hard to pick up especially since your main experience is with C/C++.
Both languages have great concurrency models and you will probably be happy with both.
I have some experience in Scala and only little knowledge in Clojure, but I programmed Lisp many years ago.
Lisp is a beautiful language, but it never made it to the world, because it was too limited. I believe you need a statically-typed language to develop robust systems. The type system in Scala is not difficult to master to benefit from it. If you want to do very advanced things with it to make your libraries idiot-proof, you can, but then you will need to study the type system a little more.
Scala favours immutable types, but you can use mutables without any problem, which you sometimes do need. Concurrency in Scala is very well implemented and frameworks like akka extend and enhance these possibilities.
Scala stands a better chance to become a mainstream language since it's a fuller language. I'm afraid that Clojure is too much like Lisp (but reimplemented on the JVM). I liked Lisp a lot, but it had too many disadvantages for real-life programs. With Scala I think we have the best of both worlds (OO and functional) in a clean marriage. On top of that, Scala seems to really catch on in the market.
We have been working on some experimental code in the Rudolf/BioClojure project on GitHub. Also, look at Jan Aert's BioClojure project which is more structured.
Additionally, there is a BioCaml project in the works...
发布评论
评论(9)
我个人可以保证 Clojure 是完成此类工作的绝佳工具。 (我相信 Scala 也很棒,只是我的经验较少)。
我个人的研究领域是预测建模/机器学习领域,计算量非常大——所以我认为它与生物信息学或生物统计学有很多相似之处。
我个人的方法/设置包括:
Incanter 主要用作数据可视化工具。非常适合生成快速可视化,这些可视化在 REPL 中通常只是一行代码。还有很多统计和数值处理工具,我相信它们在幕后使用 Colt 库。我不是 R 方面的专家,但我知道 Incanter 大致是“R 翻译为 Clojure/Lisp”。
根据需要利用相当多的 Java 库。其中一些是我自己的,例如我用 Java 编写的算法,以便从 JVM 中获得最佳的微调性能。但是您同样可以轻松地使用任何其他可用的优秀 Java 库,因为从 Clojure 调用 Java 非常简单(.methodName 对象 param1 param2)
相当多的高阶函数来自动化我的工作流程。例如,我有一个高阶函数,它将在指定的时间内在循环中运行任何类型的优化算法,然后生成每次迭代改进的 Incanter 图。不是火箭科学,但很容易用几行 Clojure 进行编码。
永远不必真正担心性能。如果您愿意,您可以使 Clojure 运行得相当快(例如,使用类型提示、原始算术支持等),但通常这是无关紧要的,因为无论如何您都会将 99% 以上的周期花费在经过良好优化的库代码中。因此,“粘合”代码中的一点开销可以忽略不计 - 我觉得通过使用动态的、高级的函数式语言来工作,我在个人生产力方面获得了更多。
Clojure 的主要用途 并发功能 - 这必须是 Clojure 最强大的功能之一。我倾向于使用 STM 来编写具有互不干扰的事务的并发进程,然后在将来启动长时间运行的计算,以便我可以继续执行其他任务并等待结果通知。
缓慢增长的宏集合,可在需要时“扩展语言”。实际上,我使用宏的次数比我想象的要少(高阶函数通常是更好的选择)。但当你需要它们时,它们是无价的 - 这就是你真正欣赏同像语言的价值的地方。由于它们有效地允许您向语言本身添加新语法,因此如果正确使用它们来构建您需要的 DSL,它们将非常强大。
简而言之,我认为作为一名研究人员,使用 Clojure 不会出错。
我可能不会使用它的一件事(目前)实际上是编写一个新的数值库 - 这可能会更好地在 Scala 或纯 Java 中完成,因为您可能希望采用更命令式/OOP 风格。
I can personally vouch for Clojure as a great tool for this kind of work. (I believe Scala would be great too, I just have less experience with it).
My personal research is in the field of predictive modelling / machine learning and is very computationally intensive - so I think it has many parallels with bioinformatics or biostatistics.
My personal approach / setup includes:
Incanter used primarily as a data visualisation tool. Great for producing quick visualisations which are usually just 1-liners at the REPL. There are also lots of statistical and numerical processing tools which I believe use the Colt library under the hood. I'm not an expert in R but I understand that Incanter is roughly "R translated to Clojure/Lisp".
Exploiting quite a few Java libraries as needed. Some of these are my own, for example algorithms that I have written in Java in order to get the best possible fine-tuned performance out of the JVM. But you could equally easily use any of the other great Java libraries available, as calling Java from Clojure is very simple (.methodName object param1 param2)
Quite a lot of higher order functions to automate my workflow. For example I have a higher order function that will run an optimisation algorithm of any kind in a loop for a specified amount of time and then produce an Incanter graph of the improvement on each iteration. Not rocket science, but really easy to code up in a few lines of Clojure.
Never really having to worry about performance. You can make Clojure go pretty fast if you want to (e.g. with type hints, primitive arithmetic support etc.) but normally it's irrelevant as you're going to spend 99%+ of your cycles in well-optimised library code anyway. Hence a bit of overhead in the "glue" code is negligible - I feel I gain much more in terms of personal productivity by having a dynamic, high-level, functional language to work in.
Major use of Clojure's concurrency features - this has to be one of Clojure's strongest features. I tend to use the STM to code concurrent processes with transactions that can't interfere with each other, then kick off long-running calculations in a future so that I can get on with other tasks and wait for notification of the result.
A slowly growing collection of macros to "extend the language" when needed. I actually use macros less than I thought I would (higher order functions are often a better choice). But when you need them they are invaluable - this is where you really appreciate the value of a homoiconic language. Since they effectively allow you to add new syntax to the language itself, they are very powerful when used correctly to build the DSL that you need.
In short - I don't think you can go wrong with Clojure as a researcher.
The one thing I probably wouldn't use it for (yet) is actually writing a new numerical library - this would probably be better done in Scala or pure Java as you would probably want to adopt a more imperative / OOP style.
我不确定生物信息学和生物统计学本身,但我经常进行科学数据分析,并且我很欣赏 Scala 允许我相对轻松地编写与 Java 一样快的代码。我相信现在在 Clojure 中通常是可能的,但我还没有看到支持这一点的基准。目前,我认为谨慎的假设是它们的表现不同样好。例如,请参阅计算机语言基准测试游戏,其中 Scala 在每个测试中都比 Clojure 更快。 (忽略 Clojure 可怕的“pidigits”结果——Scala(和 Java)正在调用用 C 编写的 GMP 库,Clojure 可以做到这一点,但由于技术细节需要对库进行不同的包装,目前不允许游戏)。查看多核比较并不能改善 Clojure 的性能显示,并请注意,对于此类低级算法任务,Clojure 代码并不短。
Clojure 目前在并行集合方面处于领先地位,尽管即将发布的 Scala 2.9 版本应该可以弥补大部分差异。从 C++ 开始学习时,两者都没有平缓的学习曲线; Scala 可能更容易一些,因为语法看起来更熟悉一些。我相信每个都有很好的学习材料。
编辑: PS 您可以使用 rJava 从 Java(因此从 Clojure 或 Scala)调用 R(特别是JRI 接口)。编辑编辑:现在,rScala。
编辑#2:在撰写本文时,Scala 在所有方面都比 Clojure 更快;截至本次编辑,Clojure 在这方面稍稍领先(以大量代码为代价)——但无论如何,总体观点是正确的。 (并且可以加快该测试的 Scala 实现速度。)
I am not sure about bioinformatics and biostatistics per se, but I do scientific data analysis frequently and I appreciate that Scala allows me to write as-fast-as-Java code with relative ease. I believe that it is often possible in Clojure now, but I haven't seen the benchmarks to back that up. For the time being, I think the prudent thing to assume is that they do not perform equally well. See, for example, the Computer Languages Benchmark Game, where Scala is faster than Clojure in every single test. (Ignore the horrible "pidigits" result for Clojure--Scala (and Java) are calling the GMP library written in C, which Clojure could do but because of a technical detail requiring a different wrapping for the library, isn't presently allowed in the game). Looking at multicore comparisons doesn't improve Clojure's showing, and note that the Clojure code is no shorter for these sorts of lowish-level algorithmic tasks.
Clojure is ahead for the time being with parallel collections, though the upcoming 2.9 release of Scala should make up much of the difference. Neither has a gentle learning curve when coming from C++; Scala is maybe a little easier given that the syntax outwardly looks a little more familiar. I believe there are good materials for learning each of them.
Edit: P.S. You can call R from Java (and therefore from either Clojure or Scala) using rJava (specifically the JRI interface). Edit to edit: and, these days, rScala.
Edit #2: Scala was faster than Clojure in everything at the time of writing; as of this edit, Clojure's a little ahead in one (at the cost of a huge amount of code)--but anyway, the overall point stands. (And the Scala implementation on that one test could be sped up.)
如果您喜欢 R,请尝试 Incanter! R 代表 Clojure。
Scala 的目标是让来自 Java 的人在语法上变得简单,而 Java 的目的是让来自 C 的人在语法上变得简单,尽管使用像这样的两级间接,优势可能会消失。
Clojure 在大数据领域获得了很大的关注,并且很好地映射到 Hadoop 作业海量数据。我认为这将是生物信息学领域的一大优势。
真的,这些东西很大程度上是个人品味,所以尝试两者,看看这会让你高兴:)
如果你想在没有太多“智力开销”的情况下体验 Clojure,我可以建议使用 leiningen 来快速开始一个测试项目吗?
If you like R, give Incanter a try! It's R for Clojure.
Scala's is geared toward being syntactically easy for people coming from Java, which was intended to be syntactically easy for people coming from C though with two levels of indirection like this the advantage may be lost.
Clojure is getting a lot of traction in the Big Data space and maps very well onto Hadoop jobs for Huge Data. I think this would be a big advantage in the bioinformatics world.
Really, these things are largely personal taste so try both and see that makes you happy :)
If you are looking to get a feel for Clojure without a lot of "intellectual overhead" may I suggest using leiningen to get a test project started quickly?
为了以 Rex 的答案为基础,我想添加一些您可能感兴趣的 Scala 库/产品:
To build on Rex's answer I would like to add some Scala libraries/products that may be of interest to you:
我不了解 Scala,所以无法提供比较,但我正在生物信息学项目中积极使用 Clojure。
Java 集成非常出色,并且我在使用 BioJava 库时没有任何问题。
Clojure 并发模型的亮点在于不可变的默认数据类型和具有
seq
抽象的函数式编程。在我的生物信息工作中,我经常发现自己有大量输入数据(例如基因序列)需要进行相同的分析。一旦我有了分析函数,我就可以将其映射到一系列输入上(并延迟生成结果)。我只需将
map
更改为pmap
。具有单个字符更改的大规模并行化很难被击败!
当然,
pmap
并不是灵丹妙药,只有当分析函数在计算上占主导地位时才有用,但事实上map
和pmap
可以插入和拔出显示了 Clojure 设计带来的优雅和简洁。I don't know Scala, so I can't offer a comparison, but I am actively using Clojure in bioinformatics projects.
The Java integration is excellent, and I have had no problem making use of the BioJava libraries.
Where Clojure's concurrency model shines is in the immutable default data types and functional programming with the
seq
abstraction.In my bioinformatic work I very often find myself with a lot input data (say gene sequences) which need to be subjected to the same analysis. Once I have my analysis function I can
map
it over a sequence of inputs (with the results lazily generated). I have gotten full utilization of a large 48-core server simply by changing thatmap
to apmap
.Large scale parallelization with a single character change is hard to beat!
Of course
pmap
isn't a magic bullet and only helps when the analysis function computationally dominates, but the fact thatmap
andpmap
can just be plugged in and out shows the elegance and simplicity enabled by Clojure's design.我对 Scala 只是略为熟悉,所以我能做的就是为 Clojure 做一些宣传。这是一门很棒的语言,但请对所有这些建议持保留态度,因为它来自一位爱好者。
如果您正在寻找并发性,Clojure 在编程简便性和性能方面都非常出色。不可变的数据结构意味着无需任何手动且容易出错的锁定,即可轻松处理世界的连贯快照; STM 使得以线程敏感的方式更改数据变得相当简单,而不会破坏其他人的快照。
我的理解是,Scala 拥有许多 Clojure 所拥有的优秀功能工具,但 Clojure 始终会凭借 Lisp 的优势在语法上获胜。如果您想要做一些专门的生物信息学工作,Clojure 能够隐藏您不需要的 Lisp 部分,并将您自己的构造提升到与内置语言构造相同的水平。我现在找不到参考资料,但有一些关于 Lisp 的著名引言,如下所示:
这是可怕的解释,但根据我的经验,这是真的。看起来您需要一套相当专业的工具,而没有一种语言能让这些工具像 Lisp 那样自然。
I am only passingly familiar with Scala, so the best I can do is evangelize a bit for Clojure. It's a great language, but take all this advice with a grain of salt as it's coming from an enthusiast.
If you are looking for concurrency, Clojure is fantastic both for ease of programming and for performance. The immutable data structures mean that it's trivial to work with a coherent snapshot of the world without any manual and error-prone locking; the STM makes it fairly simple to change data in a thread-sensitive way without breaking anyone else's snapshots.
My understanding is that Scala has a lot of the nice functional tools that Clojure does, but Clojure will always win syntactically by virtue of being a Lisp. If you're looking to do some specialized bioinformatics stuff, Clojure is able to hide the bits of Lisp that you don't want, and raise your own constructs to the same level as the built-in language constructs. I can't find the reference right now, but there's some well-known quote about Lisp that goes like:
That's horribly paraphrased, but in my experience it has been true. It looks like you'll want a fairly specialized set of tools, and no language will make those feel as natural as a Lisp.
你必须问自己函数式编程对你来说有多重要。你了解 C++,所以你可能了解 OO。我想说在 Clojure 中做 FP 更容易(因为你不能真正回到 OO 风格),在 Scala 中你可能最终会放弃 FP 并做更多的 OO 风格。
关于你的应用空间我真的不能说什么。
既然您提到了 R,就有一个类似 R 的 Clojure 统计库,名为 Incanter。我不知道您的应用程序空间中的其他现有项目。
关于这两种语言都有很多信息,所以这应该不是问题。这两种语言的学习曲线都很陡峭。 Clojure 是一种小得多的语言,因为您已经了解一些 lisp,所以学习重要的东西应该不会太难。 Scala 有一个很难掌握的类型系统,特别是因为您的主要经验是使用 C/C++。
这两种语言都有很好的并发模型,您可能会对这两种语言感到满意。
You have to ask yourself how important functional programming is for you. You know C++ so you probably know OO. I would say it's easier to do FP in Clojure (because you can't really drop back to OO-style) in Scala you will probebly end up dropping FP and do more OO style.
I can't really say anything about your application space.
Since you mentioned R, there is an R-like Clojure library for statistics called Incanter. I don't know about other existing projects in your application space.
There is a lot of information about both languages, so that should not be a problem. The learning curve is kind of steep with both languages. Clojure is a much smaller language and since you already know some lisp it should not be to hard to learn the important stuff. Scala has a type system that will be hard to pick up especially since your main experience is with C/C++.
Both languages have great concurrency models and you will probably be happy with both.
我有一些 Scala 经验,但对 Clojure 知之甚少,但我很多年前就编写过 Lisp。
Lisp 是一种美丽的语言,但它从未走向世界,因为它太有限了。我相信您需要一种静态类型语言来开发健壮的系统。 Scala 中的类型系统并不难掌握并从中受益。如果您想用它做非常高级的事情以使您的库防白痴,您可以,但是您将需要更多地研究类型系统。
Scala 支持不可变类型,但您可以毫无问题地使用可变类型,而有时您确实需要这种类型。 Scala 中的并发性得到了很好的实现,像 akka 这样的框架扩展并增强了这些可能性。
Scala 有更好的机会成为主流语言,因为它是一种更全面的语言。恐怕 Clojure 太像 Lisp了(但是在 JVM 上重新实现)。我非常喜欢 Lisp,但它对于现实生活中的程序来说有太多缺点。有了 Scala,我认为我们在干净的婚姻中拥有了两全其美(面向对象和函数式)。最重要的是,Scala 似乎确实在市场上流行起来。
I have some experience in Scala and only little knowledge in Clojure, but I programmed Lisp many years ago.
Lisp is a beautiful language, but it never made it to the world, because it was too limited. I believe you need a statically-typed language to develop robust systems. The type system in Scala is not difficult to master to benefit from it. If you want to do very advanced things with it to make your libraries idiot-proof, you can, but then you will need to study the type system a little more.
Scala favours immutable types, but you can use mutables without any problem, which you sometimes do need. Concurrency in Scala is very well implemented and frameworks like akka extend and enhance these possibilities.
Scala stands a better chance to become a mainstream language since it's a fuller language. I'm afraid that Clojure is too much like Lisp (but reimplemented on the JVM). I liked Lisp a lot, but it had too many disadvantages for real-life programs. With Scala I think we have the best of both worlds (OO and functional) in a clean marriage. On top of that, Scala seems to really catch on in the market.
我们一直在 GitHub 上的 Rudolf/BioClojure 项目中编写一些实验代码。另外,看看 Jan Aert 的 BioClojure 项目,它的结构更加结构化。
此外,还有一个 BioCaml 项目正在进行中......
We have been working on some experimental code in the Rudolf/BioClojure project on GitHub. Also, look at Jan Aert's BioClojure project which is more structured.
Additionally, there is a BioCaml project in the works...