如何标记科学数据处理工具以确保可重复性

发布于 2024-09-09 01:34:24 字数 470 浏览 9 评论 0原文

我们开发了一种数据处理工具,可以从一组给定的原始数据中提取一些科学结果。在数据科学中,您可以重新获取结果并重复计算,从而生成结果集,这一点非常重要

由于该工具正在不断发展,我们需要一种方法来找出工具的哪个修订/构建生成了给定的结果设置以及如何找到构建该工具的相应源。

该工具是用 C++ 和 Python 编写的;使用 Boost::Python 将 C++ 部分粘合在一起。我们使用 CMake 作为为 Linux 生成 Make 文件的构建系统。目前该项目存储在 subversion 存储库中,但我们中的一些人已经使用 git resp。 hg,我们计划在不久的将来将整个项目迁移到其中之一。

在这样的场景中,获得源代码、二进制文件和结果集之间唯一映射的最佳实践是什么?

我们已经在讨论的想法:

  • 以某种方式注入全局修订号
  • 使用内部版本号生成器
  • 将整个源代码存储在可执行文件本身中

we develop a data processing tool to extract some scientific results out of a given set of raw data. In data science it is very important that you can re-obtain your results and repeat the calculations, that led to a result set

Since the tool is evolving, we need a way to find out which revision/build of our tool generated a given result set and how to find the corresponding source from which the tool was build.

The tool is written in C++ and Python; gluing together the C++ parts using Boost::Python. We use CMake as a build system generating Make files for Linux. Currently the project is stored in a subversion repo, but some of us already use git resp. hg and we are planning to migrate the whole project to one of them in the very near future.

What are the best practices in a scenario like this to get a unique mapping between source code, binary and result set?

Ideas we are already discussing:

  • Somehow injecting the global revision number
  • Using a build number generator
  • Storing the whole sourcecode inside the executable itself

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

芸娘子的小脾气 2024-09-16 01:34:24

这是我花了相当多时间来解决的一个问题。对于 @VonC 已经写的内容,让我补充一些想法。

我认为软件配置管理的主题很好理解,并且经常在商业环境中仔细实践。然而,这种通用方法在科学数据处理环境中往往缺乏,其中许多环境要么留在学术界,要么已经脱离学术界。然而,如果您处于这样的工作环境中,就有现成的信息和建议来源以及许多可以提供帮助的工具。我不会进一步扩展这一点。

我认为您将整个源代码包含在可执行文件中的建议即使可行,也没有必要。事实上,如果您正确地掌握了 SCM,那么您已经这样做并继续这样做的基本测试之一就是您根据需要重建“旧”可执行文件的能力。您还应该能够确定每个可执行文件和版本中使用了哪个源版本。这些应该使得在可执行文件中包含源代码变得不必要。

正如您所说,将结果集与计算联系起来的主题也是必不可少的。以下是我们正在构建的解决方案的一些组件:

我们正在从传统的非结构化文本文件(这是许多科学程序的输出特征)转向结构化文件,在我们的例子中,我们正在考虑 HDF5 和XML,其中存储感兴趣的数据和元数据。元数据包括用于生成结果的程序(和版本)的标识、输入数据集的标识、作业参数和一堆其他内容。

我们考虑使用 DBMS 来存储我们的结果;我们想走这条路,但今年我们没有资源这样做,明年也可能不会。但企业使用 DBMS 的原因有多种,其中一个原因是它们能够回滚、提供审计跟踪等。

我们还在仔细研究需要存储哪些结果集。一个好的方法是只存储从我们的现场传感器捕获的原始数据集。不幸的是,我们的一些计算需要花费 1000 个 CPU 小时才能完成,因此无法根据需要从头开始重现它们。然而,我们将来存储的中间数据集将比过去少得多。

我们还让用户直接编辑结果集变得更加困难(我认为这是不可能的,但我不确定我们是否已经做到了)。一旦有人这样做,世界上所有的出处信息都是错误且无用的。

最后,如果您想了解有关该主题的更多信息,请尝试在谷歌上搜索“科学工作流程”和“数据来源”类似主题。

编辑:从我上面写的内容中还不清楚,但是我们修改了我们的程序,以便它们包含自己的标识(我们使用 Subversion 的关键字功能以及我们自己的一个或两个扩展)并编写这将转化为他们产生的任何输出。

This is a problem I spend a fair amount of time working on. To what @VonC has already written let me add a few thoughts.

I think that the topic of software configuration management is well understood and often carefully practiced in commercial environments. However, this general approach is often lacking in scientific data processing environments many of which either remain in, or have grown out of, academia. However, if you are in such a working environment, there are readily available sources of information and advice and lots of tools to help. I won't expand on this further.

I don't think that your suggestion of including the whole source code in an executable is, even if feasible, necessary. Indeed, if you get SCM right then one of the essential tests that you have done so, and continue to do so, is your ability to rebuild 'old' executables on demand. You should also be able to determine which revision of sources were used in each executable and version. These ought to make including the source code in an executable unnecessary.

The topic of tying result sets in to computations is also, as you say, essential. Here are some of the components of the solution that we are building:

We are moving away from the traditional unstructured text file that is characteristic of the output of a lot of scientific programs towards structured files, in our case we're looking at HDF5 and XML, in which both the data of interest and the meta-data is stored. The meta-data includes the identification of the program (and version) which was used to produce the results, the identification of the input data sets, job parameters and a bunch of other stuff.

We looked at using a DBMS to store our results; we'd like to go this way but we don't have the resources to do it this year, probably not next either. But businesses use DBMSs for a variety of reasons, and one of the reasons is their ability to roll-back, to provide an audit trail, that sort of thing.

We're also looking closely at which result sets need to be stored. A nice approach would be only ever to store original data sets captured from our field sensors. Unfortunately some of our computations take 1000s of CPU-hours to produce so it is infeasible to reproduce them ab-initio on demand. However, we will be storing far fewer intermediate data sets in future than we have in the past.

We are also making it much harder (I'd like to think impossible but am not sure we are there yet) for users to edit result sets directly. Once someone does that all the provenance information in the world is wrong and useless.

Finally, if you want to read more about the topic, try Googling for 'scientific workflow' and 'data provenance' similar topics.

EDIT: It's not clear from what I wrote above, but we have modified our programs so that they contain their own identification (we use Subversion's keyword capabilities for this with an extension or two of our own) and write this into any output that they produce.

私藏温柔 2024-09-16 01:34:24

您需要考虑 git 子模块。 selenic.com/wiki/subrepos" rel="nofollow noreferrer">hg 子存储库。

这种情况下的最佳实践是拥有一个父存储库,它将引用:

  • 工具的来源
  • 从该工具生成的结果集
  • 理想情况下 C++ 编译器(不会每天都进化)
  • 理想情况下 python 发行版(不会每天进化 )天)

其中每一个都是一个组件,即一个独立的存储库(Git 或 Mercurial)。
每个组件的一个精确修订将由父存储库引用。

所有流程代表基于组件的方法,并且是充分利用 SCM(此处为软件配置管理)是关键。

You need to consider git submodules of hg subrepos.

The best practice in this scenario os to have a parent repo which will reference:

  • the sources of the tool
  • the result set generated from that tool
  • ideally the c++ compiler (won't evolve every day)
  • ideally the python distribution (won't evolve every day)

Each of those are a component, that is an independent repository (Git or Mercurial).
One precise revision of each component will be reference by a parent repository.

The all process is representative of a component-based approach, and is key in using an SCM (here Software Configuration Management) at its fullest.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文