如何标记科学数据处理工具以确保可重复性

发布于 2024-09-09 01:34:24 字数 470 浏览 9 评论 0原文

我们开发了一种数据处理工具，可以从一组给定的原始数据中提取一些科学结果。在数据科学中，您可以重新获取结果并重复计算，从而生成结果集，这一点非常重要

由于该工具正在不断发展，我们需要一种方法来找出工具的哪个修订/构建生成了给定的结果设置以及如何找到构建该工具的相应源。

该工具是用 C++ 和 Python 编写的；使用 Boost::Python 将 C++ 部分粘合在一起。我们使用 CMake 作为为 Linux 生成 Make 文件的构建系统。目前该项目存储在 subversion 存储库中，但我们中的一些人已经使用 git resp。 hg，我们计划在不久的将来将整个项目迁移到其中之一。

在这样的场景中，获得源代码、二进制文件和结果集之间唯一映射的最佳实践是什么？

我们已经在讨论的想法：

以某种方式注入全局修订号
使用内部版本号生成器
将整个源代码存储在可执行文件本身中

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

芸娘子的小脾气 2024-09-16 01:34:24

这是我花了相当多时间来解决的一个问题。对于 @VonC 已经写的内容，让我补充一些想法。

我认为软件配置管理的主题很好理解，并且经常在商业环境中仔细实践。然而，这种通用方法在科学数据处理环境中往往缺乏，其中许多环境要么留在学术界，要么已经脱离学术界。然而，如果您处于这样的工作环境中，就有现成的信息和建议来源以及许多可以提供帮助的工具。我不会进一步扩展这一点。

我认为您将整个源代码包含在可执行文件中的建议即使可行，也没有必要。事实上，如果您正确地掌握了 SCM，那么您已经这样做并继续这样做的基本测试之一就是您根据需要重建“旧”可执行文件的能力。您还应该能够确定每个可执行文件和版本中使用了哪个源版本。这些应该使得在可执行文件中包含源代码变得不必要。

正如您所说，将结果集与计算联系起来的主题也是必不可少的。以下是我们正在构建的解决方案的一些组件：

我们正在从传统的非结构化文本文件（这是许多科学程序的输出特征）转向结构化文件，在我们的例子中，我们正在考虑 HDF5 和XML，其中存储感兴趣的数据和元数据。元数据包括用于生成结果的程序（和版本）的标识、输入数据集的标识、作业参数和一堆其他内容。

我们考虑使用 DBMS 来存储我们的结果；我们想走这条路，但今年我们没有资源这样做，明年也可能不会。但企业使用 DBMS 的原因有多种，其中一个原因是它们能够回滚、提供审计跟踪等。

我们还在仔细研究需要存储哪些结果集。一个好的方法是只存储从我们的现场传感器捕获的原始数据集。不幸的是，我们的一些计算需要花费 1000 个 CPU 小时才能完成，因此无法根据需要从头开始重现它们。然而，我们将来存储的中间数据集将比过去少得多。

我们还让用户直接编辑结果集变得更加困难（我认为这是不可能的，但我不确定我们是否已经做到了）。一旦有人这样做，世界上所有的出处信息都是错误且无用的。

最后，如果您想了解有关该主题的更多信息，请尝试在谷歌上搜索“科学工作流程”和“数据来源”类似主题。

编辑：从我上面写的内容中还不清楚，但是我们修改了我们的程序，以便它们包含自己的标识（我们使用 Subversion 的关键字功能以及我们自己的一个或两个扩展）并编写这将转化为他们产生的任何输出。

This is a problem I spend a fair amount of time working on. To what @VonC has already written let me add a few thoughts.

I think that the topic of software configuration management is well understood and often carefully practiced in commercial environments. However, this general approach is often lacking in scientific data processing environments many of which either remain in, or have grown out of, academia. However, if you are in such a working environment, there are readily available sources of information and advice and lots of tools to help. I won't expand on this further.

I don't think that your suggestion of including the whole source code in an executable is, even if feasible, necessary. Indeed, if you get SCM right then one of the essential tests that you have done so, and continue to do so, is your ability to rebuild 'old' executables on demand. You should also be able to determine which revision of sources were used in each executable and version. These ought to make including the source code in an executable unnecessary.

The topic of tying result sets in to computations is also, as you say, essential. Here are some of the components of the solution that we are building:

We are moving away from the traditional unstructured text file that is characteristic of the output of a lot of scientific programs towards structured files, in our case we're looking at HDF5 and XML, in which both the data of interest and the meta-data is stored. The meta-data includes the identification of the program (and version) which was used to produce the results, the identification of the input data sets, job parameters and a bunch of other stuff.

We looked at using a DBMS to store our results; we'd like to go this way but we don't have the resources to do it this year, probably not next either. But businesses use DBMSs for a variety of reasons, and one of the reasons is their ability to roll-back, to provide an audit trail, that sort of thing.

We're also looking closely at which result sets need to be stored. A nice approach would be only ever to store original data sets captured from our field sensors. Unfortunately some of our computations take 1000s of CPU-hours to produce so it is infeasible to reproduce them ab-initio on demand. However, we will be storing far fewer intermediate data sets in future than we have in the past.

We are also making it much harder (I'd like to think impossible but am not sure we are there yet) for users to edit result sets directly. Once someone does that all the provenance information in the world is wrong and useless.

Finally, if you want to read more about the topic, try Googling for 'scientific workflow' and 'data provenance' similar topics.

EDIT: It's not clear from what I wrote above, but we have modified our programs so that they contain their own identification (we use Subversion's keyword capabilities for this with an extension or two of our own) and write this into any output that they produce.

回复收藏 0 原文