通过规范值管理科学数据依赖图的Python解决方案

发布于 2024-09-06 11:00:16 字数 1650 浏览 5 评论 0原文

我有一个看似普遍的科学数据管理问题，但我找不到现有的解决方案，甚至找不到它的描述，这让我长期困惑。我即将开始一次重大重写（python），但我想我应该为现有的解决方案进行最后一次投稿，这样我就可以放弃自己的解决方案并回到生物学，或者至少学习一些适当的语言以便更好地谷歌搜索。

问题：我拥有昂贵的（需要数小时到数天的计算时间）和大（GB）的数据属性，这些属性通常是作为一个或多个其他数据属性的转换而构建的。我需要准确跟踪这些数据的构建方式，以便如果它适合问题（使用正确的规范值构建）或根据需要构建新数据，我可以将其重新用作另一个转换的输入。虽然这并不重要，但我通常会从“增值”的有些异质的分子生物学信息开始，例如，由其他研究人员的其他过程注释的基因组和蛋白质。我需要结合和比较这些数据来做出自己的推论。通常需要许多中间步骤，而且这些步骤可能很昂贵。此外，最终结果可以成为其他转换的输入。所有这些转换都可以通过多种方式完成：用不同的初始数据进行限制（例如使用不同的生物体），通过在相同的推论中使用不同的参数值，或者通过使用不同的推论模型等。分析经常变化并建立在其他分析的基础上以计划外的方式。我需要知道我拥有哪些数据（哪些参数或规范完全定义了它），以便我可以在适当的情况下重复使用它，以及为了一般的科学完整性。

我的总体努力：我在设计 Python 类时考虑到了描述问题。类对象构建的所有数据属性都由一组参数值描述。我将这些定义参数或规范称为“def_specs”，这些 def_specs 及其值称为数据属性的“形状”。进程的整个全局参数状态可能非常大（例如一百个参数），但任何一个类提供的数据属性只需要其中的一小部分，至少是直接需要。目标是通过测试先前构建的数据属性的形状是否是全局参数状态的子集来检查它们是否合适。

在类中，通过检查代码很容易找到定义形状所需的 def_spec。当一个模块需要来自另一个模块的数据属性时，就会出现问题。这些数据属性将具有自己的形状，可能由调用对象作为参数传递，但更常见的是从全局参数状态中过滤出来。调用类应该通过其依赖项的形状来增强，以便维护其数据属性的完整描述。理论上，这可以通过检查依赖关系图来手动完成，但是这个图可能会变得很深，并且有很多模块，我不断地更改和添加这些模块，并且......我太懒了，太粗心了，无法手动完成。

因此，程序通过跟踪对其他类属性的调用并通过 __get__ 调用的托管堆栈将其形状推回到调用者，动态地发现数据属性的完整形状。当我重写时，我发现我需要严格控制对构建器类的属性访问，以防止任意信息影响数据属性。幸运的是，Python 通过描述符使这一切变得简单。

我将数据属性的形状存储在数据库中，以便我可以查询是否已存在适当的数据（即其形状是当前参数状态的子集）。在我的重写中，我通过伟大的 SQLAlchemy 从 mysql 转移到对象数据库（ZODB 或 couchdb？），因为当发现额外的 def_specs 时，必须更改每个类的表，这是一个痛苦，并且因为一些 def_specs 是python 列表或字典，转换为 sql 很痛苦。

我不认为这种数据管理可以与我的数据转换代码分开，因为需要严格的属性控制，尽管我正在尝试尽可能地这样做。我可以使用现有的类，方法是用一个提供 def_specs 作为类属性的类来包装它们，并通过描述符进行数据库管理，但这些类是终端类，因为无法进一步发现其他依赖关系形状。

如果数据管理不能轻易地与数据构建分开，我想不太可能有一个开箱即用的解决方案，而是一千个特定的解决方案。也许有一个适用的模式？如果有任何有关如何查找或更好地描述问题的提示，我将不胜感激。对我来说，这似乎是一个普遍问题，尽管管理深层数据可能与网络的流行趋势不一致。

原文

I have a scientific data management problem which seems general, but I can't find an existing solution or even a description of it, which I have long puzzled over. I am about to embark on a major rewrite (python) but I thought I'd cast about one last time for existing solutions, so I can scrap my own and get back to the biology, or at least learn some appropriate language for better googling.

The problem:
I have expensive (hours to days to calculate) and big (GB's) data attributes that are typically built as transformations of one or more other data attributes. I need to keep track of exactly how this data is built so I can reuse it as input for another transformation if it fits the problem (built with right specification values) or construct new data as needed. Although it shouldn't matter, I typically I start with 'value-added' somewhat heterogeneous molecular biology info, for example, genomes with genes and proteins annotated by other processes by other researchers. I need to combine and compare these data to make my own inferences. A number of intermediate steps are often required, and these can be expensive. In addition, the end results can become the input for additional transformations. All of these transformations can be done in multiple ways: restricting with different initial data (eg using different organisms), by using different parameter values in the same inferences, or by using different inference models, etc. The analyses change frequently and build on others in unplanned ways. I need to know what data I have (what parameters or specifications fully define it), both so I can reuse it if appropriate, as well as for general scientific integrity.

My efforts in general:
I design my python classes with the problem of description in mind. All data attributes built by a class object are described by a single set of parameter values. I call these defining parameters or specifications the 'def_specs', and these def_specs with their values the 'shape' of the data atts. The entire global parameter state for the process might be quite large (eg a hundred parameters), but the data atts provided by any one class require only a small number of these, at least directly. The goal is to check whether previously built data atts are appropriate by testing if their shape is a subset of the global parameter state.

Within a class it is easy to find the needed def_specs that define the shape by examining the code. The rub arises when a module needs a data att from another module. These data atts will have their own shape, perhaps passed as args by the calling object, but more often filtered from the global parameter state. The calling class should be augmented with the shape of its dependencies in order to maintain a complete description of its data atts.
In theory this could be done manually by examining the dependency graph, but this graph can get deep, and there are many modules, which I am constantly changing and adding, and ... I'm too lazy and careless to do it by hand.

So, the program dynamically discovers the complete shape of the data atts by tracking calls to other classes attributes and pushing their shape back up to the caller(s) through a managed stack of __get__ calls. As I rewrite I find that I need to strictly control attribute access to my builder classes to prevent arbitrary info from influencing the data atts. Fortunately python is making this easy with descriptors.

I store the shape of the data atts in a db so that I can query whether appropriate data (i.e. its shape is a subset of the current parameter state) already exists. In my rewrite I am moving from mysql via the great SQLAlchemy to an object db (ZODB or couchdb?) as the table for each class has to be altered when additional def_specs are discovered, which is a pain, and because some of the def_specs are python lists or dicts, which are a pain to translate to sql.

I don't think this data management can be separated from my data transformation code because of the need for strict attribute control, though I am trying to do so as much as possible. I can use existing classes by wrapping them with a class that provides their def_specs as class attributes, and db management via descriptors, but these classes are terminal in that no further discovery of additional dependency shape can take place.

If the data management cannot easily be separated from the data construction, I guess it is unlikely that there is an out of the box solution but a thousand specific ones. Perhaps there is an applicable pattern? I'd appreciate any hints at how to go about looking or better describing the problem. To me it seems a general issue, though managing deeply layered data is perhaps at odds with the prevailing winds of the web.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

柳若烟 2024-09-13 11:00:16

我没有给您具体的与 Python 相关的建议，但这里有一些想法：

您遇到了生物信息学中的常见挑战。数据量大、异构，并且随着新技术的引入，其格式也在不断变化。我的建议是不要过度考虑您的管道，因为它们明天可能会发生变化。选择一些定义良好的文件格式，并尽可能频繁地将传入数据调整为这些格式。根据我的经验，通常最好拥有能够很好地完成一件事的松散耦合工具，以便您可以将它们快速链接在一起以进行不同的分析。

您还可以考虑将此问题的一个版本转移到生物信息学堆栈交换 http://biostar.stackexchange.com/

回复收藏 0 原文

明天过后 2024-09-13 11:00:16

ZODB 并不是为处理海量数据而设计的，它仅适用于基于 Web 的应用程序，并且无论如何它都是基于平面文件的数据库。

我建议你尝试 PyTables，这是一个用于处理 HDF5 文件的 Python 库，HDF5 文件是一种用于天文学和数学的格式。物理存储大型计算和模拟的结果。它可以用作类似分层的数据库，并且还有一种有效的方法来 pickle python 对象。顺便说一句，pytables 的作者解释说 ZOdb 太慢了他需要这样做，我可以向你证实这一点。如果您对 HDF5 感兴趣，还有另一个库，h5py。

作为管理不同计算的版本控制的工具，您可以尝试 sumatra< /a>，类似于 git/trac 的扩展，但专为模拟而设计。

你应该在映泰上问这个问题，你会在那里找到更好的答案。

回复收藏 0 原文

~没有更多了~