可扩展、快速、文本文件支持的数据库引擎?

发布于 2024-09-12 05:32:20 字数 449 浏览 13 评论 0原文

我正在处理存储在制表符分隔的 .tsv 文件中的大量科学数据。要执行的典型操作是读取多个大文件、仅过滤掉某些列/行、与其他数据源连接、添加计算值并将结果写入另一个 .tsv。

纯文本因其稳健性、持久性和自记录特性而被使用。以其他格式存储数据不是一种选择,它必须保持开放且易于处理。数据量很大(几十TB),将副本加载到关系数据库中是负担不起的(我们必须购买两倍的存储空间)。

由于我主要进行选择和连接,我意识到我基本上需要一个带有基于 .tsv 的后备存储的数据库引擎。我不关心事务,因为我的数据都是一次写入多次读取。我需要就地处理数据,无需主要转换步骤和数据克隆。

由于以这种方式查询大量数据,我需要利用缓存和计算机网格来有效地处理它们。

有谁知道有一个系统可以提供类似数据库的功能,同时使用普通的制表符分隔文件作为后端?在我看来,这是一个非常普遍的问题,几乎所有科学家都以这种或那种方式处理。

I am dealing with large amounts of scientific data that are stored in tab separated .tsv files. The typical operations to be performed are reading several large files, filtering out only certain columns/rows, joining with other sources of data, adding calculated values and writing the result as another .tsv.

The plain text is used for its robustness, longevity and self-documenting character. Storing the data in another format is not an option, it has to stay open and easy to process. There is a lot of data (tens of TBs), and it is not affordable to load a copy into a relational database (we would have to buy twice as much storage space).

Since I am mostly doing selects and joins, I realized I basically need a database engine with .tsv based backing store. I do not care about transactions, since my data is all write-once-read-many. I need to process the data in-place, without a major conversion step and data cloning.

As there is a lot of data to be queried this way, I need to process it efficiently, utilizing caching and a grid of computers.

Does anyone know of a system that would provide database-like capabilities, while using plain tab-separated files as backend? It seems to me like a very generic problem, that virtually all scientists get to deal with in one way or the other.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

巴黎盛开的樱花 2024-09-19 05:32:20

数据量很大(几十TB),将副本加载到关系数据库中是负担不起的(我们必须购买两倍的存储空间)。

您比我们任何人都更了解您的要求,但我建议您再考虑一下。如果您有 16 位整数 (0-65535) 存储在 csv 文件中,您的 .tsv 存储效率约为 33%:存储大多数 16 位整数加上分隔符 = 6 个字节需要 5 个字节,而本机整数占用2个字节。对于浮点数据,效率更差。

我会考虑采用现有数据,而不是存储原始数据,而是通过以下两种方式对其进行处理:

  1. 将其以众所周知的压缩格式(例如 gzip 或 bzip2)压缩存储到永久存档介质(备份服务器、磁带驱动器、无论如何),以便您保留 .tsv 格式的优点。
  2. 将其处理成数据库,具有良好的存储效率。如果文件具有固定且严格的格式(例如,X 列始终 为字符串,Y 列始终 16 位整数),那么您可能处于良好状态形状。否则,NoSQL 数据库可能会更好(请参阅 Stefan 的回答)。

这将创建一个可审计(但可能访问缓慢)的存档,数据丢失的风险较低,以及一个快速访问的数据库,无需担心丢失源数据,因为您始终可以将其重新读入数据库从档案中。

您应该能够减少存储空间,并且不需要像您所说的那样需要两倍的存储空间。

索引将是最困难的部分;您最好清楚需要哪些数据子集才能有效查询。

There is a lot of data (tens of TBs), and it is not affordable to load a copy into a relational database (we would have to buy twice as much storage space).

You know your requirements better than any of us, but I would suggest you think again about this. If you have 16-bit integers (0-65535) stored in a csv file, your .tsv storage efficiency is about 33%: it takes 5 bytes to store most 16-bit integers plus a delimiter = 6 bytes, whereas the native integers take 2 bytes. For floating-point data the efficiency is even worse.

I would consider taking the existing data, and instead of storing raw, processing it in the following two ways:

  1. Store it compressed in a well-known compression format (e.g. gzip or bzip2) onto your permanent archiving media (backup servers, tape drives, whatever), so that you retain the advantages of the .tsv format.
  2. Process it into a database which has good storage efficiency. If the files have a fixed and rigorous format (e.g. column X is always a string, column Y is always a 16-bit integer), then you're probably in good shape. Otherwise, a NoSQL database might be better (see Stefan's answer).

This would create an auditable (but perhaps slowly accessible) archive with low risk of data loss, and a quickly-accessible database that doesn't need to be concerned with losing the source data, since you can always re-read it into the database from the archive.

You should be able to reduce your storage space and should not need twice as much storage space, as you state.

Indexing is going to be the hard part; you'd better have a good idea of what subset of the data you need to be able to query efficiently.

怀念你的温柔 2024-09-19 05:32:20

这些 nosql 数据库之一可能有效。我非常怀疑是否可以将其配置为位于平面分隔文件之上。您可以查看其中一个开源项目并编写自己的数据库层。

One of these nosql dbs might work. I highly doubt any are configurable to sit on top of flat, delimited files. You might look at one of the open source projects and write your own database layer.

零崎曲识 2024-09-19 05:32:20

可扩展性始于制表符分隔 ASCII 之外的点。

务实一点——不要学术化——惯例可以解放你的手指,也解放你的思想。

Scalability begins at a point beyond tab-separated ASCII.

Just be practical - don't academicise it - convention frees your fingers as well as your mind.

狼亦尘 2024-09-19 05:32:20

如果我有声誉的话,我会投票赞成杰森的推荐。我唯一的补充是,如果您不将其存储为像数据库那样的不同格式,Jason 建议您在每次操作时支付解析成本,而不是在最初处理它时只支付一次。

I would upvote Jason's recommendation if I had the reputation. My only add is that if you do not store it in a different format like the database Jason was suggesting you pay the parsing cost on every operation instead of just once when you initially process it.

桜花祭 2024-09-19 05:32:20

如果您处于 .NET 环境中,则可以使用 LINQ to Objects 来完成此操作。流式/延迟执行、函数式编程模型和所有 SQL 运算符。连接将在流模型中工作,但会拉入一个表,因此您必须将一个大表连接到一个较小的表情况。

塑造数据的简易性和编写自己的表达式的能力在科学应用中确实会大放异彩。

针对分隔文本文件的 LINQ 是 LINQ 的常见演示。您需要提供向 LINQ 提供表格模型的能力。 Google LINQ for text files 的一些示例(例如,请参阅 http://www.codeproject. com/KB/linq/Linq2CSV.aspxhttp://www.thereforesystems.com/tutorial-reading-a-text-file-using-linq/等)。

预计会有一个学习曲线,但它是解决您问题的一个很好的解决方案。 Jon Skeet 的《C# 深度剖析》是关于该主题的最佳论述之一。从 Manning 处获取“MEAP”版本,以便抢先体验他的最新版本。

我以前曾对需要清理、删除和附加的大型邮件列表做过类似的工作。你总是受到 IO 限制。尝试固态硬盘,特别是英特尔的“E”系列,它具有非常快的写入性能,并尽可能并行地对它们进行 RAID。我们还使用了网格,但必须调整算法以进行多遍方法,从而减少数据。

请注意,我同意其他答案,即如果数据非常规则,则强调加载到数据库并建立索引。在这种情况下,您基本上是在进行 ETL,这在仓库社区中是一个众所周知的问题。然而,如果数据是临时的,科学家们只需将结果放入目录中,您就需要“敏捷/及时”转换,并且如果大多数转换是单通道选择...其中...加入,那么您就以正确的方式接近它。

You can do this with LINQ to Objects if you are in a .NET environment. Streaming/deferred execution, functional programming model and all of the SQL operators. The joins will work in a streaming model, but one table gets pulled in so you have to have a large table joined to a smaller table situation.

The ease of shaping the data and the ability to write your own expressions would really shine in a scientific application.

LINQ against a delimited text file is a common demonstration of LINQ. You need to provide the ability to feed LINQ a tabular model. Google LINQ for text files for some examples (e.g., see http://www.codeproject.com/KB/linq/Linq2CSV.aspx, http://www.thereforesystems.com/tutorial-reading-a-text-file-using-linq/, etc.).

Expect a learning curve, but it's a good solution for your problem. One of the best treatments on the subject is Jon Skeet's C# in depth. Pick up the "MEAP" version from Manning for early access of his latest edition.

I've done work like this before with large mailing lists that need to be cleansed, dedupped and appended. You are invariably IO bound. Try Solid State Drives, particularly Intel's "E" series which has very fast write performance, and RAID them as parallel as possible. We also used grids, but had to adjust the algorithms to do multi-pass approaches that would reduce the data.

Note I would agree with the other answers that stress loading into a database and indexing if the data is very regular. In that case, you're basically doing ETL which is a well understood problem in the warehouseing community. If the data is ad-hoc however, you have scientists that just drop their results in a directory, you have a need for "agile/just in time" transformations, and if most transformations are single pass select ... where ... join, then you're approaching it the right way.

花心好男孩 2024-09-19 05:32:20

您可以使用 VelocityDB 来完成此操作。它能够非常快地将制表符分隔的数据读取到 C# 对象和数据库中。整个维基百科文本是一个 33GB 的 xml 文件。该文件需要 18 分钟才能读入并保留为对象(每个维基百科主题 1 分钟)并存储在紧凑的数据库中。作为下载的一部分,显示了许多示例,说明如何读取制表符分隔的文本文件。

You can do this with VelocityDB. It is is very fast at reading tab seperated data into C# objects and databases. The entire Wikipedia text is a 33GB xml file. This file takes 18 minutes to read in and persist as objects (1 per Wikipedia topic) and store in compact databases. Many samples are shown for how to read in tab seperated text files as part of the download.

溺ぐ爱和你が 2024-09-19 05:32:20

这个问题已经有了答案,我同意大部分的说法。

在我们的中心,我们有一个我们提供的标准演讲,“所以你有 40TB 的数据”,因为科学家们最近发现自己一直处于这种情况。这次演讲名义上是关于可视化的,但主要是关于为新手管理大量数据的。我们试图传达的基本要点:

  • 规划您的 I/O
    • 二进制文件
    • 尽可能使用大文件
    • 可并行读取的文件格式、提取的子区域
    • 避免大量文件
    • 尤其要避免单个目录中存在无数文件
  • 数据管理必须扩展:
    • 包含来源元数据
      • 减少重做的需要
    • 敏感数据管理
      • 数据目录的层次结构仅在始终有效的情况下
    • 允许元数据的数据库、格式
  • 使用可扩展、自动化的工具:
    • 对于大型数据集,并行工具 - ParaView、VisIt 等
    • 可编写脚本的工具 - gnuplot、python、R、ParaView/Visit...
    • 脚本提供可重复性!

我们有大量关于大规模 I/O 的内容,因为这对科学家来说是一个越来越常见的绊脚石。

The question's already been answered, and I agree with the bulk of the statements.

At our centre, we have a standard talk we give, "so you have 40TB of data", as scientists are newly finding themselves in this situation all the time now. The talk is nominally about visualization, but primarly about managing large amounts of data for those that are new to it. The basic points we try to get across:

  • Plan your I/O
    • Binary files
    • As much as possible, large files
    • File formats that can be read in parallel, subregions extracted
    • Avoid zillions of files
    • Especially avoid zillions of files in single directory
  • Data Management must scale:
    • Include metadata for provenance
      • Reduce need to re-do
    • Sensible data management
      • Hierarchy of data directories only if that will always work
    • Data bases, formats that allow metadata
  • Use scalable, automatable tools:
    • For large data sets, parallel tools - ParaView, VisIt, etc
    • Scriptable tools - gnuplot, python, R, ParaView/Visit...
    • Scripts provide reproducability!

We have a fair amount of stuff on large-scale I/O generally, as this is an increasingly common stumbling block for scientists.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文