高效存储 7.300.000.000 行
您将如何解决以下存储和检索问题?
每天(365 天/年)将添加大约 2.000.000 行,每行包含以下信息:
- id(唯一行标识符)
- entity_id(采用 1 到 2.000.000 之间的值)
- date_id(每天递增 1 -将采用 1 到 3.650 之间的值(十年:1*365*10))
- value_1(采用 1 到 1.000.000(含)之间的值)
- value_2(采用 1 到 1.000.000(含)之间的值)
entity_id 与 date_id 组合为独特的。 因此,每个实体和日期最多可以将一行添加到表中。 数据库必须能够保存 10 年的每日数据(7.300.000.000 行 (3.650*2.000.000))。
上面描述的是写入模式。 读取模式很简单:所有查询都将针对特定的entity_id 进行。 即检索描述entity_id = 12345的所有行。
不需要事务支持,但存储解决方案必须是开源的。 理想情况下我想使用 MySQL,但我愿意接受建议。
现在 - 您将如何解决所描述的问题?
更新:我被要求详细说明读写模式。 对表的写入将每天一批完成,其中新的 2M 条目将一次性添加。 读取将连续进行,每秒读取一次。
How would you tackle the following storage and retrieval problem?
Roughly 2.000.000 rows will be added each day (365 days/year) with the following information per row:
- id (unique row identifier)
- entity_id (takes on values between 1 and 2.000.000 inclusive)
- date_id (incremented with one each day - will take on values between 1 and 3.650 (ten years: 1*365*10))
- value_1 (takes on values between 1 and 1.000.000 inclusive)
- value_2 (takes on values between 1 and 1.000.000 inclusive)
entity_id combined with date_id is unique. Hence, at most one row per entity and date can be added to the table. The database must be able to hold 10 years worth of daily data (7.300.000.000 rows (3.650*2.000.000)).
What is described above is the write patterns. The read pattern is simple: all queries will be made on a specific entity_id. I.e. retrieve all rows describing entity_id = 12345.
Transactional support is not needed, but the storage solution must be open-sourced. Ideally I'd like to use MySQL, but I'm open for suggestions.
Now - how would you tackle the described problem?
Update: I was asked to elaborate regarding the read and write patterns. Writes to the table will be done in one batch per day where the new 2M entries will be added in one go. Reads will be done continuously with one read every second.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
“现在 - 您将如何解决所描述的问题?”
使用简单的平面文件。
这就是为什么
您有2.000.000个实体。根据实体编号进行分区:
每个数据文件是
level1/level2/level3/batch_of_data
然后您可以读取一个文件中的所有文件 其
如果有人想要一个关系数据库,则将给定实体 ID 的文件加载到数据库中以供
。
entity_id
唯一性规则不是必须处理的东西,它 (a) 简单地强加在文件名上,(b) 与查询无关。 .date_id
“翻转”没有任何意义——没有查询,因此无需重命名date_id
。 code> 应该简单地增长,不受纪元日期的限制。 如果您想清除旧数据,请删除旧文件。由于没有查询依赖于
date_id
,因此无需对其执行任何操作。 它可以是所有重要内容的文件名。要将
date_id
包含在结果集中,请将其与文件每行中的其他四个属性一起写入文件中。打开/关闭时编辑
为了写入,您必须保持文件打开。 您定期刷新(或关闭/重新打开)以确保这些内容确实会写入磁盘。
对于您的作家的架构,您有两种选择。
有一个单一的“编写器”进程来整合来自不同来源的数据。 如果查询相对频繁,这会很有帮助。 您需要为在写入时合并数据付费。
同时打开多个文件进行写入。 查询时,将这些文件合并为一个结果。 这很有帮助,因为查询相对较少。 您需要为在查询时合并数据付费。
"Now - how would you tackle the described problem?"
With simple flat files.
Here's why
You have 2.000.000 entities. Partition based on entity number:
The each file of data is
level1/level2/level3/batch_of_data
You can then read all of the files in a given part of the directory to return samples for processing.
If someone wants a relational database, then load files for a given entity_id into a database for their use.
Edit On day numbers.
The
date_id
/entity_id
uniqueness rule is not something that has to be handled. It's (a) trivially imposed on the file names and (b) irrelevant for querying.The
date_id
"rollover" doesn't mean anything -- there's no query, so there's no need to rename anything. Thedate_id
should simply grow without bound from the epoch date. If you want to purge old data, then delete the old files.Since no query relies on
date_id
, nothing ever needs to be done with it. It can be the file name for all that it matters.To include the
date_id
in the result set, write it in the file with the other four attributes that are in each row of the file.Edit on open/close
For writing, you have to leave the file(s) open. You do periodic flushes (or close/reopen) to assure that stuff really is going to disk.
You have two choices for the architecture of your writer.
Have a single "writer" process that consolidates the data from the various source(s). This is helpful if queries are relatively frequent. You pay for merging the data at write time.
Have several files open concurrently for writing. When querying, merge these files into a single result. This is helpful is queries are relatively rare. You pay for merging the data at query time.
使用分区。 根据您的读取模式,您需要按
entity_id
哈希进行分区。Use partitioning. With your read pattern you'd want to partition by
entity_id
hash.您可能想看看这些问题:
大型主键:1+十亿rows MySQL + InnoDB?
大型 MySQL 表
就个人而言,我也会考虑计算您的行宽度让您了解表的大小(根据第一个链接中的分区说明)。
HTH.,
S
You might want to look at these questions:
Large primary key: 1+ billion rows MySQL + InnoDB?
Large MySQL tables
Personally, I'd also think about calculating your row width to give you an idea of how big your table will be (as per the partitioning note in the first link).
HTH.,
S
您的应用程序似乎与我的应用程序具有相同的特征。 我写了一个MySQL自定义存储引擎来高效解决这个问题。 此处对此进行了描述
想象一下您的数据作为 2M 的数组布置在磁盘上固定长度条目(每个实体一个),每个条目包含 3650 行(每天一个),每行 20 字节(每天一个实体的行)。
您的读取模式读取一个实体。 它在磁盘上是连续的,因此需要 1 次查找(大约 8mllisecs)并以 100MB/秒的速度读取 3650x20 = 大约 80K ...所以它在不到一秒的时间内完成,轻松满足每秒 1 次查询的读取图案。
更新必须在磁盘上 2M 的不同位置写入 20 个字节。 在最简单的情况下,这将需要 2M 搜索,每次搜索大约需要 8 毫秒,因此需要 2M*8ms = 4.5 小时。 如果将数据分布在 4 个“raid0”磁盘上,则可能需要 1.125 小时。
然而,这些地方仅相距 80K。 这意味着 16MB 块(典型磁盘缓存大小)内有 200 个这样的位置,因此它的运行速度最高可达 200 倍。 (1 分钟)现实介于两者之间。
我的存储引擎遵循这种理念,尽管它比固定长度数组更通用一些。
您可以准确地编写我所描述的代码。 将代码放入 MySQL 可插拔存储引擎意味着您可以使用 MySQL 通过各种报告生成器等来查询数据。
顺便说一下,您可以从存储的行中删除日期和实体 id(因为它们是数组索引)并且可能是唯一的 id – 如果您并不真正需要它,因为 (entity id, date) 是唯一的,并将 2 个值存储为 3 字节 int。 那么您存储的行是 6 个字节,每 16M 有 700 次更新,因此插入速度更快,文件更小。
编辑 与平面文件比较
我注意到评论通常偏向于平面文件。 不要忘记目录只是文件系统实现的索引,它们通常针对相对少量的相对较大的项目进行优化。 对文件的访问通常经过优化,因此预计打开的文件数量相对较少,并且打开和关闭以及每个打开的文件的开销相对较高。 所有这些“相对”都是相对于数据库的典型使用而言的。
使用文件系统名称作为实体 ID 的索引(我将其视为 1 到 200 万的非稀疏整数)是违反直觉的。 例如,在编程中,您将使用数组,而不是哈希表,并且您将不可避免地会因昂贵的访问路径而产生大量开销,而该访问路径可能只是一个数组实际操作。
因此,如果您使用平面文件,为什么不只使用一个平面文件并为其建立索引呢?
编辑性能
此应用程序的性能将由磁盘寻道时间决定。 我上面所做的计算决定了你能做的最好的事情(尽管你可以通过减慢 SELECT 来使 INSERT 更快 - 你不能让它们都变得更好)。 无论您使用数据库、平面文件还是一个平面文件,除了您可以添加更多您并不真正需要的搜索并进一步减慢速度,这并不重要。 例如,与“数组查找”相比,索引(无论是文件系统索引还是数据库索引)会导致额外的 I/O,而这些会降低速度。
编辑基准测量值
我有一张表,看起来非常像您的表(或者几乎完全像您的一个分区)。 这是 64K 个实体,而不是 2M(您的 1/32),以及 2788 个“天”。 该表是按照与您的表相同的 INSERT 顺序创建的,并且具有相同的索引 (entity_id,day)。 对一个实体执行 SELECT 操作需要 20.3 秒来检查 2788 天,这相当于预期每秒约 130 次查找(在平均查找时间为 8 毫秒的磁盘上)。 SELECT 时间将与天数成正比,并且不太依赖于实体的数量。 (在具有更快寻道时间的磁盘上速度会更快。我在 RAID0 中使用一对 SATA2,但这没有太大区别)。
如果将表重新排序为实体顺序
ALTER TABLE x ORDER BY(实体,天)
那么相同的 SELECT 需要 198 毫秒(因为它在单次磁盘访问中读取订单实体)。
然而,ALTER TABLE 操作需要 13.98 天才能完成(对于 182M 行)。
测量结果还告诉您一些其他信息
1. 您的索引文件将与数据文件一样大。 该示例表的大小为 3GB。 这意味着(在我的系统上)所有索引都以磁盘速度而不是内存速度进行。
2.你的INSERT率将会呈对数下降。 对数据文件的插入是线性的,但对索引的键插入是对数的。 在 180M 记录下,我每秒获得 153 次插入,这也非常接近查找速率。 它表明 MySQL 正在为几乎每个 INSERT 更新叶索引块(正如您所期望的,因为它在实体上建立索引,但按日顺序插入。)。 因此,您每天插入 2M 行需要 2M/153 秒= 3.6 小时。 (除以跨系统或磁盘分区可以获得的任何效果)。
Your application appears to have the same characteristics as mine. I wrote a MySQL custom storage engine to efficiently solve the problem. It is described here
Imagine your data is laid out on disk as an array of 2M fixed length entries (one per entity) each containing 3650 rows (one per day) of 20 bytes (the row for one entity per day).
Your read pattern reads one entity. It is contiguous on disk so it takes 1 seek (about 8mllisecs) and read 3650x20 = about 80K at maybe 100MB/sec ... so it is done in a fraction of a second, easily meeting your 1-query-per-second read pattern.
The update has to write 20 bytes in 2M different places on disk. IN simplest case this would take 2M seeks each of which takes about 8millisecs, so it would take 2M*8ms = 4.5 hours. If you spread the data across 4 “raid0” disks it could take 1.125 hours.
However the places are only 80K apart. In the which means there are 200 such places within a 16MB block (typical disk cache size) so it could operate at anything up to 200 times faster. (1 minute) Reality is somewhere between the two.
My storage engine operates on that kind of philosophy, although it is a little more general purpose than a fixed length array.
You could code exactly what I have described. Putting the code into a MySQL pluggable storage engine means that you can use MySQL to query the data with various report generators etc.
By the way, you could eliminate the date and entity id from the stored row (because they are the array indexes) and may be the unique id – if you don't really need it since (entity id, date) is unique, and store the 2 values as 3-byte int. Then your stored row is 6 bytes, and you have 700 updates per 16M and therefore a faster inserts and a smaller file.
Edit Compare to Flat Files
I notice that comments general favor flat files. Don't forget that directories are just indexes implemented by the file system and they are generally optimized for relatively small numbers of relatively large items. Access to files is generally optimized so that it expects a relatively small number of files to be open, and has a relatively high overhead for open and close, and for each file that is open. All of those "relatively" are relative to the typical use of a database.
Using file system names as an index for a entity-Id which I take to be a non-sparse integer 1 to 2Million is counter-intuitive. In a programming you would use an array, not a hash-table, for example, and you are inevitably going to incur a great deal of overhead for an expensive access path that could simply be an array indeing operation.
Therefore if you use flat files, why not use just one flat file and index it?
Edit on performance
The performance of this application is going to be dominated by disk seek times. The calculations I did above determine the best you can do (although you can make INSERT quicker by slowing down SELECT - you can't make them both better). It doesn't matter whether you use a database, flat-files, or one flat-file, except that you can add more seeks that you don't really need and slow it down further. For example, indexing (whether its the file system index or a database index) causes extra I/Os compared to "an array look up", and these will slow you down.
Edit on benchmark measurements
I have a table that looks very much like yours (or almost exactly like one of your partitions). It was 64K entities not 2M (1/32 of yours), and 2788 'days'. The table was created in the same INSERT order that yours will be, and has the same index (entity_id,day). A SELECT on one entity takes 20.3 seconds to inspect the 2788 days, which is about 130 seeks per second as expected (on 8 millisec average seek time disks). The SELECT time is going to be proportional to the number of days, and not much dependent on the number of entities. (It will be faster on disks with faster seek times. I'm using a pair of SATA2s in RAID0 but that isn't making much difference).
If you re-order the table into entity order
ALTER TABLE x ORDER BY (ENTITY,DAY)
Then the same SELECT takes 198 millisecs (because it is reading the order entity in a single disk access).
However the ALTER TABLE operation took 13.98 DAYS to complete (for 182M rows).
There's a few other things the measurements tell you
1. Your index file is going to be as big as your data file. It is 3GB for this sample table. That means (on my system) all the index at disk speeds not memory speeds.
2.Your INSERT rate will decline logarithmically. The INSERT into the data file is linear but the insert of the key into the index is log. At 180M records I was getting 153 INSERTs per second, which is also very close to the seek rate. It shows that MySQL is updating a leaf index block for almost every INSERT (as you would expect because it is indexed on entity but inserted in day order.). So you are looking at 2M/153 secs= 3.6hrs to do your daily insert of 2M rows. (Divided by whatever effect you can get by partition across systems or disks).
我有类似的问题(尽管规模更大 - 关于您每天的每年使用情况)
使用一张大表让我戛然而止——你可以拖几个月,但我想你最终会分割它。
不要忘记为表建立索引,否则每次查询都会弄乱少量的数据; 哦,如果你想进行大量查询,使用平面文件
I had similar problem (although with much bigger scale - about your yearly usage every day)
Using one big table got me screeching to a halt - you can pull a few months but I guess you'll eventually partition it.
Don't forget to index the table, or else you'll be messing with tiny trickle of data every query; oh, and if you want to do mass queries, use flat files
您对阅读模式的描述还不够。 您需要描述将检索多少数据、查询的频率和偏差程度。
这将允许您考虑对某些列进行压缩。
还要考虑归档和分区。
Your description of the read patterns is not sufficient. You'll need to describe what amounts of data will be retrieved, how often and how much deviation there will be in the queries.
This will allow you to consider doing compression on some of the columns.
Also consider archiving and partitioning.
如果要处理数百万行的海量数据,可以考虑类似于时间序列数据库,它记录时间并将数据保存到数据库中。 存储数据的一些方法是使用 InfluxDB 和 MongoDB。
If you want to handle huge data with millions of rows it can be considered similar to time series database which logs the time and saves the data to the database. Some of the ways to store the data is using InfluxDB and MongoDB.