对于依赖于时间的大型数据集,命名表 september_2010 是否可接受且有效?
我每天需要存储大约 73,200 条记录,由 3 个数据点组成:id、日期和整数。
我团队的一些成员建议使用月份作为表名(september_2010)创建表,而其他人则建议使用一个包含大量数据的表...
关于如何处理这么多数据有什么建议吗?谢谢。
========== 感谢您的所有反馈。
I need to store about 73,200 records per day consisting of 3 points of data: id, date, and integer.
Some members of my team suggest creating tables using month's as the table name (september_2010), while others are suggesting having one table with lots of data in it...
Any suggestions on how to deal with this amount of data? Thanks.
========== Thank you to all the feedback.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(12)
我建议反对这样做。我将此称为反模式元数据Tribbles。它会产生多个问题:
我的建议是将其保留在一个表中,直到并且除非您已经证明表的大小正在成为一个真正的问题,并且您无法通过任何其他方式解决它(例如缓存、索引、分区)。
I recommend against that. I call this antipattern Metadata Tribbles. It creates multiple problems:
My recommendation is to keep it in one table until and unless you've demonstrated that the size of the table is becoming a genuine problem, and you can't solve it any other way (e.g. caching, indexing, partitioning).
看起来把所有东西都放在一张桌子上应该没问题。将来维护 1 个表(而不是每年 12 个表)将使检索变得更加容易。按每天 73,200 条记录计算,您将需要近 4 年的时间才能达到 100,000,000 条记录,这仍然在 MySQL 的能力范围之内。
Seems like it should be just fine holding everything in one table. It will make retrieval much easier in the future to maintain 1 table, as opposed to 12 tables per year. At 73,200 records per day it will take you almost 4 years to hit 100,000,000 which is still well within MySQLs capabilities.
绝对不是。
它会破坏表之间的关系。
基于字段值而不是表名称构建表关系。
特别是对于每年仅增长 300Mb 的这张表
Absolutely not.
It will ruin relationship between tables.
Table relations being built based on field values, not table names.
Especially for this very table that will grow by just 300Mb/year
因此,在 100 天内,您将拥有 730 万行,每年大约 2500 万行。 25M 行已经不再多了。 MySQL 可以处理具有数百万行的表。这实际上取决于您的硬件以及查询类型和查询频率。
但是你应该能够对该表进行分区(如果 MySQL 支持分区),你所描述的是一种旧的 SQL Server 分区方法。在构建这些每月表之后,您将构建一个视图,将它们连接在一起,看起来像一个大表......这本质上就是分区的作用,但它都是在幕后并完全优化的。
so in 100 days you have 7.3 M rows, about 25M a year or so. 25M rows isn't a lot anymore. MySQL can handle tables with millions of rows. It really depends on your hardware and your query types and query frequency.
But you should be able to partition that table (if MySQL supports partitioning), what you're describing is an old SQL Server method of partition. After building those monthly tables you'd build a view that concatenates them together to look like one big table... which is essentially what partitioning does but it's all under-the-covers and fully optimized.
通常这会带来更多的麻烦,而不是其价值,需要更多的维护,您的查询需要更多的逻辑,并且从多个时期提取数据是很痛苦的。
我们在一个 (MyISAM) 表中存储 2 亿多条基于时间的记录,并且查询速度仍然非常快。
您只需要确保您的时间/日期列上有一个索引,并且您的查询使用该索引(例如,在日期列上与 DATE_FORMAT 或类似内容混淆的查询可能不会使用索引。我不会把只是为了提高检索性能而将它们放在单独的表中,
对于如此大量的记录,一件非常痛苦的事情是,当您必须删除旧数据时,这可能需要很长时间(例如,擦除旧数据需要 10 分钟到 2 小时)。具有数百个竖框行的表中的一个月数据)因此我们对表进行分区,并使用 time_dimension(请参阅稍微向下的 time_dimension 表 此处)用于管理周期的关系表,而不是简单的日期/日期时间列或表示日期的字符串/varchar。
Usually this creates more trouble than it's worth, it's more maintenance , your queries need more logic, and it's painful to pull data from more than one period.
We store 200+ million time based records in one (MyISAM) table, and queries are blazingly still fast.
You just need to ensure there's an index on your time/date column and that your queries makes use of the index (e.g. a query that messes around with DATE_FORMAT or similar on a date column will likely not use an index. I wouldn't put them in separate tables just for the sake of retreival performance.
One thing that gets very painful with such a large number of records is when you have to delete old data, this can take a long time (10 minutes to 2 hours for e.g. wiping a month worth of data in tables with hundreds of mullions rows). For that reason we've partitioning the tables, and use a time_dimension(see e.g. the time_dimension table a bit down here) relation table for managing the periods instead of simple date/datetime columns or strings/varchars representing dates.
不要听他们的。您已经存储了日期戳,那么不同的月份如何以这种方式分割数据?该引擎可以很好地处理更大的数据集,因此按月分割除了人为地隔离数据之外什么也做不了。
Don't listen to them. You're already storing a date stamp, what about different months makes it a good idea to split the data that way? The engine will handle the larger data sets just fine, so splitting by month does nothing but artificially segregate the data.
我的第一反应是:啊啊啊啊啊!!!!!!
表名称不应嵌入数据值。您没有说明数据的含义,但为了论证的目的,假设它是温度读数,我不知道。试想一下,尝试编写一个查询来查找平均气温较前一个月有所上升的所有月份。您必须循环遍历表名。更糟糕的是,想象一下尝试查找所有 30 天的时间段(即可能跨越月份边界的时间段),其中温度较前 30 天的时间段有所上升。
事实上,仅仅检索旧记录就会从一个简单的操作(“select * where id=whatever”)变成一个复杂的操作,要求您让程序根据日期动态生成表名称。如果您不知道日期,则必须扫描所有表,在每个表中搜索所需的记录。恶心。
由于所有数据都在一个正确规范化的表中,像上面这样的查询非常简单。每个月都有单独的表格,这简直就是一场噩梦。
只需将日期作为索引的一部分,将所有记录放在一个表中的性能损失应该非常小。如果表的大小确实成为一个性能问题,我可以简单地理解为包含所有旧内容的存档数据创建一张表,并为包含您定期检索的所有内容的当前数据创建一张表。但不要创建数百个表。大多数数据库引擎都可以使用“表空间”等跨多个驱动器对数据进行分区。如有必要,请使用数据库的复杂功能,而不是拼凑出粗略的模拟。
My first reaction is: Aaaaaaaaahhhhhhhhh!!!!!!
Table names should not embed data values. You don't say what the data means, but supposing for the sake of argument it is, I don't know, temperature readings. Just imagine trying to write a query to find all the months in which average temperature increased over the previous month. You'd have to loop through table names. Worse yet, imagine trying to find all 30-day periods -- i.e. periods that might cross month boundaries -- where temperature increased over the previous 30-day period.
Indeed, just retrieving an old record would go from a trivial operation -- "select * where id=whatever" -- would become a complex operation requiring you to have the program generate table names from the date on the fly. If you didn't know the date, you would have to scan through all the tables searching each one for the desired record. Yuck.
With all the data in one properly-normalized table, queries like the above are pretty trivial. With separate tables for each month, they're a nightmare.
Just make the date part of the index and the performance penalty of having all the records in one table should be very small. If the size of table really becomes a performance problem, I could dimply comprehend making one table for archive data with all the old stuff and one for current data with everything you retrieve regularly. But don't create hundreds of tables. Most database engines have ways to partition your data across multiple drives using "table spaces" or the like. Use the sophisticated features of the database if necessary, rather than hacking together a crude simulation.
取决于您需要执行哪些搜索。如果通常受日期限制,则拆分是好的。
如果您进行拆分,请考虑将表命名为 foo_2010_09,以便表按字母数字顺序排序。
Depends on what searches you'll need to do. If normally constrained by date, splitting is good.
If you do split, consider naming the tables like foo_2010_09 so the tables will sort alphanumerically.
你的数据库平台是什么?
在 SQL Server 2K5+ 中,您可以按日期分区。
糟糕,我没注意到标签。 @thetaiko 是对的,这完全在 MySQL 的能力范围内来处理这个问题。
what is your DB platform?
In SQL Server 2K5+ you can partition on date.
My bad, I didnt notice the tag. @thetaiko is right though and this is well within MySQL capabilities to deal with this.
我想说这取决于数据的使用方式。如果大多数查询都是针对完整数据完成的,那么总是再次将表重新连接在一起将是一种开销。
如果您大多数时候只需要一部分数据(按日期),那么最好将表分割成更小的部分。
对于命名,我会做 tablename_yyyymm。
编辑:当然,您还应该考虑数据库和应用程序之间的另一层来根据给定的某个日期处理分段表。这可能会变得相当复杂。
I would say it depends on how the data is used. If most queries are done over the complete data, it would be an overhead to always join the tables back together again.
If you most times only need a part of the data (by date), it is a good idea to segment the tables into smaller pieces.
For the naming i would do tablename_yyyymm.
Edit: For sure you should then also think about another layer between the DB and your app to handle the segmented tables depending on some date given. Which can then get pretty complicated.
我建议放弃年份,每个月只有一张表,以月份命名。通过重命名所有表 $MONTH_$YEAR 并重新创建月份表,每年归档您的数据。或者,由于您使用数据存储时间戳,因此只需继续附加到相同的表即可。我认为,由于您首先提出这个问题,因此按月分离数据符合您的报告要求。如果没有,那么我建议将所有内容保存在一个表中,并在性能成为问题时定期归档历史记录。
I'd suggest dropping the year and just having one table per month, named after the month. Archive your data annually by renaming all the tables $MONTH_$YEAR and re-creating the month tables. Or, since you're storing a timestamp with your data, just keep appending to the same tables. I assume by virtue of the fact that you're asking the question in the first place, that segregating your data by month fits your reporting requirements. If not, then I'd recommend keeping it all in one table and periodically archiving off historical records when performance gets to be an issue.
我同意这个想法,让你的数据库不必要地复杂化。使用单个表。正如其他人指出的那样,这还不足以保证进行无关处理。除非您使用 SQLite,否则您的数据库可以很好地处理它。
然而,这也取决于您想要如何访问它。如果旧条目实际上仅用于存档目的,那么存档模式是一个选项。版本控制系统通常会分离出不常用的数据。在您的情况下,您只希望所有> 1年的内容都移出主表。这严格来说是数据库管理任务,而不是应用程序行为。应用程序只会加入当前列表和 _archive 列表(如果有的话)。同样,这很大程度上取决于用例。通常需要旧条目吗?是否有太多数据需要定期处理?
I agree with this idea complicating your database needlessly. Use a single table. As others have pointed out, it's not nearly enough data to warrent extraneous handling. Unless you use SQLite, your database will handle it well.
However it also depends on how you want to access it. If the old entries are really only there for archival purposes, then the archive pattern is an option. It's common for versioning systems to have the infrequently used data separated out. In your case you'd only want everything >1 year to move out of the main table. And this is strictly an database administration task, not an application behavior. The application would only join the current list and the _archive list, if at all. Again, this highly depends on the use case. Are the old entries generally needed? Is there too much data to process regularily?