大型数据库5亿条记录
我们的业务问题之一的解决方案是将大约 5 亿条记录存储到数据库中。 每条记录大约有 40 到 50 列。
我有一个包含所有这些数据的文本文件,大小约为 150 GB。 (我的硬盘的三分之一)
如果我(以某种方式)将所有这些数据加载到数据库(ORACLE?)中,它的性能会如何?
我的一位同事坚持认为这绝对没问题。 我什至可以对所有 40 到 50 列进行索引,然后编写一些 sql 来获取数据即可。
他说得对吗? 或者 5 亿条记录对于数据库来说太多了吗?
附注 只需在一些非常好的答案之后添加更多信息: 40 到 50 列将保存小字符串和/或数字。 对于小字符串,我打算使用小于 64 个字符的字符串。
A solution of one of our business problems would be to store about 500 milion records into a database.
Each record would have about 40 to 50 columns.
I have a text file containing all of this data and it is about 150 GB.
(a 3rd of my harddisk)
If I were to load (somehow) all of this data into a db (ORACLE ?) how well would it perform ?
A colleague of mine insists that it would be absolutely fine.
And I could even index all of the 40 to 50 columns and then it is a matter of writing some sql to get data out.
is he right ?
or is 500 milion records too much for a db ?
p.s.
Just adding more information following some very good answers:
The 40 to 50 columns will hold small strings and/or numbers.
For small strings I intend something smaller than 64 characters.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果没有任何有关您正在使用的 RDBMS、它的托管方式以及数据类型(大文本、小数字等)的信息,要得到可靠的答案并不容易。
纯粹的记录量应该不是问题,几乎每个现代 RDBMS 都可以轻松处理 5 亿条记录甚至更多。
更有趣的是数据如何存储在 RDBMS 上、它使用哪种文件系统、表可用的磁盘空间有多少、表如何在硬盘上分布等等,所有这些都应该考虑在内。
一般来说,我建议只对应用程序和查询数据所必需的列建立索引,否则它们只会减慢您的插入速度,使用宝贵的磁盘空间,并且根本没有帮助您。
以下是一些可能对您有进一步帮助的链接:
Without any information about the RDBMS you're using, how it is hosted and what kind of data this is (large text, small numbers and such), a solid answer ain't easy.
The pure amount of records should not be an issue, almost every modern RDBMS will easily cope with 500 million records and more.
It'll get more interesting how the data is stored upon your RDBMS, p.e. what kind of filesystem it is using, how much disk space is available for the table, how the table is spread out on the hard disk(s) and such, which all should be taken into consideration.
Generally speaking I recommend only to index the columns which really are necessary for the applications and queries the data is used for, otherwise they'll just slow your inserts down, use precious disk space and don't help you at all.
Here are a few SO links which might help you further:
你的同事是正确的 - 数据库中 500M 记录就可以了,我使用过 2G 行的数据库,那是 10 年前的事了。对每一列建立索引是一个问题 - 索引会减慢每个新记录的插入速度,并且构建索引将花费很长的时间。您需要确定将运行什么类型的查询,然后适当地建立索引。拥有如此多的记录,您可以通过标准化数据来获得好处 - 扁平结构通常更快,但如果您有重复的长文本字段,那么用查找替换它们可能会给存储和索引带来好处。在没有看到数据的情况下,很难给出更准确的建议。
顺便说一句,如果您确实遇到性能问题,您还可以将数据分区到物理上独立的表中,也许按年?
我的下一步(在选择数据库平台并找到服务器之后)是加载数据并查看其执行情况。我会看一下批量加载数据 - 我是一个 Sql Server 爱好者,所以集成服务是最佳选择。我会确保您有一个唯一的密钥,如果它不在数据中,请添加一个标识列。然后您就可以测试其中的一些内容了。 SqlExpress 是免费的,随 SSIS 一起提供,但它只能处理 10G DB - 但这足以熟悉这些问题。
我经常批量加载一个包含 50 多列的 4M 行表,大约需要 2 分钟。如果您需要进一步的一对一建议,我很乐意将其离线。
Your colleague is kind of correct - 500M records in DB is fine, I've used DB's with 2G rows and that was 10 years back. Indexing every column is a problem - the indexes will slow every new record insert, and building the indexes will take a looonng time. You need to identify what kind of queries you'll run and then index appropriately. With so many records you may get benefits by normalising the data - a flat structure is often faster, but if you have repetitive long text fields, then replacing them with lookups may give storage and indexing benefits. Without seeing the data it's hard to give more precise advice.
BTW If you do hit performance probs you can also partition the data into physically separate tables, maybe by year?
My next step (after you've chosen your DB platform and found a server) is to get the data loaded and see how it performs. I'd take a look at Bulk Loading your data - I'm a Sql Server bloke so Integration Services is the way to go. I'd make sure you have a single unique key and if it's not in the data add an identity column. Then you're ready to test some of this out. SqlExpress is free, comes with SSIS but it can only handle 10G DBs - but that's enough to get familiar with the issues.
I regularly bulk load a 4M row table with 50+ columns and it takes around 2 minutes. I'm happy to take this offline if you want further one on one advice.