在数据库中存储大量数据
我目前正在开发一个家庭自动化项目,该项目为用户提供了查看一段时间内的能源使用情况的可能性。目前,我们每 15 分钟请求一次数据,预计我们的第一个大型试点将有大约 2000 名用户。
我的老板要求我们至少存储半年的数据。快速加总即可估算出大约 3500 万条记录。尽管这些记录很小(每条大约 500 字节),但我仍然想知道将这些记录存储在我们的数据库 (Postgres) 中是否是一个正确的决定。
有人有一些好的参考材料和/或关于如何处理这么多信息的建议吗?
I'm currently working on a home-automation project which provides the user with the possibility to view their energy usage over a period of time. Currently we request data every 15 minutes and we are expecting around 2000 users for our first big pilot.
My boss is requesting we that we store at least half a year of data. A quick sum leads to estimates of around 35 million records. Though these records are small (around 500bytes each) I'm still wondering whether storing these in our database (Postgres) is a correct decision.
Does anyone have some good reference material and/or advise about how to deal with this amount of information?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
我们经常碰到这样的桌子。显然,根据使用情况构建索引(您是否经常读取或写入等),并从一开始就考虑基于某些高级数据分组的表分区。
此外,您还可以实施归档想法来保持实时表的精简。历史记录要么从未被触及,要么被报道过,在我看来,这两者都没有什么好处。
值得注意的是,我们有大约 1 亿条记录的表,并且我们不认为存在性能问题。其中许多性能改进都可以在事后轻松实现,因此您始终可以从常识性解决方案开始,仅在性能被证明很差时进行调整。
We frequently hit tables that look like this. Obviously structure your indexes based on usage (do you read or write a lot, etc), and from the start think about table partitioning based on some high level grouping of the data.
Also, you can implement an archiving idea to keep the live table thin. Historical records are either never touched, or reported on, both of which are no good to live tables in my opinion.
It's worth noting that we have tables around 100m records and we don't perceive there to be a performance problem. A lot of these performance improvements can be made with little pain afterwards, so you could always start with a common-sense solution and tune only when performance is proven to be poor.
目前,每条 0.5K 的 35M 记录意味着 37.5G 的数据。这适合您的试点数据库,但您还应该考虑试点后的下一步。当试点取得巨大成功并且您会告诉他如果不重新设计一切就无法在接下来的几个月内向系统添加 100,000 个用户时,您的老板不会高兴。此外,VIP用户每分钟请求数据的新功能怎么样?
这是一个复杂的问题,你所做的选择将限制你的软件的发展。
对于试点来说,尽可能简单,以尽可能便宜的价格推出产品 -->对于数据库来说还可以。但告诉你的老板,你不能像那样开放服务,并且在每周获得 10,000 个新用户之前你必须做出改变。
下一版本的一件事是:拥有许多数据存储库:一个用于经常更新的用户数据,一个用于您的查询/统计系统,...
您可以查看 RRD 用于您的下一个版本。
还要记住更新频率:2000 个用户每 15 分钟更新一次数据意味着每秒 2.2 次更新 -->好的;每 5 分钟有 100.000 个用户更新数据,意味着每秒有 333.3 次更新。我不确定一个简单的数据库能否跟上这个速度,而单个 Web 服务服务器绝对不能。
For now, 35M records of 0.5K each means 37.5G of data. This fits in a database for your pilot, but you should also think of the next step after the pilot. Your boss will not be happy when the pilot will be a big success and that you will tell him that you cannot add 100.000 users to the system in the next months without redesigning everything. Moreover, what about a new feature for VIP users to request data at each minutes...
This is a complex issue and the choice you make will restrict the evolution of your software.
For the pilot, keep it as simple as possible to get the product out as cheap as possible --> ok for a database. But tell you boss that you cannot open the service like that and that you will have to change things before getting 10.000 new users per week.
One thing for the next release: have many data repositories: one for your user data that is updated frequently, one for you queries/statistics system, ...
You could look at RRD for your next release.
Also keep in mind the update frequency: 2000 users updating data each 15 minutes means 2.2 updates per seconds --> ok; 100.000 users updating data each 5 minutes means 333.3 updates per seconds. I am not sure a simple database can keep up with that, and a single web service server definitely cannot.
通过适当的索引来避免缓慢的查询,我不希望任何像样的 RDBMS 会遇到这种数据集的问题。许多人使用 PostgreSQL 来处理远超此数的数据。
这就是数据库的用途:)
With appropriate indexes to avoid slow queries, I wouldn't expect any decent RDBMS to struggle with that kind of dataset. Lots of people are using PostgreSQL to handle far more data than that.
It's what databases are made for :)
首先,我建议您进行性能测试 - 编写一个程序,生成与半年以上您将看到的条目数量相对应的测试条目,插入它们并检查结果以查看查询时间是否令人满意。如果没有,请尝试按照其他答案的建议建立索引。顺便说一句,写入性能也值得尝试,以确保您实际上可以在 15 分钟或更短的时间内插入您在 15 分钟内生成的数据量。
进行测试将避免所有问题之母 - 假设 :-)
还要考虑生产性能 - 您的试点将有 2000 个用户 - 您的生产环境在一两年内将有 4000 个用户还是 200000 个用户?
如果我们谈论的是一个非常大的环境,您需要考虑一种解决方案,允许您通过添加更多节点来进行扩展,而不是总是依赖于向单台机器添加更多 CPU、磁盘和内存。您可以在应用程序中通过跟踪多个数据库计算机中的哪一个正在托管特定用户的详细信息来执行此操作,也可以使用 Postgresql 集群方法之一,或者可以采用完全不同的路径 - NoSQL 方法,您可以完全放弃 RDBMS 并使用为水平扩展而构建的系统。
有许多这样的系统。我只有 Cassandra 的个人经验。与您在 RDBMS 世界中习惯的方式相比,您必须以完全不同的方式思考,这是一个挑战 - 更多地考虑您想要的方式
访问数据而不是如何存储数据。对于您的示例,我认为使用 user-id 作为键存储数据,然后添加一个列,其中列名称是时间戳,列值是该时间戳的数据是有意义的。然后,您可以请求这些列的切片,例如在 Web UI 中绘制结果图表 - Cassandra 对于 UI 应用程序具有足够好的响应时间。
投入时间学习和使用 nosql 系统的好处是,当您需要更多空间时 - 您只需添加一个新节点即可。如果您需要更多的写入性能或更多的读取性能,同样的情况。
First of all, I would suggest that you make a performance test - write a program that generates test entries that corresponds to the number of entries you'll see over half a year, insert them and check results to see if query times are satisfactory. If not, try indexing as suggested by other answers. It is, btw, also worth trying write performance to ensure that you can actually insert the amount of data you're generating in 15 minutes in.. 15 minutes or less.
Making a test will avoid the mother of all problems - assumptions :-)
Also think about production performance - your pilot will have 2000 users - will your production environment have 4000 users or 200000 users in a year or two?
If we're talking a really big environment, you need to think about a solution that allows you to scale out by adding more nodes instead of relying on always being able to add more CPU, disk and memory to a single machine. You can either do this in your application by keeping track on which out of multiple database machines is hosting details for a specific user, or you can use one of the Postgresql clustering methods, or you could go a completely different path - the NoSQL approach, where you walk away completely from RDBMS and use systems which are built to scale horizontally.
There are a number of such systems. I only have personal experience of Cassandra. You have to think completely different compared to what you're used to from the RDBMS world which is something of a challenge - think more about how you want
to access the data rather than how to store it. For your example, I think storing the data with the user-id as key and then add a column with the column name being the timestamp and the column value being your data for that timestamp would make sense. You can then ask for slices of those columns for example for graphing results in a Web UI - Cassandra has good enough response times for UI applications.
The upside of investing time in learning and using a nosql system is that when you need more space - you just add a new node. Same thing if you need more write performance, or more read performance.
在整个期间内不保留单个样本不是更好吗?您可以实施某种合并机制,将每周/每月的样本连接到一条记录中。并按计划运行所说的整合。
您的决定必须取决于您需要能够在数据库上运行的查询类型。
Are you not better off not keeping individual samples for the full period? You could possibly implement some sort of consolidation mechanism, which concatenates weekly/monthly samples into one record. And run said consolidation on a schedule.
You decision has to depend on the type of queries you need to be able to run on the database.
有很多技术可以解决这个问题。只有达到最低记录数,您才会获得性能。根据您的情况,您可以使用以下技术。
示例:假设您每隔 1 秒收到在任何表中插入记录的请求,那么您制定了一种机制,以 5 条记录为一批处理该请求,这样您将在 5 秒后访问数据库,这样要好得多。是的,您可以让用户等待 5 秒来等待他们的记录插入,就像在 Gmail 中您发送电子邮件并要求您等待/处理一样。对于选择,您可以定期将结果集放入文件系统中,并可以直接将它们提供给用户,而无需像大多数股票市场数据公司那样接触数据库。
如有任何进一步疑问,您可以通过 [email protected] 给我发邮件
There are lots of techniques to handle this problem. you will only get performance if you touch minimum number of records. in your case you can use following techniques.
Example: suppose you are getting request to insert record in any table after every 1 second then you make a mechanism where you process this request in batch of 5 record in this way you will hit your database after 5 second which is much better. Yes, you can make users to wait for 5 second to wait for their record inserted like in Gmail where you send email and it ask you to wait/processing. for select you can put your resultset periodically in file system and can serve them directly to user without touching database like most stock market data company do.
For any further query you can mail me on [email protected]