在数据库中存储大量数据
我对大量数据的存储有疑问。情况如下:
我要存储
- GPS坐标(纬度和经度)(每分钟甚至更短的间隔,但我正在考虑每分钟)
- 事件,可以针对多个坐标重复
- 条目的日期时间或时间戳(不知道在我的情况下使用哪个更好)
- (用户 ID)
我希望能够查询:
- 按区域划分的事件(定义纬度和经度的范围,例如从 (1,1) 到 (2,2))
- 从日期 X 到日期 Y 的用户跟踪(一名或多名用户)
到目前为止,我正在考虑解决方案:
解决方案 1
id_user (int)
id_experince (int)
id_event (int)
dt (datetime)
latitude (decimal)
longitude (decimal)
我开始做一些计算,类似于: - 每个用户每天大约 500 个条目 - 由于我正在准备一些负载的应用程序,因此可能有大约 100-150 个用户,这将是 75000 个条目/天 - 一个月后将有数百万个条目
也许,解决方案 1 不是一个好的解决方案,因为数据库的大小增长得非常快。
解决方案2
有2个表,其中一个将根据事件聚合坐标,例如我有一个事件“晚餐”,需要30分钟,因此30个条目将被分组在一个BLOB类型的字段中。该表将如下所示:
id_user (int)
id_experience (int)
id_event (int)
dt (datetime)
coordinates(blob)
另一个表已计算出具有某些“宽度”和“长度”的位置,并具有指向第一个表的指针
latitude (decimal)
longitude (decimal)
id_entry_in_first_table (int)
此解决方案仅部分解决了我的问题,想象一下,某些事件不会超过几分钟并且需要第二个数据库。
解决方案 3
这可能不是非常正确的解决方案,但似乎有一定道理。我有与某种体验相关的用户,该体验具有开始日期和结束日期。当经验添加时,我将为该经验创建数据转储并保存到文件中,删除与该经验相关的条目。当用户想要查阅“存档”体验时,我会将数据加载到某个临时表中并在一天内将其删除(例如),在这种情况下,我将根据解决方案1保存数据。
主要问题是:就数据库性能而言,所提出的解决方案是否可以接受?对于我的问题有更好的解决方案吗?
I have question regarding the storage of large amount of data. The situation is the following:
I want to store
- GPS Coordinates(latitude and longitude) (every minute or even less interval, but I'm considering every minute)
- Event, which can be repeated for several coordinates
- Datetime or timestamp of entry(dunno which is better to use in my case)
- (user id)
I want to be able to query:
- Event by zone(defining the range of latitude and longitude, for example from (1,1) to (2,2))
- User tracking from date X to date Y (one or more users)
So far I was thinking on the solution:
Solution 1
id_user (int)
id_experince (int)
id_event (int)
dt (datetime)
latitude (decimal)
longitude (decimal)
I started to do some calculations and that would be something like:
- around 500 entries per day/user
- since I'm preparing application for some load, there can be around 100-150 users, which will be 75000 entries/day
- after one month there will be millions of entries
Probably, Solution 1 is not good solution, since the size of database with grow very fast.
Solution 2
Have 2 tables, one of which will be aggregate coordinates according to event, for example I have event "dinner" and it takes 30 minutes, so 30 entries will be grouped in one field with BLOB type. This table will look like:
id_user (int)
id_experience (int)
id_event (int)
dt (datetime)
coordinates(blob)
And another table, which have have calculated locations with some "width" and "length", having pointer to the first table
latitude (decimal)
longitude (decimal)
id_entry_in_first_table (int)
This solution only partially solves my problem, imagine, that some events will not be more several minutes and there is a need for the second database..
Solution 3
This is probably not very correct solution, but it seems making some sense. I have user associated with some kind of experience, which has start date and end date. When experience adds, I will create dump of data for that experience and save to the file, deleting the entries related to the experience. When the user will want to consult "archived" experience, I will load data into some temporary table and drop it within one day(for example), in this case I will save the data according to the solution 1.
The main question is: are any of the presented solutions acceptable in terms of performance of the database? Is there any better solution for my problem?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
“数百万个条目”听起来很多,但这就是数据库设计的目的。无论您如何设计它,如果您根据稍后想要从中提取结果的方式对其进行优化(因为这将花费时间而不是插入),那么您就可以开始了。
当然,如果您有很多用户同时对您的数据库执行很多操作,那么我认为您的服务器/带宽会先于数据库执行!
"Millions of entries" sounds like a lot, but this is what databases are designed to handle. However you design it, if you optimise it according to how you want to extract results from it later (as thats what will take the time as opposed to the inserts) then you're good to go.
Saying that of course... if you have lots of users doing lots of things at the same time to your database then I think your server/bandwidth with go before your database does!
我会选择主细节方法。
两个优点:
没有冗余条目(1 个主行和 x 个带有坐标的子行)
它仍然很容易查询(与 blob 方法相比)。
如果您在 master_table_id 上设置外键或索引,即使主表中有数百万条记录,这也应该相当快
I would choose a master detail approach.
Two advantages:
Yo don't have redundant entries (1 master row and x child rows with coordinates)
It is still easy to query (in contrast to the blob approach).
And this should be pretty fast even with many millions of records in the master table, if you setup a foreign key or index on master_table_id
您可能想阅读以下内容:http://dev.mysql。 com/doc/refman/5.0/en/spatial-extensions.html。
一般来说,只要您可以在查询中使用索引,巨大的表就不是问题 - 可以在消费级笔记本电脑上查询数十亿条记录。如果您打算扩展到大量历史记录,则应该制定归档策略,但这不是一个重要的优先事项。
更棘手的是支持您在特定地理边界内查找事件的愿望;这很容易以各种令人讨厌的方式破坏您的索引策略。如果您必须基于数学运算进行查询,则可能无法使用索引 - 因此查找 1 英里圆圈半径内的用户可能必须评估数据库表中每条记录的圆圈公式。
空间扩展为此提供了一个解决方案 - 但它们不是“免费的”,您必须专门为此优化您的设计。
You probably want to read this: http://dev.mysql.com/doc/refman/5.0/en/spatial-extensions.html.
Broadly speaking, as long as you can use indexes in your queries, huge tables aren't an issue - billions of records can be queried on consumer grade laptops. You should have an archiving strategy if you intend to scale to huge numbers of historical records, but it's not a huge priority.
Far more tricky is to support your desire to find events within a certain geographic boundary; it's easy for this to break your indexing strategy in all sorts of nasty ways. If you have to query based on mathematical operations, you may not be able to use an index - so finding users within a radius of a 1 mile circle might have to evaluate the circle formula for every record in your database table.
The spatial extensions offer a solution for this - but they're not "free", you have to optimize your design specifically for this.