SQL:你将如何保存用户?自己的数据?
我正在进行一个涉及时间序列分析的项目,我需要能够让用户上传包含自己的时间序列(即带有日期的数字)的文件,例如 .csv 文件。他们的文件中包含的数据可以随时访问,并在我们的系统中使用。
我怎么能这么做呢? 我考虑过的想法:
- 每次用户上传文件时创建一个表(并将该表的名称保存在某处)。如果我有很多用户上传大量数据,我最终可能会得到大量的表。
- 创建一个基本上有三列或四列的大胖怪物表:值的日期;价值;数据集名称(和/或数据集的所有者)。所有内容都上传到该表中,当 Bob 需要其天气数据时,我只需选择(日期,值),其中所有者 = Bob 且数据集名称 = 天气数据。
- 中间解决方案:每个用户一张表,Bob 的所有数据集都在 Bob 的表中。
- 完全不同:只需将 .csv 文件保存在某处并在需要时读取即可。
我一直读到拥有不同数量的桌子是一种不好的做法(我相信这一点)。然而,我的情况与我在该网站上看到的其他问题略有不同(大多数人似乎希望为每个用户创建一个表,而实际上他们应该为每个用户创建一行)。
一些附加信息:
- 时间序列数据可能包含数十万个观察值,也许
- 先验数百万个,保存的数据不应在事后修改。不过我想让用户将新数据附加到他们的时间序列中会很有用。
- 先验,我不需要执行复杂的 SQL select 语句。我只想阅读鲍勃的天气数据,并且我可能会按时间顺序使用它 - 尽管你永远不知道明天会发生什么。
- 使用 PostgreSQL 9.1,如果这很重要的话。
编辑 阅读一些答案,我意识到我可能没有很好地完成我的工作,我应该说我显然已经在 SQL 环境中发展;我已经有一个用户表;当我写“表”时,我真正的意思是“关系”;我的 4 个想法都在某个地方涉及外键;除非有其他更好的东西,否则 RDBMS 规范化就是范例。 (所有这些并不意味着我反对非 sql 解决方案)。
I'm on a project that involves time series analytics, and I need to be able to let users upload a file containing their own time series (ie numbers with dates), for instance in a .csv file. Data contained in their files would then be accessible at any time, to be used within our system.
How could I do that?
The ideas I've thought about:
- Create a table each time a user upload a file (and save somewhere the name of that table). If I have lots of users uploading lots of data, I may end up with tons of tables.
- Create one big fat monster table with basically three or four columns: the date of the value; the value; the dataset name (and/or the dataset's owner). Everything is uploaded in that table, and when Bob needs its weather data I just select (date,value) where owner = Bob and datasetname = weatherdata.
- In between solution: one table per user, and all Bob's datasets are in Bob's table.
- Completely different: just save the .csv file somewhere and read it when you need it.
I keep reading it's bad practice to have a varying number of tables (and I believe it). However my situation is slightly different from other questions I've seen on this site (most people seems to want to create one table per user, when they should create one row per user).
Some additional information:
- time series data may contain hundreds of thousands observations, maybe millions
- a priori, saved data should not be modified afterwards. However I guess it would be useful to let users append new data to their time series.
- a priori, I won't need to do complicated SQL select statements. I just want to read Bob's weather data and I'll probably use it in the chronological order - although you never know what tomorrow may bring.
- using PostgreSQL 9.1, if that's of any importance.
EDIT
Reading some answers I realize I may have not done my job very well, I should have said that I'm clearly already evolving in a SQL environment; I already have a User table; when I write "table" I really mean "relation"; all my 4 ideas involve foreign keys somewhere; and RDBMS normalization is the paradigm unless something else is better. (All this not meaning I'm against not-sql solutions).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我必须选择“大胖怪物桌”。这就是关系数据库的工作方式,尽管您应该对其进行规范化(为用户创建一个表,为数据集创建另一个表,为数据点创建另一个表)。从各个角度来看,拥有多个具有相同模式的表都是一个坏主意 - 设计、管理、安全性,甚至查询;您确定永远不想合并两个数据集中的信息吗?
如果您确实确定每个数据集将完全隔离,那么您也可能会考虑根本不使用 SQL。 HDF(分层数据格式)字面上就是为此目的而构建的,高效“科学数据集”的存储和检索通常是时间序列数据。 HDF 中的“表”字面意思是数据集,它们可以共享定义,可以是多维的(例如,一维代表一天,一维代表时间),而且它们比 SQL 表便宜得多。
我通常不会试图引导人们远离 SQL,但不寻常的情况有时需要不寻常的解决方案。如果您最终会在 SQL 表中包含数十亿 行(或更多),并且几乎没有其他数据可存储,那么 SQL 可能不是最佳选择适合您的解决方案。
I'm going to have to go with the "big fat monster table". This is how relational databases are meant to work, although you should normalize it (create one table for users, another for data sets, and another for the data points). Having multiple tables with identical schemas is a bad idea from all angles - design, management, security, even querying; are you sure you'll never want to combine information from two data sets?
If you really are certain that each data set will be totally isolated then you might also consider not using SQL at all. HDF (hierarchical data format) was literally built for this exact purpose, efficient storage and retrieval of "scientific data sets" which are very often time-series data. "Tables" in HDF are literally called data sets, they can share definitions, they can be multidimensional (e.g. one dimension for the day, one for the time), and they are much cheaper than SQL tables.
I don't normally try to steer people away from SQL, but unusual situations sometimes call for unusual solutions. If you're going to end up with billions of rows in a SQL table (or more) and you have practically no other data to store, then SQL may not be the right solution for you.
您的想法都是完成任务的相当好的方法(希望我正确地阅读了它)。
关系数据库怎么样?例如,包含用户名、上传时间和唯一 dataid 的表,然后将 dataid 链接到包含 dataid 外键和原始文件数据的另一个表。这将使用户表保持最小(并且您可以将其与另一个表合并,例如包含用户详细信息)。为用户设置一个单独的表,然后为密码设置另一个表,为电子邮件设置一个单独的表,然后再为数据设置 5 个表,这可能是不好的做法,但就我个人而言,我认为将文件与用户数据分开没有任何问题。
您使用什么语言来处理数据?这也可能是一个决定因素。
希望这有帮助:)
汤姆
Your ideas are all fairly good ways of accomplishing the task (hopefully i've read it correctly).
What about a relational database? For example a table with username, time uploaded and a unique dataid, then link the dataid to another table containing the dataid foreign key and the raw file data. This would keep the user table to a minimum (and you could possibly merge it with another table, containing the users details for example). Having a separate table for users and then another for passwords and another for emails and then 5 more for data is probably bad practice, but personally I don't see anything wrong with separating files from user data.
What language are you using to process the data? This could also be a deciding factor also.
Hope this helps :)
Tom
好吧,我认为选项 2 是最好的,创建额外的表对于维护来说是一场噩梦,并且会让您面临很多错误等。选项 4 有点吸引人,但我仍然认为数据库应该能够处理这种任务。
我想我会像这样构造我的表:
User Table - UserID、Name 等
Row - 上传数据中的每一行(rowid 等)
RowInDataSet - Row ID、DataSetID
DataSet - DataSetID、Upload Date、UploadBy 等
这可以让你分手您的数据很少并且易于维护。如果您正确索引这些表,存储大量数据不应该是一个问题。
Ok I think Option 2 is best, creating extra tables is just a nightmare to maintain and leaves you open to so many errors etc. Option 4 is somewhat appealing but I still think a database should be able to cope with this kind of task.
I think I would structure my tables like so:
User Table - UserID, Name etc
Row - Each row in your uploaded data (rowid, etc)
RowInDataSet - Row ID, DataSetID
DataSet - DataSetID, Upload Date, UploadBy etc
This lets you break up your data a little and makes it easy to maintain. Storing large amounts of data should not be such an issue if you correctly index these tables.
可能设计的 T-SQL* 示例:
该设计对您的问题中隐含的两个“实体”进行建模 - 正在加载的时间序列数据和时间序列数据集。
*对于 SQL Server;我知道你说的是 PostgreSQL 9.1,但我很确定你可以很容易地翻译。
Example T-SQL* for a possible design:
The design models two 'entities' implied in your question – the time series data being loaded and the sets of time series data.
*For SQL Server; I know you said PostgreSQL 9.1, but I'm pretty sure you can translate easily enough.