如何设置MySQL表以快速选择
问题是关于 *.fit文件(链接到定义)不断更多),从运动手表,速度计, 其中始终有时间戳(1到n秒),以及1至n个参数(还具有时间戳或计数器为1到x)。
要执行数据分析,我需要数据库中的数据来计算几个拟合文件 /培训单元 /时间段的高度相关的心率。
由于拟合文件中的参数数量变化(取决于连接的设备,创建文件等的设备)以及将来集成更多/新参数的可能性,我的想法是为每个参数,而不是在一个大表中写所有内容(每当拟合文件中不存在参数时,它就会有很多“空”单元格)。
基本表:
1 x tbl_file < / strong>
ID | 文件 | 名日期 |
---|---|---|
1 | xyz.fit | 2022-01-01 |
2 | vwx.fit | 2022-01-02 |
.. | .. .. | .. |
nx tbl_parameter_xy_xy / tbl_ parameter_yz /。 ...
id | Timestamp/counter | File_id | 值 |
---|---|---|---|
1 | 0 | 1 | 值 |
2 | 1 | 1 | 值 |
3 | 0 | 2 | 值 |
.. | .. | .. | .. ..... |
然后将通过 file_id互相链接这些参数表以及适合文件。
然后,我使用了一台测试服务器,设置了MySQL-DB进行测试并感到震惊:
SELECT * FROM tbl_parameter_xy as x
LEFT JOIN tbl_parameter_yz as y
ON x.file_id = y.file_id
WHERE x.file_id = 999
花了几乎 30秒给我结果。
在我的参数表中有 209918 行。 File_id 999由 1964 行组成。 但是我选择加入返回 3857269 行,因此必须有一个/错误,这就是为什么它需要30秒的原因。
相比之下,从“大完整”表中获取 0.5秒:
SELECT * FROM tbl_all_parameters
WHERE file_id = 999
经过一些研究,我遇到了索引,并认为我已经有了解决方案。 我为每个参数表创建了一个索引(file_id),但结果甚至更慢/相同。
现在,我正在考虑构建那张大“一张”桌子,这使得更易于处理和更快地选择,但是我必须经常更新它,以将新的Cols插入新的Cols以获取新的新科尔斯参数。我害怕它会变得如此大,以至于自我杀死
我有2个问题:
- 建议使用哪个桌子设置,主要专注于选择速度,次要db的大小。
- 我有我选择中的一个基本错误使其如此慢?
The question is about *.FIT files (link to definition) (1 to extremely many and constantly more), from Sports watches, speedometers,
in which there is always a timestamp (1 to n seconds), as well as 1 to n further parameters (which also have either a timestamp or a counter from 1 to x).
To perform data analysis, I need the data in the database to calculate e.g. the heart rates in relation to the altitude over several FIT files / training units / time periods.
Because of the changing number of parameters in a FIT file (depending on the connected devices, the device that created the file, etc.) and the possibility to integrate more/new parameters in the future, my idea was to have a separate table for each parameter instead of writing everything in one big table (which would then have extremely many "empty" cells whenever a parameter is not present in a FIT file).
Basic tables:
1 x tbl_file
id | filename | date |
---|---|---|
1 | xyz.fit | 2022-01-01 |
2 | vwx.fit | 2022-01-02 |
.. | .. | .. |
n x tbl_parameter_xy / tbl_ parameter_yz / ....
id | timestamp/counter | file_id | value |
---|---|---|---|
1 | 0 | 1 | value |
2 | 1 | 1 | value |
3 | 0 | 2 | value |
.. | .. | .. | .. |
And these parameter tables would then be linked to each other via the file_id as well as to the FIT File.
I then used a test server, set up a MYSQL-DB to test this and was shocked:
SELECT * FROM tbl_parameter_xy as x
LEFT JOIN tbl_parameter_yz as y
ON x.file_id = y.file_id
WHERE x.file_id = 999
Took almost 30 seconds to give me the results.
In my parameter tables there are 209918 rows.
file_id 999 consists of 1964 rows.
But my SELECT with JOIN returns 3857269 rows, so there must be an/the error and that's the reason why it takes 30sec.
In comparison, fetching from a "large complete" table was done in 0.5 seconds:
SELECT * FROM tbl_all_parameters
WHERE file_id = 999
After some research, I came across INDEX and thought I had the solution.
I created an index (file_id) for each of the parameter tables, but the result was even slower/same.
Right now I´m thinking about building that big "one in all" table, which makes it easier to handle and faster to select from, but I would have to update it frequently to insert new cols for new parameters. And I´m afraid it will grow so big it kills itself
I have 2 questions:
- Which table setup is recommended, primary with focus on SELECT speed, secondary with size of DB.
- Do I have a basic bug in my SELECT that makes it so slow?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您正在加入中会有组合爆炸。您的结果集在两个参数表中的每对输入行中包含一个输出行。
如果您说的
根本没有条件,则会在结果集中获得计数(a) *计数(b)行。您说的是,这
为您提供了类似的结果集。
您需要另一个条件...可能会尝试一下。
如果两个表中的时间戳在某种程度上是同步的。
但是,就尊重而言,我认为您还没有很好的数据库设计。
这是一种创建最佳数据库布局的棘手数据,因为它是可扩展的。
如果您发现自己经常在生产中创建新表(例如,在添加新设备类型时),您几乎可以肯定会误导您的数据库。
您可能采用的一种方法是创建一个属性 /值表。它会有很多行,但是它们会简短且易于索引。
您的观察结果将进入这样的桌子。
然后,当您需要从特定文件中检索参数2和3时,我建议您执行
多列主键,我建议将优化此特定查询。
完成此工作后,请阅读有关统计化的信息。
You're getting a combinatorial explosion in your JOIN. Your result set contains one output row for every pair of input rows in your two parameter tables.
If you say
with no ON condition at all you get COUNT(a) * COUNT(b) rows in your result set. And you said this
which gives you a similarly bloated result set.
You need another ON condition... possibly try this.
if the timestamps in the two tables are somehow in sync.
But, with respect, I don't think you have a very good database design yet.
This is a tricky kind of data for which to create an optimal database layout, because it's extensible.
If you find yourself with a design where you routinely create new tables in production (for example, when adding a new device type) you almost certainly have misdesigned you database.
An approach you might take is creating an attribute / value table. It will have a lot of rows in it, but they'll be short and easy to index.
Your observations will go into a table like this.
Then, when you need to, say, retrieve parameters 2 and 3 from a particular file, you would do
The multicolumn primary key I suggested will optimize this particular query.
Once you have this working, read about denormalization.