如何设置MySQL表以快速选择

发布于 2025-02-09 05:02:05 字数 2137 浏览 3 评论 0原文

问题是关于 *.fit文件（链接到定义）不断更多），从运动手表，速度计，其中始终有时间戳（1到n秒），以及1至n个参数（还具有时间戳或计数器为1到x）。

要执行数据分析，我需要数据库中的数据来计算几个拟合文件 /培训单元 /时间段的高度相关的心率。

由于拟合文件中的参数数量变化（取决于连接的设备，创建文件等的设备）以及将来集成更多/新参数的可能性，我的想法是为每个参数，而不是在一个大表中写所有内容（每当拟合文件中不存在参数时，它就会有很多“空”单元格）。

基本表：

1 x tbl_file < / strong>

ID	文件	名日期
1	xyz.fit	2022-01-01
2	vwx.fit	2022-01-02
..	.. ..	..

nx tbl_parameter_xy_xy / tbl_ parameter_yz /。 ...

id	Timestamp/counter	File_id	值
1	0	1	值
2	1	1	值
3	0	2	值
..	..	..	.. .....

然后将通过 file_id互相链接这些参数表以及适合文件。

然后，我使用了一台测试服务器，设置了MySQL-DB进行测试并感到震惊：

SELECT * FROM tbl_parameter_xy as x
LEFT JOIN tbl_parameter_yz as y
ON x.file_id = y.file_id
WHERE x.file_id = 999

花了几乎 30秒给我结果。

在我的参数表中有 209918 行。 File_id 999由 1964 行组成。但是我选择加入返回 3857269 行，因此必须有一个/错误，这就是为什么它需要30秒的原因。

相比之下，从“大完整”表中获取 0.5秒：

SELECT * FROM tbl_all_parameters 
WHERE file_id = 999

经过一些研究，我遇到了索引，并认为我已经有了解决方案。我为每个参数表创建了一个索引（file_id），但结果甚至更慢/相同。

现在，我正在考虑构建那张大“一张”桌子，这使得更易于处理和更快地选择，但是我必须经常更新它，以将新的Cols插入新的Cols以获取新的新科尔斯参数。我害怕它会变得如此大，以至于自我杀死

我有2个问题：

建议使用哪个桌子设置，主要专注于选择速度，次要db的大小。
我有我选择中的一个基本错误使其如此慢？

解释选择

原文

The question is about *.FIT files (link to definition) (1 to extremely many and constantly more), from Sports watches, speedometers,
in which there is always a timestamp (1 to n seconds), as well as 1 to n further parameters (which also have either a timestamp or a counter from 1 to x).

To perform data analysis, I need the data in the database to calculate e.g. the heart rates in relation to the altitude over several FIT files / training units / time periods.

Because of the changing number of parameters in a FIT file (depending on the connected devices, the device that created the file, etc.) and the possibility to integrate more/new parameters in the future, my idea was to have a separate table for each parameter instead of writing everything in one big table (which would then have extremely many "empty" cells whenever a parameter is not present in a FIT file).

Basic tables:

1 x tbl_file

id	filename	date
1	xyz.fit	2022-01-01
2	vwx.fit	2022-01-02
..	..	..

n x tbl_parameter_xy / tbl_ parameter_yz / ....

id	timestamp/counter	file_id	value
1	0	1	value
2	1	1	value
3	0	2	value
..	..	..	..

And these parameter tables would then be linked to each other via the file_id as well as to the FIT File.

I then used a test server, set up a MYSQL-DB to test this and was shocked:

SELECT * FROM tbl_parameter_xy as x
LEFT JOIN tbl_parameter_yz as y
ON x.file_id = y.file_id
WHERE x.file_id = 999

Took almost 30 seconds to give me the results.

In my parameter tables there are 209918 rows.
file_id 999 consists of 1964 rows.
But my SELECT with JOIN returns 3857269 rows, so there must be an/the error and that's the reason why it takes 30sec.

In comparison, fetching from a "large complete" table was done in 0.5 seconds:

SELECT * FROM tbl_all_parameters 
WHERE file_id = 999

After some research, I came across INDEX and thought I had the solution.
I created an index (file_id) for each of the parameter tables, but the result was even slower/same.

Right now I´m thinking about building that big "one in all" table, which makes it easier to handle and faster to select from, but I would have to update it frequently to insert new cols for new parameters. And I´m afraid it will grow so big it kills itself

I have 2 questions:

Which table setup is recommended, primary with focus on SELECT speed, secondary with size of DB.
Do I have a basic bug in my SELECT that makes it so slow?

EXPLAIN SELECT

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

小梨窩很甜 2025-02-16 05:02:05

您正在加入中会有组合爆炸。您的结果集在两个参数表中的每对输入行中包含一个输出行。

如果您说的

  SELECT * FROM a LEFT JOIN b

根本没有条件，则会在结果集中获得计数（a） *计数（b）行。您说的是，这

  SELECT * FROM a LEFT JOIN b WHERE a.file_id = b.file_id

为您提供了类似的结果集。

您需要另一个条件...可能会尝试一下。

SELECT * 
  FROM tbl_parameter_xy as x
  LEFT JOIN tbl_parameter_yz as y
         ON x.file_id = y.file_id
        AND x.timestamp = y.timestamp

如果两个表中的时间戳在某种程度上是同步的。

但是，就尊重而言，我认为您还没有很好的数据库设计。
这是一种创建最佳数据库布局的棘手数据，因为它是可扩展的。

如果您发现自己经常在生产中创建新表（例如，在添加新设备类型时），您几乎可以肯定会误导您的数据库。

您可能采用的一种方法是创建一个属性 /值表。它会有很多行，但是它们会简短且易于索引。

您的观察结果将进入这样的桌子。

file_id       part of your primary key 
parameter_id  part of your primary key
timestamp     part of your primary key
value

然后，当您需要从特定文件中检索参数2和3时，我建议您执行

    SELECT timestamp, parameter_id, value
      FROM observation_table
     WHERE file_id = xxxx
       AND parameter_id IN (2,3)
     ORDER BY timestamp, parameter_id

多列主键，我建议将优化此特定查询。

完成此工作后，请阅读有关统计化的信息。

You're getting a combinatorial explosion in your JOIN. Your result set contains one output row for every pair of input rows in your two parameter tables.

If you say

  SELECT * FROM a LEFT JOIN b

with no ON condition at all you get COUNT(a) * COUNT(b) rows in your result set. And you said this

  SELECT * FROM a LEFT JOIN b WHERE a.file_id = b.file_id

which gives you a similarly bloated result set.

You need another ON condition... possibly try this.

SELECT * 
  FROM tbl_parameter_xy as x
  LEFT JOIN tbl_parameter_yz as y
         ON x.file_id = y.file_id
        AND x.timestamp = y.timestamp

if the timestamps in the two tables are somehow in sync.

But, with respect, I don't think you have a very good database design yet.
This is a tricky kind of data for which to create an optimal database layout, because it's extensible.

If you find yourself with a design where you routinely create new tables in production (for example, when adding a new device type) you almost certainly have misdesigned you database.

An approach you might take is creating an attribute / value table. It will have a lot of rows in it, but they'll be short and easy to index.

Your observations will go into a table like this.

file_id       part of your primary key 
parameter_id  part of your primary key
timestamp     part of your primary key
value

Then, when you need to, say, retrieve parameters 2 and 3 from a particular file, you would do

    SELECT timestamp, parameter_id, value
      FROM observation_table
     WHERE file_id = xxxx
       AND parameter_id IN (2,3)
     ORDER BY timestamp, parameter_id

The multicolumn primary key I suggested will optimize this particular query.

Once you have this working, read about denormalization.

回复收藏 0 原文

~没有更多了~