如何设置MySQL表以快速选择

发布于 2025-02-09 05:02:05 字数 2137 浏览 3 评论 0原文

问题是关于 *.fit文件(链接到定义)不断更多),从运动手表,速度计, 其中始终有时间戳(1到n秒),以及1至n个参数(还具有时间戳或计数器为1到x)。

要执行数据分析,我需要数据库中的数据来计算几个拟合文件 /培训单元 /时间段的高度相关的心率。

由于拟合文件中的参数数量变化(取决于连接的设备,创建文件等的设备)以及将来集成更多/新参数的可能性,我的想法是为每个参数,而不是在一个大表中写所有内容(每当拟合文件中不存在参数时,它就会有很多“空”单元格)。

基本表:

1 x tbl_file < / strong>

ID文件名日期
1xyz.fit2022-01-01
2vwx.fit2022-01-02
.... ....

nx tbl_parameter_xy_xy / tbl_ parameter_yz /。 ...

idTimestamp/counterFile_id
101
211
302
........ .....

然后将通过 file_id互相链接这些参数表以及适合文件。

然后,我使用了一台测试服务器,设置了MySQL-DB进行测试并感到震惊:

SELECT * FROM tbl_parameter_xy as x
LEFT JOIN tbl_parameter_yz as y
ON x.file_id = y.file_id
WHERE x.file_id = 999

花了几乎 30秒给我结果。

在我的参数表中有 209918 行。 File_id 999由 1964 行组成。 但是我选择加入返回 3857269 行,因此必须有一个/错误,这就是为什么它需要30秒的原因。

相比之下,从“大完整”表中获取 0.5秒

SELECT * FROM tbl_all_parameters 
WHERE file_id = 999

经过一些研究,我遇到了索引,并认为我已经有了解决方案。 我为每个参数表创建了一个索引(file_id),但结果甚至更慢/相同。

现在,我正在考虑构建那张大“一张”桌子,这使得更易于处理更快地选择,但是我必须经常更新它,以将新的Cols插入新的Cols以获取新的新科尔斯参数。我害怕它会变得如此大,以至于自我杀死

我有2个问题:

  • 建议使用哪个桌子设置,主要专注于选择速度,次要db的大小。
  • 我有我选择中的一个基本错误使其如此慢?

解释选择

The question is about *.FIT files (link to definition) (1 to extremely many and constantly more), from Sports watches, speedometers,
in which there is always a timestamp (1 to n seconds), as well as 1 to n further parameters (which also have either a timestamp or a counter from 1 to x).

To perform data analysis, I need the data in the database to calculate e.g. the heart rates in relation to the altitude over several FIT files / training units / time periods.

Because of the changing number of parameters in a FIT file (depending on the connected devices, the device that created the file, etc.) and the possibility to integrate more/new parameters in the future, my idea was to have a separate table for each parameter instead of writing everything in one big table (which would then have extremely many "empty" cells whenever a parameter is not present in a FIT file).

Basic tables:

1 x tbl_file

idfilenamedate
1xyz.fit2022-01-01
2vwx.fit2022-01-02
......

n x tbl_parameter_xy / tbl_ parameter_yz / ....

idtimestamp/counterfile_idvalue
101value
211value
302value
........

And these parameter tables would then be linked to each other via the file_id as well as to the FIT File.

I then used a test server, set up a MYSQL-DB to test this and was shocked:

SELECT * FROM tbl_parameter_xy as x
LEFT JOIN tbl_parameter_yz as y
ON x.file_id = y.file_id
WHERE x.file_id = 999

Took almost 30 seconds to give me the results.

In my parameter tables there are 209918 rows.
file_id 999 consists of 1964 rows.
But my SELECT with JOIN returns 3857269 rows, so there must be an/the error and that's the reason why it takes 30sec.

In comparison, fetching from a "large complete" table was done in 0.5 seconds:

SELECT * FROM tbl_all_parameters 
WHERE file_id = 999

After some research, I came across INDEX and thought I had the solution.
I created an index (file_id) for each of the parameter tables, but the result was even slower/same.

Right now I´m thinking about building that big "one in all" table, which makes it easier to handle and faster to select from, but I would have to update it frequently to insert new cols for new parameters. And I´m afraid it will grow so big it kills itself

I have 2 questions:

  • Which table setup is recommended, primary with focus on SELECT speed, secondary with size of DB.
  • Do I have a basic bug in my SELECT that makes it so slow?

EXPLAIN SELECT

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

小梨窩很甜 2025-02-16 05:02:05

您正在加入中会有组合爆炸。您的结果集在两个参数表中的每对输入行中包含一个输出行。

如果您说的

  SELECT * FROM a LEFT JOIN b

根本没有条件,则会在结果集中获得计数(a) *计数(b)行。您说的是,这

  SELECT * FROM a LEFT JOIN b WHERE a.file_id = b.file_id

为您提供了类似的结果集。

您需要另一个条件...可能会尝试一下。

SELECT * 
  FROM tbl_parameter_xy as x
  LEFT JOIN tbl_parameter_yz as y
         ON x.file_id = y.file_id
        AND x.timestamp = y.timestamp

如果两个表中的时间戳在某种程度上是同步的。

但是,就尊重而言,我认为您还没有很好的数据库设计。
这是一种创建最佳数据库布局的棘手数据,因为它是可扩展的。

如果您发现自己经常在生产中创建新表(例如,在添加新设备类型时),您几乎可以肯定会误导您的数据库。

您可能采用的一种方法是创建一个属性 /值表。它会有很多行,但是它们会简短且易于索引。

您的观察结果将进入这样的桌子。

file_id       part of your primary key 
parameter_id  part of your primary key
timestamp     part of your primary key
value

然后,当您需要从特定文件中检索参数2和3时,我建议您执行

    SELECT timestamp, parameter_id, value
      FROM observation_table
     WHERE file_id = xxxx
       AND parameter_id IN (2,3)
     ORDER BY timestamp, parameter_id

多列主键,我建议将优化此特定查询。

完成此工作后,请阅读有关统计化的信息。

You're getting a combinatorial explosion in your JOIN. Your result set contains one output row for every pair of input rows in your two parameter tables.

If you say

  SELECT * FROM a LEFT JOIN b

with no ON condition at all you get COUNT(a) * COUNT(b) rows in your result set. And you said this

  SELECT * FROM a LEFT JOIN b WHERE a.file_id = b.file_id

which gives you a similarly bloated result set.

You need another ON condition... possibly try this.

SELECT * 
  FROM tbl_parameter_xy as x
  LEFT JOIN tbl_parameter_yz as y
         ON x.file_id = y.file_id
        AND x.timestamp = y.timestamp

if the timestamps in the two tables are somehow in sync.

But, with respect, I don't think you have a very good database design yet.
This is a tricky kind of data for which to create an optimal database layout, because it's extensible.

If you find yourself with a design where you routinely create new tables in production (for example, when adding a new device type) you almost certainly have misdesigned you database.

An approach you might take is creating an attribute / value table. It will have a lot of rows in it, but they'll be short and easy to index.

Your observations will go into a table like this.

file_id       part of your primary key 
parameter_id  part of your primary key
timestamp     part of your primary key
value

Then, when you need to, say, retrieve parameters 2 and 3 from a particular file, you would do

    SELECT timestamp, parameter_id, value
      FROM observation_table
     WHERE file_id = xxxx
       AND parameter_id IN (2,3)
     ORDER BY timestamp, parameter_id

The multicolumn primary key I suggested will optimize this particular query.

Once you have this working, read about denormalization.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文