为“最近邻居”实现 kd 树；在MYSQL中搜索？

发布于 2024-11-28 16:06:34 字数 3496 浏览 3 评论 0原文

我正在为外汇市场设计一个自动交易软件。在 MYSQL 数据库中，我有多年的市场数据，每隔五分钟一次。除了价格和时间之外，我还有 4 个不同的数据指标。

[Time|Price|M1|M2|M3|M4] 
x ~400,0000

Time 是主键，M1 到 M4 是不同的指标（例如标准差或移动平均线的斜率）。

这是一个真实的示例（摘录：）

+------------+--------+-----------+--------+-----------+-----------+
|  Time      | Price  |     M1    |   M2   |    M3     |     M4    |
+------------+--------+-----------+--------+-----------+-----------+
| 1105410300 | 1.3101 |   12.9132 | 0.4647 |   29.6703 |        50 |
| 1105410600 | 1.3103 |    14.056 | 0.5305 | 29.230801 |        50 |
| 1105410900 | 1.3105 |   15.3613 | 0.5722 |   26.8132 |        25 |
| 1105411200 | 1.3106 | 16.627501 | 0.4433 | 24.395599 |  26.47059 |
| 1105411500 | 1.3112 |   18.7843 | 1.0019 | 24.505501 |    34.375 |
| 1105411800 | 1.3111 |   19.8375 | 0.5626 |        20 |   32.8125 |
| 1105412100 | 1.3105 |   20.0168 | 0.6718 |    9.7802 |   23.4375 |
| 1105412400 | 1.3105 |   20.4538 | 0.8943 |     7.033 |   23.4375 |
| 1105412700 | 1.3109 |   21.6078 | 0.4902 |   11.7582 |   29.6875 |
| 1105413000 | 1.3104 |   21.2045 |  1.565 |    8.6813 |    21.875 |
+------------+--------+-----------+--------+-----------+-----------+...400k more

给定输入 M1、M2、M3 和 M4 我想（快速而准确地）找到 5,000 个最接近的匹配项。

示例输入：

+------------+--------+-----------+--------+-----------+-----------+
|  Time      | Price  |     M1    |   M2   |    M3     |     M4    |
+------------+--------+-----------+--------+-----------+-----------+
| 1205413000 | 1.4212 |   20.1045 | 1.0012 |    9.1013 |    11.575 |
+------------+--------+-----------+--------+-----------+-----------+

我认为这些指标中的每一个都可以被视为一个“维度”，并且我可以进行最近邻居搜索来定位这个多维空间中最近的数据点。

似乎最简单的方法是迭代每个数据点并测量到输入点的多维距离；但速度至关重要！

我读到了用于此目的的名为 KD Trees 的东西。谁能解释一下或者给我提供一些材料来解释如何在 MYSQL 中实现这一点？

可能需要提到的是，我可以预处理表格，但输入是实时接收的。

目前，我只是独立地围绕每个维度上的数据进行粗略聚类：

INSERT INTO Dim1 SELECT * FROM myTable AS myTable USE INDEX(M1) WHERE myTable.M1 < currentM1 ORDER BY M1 DESC LIMIT 2500;
INSERT INTO Dim1 SELECT * FROM myTable AS myTable USE INDEX(M1) WHERE myTable.M1 > currentM1 ORDER BY M1  ASC LIMIT 2500;

INSERT INTO Dim2 SELECT * FROM myTable AS myTable USE INDEX(M2) WHERE myTable.M2 < currentM2 ORDER BY M2 DESC LIMIT 2500;
INSERT INTO Dim2 SELECT * FROM myTable AS myTable USE INDEX(M2) WHERE myTable.M2 > currentM2 ORDER BY M2  ASC LIMIT 2500;

INSERT INTO Dim3 SELECT * FROM myTable AS myTable USE INDEX(M3) WHERE myTable.M3 < currentM3 ORDER BY M3 DESC LIMIT 2500;
INSERT INTO Dim3 SELECT * FROM myTable AS myTable USE INDEX(M3) WHERE myTable.M3 > currentM3 ORDER BY M3  ASC LIMIT 2500;

INSERT INTO Dim4 SELECT * FROM myTable AS myTable USE INDEX(M4) WHERE myTable.M4 < currentM4 ORDER BY M4 DESC LIMIT 2500;
INSERT INTO Dim4 SELECT * FROM myTable AS myTable USE INDEX(M4) WHERE myTable.M4 > currentM4 ORDER BY M4  ASC LIMIT 2500;

重要的是要了解我对排名距离而不是值感兴趣。

编辑：我我更接近了解如何做到这一点（我认为）：我需要预处理每个指标的每一行，并为其分配一个百分位数，该百分位数表示其在其范围内的位置（按百分比）。

例如，对于 M1 的任何给定值：

percentile = (#  rows with values less than input)/(# total rows)

如果我计算输入的百分位数并使用该进行最近邻搜索而不是实际值，我将有效地缩放各种指标，以便它们可以用作维度。

但我仍然不知道如何进行实际搜索。这是否可以在 MySQL 中有效地完成？

原文

I am designing an automated trading software for the foreign exchange market.
In a MYSQL database I have years of market data at five-minute intervals. I have 4 different metrics for this data alongside the price and time.

[Time|Price|M1|M2|M3|M4] 
x ~400,0000

Time is the primary key, and M1 through M4 are different metrics (such as standard deviation or slope of a moving average).

Here is a real example (excerpt:)

+------------+--------+-----------+--------+-----------+-----------+
|  Time      | Price  |     M1    |   M2   |    M3     |     M4    |
+------------+--------+-----------+--------+-----------+-----------+
| 1105410300 | 1.3101 |   12.9132 | 0.4647 |   29.6703 |        50 |
| 1105410600 | 1.3103 |    14.056 | 0.5305 | 29.230801 |        50 |
| 1105410900 | 1.3105 |   15.3613 | 0.5722 |   26.8132 |        25 |
| 1105411200 | 1.3106 | 16.627501 | 0.4433 | 24.395599 |  26.47059 |
| 1105411500 | 1.3112 |   18.7843 | 1.0019 | 24.505501 |    34.375 |
| 1105411800 | 1.3111 |   19.8375 | 0.5626 |        20 |   32.8125 |
| 1105412100 | 1.3105 |   20.0168 | 0.6718 |    9.7802 |   23.4375 |
| 1105412400 | 1.3105 |   20.4538 | 0.8943 |     7.033 |   23.4375 |
| 1105412700 | 1.3109 |   21.6078 | 0.4902 |   11.7582 |   29.6875 |
| 1105413000 | 1.3104 |   21.2045 |  1.565 |    8.6813 |    21.875 |
+------------+--------+-----------+--------+-----------+-----------+...400k more

Given an input of M1, M2, M3, and M4 I want (quickly and accurately) find the 5,000 closest matches.

Sample input:

+------------+--------+-----------+--------+-----------+-----------+
|  Time      | Price  |     M1    |   M2   |    M3     |     M4    |
+------------+--------+-----------+--------+-----------+-----------+
| 1205413000 | 1.4212 |   20.1045 | 1.0012 |    9.1013 |    11.575 |
+------------+--------+-----------+--------+-----------+-----------+

I figured that each of these metrics could be considered a 'dimension,' and that I can do a nearest neighbor search to locate the closest datapoints in this multidimensional space.

It seems the simplest way to do this is to iterate through every single data point and measure the multidimensional distance to my input point; but speed is of the essence!

I read about something called K-D Trees used for this purpose. Can anyone please explain or provide me with some material that explains how to implement this in MYSQL?

It may be relevant to mention that I can pre-process the table, but the input is received in real-time.

Currently I just make a rough cluster around the data on each dimension independently:

INSERT INTO Dim1 SELECT * FROM myTable AS myTable USE INDEX(M1) WHERE myTable.M1 < currentM1 ORDER BY M1 DESC LIMIT 2500;
INSERT INTO Dim1 SELECT * FROM myTable AS myTable USE INDEX(M1) WHERE myTable.M1 > currentM1 ORDER BY M1  ASC LIMIT 2500;

INSERT INTO Dim2 SELECT * FROM myTable AS myTable USE INDEX(M2) WHERE myTable.M2 < currentM2 ORDER BY M2 DESC LIMIT 2500;
INSERT INTO Dim2 SELECT * FROM myTable AS myTable USE INDEX(M2) WHERE myTable.M2 > currentM2 ORDER BY M2  ASC LIMIT 2500;

INSERT INTO Dim3 SELECT * FROM myTable AS myTable USE INDEX(M3) WHERE myTable.M3 < currentM3 ORDER BY M3 DESC LIMIT 2500;
INSERT INTO Dim3 SELECT * FROM myTable AS myTable USE INDEX(M3) WHERE myTable.M3 > currentM3 ORDER BY M3  ASC LIMIT 2500;

INSERT INTO Dim4 SELECT * FROM myTable AS myTable USE INDEX(M4) WHERE myTable.M4 < currentM4 ORDER BY M4 DESC LIMIT 2500;
INSERT INTO Dim4 SELECT * FROM myTable AS myTable USE INDEX(M4) WHERE myTable.M4 > currentM4 ORDER BY M4  ASC LIMIT 2500;

It is important to understand that I am interested in distance by rank, not by value.

Edit: I am a little closer to understanding how to do it (I think):
I need to pre-process each row of each metric and assign it a percentile which would represent its location (percent-wise) in its range.

For example, for any given value of M1:

percentile = (#  rows with values less than input)/(# total rows)

If I calculate the input's percentile and use that for a nearest neighbor search instead of the actual value I will have effectively scaled the various metrics such that they could be used as dimensions.

I am still lost on how to do the actual search though. Is this even possible to accomplish efficiently in MySQL?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

羞稚 2024-12-05 16:06:35

您应该能够执行如下查询：

SELECT * FROM myTable
WHERE M1 BETWEEN searchM1 - radiusM1 AND searchM1 + radiusM1
  AND M2 BETWEEN searchM2 - radiusM2 AND searchM2 + radiusM2
  AND M3 BETWEEN searchM3 - radiusM3 AND searchM3 + radiusM3
  AND M4 BETWEEN searchM4 - radiusM4 AND searchM4 + radiusM4

当然，对于球体，所有半径值都将相同。然后调整半径，直到接近所需的记录数。我建议使用二分搜索。

我不确定你是否想扰乱分布，但假设你这样做，你只需要给每个搜索值一个介于表中两个值之间的排名（例如，如果排名 5 是 5.5 ，排名 6 为 5.9，搜索值为 5.6，则搜索排名可能为 5.5)

You should be able to do a query like the following:

SELECT * FROM myTable
WHERE M1 BETWEEN searchM1 - radiusM1 AND searchM1 + radiusM1
  AND M2 BETWEEN searchM2 - radiusM2 AND searchM2 + radiusM2
  AND M3 BETWEEN searchM3 - radiusM3 AND searchM3 + radiusM3
  AND M4 BETWEEN searchM4 - radiusM4 AND searchM4 + radiusM4

In the case of a sphere, all the radius values will be the same, of course. You then adjust the radius until you get as close to the number of records you want. I'd suggest a binary search.

I'm not sure if you want to mess with the distribution or not, but assuming you do, you would just need to give each search value a rank between the two values it would fall between in your table (e.g. if rank 5 is 5.5, rank 6 is 5.9, and the search value is 5.6, then the search rank could be 5.5)

回复收藏 0 原文

~没有更多了~