使用Python在大数据集中使用DTW提高时间序列的距离矩阵计算的效率

发布于 2025-02-06 22:46:30 字数 1255 浏览 1 评论 0原文

我正在尝试改善109489个元素数据集中所有可能的巴黎的距离矩阵计算。

列上的所有元素都是时间序列（存储为列表），我用来进行计算的指标是DTW。

我已经创建了以下良好的代码，但我想改进它，因为完成计算需要将近一年的时间：

import numpy as np
import joblib
import _ucrdtw

def compute_distance(cycle1, cycle2):
    return _ucrdtw.ucrdtw(cycle1, cycle2, 0.05, False)

cycles = joblib.load('new_cycles.pkl')

dim_matrix = np.empty((len(cycles),len(cycles)), dtype=object)
with joblib.Parallel(n_jobs=15, verbose=10) as parallel:
    for i in range(len(cycles)):
        print(i)
        dim_matrix[i] = parallel(joblib.delayed(compute_distance)(
            cycles.iloc[i]['cycle_info_scaled'], 
            cycles.iloc[j]['cycle_info_scaled']) for j in range(len(cycles)))

这是列存储在列中cycle_info_scaled中的示例。数据框架：

cycle_info_scaled
[0.9948399033268663, 0.9948399033268663, 0.995...   
[1.0005639107922013, 1.00041937494038, 1.00020...   
[0.9975316557982873, 0.9975316557982873, 0.997...   
[1.0004695161595962, 1.000252692084482, 1.0002...   
[0.9919345197891065, 0.9919345197891065, 0.991...   
...
[0.9888568683957731,...
[1.0588396740044068,...
[0.9830848045072209,...
[0.9918426003000614, 0.9918426003000614, 0.991...

如果使用此代码进行时间序列的聚类的目的我在编码和机器学习方面有点业余... 如何提高此代码的效率？

原文

I'm trying to improve the distance matrix calculation of all possible paris in a dataset of 109489 elements.

All the elements on the column are time series (stored as lists) and the metric I'm using to do the calculations is DTW.

I've created the following code that's working kind of good but I want to improve it because it will take almost a year to finish the calculations:

import numpy as np
import joblib
import _ucrdtw

def compute_distance(cycle1, cycle2):
    return _ucrdtw.ucrdtw(cycle1, cycle2, 0.05, False)

cycles = joblib.load('new_cycles.pkl')

dim_matrix = np.empty((len(cycles),len(cycles)), dtype=object)
with joblib.Parallel(n_jobs=15, verbose=10) as parallel:
    for i in range(len(cycles)):
        print(i)
        dim_matrix[i] = parallel(joblib.delayed(compute_distance)(
            cycles.iloc[i]['cycle_info_scaled'], 
            cycles.iloc[j]['cycle_info_scaled']) for j in range(len(cycles)))

This is an example of what's stored in the column cycle_info_scaled of the data frame:

cycle_info_scaled
[0.9948399033268663, 0.9948399033268663, 0.995...   
[1.0005639107922013, 1.00041937494038, 1.00020...   
[0.9975316557982873, 0.9975316557982873, 0.997...   
[1.0004695161595962, 1.000252692084482, 1.0002...   
[0.9919345197891065, 0.9919345197891065, 0.991...   
...
[0.9888568683957731,...
[1.0588396740044068,...
[0.9830848045072209,...
[0.9918426003000614, 0.9918426003000614, 0.991...

The purpose if to use this code to do a clustering of the time series
I'm kind of amateur in coding and Machine learning...
How can I improve the efficiency of this code?

分享到QQ

分享到微博