使用Python在大数据集中使用DTW提高时间序列的距离矩阵计算的效率
我正在尝试改善109489个元素数据集中所有可能的巴黎的距离矩阵计算。
列上的所有元素都是时间序列(存储为列表),我用来进行计算的指标是DTW。
我已经创建了以下良好的代码,但我想改进它,因为完成计算需要将近一年的时间:
import numpy as np
import joblib
import _ucrdtw
def compute_distance(cycle1, cycle2):
return _ucrdtw.ucrdtw(cycle1, cycle2, 0.05, False)
cycles = joblib.load('new_cycles.pkl')
dim_matrix = np.empty((len(cycles),len(cycles)), dtype=object)
with joblib.Parallel(n_jobs=15, verbose=10) as parallel:
for i in range(len(cycles)):
print(i)
dim_matrix[i] = parallel(joblib.delayed(compute_distance)(
cycles.iloc[i]['cycle_info_scaled'],
cycles.iloc[j]['cycle_info_scaled']) for j in range(len(cycles)))
这是列存储在列中cycle_info_scaled
中的示例 。数据框架:
cycle_info_scaled
[0.9948399033268663, 0.9948399033268663, 0.995...
[1.0005639107922013, 1.00041937494038, 1.00020...
[0.9975316557982873, 0.9975316557982873, 0.997...
[1.0004695161595962, 1.000252692084482, 1.0002...
[0.9919345197891065, 0.9919345197891065, 0.991...
...
[0.9888568683957731,...
[1.0588396740044068,...
[0.9830848045072209,...
[0.9918426003000614, 0.9918426003000614, 0.991...
如果使用此代码进行时间序列的聚类的目的 我在编码和机器学习方面有点业余... 如何提高此代码的效率?
I'm trying to improve the distance matrix calculation of all possible paris in a dataset of 109489 elements.
All the elements on the column are time series (stored as lists) and the metric I'm using to do the calculations is DTW.
I've created the following code that's working kind of good but I want to improve it because it will take almost a year to finish the calculations:
import numpy as np
import joblib
import _ucrdtw
def compute_distance(cycle1, cycle2):
return _ucrdtw.ucrdtw(cycle1, cycle2, 0.05, False)
cycles = joblib.load('new_cycles.pkl')
dim_matrix = np.empty((len(cycles),len(cycles)), dtype=object)
with joblib.Parallel(n_jobs=15, verbose=10) as parallel:
for i in range(len(cycles)):
print(i)
dim_matrix[i] = parallel(joblib.delayed(compute_distance)(
cycles.iloc[i]['cycle_info_scaled'],
cycles.iloc[j]['cycle_info_scaled']) for j in range(len(cycles)))
This is an example of what's stored in the column cycle_info_scaled
of the data frame:
cycle_info_scaled
[0.9948399033268663, 0.9948399033268663, 0.995...
[1.0005639107922013, 1.00041937494038, 1.00020...
[0.9975316557982873, 0.9975316557982873, 0.997...
[1.0004695161595962, 1.000252692084482, 1.0002...
[0.9919345197891065, 0.9919345197891065, 0.991...
...
[0.9888568683957731,...
[1.0588396740044068,...
[0.9830848045072209,...
[0.9918426003000614, 0.9918426003000614, 0.991...
The purpose if to use this code to do a clustering of the time series
I'm kind of amateur in coding and Machine learning...
How can I improve the efficiency of this code?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论