如何对时间序列数据进行 K 均值聚类?
如何对时间序列数据进行 K 均值聚类? 我了解当输入数据是一组点时这是如何工作的,但我不知道如何使用 1XM 对时间序列进行聚类,其中 M 是数据长度。特别是,我不确定如何更新时间序列数据的集群平均值。
我有一组带标签的时间序列,我想使用 K-means 算法来检查是否会得到相似的标签。我的 X 矩阵将是 NXM,其中 N 是时间序列的数量,M 是数据长度,如上所述。
有谁知道该怎么做?例如,我如何修改 这个 k-means MATLAB 代码 以便它适用于时间序列数据?另外,我希望能够使用欧几里德距离之外的不同距离度量。
为了更好地说明我的疑问,这是我针对时间序列数据修改的代码:
% Check if second input is centroids
if ~isscalar(k)
c=k;
k=size(c,1);
else
c=X(ceil(rand(k,1)*n),:); % assign centroid randomly at start
end
% allocating variables
g0=ones(n,1);
gIdx=zeros(n,1);
D=zeros(n,k);
% Main loop converge if previous partition is the same as current
while any(g0~=gIdx)
% disp(sum(g0~=gIdx))
g0=gIdx;
% Loop for each centroid
for t=1:k
% d=zeros(n,1);
% Loop for each dimension
for s=1:n
D(s,t) = sqrt(sum((X(s,:)-c(t,:)).^2));
end
end
% Partition data to closest centroids
[z,gIdx]=min(D,[],2);
% Update centroids using means of partitions
for t=1:k
% Is this how we calculate new mean of the time series?
c(t,:)=mean(X(gIdx==t,:));
end
end
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
时间序列通常是高维的。并且您需要专门的距离函数来比较它们的相似性。另外,可能存在异常值。
k-means 专为具有(有意义的)欧几里德距离的低维空间而设计。它对异常值的鲁棒性不是很强,因为它对异常值施加了平方权重。
对我来说,在时间序列数据上使用 k 均值听起来不是一个好主意。尝试研究更现代、更强大的聚类算法。许多将允许您使用任意距离函数,包括时间序列距离,例如 DTW。
Time series are usually high-dimensional. And you need specialized distance function to compare them for similarity. Plus, there might be outliers.
k-means is designed for low-dimensional spaces with a (meaningful) euclidean distance. It is not very robust towards outliers, as it puts squared weight on them.
Doesn't sound like a good idea to me to use k-means on time series data. Try looking into more modern, robust clustering algorithms. Many will allow you to use arbitrary distance functions, including time series distances such as DTW.
现在回答可能已经太晚了,但是:
上面的方法使用 R。您可以通过查找找到更多方法,例如“时间序列迭代增量聚类”。
It's probably too late for an answer, but:
The methods above use R. You'll find more methods by looking, e.g., for "Iterative Incremental Clustering of Time Series".
我最近遇到了
kml
R 包声称可以对纵向数据实现 k 均值聚类。我自己还没有尝试过。还有 S. Aghabozorgi 的时间序列聚类 - 十年回顾论文, AS Shirkhorshidi 和 T. Ying Wah 可能对您寻找替代方案有用。另一篇不错的论文,虽然有点过时了,是时间序列数据聚类-一项调查,作者:廖T.沃伦.
I have recently come across the
kml
R package which claims to implement k-means clustering for longitudinal data. I have not tried it out myself.Also the Time-series clustering - A decade review paper by S. Aghabozorgi, A. S. Shirkhorshidi and T. Ying Wah might be useful to you to seek out alternatives. Another nice paper although somewhat dated is Clustering of time series data-a survey by T. Warren Liao.
如果您确实想使用聚类,那么根据您的应用程序,您可以为每个时间序列生成低维特征向量。例如,使用时间序列均值、标准差、傅立叶变换的主频率等。这适合与 k 均值一起使用,但它是否会给您提供有用的结果取决于您的具体应用和您的时间内容系列。
If you did really want to use clustering, then dependent on your application you could generate a low dimensional feature vector for each time series. For example, use time series mean, standard deviation, dominant frequency from a Fourier transform etc. This would be suitable for use with k-means, but whether it would give you useful results is dependent on your specific application and the content of your time series.
我也不认为 k-means 是正确的方法。正如@Anony-Mousse建议的那样,您可以利用DTW。事实上,我的一个项目也遇到了同样的问题,我用 Python 编写了自己的类。逻辑是;
n! /k! /(nk)!
。这些就像潜在的中心一样。如果您有兴趣,可以在此处查看 Python 实现。
I don't think k-means is the right way for it either. As @Anony-Mousse suggested you can utilize DTW. In fact, I had the same problem for one of my projects and I wrote my own class for that in Python. The logic is;
n! / k! / (n-k)!
. These would be something like potential centers.And, the Python implementation is here if you're interested.