如何对时间序列数据进行 K 均值聚类?

发布于 2024-09-15 02:22:51 字数 1311 浏览 9 评论 0 原文

如何对时间序列数据进行 K 均值聚类? 我了解当输入数据是一组点时这是如何工作的,但我不知道如何使用 1XM 对时间序列进行聚类,其中 M 是数据长度。特别是,我不确定如何更新时间序列数据的集群平均值。

我有一组带标签的时间序列,我想使用 K-means 算法来检查是否会得到相似的标签。我的 X 矩阵将是 NXM,其中 N 是时间序列的数量,M 是数据长度,如上所述。

有谁知道该怎么做?例如,我如何修改 这个 k-means MATLAB 代码 以便它适用于时间序列数据?另外,我希望能够使用欧几里德距离之外的不同距离度量。

为了更好地说明我的疑问,这是我针对时间序列数据修改的代码:


% Check if second input is centroids
if ~isscalar(k) 
    c=k;
    k=size(c,1);
else
    c=X(ceil(rand(k,1)*n),:); % assign centroid randomly at start
end

% allocating variables
g0=ones(n,1); 
gIdx=zeros(n,1);
D=zeros(n,k);

% Main loop converge if previous partition is the same as current
while any(g0~=gIdx)
%     disp(sum(g0~=gIdx))
    g0=gIdx;
    % Loop for each centroid
    for t=1:k
        %  d=zeros(n,1);
        % Loop for each dimension
        for s=1:n
            D(s,t) = sqrt(sum((X(s,:)-c(t,:)).^2)); 
        end
    end
    % Partition data to closest centroids
    [z,gIdx]=min(D,[],2);
    % Update centroids using means of partitions
    for t=1:k

        % Is this how we calculate new mean of the time series?
        c(t,:)=mean(X(gIdx==t,:));

    end
end

How can I do K-means clustering of time series data?
I understand how this works when the input data is a set of points, but I don't know how to cluster a time series with 1XM, where M is the data length. In particular, I'm not sure how to update the mean of the cluster for time series data.

I have a set of labelled time series, and I want to use the K-means algorithm to check whether I will get back a similar label or not. My X matrix will be N X M, where N is number of time series and M is data length as mentioned above.

Does anyone know how to do this? For example, how could I modify this k-means MATLAB code so that it would work for time series data? Also, I would like to be able to use different distance metrics besides Euclidean distance.

To better illustrate my doubts, here is the code I modified for time series data:


% Check if second input is centroids
if ~isscalar(k) 
    c=k;
    k=size(c,1);
else
    c=X(ceil(rand(k,1)*n),:); % assign centroid randomly at start
end

% allocating variables
g0=ones(n,1); 
gIdx=zeros(n,1);
D=zeros(n,k);

% Main loop converge if previous partition is the same as current
while any(g0~=gIdx)
%     disp(sum(g0~=gIdx))
    g0=gIdx;
    % Loop for each centroid
    for t=1:k
        %  d=zeros(n,1);
        % Loop for each dimension
        for s=1:n
            D(s,t) = sqrt(sum((X(s,:)-c(t,:)).^2)); 
        end
    end
    % Partition data to closest centroids
    [z,gIdx]=min(D,[],2);
    % Update centroids using means of partitions
    for t=1:k

        % Is this how we calculate new mean of the time series?
        c(t,:)=mean(X(gIdx==t,:));

    end
end

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

独夜无伴 2024-09-22 02:22:51

时间序列通常是高维的。并且您需要专门的距离函数来比较它们的相似性。另外,可能存在异常值。

k-means 专为具有(有意义的)欧几里德距离的低维空间而设计。它对异常值的鲁棒性不是很强,因为它对异常值施加了平方权重。

对我来说,在时间序列数据上使用 k 均值听起来不是一个好主意。尝试研究更现代、更强大的聚类算法。许多将允许您使用任意距离函数,包括时间序列距离,例如 DTW。

Time series are usually high-dimensional. And you need specialized distance function to compare them for similarity. Plus, there might be outliers.

k-means is designed for low-dimensional spaces with a (meaningful) euclidean distance. It is not very robust towards outliers, as it puts squared weight on them.

Doesn't sound like a good idea to me to use k-means on time series data. Try looking into more modern, robust clustering algorithms. Many will allow you to use arbitrary distance functions, including time series distances such as DTW.

夏の忆 2024-09-22 02:22:51

现在回答可能已经太晚了,但是:

上面的方法使用 R。您可以通过查找找到更多方法,例如“时间序列迭代增量聚类”。

It's probably too late for an answer, but:

The methods above use R. You'll find more methods by looking, e.g., for "Iterative Incremental Clustering of Time Series".

归途 2024-09-22 02:22:51

我最近遇到了 kml R 包声称可以对纵向数据实现 k 均值聚类。我自己还没有尝试过。

还有 S. Aghabozorgi 的时间序列聚类 - 十年回顾论文, AS Shirkhorshidi 和 T. Ying Wah 可能对您寻找替代方案有用。另一篇不错的论文,虽然有点过时了,是时间序列数据聚类-一项调查,作者:廖T.沃伦.

I have recently come across the kml R package which claims to implement k-means clustering for longitudinal data. I have not tried it out myself.

Also the Time-series clustering - A decade review paper by S. Aghabozorgi, A. S. Shirkhorshidi and T. Ying Wah might be useful to you to seek out alternatives. Another nice paper although somewhat dated is Clustering of time series data-a survey by T. Warren Liao.

情绪少女 2024-09-22 02:22:51

如果您确实想使用聚类,那么根据您的应用程序,您可以为每个时间序列生成低维特征向量。例如,使用时间序列均值、标准差、傅立叶变换的主频率等。这适合与 k 均值一起使用,但它是否会给您提供有用的结果取决于您的具体应用和您的时间内容系列。

If you did really want to use clustering, then dependent on your application you could generate a low dimensional feature vector for each time series. For example, use time series mean, standard deviation, dominant frequency from a Fourier transform etc. This would be suitable for use with k-means, but whether it would give you useful results is dependent on your specific application and the content of your time series.

别闹i 2024-09-22 02:22:51

我也不认为 k-means 是正确的方法。正如@Anony-Mousse建议的那样,您可以利用DTW。事实上,我的一个项目也遇到了同样的问题,我用 Python 编写了自己的类。逻辑是;

  1. 创建您的所有集群组合。 k 代表簇数,n 代表系列数。返回的项目数应为 n! /k! /(nk)!。这些就像潜在的中心一样。
  2. 对于每个系列,计算每个聚类组中每个中心的距离并将其分配给最小的一个。
  3. 对于每个聚类组,计算各个聚类内的总距离。
  4. 选择最小值。

如果您有兴趣,可以在此处查看 Python 实现。

I don't think k-means is the right way for it either. As @Anony-Mousse suggested you can utilize DTW. In fact, I had the same problem for one of my projects and I wrote my own class for that in Python. The logic is;

  1. Create your all cluster combinations. k is for cluster count and n is for number of series. The number of items returned should be n! / k! / (n-k)!. These would be something like potential centers.
  2. For each series, calculate distances for each center in each cluster groups and assign it to the minimum one.
  3. For each cluster groups, calculate total distance within individual clusters.
  4. Choose the minimum.

And, the Python implementation is here if you're interested.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文