聚类和 matlab

发布于 2024-12-09 10:00:50 字数 2480 浏览 0 评论 0原文

我正在尝试对 KDD 1999 cup 数据集中的一些数据进行聚类,

文件的输出如下所示:

0,tcp,http,SF,239,486,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,19,19,1.00,0.00,0.05,0.00,0.00,0.00,0.00,0.00,normal.

具有该格式的 48000 条不同记录。我已经清理了数据并删除了文本,只保留了数字。现在的输出如下所示:

在此处输入图像描述

我在 excel 中创建了一个逗号分隔的文件,然后另存为 csv 文件从matlab中的csv文件创建了一个数据源,我尝试通过matlab中的fcm工具箱运行它(findcluster输出38种数据类型,预计有38列)。

然而,这些集群看起来不像集群,或者它不接受并按照我需要的方式工作。

有人可以帮忙找到集群吗?我是 matlab 新手,所以没有任何经验,而且我对聚类也很陌生。

方法:

  1. 选择簇数 (K)
  2. 初始化质心(从数据集中随机选择 K 个模式)
  3. 将每个模式分配给距离最近的质心的簇
  4. 计算每个簇的平均值作为其新的质心
  5. 重复步骤 3,直到满足停止条件(没有模式移动到另一个集群)

这就是我想要实现的目标:

在此处输入图像描述

这就是我的目标我得到:

在此处输入图像描述

load kddcup1.dat
plot(kddcup1(:,1),kddcup1(:,2),'o')  
[center,U,objFcn] = fcm(kddcup1,2);
Iteration count = 1, obj. fcn = 253224062681230720.000000
Iteration count = 2, obj. fcn = 241493132059137410.000000
Iteration count = 3, obj. fcn = 241484544542298110.000000
Iteration count = 4, obj. fcn = 241439204971005280.000000
Iteration count = 5, obj. fcn = 241090628742523840.000000
Iteration count = 6, obj. fcn = 239363408546874750.000000
Iteration count = 7, obj. fcn = 238580863900727680.000000
Iteration count = 8, obj. fcn = 238346826370420990.000000
Iteration count = 9, obj. fcn = 237617756429912510.000000
Iteration count = 10, obj. fcn = 226364785036628320.000000
Iteration count = 11, obj. fcn = 94590774984961184.000000
Iteration count = 12, obj. fcn = 2220521449216102.500000
Iteration count = 13, obj. fcn = 2220521273191876.200000
Iteration count = 14, obj. fcn = 2220521273191876.700000
Iteration count = 15, obj. fcn = 2220521273191876.700000

figure
plot(objFcn)
title('Objective Function Values')
xlabel('Iteration Count')
ylabel('Objective Function Value')

    maxU = max(U);
    index1 = find(U(1, :) == maxU);
    index2 = find(U(2, :) == maxU);
    figure
    line(kddcup1(index1, 1), kddcup1(index1, 2), 'linestyle',...
    'none','marker', 'o','color','g');
    line(kddcup1(index2,1),kddcup1(index2,2),'linestyle',...
    'none','marker', 'x','color','r');
    hold on
    plot(center(1,1),center(1,2),'ko','markersize',15,'LineWidth',2)
    plot(center(2,1),center(2,2),'kx','markersize',15,'LineWidth',2)

I'm trying to cluster some data I have from the KDD 1999 cup dataset

the output from the file looks like this:

0,tcp,http,SF,239,486,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,19,19,1.00,0.00,0.05,0.00,0.00,0.00,0.00,0.00,normal.

with 48 thousand different records in that format. I have cleaned the data up and removed the text keeping only the numbers. The output looks like this now:

enter image description here

I created a comma delimited file in excel and saved as a csv file then created a data source from the csv file in matlab, ive tryed running it through the fcm toolbox in matlab (findcluster outputs 38 data types which is expected with 38 columns).

The clusters however don't look like clusters or its not accepting and working the way I need it to.

Could anyone help finding the clusters? Im new to matlab so don't have any experience and I'm also new to clustering.

The method:

  1. Chose number of clusters (K)
  2. Initialize centroids (K patterns randomly chosen from data set)
  3. Assign each pattern to the cluster with closest centroid
  4. Calculate means of each cluster to be its new centroid
  5. Repeat step 3 until a stopping criteria is met (no pattern move to another cluster)

This is what I'm trying to achieve:

enter image description here

This is what I'm getting:

enter image description here

load kddcup1.dat
plot(kddcup1(:,1),kddcup1(:,2),'o')  
[center,U,objFcn] = fcm(kddcup1,2);
Iteration count = 1, obj. fcn = 253224062681230720.000000
Iteration count = 2, obj. fcn = 241493132059137410.000000
Iteration count = 3, obj. fcn = 241484544542298110.000000
Iteration count = 4, obj. fcn = 241439204971005280.000000
Iteration count = 5, obj. fcn = 241090628742523840.000000
Iteration count = 6, obj. fcn = 239363408546874750.000000
Iteration count = 7, obj. fcn = 238580863900727680.000000
Iteration count = 8, obj. fcn = 238346826370420990.000000
Iteration count = 9, obj. fcn = 237617756429912510.000000
Iteration count = 10, obj. fcn = 226364785036628320.000000
Iteration count = 11, obj. fcn = 94590774984961184.000000
Iteration count = 12, obj. fcn = 2220521449216102.500000
Iteration count = 13, obj. fcn = 2220521273191876.200000
Iteration count = 14, obj. fcn = 2220521273191876.700000
Iteration count = 15, obj. fcn = 2220521273191876.700000

figure
plot(objFcn)
title('Objective Function Values')
xlabel('Iteration Count')
ylabel('Objective Function Value')

    maxU = max(U);
    index1 = find(U(1, :) == maxU);
    index2 = find(U(2, :) == maxU);
    figure
    line(kddcup1(index1, 1), kddcup1(index1, 2), 'linestyle',...
    'none','marker', 'o','color','g');
    line(kddcup1(index2,1),kddcup1(index2,2),'linestyle',...
    'none','marker', 'x','color','r');
    hold on
    plot(center(1,1),center(1,2),'ko','markersize',15,'LineWidth',2)
    plot(center(2,1),center(2,2),'kx','markersize',15,'LineWidth',2)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

夜雨飘雪 2024-12-16 10:00:50

由于您是机器学习/数据挖掘的新手,因此您不应该解决此类高级问题。毕竟,您正在使用的数据是在比赛(KDD Cup'99)中使用的,所以不要指望这很容易!

此外,数据用于分类任务(监督学习),其目标是预测正确的类别(坏/好连接)。您似乎对聚类(无监督学习)感兴趣,这通常比较困难。

此类数据集需要大量预处理和巧妙的特征提取。人们通常利用领域知识(网络入侵检测)从原始数据中获取更好的特征。直接应用 K 均值等简单算法通常会产生较差的结果。

对于初学者,您需要将属性标准化为相同的比例:在方法的步骤 3 中计算欧氏距离时,具有诸如 239486< 等值的特征/code> 将主导具有较小值(例如 0.05)的其他特征,从而破坏结果。

另一点要记住的是,太多的属性可能是一件坏事(维数灾难)。因此,您应该研究特征选择或降维技术。

最后,我建议您熟悉一个更简单的数据集......

Since you are new to machine-learning/data-mining, you shouldn't tackle such advanced problems. After all, the data you are working with was used in a competition (KDD Cup'99), so don't expect it to be easy!

Besides the data was intended for a classification task (supervised learning), where the goal is predict the correct class (bad/good connection). You seem to be interested in clustering (unsupervised learning), which is generally more difficult.

This sort of dataset requires a lot of preprocessing and clever feature extraction. People usually employ domain knowledge (network intrusion detection) to obtain better features from the raw data.. Directly applying simple algorithms like K-means will generally yield poor results.

For starters, you need to normalize the attributes to be of the same scale: when computing the euclidean distance as part of step 3 in your method, the features with values such as 239 and 486 will dominate over the other features with small values as 0.05, thus disrupting the result.

Another point to remember is that too many attributes can be a bad thing (curse of dimensionality). Thus you should look into feature selection or dimensionality reduction techniques.

Finally, I suggest you familiarize yourself with a simpler dataset...

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文