在 MATLAB 中对文本进行聚类

发布于 2024-09-17 10:46:05 字数 3298 浏览 8 评论 0原文

我想在 MATLAB 中对文本进行层次凝聚聚类。比如说，我有四个句子，

I have a pen.
I have a paper. 
I have a pencil.
I have a cat.

我想对上面的四个句子进行聚类，看看哪个更相似。我知道统计工具箱有像pdist这样的命令来测量成对距离，linkage来计算聚类相似度等。一个简单的代码，如：

X=[1 2; 2 3; 1 4];
Y=pdist(X, 'euclidean');
Z=linkage(Y, 'single');
H=dendrogram(Z)

工作正常并返回树状图。

我想知道我可以在上面提到的文本上使用这些命令吗？有什么想法吗？

更新：

感谢 Amro。阅读理解并计算字符串之间的距离。代码如下：

clc
S1='I have a pen'; % first String

f_id=fopen('events.txt','r'); %saved strings to compare with
events=textscan(f_id, '%s', 'Delimiter', '\n');
fclose(f_id); %close file.
events=events{1}; % saving the text read.

ii=numel(events); % selects one text randomly.
% store the texts in a cell array

for kk=1:ii

   S2=events(kk);
   S2=cell2mat(S2);
   Z=levenshtein_distance(S1,S2);
   X(kk)=Z;

end

我输入一个字符串，并且保存了 4 个字符串。现在我使用levenshtein_distance函数计算成对距离。它返回一个矩阵X=[ 17 0 16 18 16]。

** 我想这是我的成对距离矩阵。与 pdist 的做法类似。是吗？

** 现在，我尝试输入 X 来计算链接，如

Z=linkage(X, 'single);

我得到的输出是：

使用 ==> 时出错93 尺寸的联动 Y 与输出不兼容 PDIST 函数。
错误==>无题2 20岁 Z=联动(X,'单') .

为什么会这样呢？到底能用联动功能吗？帮助表示赞赏。

更新 2

clc
S1='I have a pen';

f_id=fopen('events.txt','r');
events=textscan(f_id, '%s', 'Delimiter', '\n');
fclose(f_id); %close file.
events=events{1}; % saving the text read.

ii=numel(events)+1; % total number of strings in the comparison

D=zeros(ii, ii); % initialized distance matrix;
for kk=1:ii 

    S2=events(kk);

    %S2=cell2mat(S2);

    for jk=kk+1:ii

  D(kk,jk)= levenshtein_distance(S1{kk},S2{jk});

    end

end

D = D + D';       %'# symmetric distance matrix

%# linkage expects the output format to match that of pdist,
%# so we convert D to a row vector (lower/upper part of matrix)
D = squareform(D, 'tovector');

T = linkage(D, 'single');
dendrogram(T).

错误：???来自非单元格数组对象的单元格内容引用。错误==>无题2 22岁 D(kk,jk)= levenshtein_distance(S1{kk},S2{jk});

另外，为什么我要从第一个循环内的文件中读取事件？看起来不符合逻辑。有点困惑，如果我可以这样工作，或者唯一的解决方案是在代码中输入所有字符串。非常感谢帮助。

用于比较两个句子的更新

代码：

clc
    str1 = 'Fire in NY';
    str2= 'Jeff is sick';

D=levenshtein_distance(str1,str2);
D = D + D';       %'# symmetric distance matrix

%# linkage expects the output format to match that of pdist,
%# so we convert D to a row vector (lower/upper part of matrix)

%D = squareform(D, 'tovector');

T = linkage(D, 'complete');
[H,P] = dendrogram(T,'colorthreshold','default');

输出D=18。

与不同的字符串：

clc
str1 = 'Fire in NY';
str2= 'NY catches fire';

D=levenshtein_distance(str1,str2);
D = D + D';       %'# symmetric distance matrix

%# linkage expects the output format to match that of pdist,
%# so we convert D to a row vector (lower/upper part of matrix)

%D = squareform(D, 'tovector');

T = linkage(D, 'complete');
[H,P] = dendrogram(T,'colorthreshold','default');

D=28。

根据距离，完全不同的句子看起来很相似。我想做的是，如果我存储了纽约火灾，我不会存储纽约着火。但是，对于第一种情况，我会存储为新信息。

LD 足以做到这一点吗？帮助表示赞赏。

原文

I want to do hierarchical agglomerative clustering on texts in MATLAB. Say, I have four sentences,

I have a pen.
I have a paper. 
I have a pencil.
I have a cat.

I want to cluster the above four sentences to see which are more similar. I know Statistic toolbox has command like pdist to measure pair-wise distances, linkage to calculate the cluster similarity etc. A simple code like:

X=[1 2; 2 3; 1 4];
Y=pdist(X, 'euclidean');
Z=linkage(Y, 'single');
H=dendrogram(Z)

works fine and return a dendrogram.

I wonder can I use these command on the texts as I mentioned above. Any thoughts ?

UPDATES:

Thanks to Amro. Read Understood and computed the distance among strings. Code follows:

clc
S1='I have a pen'; % first String

f_id=fopen('events.txt','r'); %saved strings to compare with
events=textscan(f_id, '%s', 'Delimiter', '\n');
fclose(f_id); %close file.
events=events{1}; % saving the text read.

ii=numel(events); % selects one text randomly.
% store the texts in a cell array

for kk=1:ii

   S2=events(kk);
   S2=cell2mat(S2);
   Z=levenshtein_distance(S1,S2);
   X(kk)=Z;

end

I input a string and I had 4 saved strings. Now I calculated the pairwise distance using levenshtein_distance function. It returns a matrix X=[ 17 0 16 18 16].

** I guess this is my pair wise distance matrix. Similar to what pdist does. Is it ?

** Now, I'm trying to input X to compute the linkage like

Z=linkage(X, 'single);

Output I'm getting is:

Error using ==> linkage at 93 Size of
Y not compatible with the output of
the PDIST function.
Error in ==> Untitled2 at 20
Z=linkage(X,'single') .

Why so ? Can use the linkage function at all ? Help appreciated.

UPDATE 2

clc
S1='I have a pen';

f_id=fopen('events.txt','r');
events=textscan(f_id, '%s', 'Delimiter', '\n');
fclose(f_id); %close file.
events=events{1}; % saving the text read.

ii=numel(events)+1; % total number of strings in the comparison

D=zeros(ii, ii); % initialized distance matrix;
for kk=1:ii 

    S2=events(kk);

    %S2=cell2mat(S2);

    for jk=kk+1:ii

  D(kk,jk)= levenshtein_distance(S1{kk},S2{jk});

    end

end

D = D + D';       %'# symmetric distance matrix

%# linkage expects the output format to match that of pdist,
%# so we convert D to a row vector (lower/upper part of matrix)
D = squareform(D, 'tovector');

T = linkage(D, 'single');
dendrogram(T).

Error: ??? Cell contents reference from a non-cell array object.
Error in ==> Untitled2 at 22
D(kk,jk)= levenshtein_distance(S1{kk},S2{jk});

Also, Why am I reading the event from the file inside the first loop ? Doesn't seem logical. Bit confused, if I can work this way or only solution is to input all strings inside the code. Help much appreciated.

UPDATE

code to compare two sentences:

clc
    str1 = 'Fire in NY';
    str2= 'Jeff is sick';

D=levenshtein_distance(str1,str2);
D = D + D';       %'# symmetric distance matrix

%# linkage expects the output format to match that of pdist,
%# so we convert D to a row vector (lower/upper part of matrix)

%D = squareform(D, 'tovector');

T = linkage(D, 'complete');
[H,P] = dendrogram(T,'colorthreshold','default');

Output D=18.

WITH Different strings:

clc
str1 = 'Fire in NY';
str2= 'NY catches fire';

D=levenshtein_distance(str1,str2);
D = D + D';       %'# symmetric distance matrix

%# linkage expects the output format to match that of pdist,
%# so we convert D to a row vector (lower/upper part of matrix)

%D = squareform(D, 'tovector');

T = linkage(D, 'complete');
[H,P] = dendrogram(T,'colorthreshold','default');

D=28.

Based on distance, a completely different sentence looks similar. What I'm trying to do, If I have stored Fire in NY, I wont store NY catches fire. However, for the first case, I would store as the information is new.

IS LD sufficient to do this ? Help appreciated.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

山人契 2024-09-24 10:46:31

您需要的是一个可以处理字符串的距离函数。查看Levenshtein 距离（编辑距离）。有很多实现：

或者，您应该提取一些有趣的特征（例如：元音数量、字符串长度等）来构建向量空间表示，然后您可以在新表示上应用任何常见的距离度量（欧几里德，...）。

编辑

您的代码的问题是LINKAGE 期望输入距离格式与 PDIST，即对应于按 1-vs-2、1-vs-3、2-vs-3 等顺序排列的观测对的行向量。它基本上是完整距离矩阵的下半部分（因为它的应该是对称的 dist(1,2) == dist(2,1))

%# instances
str = {'I have a pen.'
    'I have a paper.'
    'I have a pencil.'
    'I have a cat.'};
numStr = numel(str);

%# create and fill upper half only of distance matrix
D = zeros(numStr,numStr);
for i=1:numStr
    for j=i+1:numStr
        D(i,j) = levenshtein_distance(str{i},str{j});
    end
end
D = D + D';       %'# symmetric distance matrix

%# linkage expects the output format to match that of pdist,
%# so we convert D to a row vector (lower/upper part of matrix)
D = squareform(D, 'tovector');

T = linkage(D, 'single');
dendrogram(T)

请参阅相关函数的文档以获取更多信息...

What you need is a distance function that can handle strings. Check out the Levenshtein distance (edit distance). There are plenty of implementations out there:

Alternatively, you should extract some interesting features (ex: number of vowels, length of string, etc..) to build a vector space representation, then you can apply any of the usual distance measures (euclidean, ...) on the new representation.

EDIT

The problem with your code is that LINKAGE expects the input distances format to match that of PDIST, namely a row vector corresponding to pairs of observations in the order 1-vs-2, 1-vs-3, 2-vs-3, etc.. which is basically the lower half of the complete distance matrix (since its supposed to be symmetric as dist(1,2) == dist(2,1))

%# instances
str = {'I have a pen.'
    'I have a paper.'
    'I have a pencil.'
    'I have a cat.'};
numStr = numel(str);

%# create and fill upper half only of distance matrix
D = zeros(numStr,numStr);
for i=1:numStr
    for j=i+1:numStr
        D(i,j) = levenshtein_distance(str{i},str{j});
    end
end
D = D + D';       %'# symmetric distance matrix

%# linkage expects the output format to match that of pdist,
%# so we convert D to a row vector (lower/upper part of matrix)
D = squareform(D, 'tovector');

T = linkage(D, 'single');
dendrogram(T)

Please refer to the documentation of the functions in question for more information...

回复收藏 0 原文

~没有更多了~