实现聚类的邻近矩阵
请我对这个领域有点陌生,所以如果这个问题听起来微不足道或基本,请原谅我。
我有一组数据集(具体来说是词袋),我需要通过使用它们彼此之间的编辑距离来查找并生成邻近矩阵来生成邻近矩阵。
然而,我很困惑如何跟踪矩阵中的数据/字符串。我需要邻近矩阵来进行聚类。
或者您通常如何在该领域解决此类问题。我正在使用 perl 和 R 来实现这个。
这是我用 perl 编写的一个典型代码,它从包含我的词袋的文本文件中读取,
use strict ;
use warnings ;
use Text::Levenshtein qw(distance) ;
main(@ARGV);
sub main
{
my @TokenDistances ;
my $Tokenfile = 'TokenDistinct.txt';
my @Token ;
my $AppendingCount = 0 ;
my @Tokencompare ;
my %Levcount = ();
open (FH ,"< $Tokenfile" ) or die ("Error opening file . $!");
while(<FH>)
{
chomp $_;
$_ =~ s/^(\s+)$//g;
push (@Token , $_ );
}
close(FH);
@Tokencompare = @Token ;
foreach my $tokenWord(@Tokencompare)
{
my $lengthoffile = scalar @Tokencompare;
my $i = 0 ;
chomp $tokenWord ;
#@TokenDistances = levDistance($tokenWord , \@Tokencompare );
for($i = 0 ; $i < $lengthoffile ;$i++)
{
if(scalar @TokenDistances == scalar @Tokencompare)
{
print "Yipeeeeeeeeeeeeeeeeeeeee\n";
}
chomp $tokenWord ;
chomp $Tokencompare[$i];
#print $tokenWord. " {$Tokencompare[$i]} " . " $TokenDistances[$i] " . "\n";
#$Levcount{$tokenWord}{$Tokencompare[$i]} = $TokenDistances[$i];
$Levcount{$tokenWord}{$Tokencompare[$i]} = levDistance($tokenWord , $Tokencompare[$i] );
}
StoreSortedValues ( \%Levcount ,\$tokenWord , \$AppendingCount);
$AppendingCount++;
%Levcount = () ;
}
# %Levcount = ();
}
sub levDistance
{
my $string1 = shift ;
#my @StringList = @{(shift)};
my $string2 = shift ;
return distance($string1 , $string2);
}
sub StoreSortedValues {
my $Levcount = shift;
my $tokenWordTopMost = ${(shift)} ;
my $j = ${(shift)};
my @ListToken;
my $Tokenfile = 'LevResult.txt';
if($j == 0 )
{
open (FH ,"> $Tokenfile" ) or die ("Error opening file . $!");
}
else
{
open (FH ,">> $Tokenfile" ) or die ("Error opening file . $!");
}
print $tokenWordTopMost;
my %tokenWordMaster = %{$Levcount->{$tokenWordTopMost}};
@ListToken = sort { $tokenWordMaster{$a} cmp $tokenWordMaster{$b} } keys %tokenWordMaster;
#@ListToken = keys %tokenWordMaster;
print FH "-------------------------- " . $tokenWordTopMost . "-------------------------------------\n";
#print FH map {"$_ \t=> $tokenWordMaster{$_} \n "} @ListToken;
foreach my $tokey (@ListToken)
{
print FH "$tokey=>\t" . $tokenWordMaster{$tokey} . "\n"
}
close(FH) or die ("Error Closing File. $!");
}
问题是我如何从中表示邻近矩阵,并且仍然能够跟踪哪个比较代表我的矩阵中的哪个比较。
Please I am a little new to this field so pardon me if the question sound trivial or basic.
I have a group of dataset(Bag of words to be specific) and I need to generate a proximity matrix by using their edit distance from each other to find and generate the proximity matrix .
I am however quite confused how I will keep track of my data/strings in the matrix. I need the proximity matrix for the purpose of clustering.
Or How generally do you approach this kinds of problem in the field. I am using perl and R to implement this.
Here is a typical code in perl I have written that reads from a text file containing my bag of words
use strict ;
use warnings ;
use Text::Levenshtein qw(distance) ;
main(@ARGV);
sub main
{
my @TokenDistances ;
my $Tokenfile = 'TokenDistinct.txt';
my @Token ;
my $AppendingCount = 0 ;
my @Tokencompare ;
my %Levcount = ();
open (FH ,"< $Tokenfile" ) or die ("Error opening file . $!");
while(<FH>)
{
chomp $_;
$_ =~ s/^(\s+)$//g;
push (@Token , $_ );
}
close(FH);
@Tokencompare = @Token ;
foreach my $tokenWord(@Tokencompare)
{
my $lengthoffile = scalar @Tokencompare;
my $i = 0 ;
chomp $tokenWord ;
#@TokenDistances = levDistance($tokenWord , \@Tokencompare );
for($i = 0 ; $i < $lengthoffile ;$i++)
{
if(scalar @TokenDistances == scalar @Tokencompare)
{
print "Yipeeeeeeeeeeeeeeeeeeeee\n";
}
chomp $tokenWord ;
chomp $Tokencompare[$i];
#print $tokenWord. " {$Tokencompare[$i]} " . " $TokenDistances[$i] " . "\n";
#$Levcount{$tokenWord}{$Tokencompare[$i]} = $TokenDistances[$i];
$Levcount{$tokenWord}{$Tokencompare[$i]} = levDistance($tokenWord , $Tokencompare[$i] );
}
StoreSortedValues ( \%Levcount ,\$tokenWord , \$AppendingCount);
$AppendingCount++;
%Levcount = () ;
}
# %Levcount = ();
}
sub levDistance
{
my $string1 = shift ;
#my @StringList = @{(shift)};
my $string2 = shift ;
return distance($string1 , $string2);
}
sub StoreSortedValues {
my $Levcount = shift;
my $tokenWordTopMost = ${(shift)} ;
my $j = ${(shift)};
my @ListToken;
my $Tokenfile = 'LevResult.txt';
if($j == 0 )
{
open (FH ,"> $Tokenfile" ) or die ("Error opening file . $!");
}
else
{
open (FH ,">> $Tokenfile" ) or die ("Error opening file . $!");
}
print $tokenWordTopMost;
my %tokenWordMaster = %{$Levcount->{$tokenWordTopMost}};
@ListToken = sort { $tokenWordMaster{$a} cmp $tokenWordMaster{$b} } keys %tokenWordMaster;
#@ListToken = keys %tokenWordMaster;
print FH "-------------------------- " . $tokenWordTopMost . "-------------------------------------\n";
#print FH map {"$_ \t=> $tokenWordMaster{$_} \n "} @ListToken;
foreach my $tokey (@ListToken)
{
print FH "$tokey=>\t" . $tokenWordMaster{$tokey} . "\n"
}
close(FH) or die ("Error Closing File. $!");
}
the problem is how can I represent the proximity matrix from this and still be able to keep track of which comparison represent which in my matrix.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
在 RecordLinkage 包中,有 levenshteinDist 函数,这是计算字符串之间编辑距离的一种方法。
设置一些数据:
现在创建一个由零组成的矩阵,为距离表保留内存。然后使用嵌套的
for
循环来计算各个距离。我们以一个矩阵结束,每个水果有一行和一列。因此,我们可以将列和行重命名为与原始向量相同。结果:
In the
RecordLinkage
package there is thelevenshteinDist
function, which is one way of calculating an edit distance between strings.Set up some data:
Now create a matrix consisting of zeros to reserve memory for the distance table. Then use nested
for
loops to calculate the individual distances. We end with a matrix with a row and a column for each fruit. Thus we can rename the columns and rows to be identical to the original vector.The results:
邻近度或相似度(或相异度)矩阵只是一个存储对象对相似度得分的表。因此,如果您有 N 个对象,则 R 代码可以是
simMat <- matrix(nrow = N, ncol = N)
,然后是的每个条目 (i,j) >simMat
表示项目 i 和项目 j 之间的相似度。在 R 中,您可以使用多个包(包括
vwr
)来计算 Levenshtein 编辑距离。您可能还会对这本维基教科书感兴趣:http://en.wikibooks.org/wiki/ R_编程/文本_处理
The proximity or similarity (or dissimilarity) matrix is just a table that stores the similarity score for pairs of objects. So, if you have N objects, then the R code can be
simMat <- matrix(nrow = N, ncol = N)
, and then each entry, (i,j), ofsimMat
indicates the similarity between item i and item j.In R, you can use several packages, including
vwr
, to calculate the Levenshtein edit distance.You may also find this Wikibook to be of interest: http://en.wikibooks.org/wiki/R_Programming/Text_Processing