将大量文本(聚类)与矩阵进行比较

发布于 2024-07-20 08:15:44 字数 1181 浏览 10 评论 0原文

我有以下 PHP 函数来计算与文本之间的关系:

function check($terms_in_article1, $terms_in_article2) {
    $length1 = count($terms_in_article1); // number of words
    $length2 = count($terms_in_article2); // number of words
    $all_terms = array_merge($terms_in_article1, $terms_in_article2);
    $all_terms = array_unique($all_terms);
    foreach ($all_terms as $all_termsa) {
        $term_vector1[$all_termsa] = 0;
        $term_vector2[$all_termsa] = 0;
    }
    foreach ($terms_in_article1 as $terms_in_article1a) {
        $term_vector1[$terms_in_article1a]++;
    }
    foreach ($terms_in_article2 as $terms_in_article2a) {
        $term_vector2[$terms_in_article2a]++;
    }
    $score = 0;
    foreach ($all_terms as $all_termsa) {
        $score += $term_vector1[$all_termsa]*$term_vector2[$all_termsa];
    }
    $score = $score/($length1*$length2);
    $score *= 500; // for better readability
    return $score;
}

变量 $terms_in_articleX 必须是一个包含文本中出现的所有单个单词的数组。

假设我有一个包含 20,000 条文本的数据库,这个函数将需要很长时间才能运行完所有连接。

我怎样才能加速这个过程? 我应该将所有文本添加到一个巨大的矩阵中,而不是总是只比较两个文本吗? 如果您有一些代码方法(最好是 PHP),那就太好了。

我希望你可以帮助我。 提前致谢!

I have the following PHP function to calculate the relation between to texts:

function check($terms_in_article1, $terms_in_article2) {
    $length1 = count($terms_in_article1); // number of words
    $length2 = count($terms_in_article2); // number of words
    $all_terms = array_merge($terms_in_article1, $terms_in_article2);
    $all_terms = array_unique($all_terms);
    foreach ($all_terms as $all_termsa) {
        $term_vector1[$all_termsa] = 0;
        $term_vector2[$all_termsa] = 0;
    }
    foreach ($terms_in_article1 as $terms_in_article1a) {
        $term_vector1[$terms_in_article1a]++;
    }
    foreach ($terms_in_article2 as $terms_in_article2a) {
        $term_vector2[$terms_in_article2a]++;
    }
    $score = 0;
    foreach ($all_terms as $all_termsa) {
        $score += $term_vector1[$all_termsa]*$term_vector2[$all_termsa];
    }
    $score = $score/($length1*$length2);
    $score *= 500; // for better readability
    return $score;
}

The variable $terms_in_articleX must be an array containing all single words which appear in the text.

Assuming I have a database of 20,000 texts, this function would take a very long time to run through all the connections.

How can I accelerate this process? Should I add all texts into a huge matrix instead of always comparing only two texts? It would be great if you had some approaches with code, preferably in PHP.

I hope you can help me. Thanks in advance!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

你不是我要的菜∠ 2024-07-27 08:15:44

您可以在添加文本时拆分文本。 简单的例子: preg_match_all(/\w+/, $text, $matches); 当然真正的分割并不是那么简单......但是可能的,只需更正模式:)

创建表 id(int Primary autoincrement )、value(varchar unique) 和链接表,如下所示:word_id(int)、text_id(int)、word_count(int)。 然后在分割文本后用新值填充表格。

最后,您可以使用这些数据做任何您想做的事情,快速使用数据库中的索引整数(ID)进行操作。

更新:
以下是表格和查询:

CREATE TABLE terms (
    id int(11) NOT NULL auto_increment, value char(255) NOT NULL,
    PRIMARY KEY  (`id`), UNIQUE KEY `value` (`value`)
);

CREATE TABLE `terms_in_articles` (
    term int(11) NOT NULL, 
    article int(11) NOT NULL, 
    cnt int(11) NOT NULL default '1',
    UNIQUE KEY `term` (`term`,`article`)
);


/* Returns all unique terms in both articles (your $all_terms) */
SELECT t.id, t.value 
FROM terms t, terms_in_articles a 
WHERE a.term = t.id AND a.article IN (1, 2);

/* Returns your $term_vector1, $term_vector2 */
SELECT article, term, cnt 
FROM terms_in_articles 
WHERE article IN (1, 2) ORDER BY article;

/* Returns article and total count of term entries in it ($length1, $length2) */
SELECT article, SUM(cnt) AS total 
FROM terms_in_articles 
WHERE article IN (1, 2) GROUP BY article;

/* Returns your $score wich you may divide by ($length1 / $length2) from previous query */
SELECT SUM(tmp.term_score) * 500 AS total_score FROM 
(
    SELECT (a1.cnt * a2.cnt) AS term_score 
    FROM terms_in_articles a1, terms_in_articles a2 
    WHERE a1.article = 1 AND a2.article = 2 AND a1.term = a2.term
    GROUP BY a2.term, a1.term
) AS tmp;

好吧,现在,我希望这会有所帮助? 最后 2 个查询足以执行您的任务。 其他查询以防万一。 当然,您可以统计更多统计数据,例如“最流行的术语”等......

You can split the text on adding it. Simple example: preg_match_all(/\w+/, $text, $matches); Sure real splitting is not so simple... but possible, just correct the pattern :)

Create table id(int primary autoincrement), value(varchar unique) and link-table like this: word_id(int), text_id(int), word_count(int). Then fill the tables with new values after splitting text.

Finally you can do with this data anything you want, quickly operating with indexed integers(IDs) in DB.

UPDATE:
Here are the tables and queries:

CREATE TABLE terms (
    id int(11) NOT NULL auto_increment, value char(255) NOT NULL,
    PRIMARY KEY  (`id`), UNIQUE KEY `value` (`value`)
);

CREATE TABLE `terms_in_articles` (
    term int(11) NOT NULL, 
    article int(11) NOT NULL, 
    cnt int(11) NOT NULL default '1',
    UNIQUE KEY `term` (`term`,`article`)
);


/* Returns all unique terms in both articles (your $all_terms) */
SELECT t.id, t.value 
FROM terms t, terms_in_articles a 
WHERE a.term = t.id AND a.article IN (1, 2);

/* Returns your $term_vector1, $term_vector2 */
SELECT article, term, cnt 
FROM terms_in_articles 
WHERE article IN (1, 2) ORDER BY article;

/* Returns article and total count of term entries in it ($length1, $length2) */
SELECT article, SUM(cnt) AS total 
FROM terms_in_articles 
WHERE article IN (1, 2) GROUP BY article;

/* Returns your $score wich you may divide by ($length1 / $length2) from previous query */
SELECT SUM(tmp.term_score) * 500 AS total_score FROM 
(
    SELECT (a1.cnt * a2.cnt) AS term_score 
    FROM terms_in_articles a1, terms_in_articles a2 
    WHERE a1.article = 1 AND a2.article = 2 AND a1.term = a2.term
    GROUP BY a2.term, a1.term
) AS tmp;

Well, now, I hope, this will help? The 2 last queries are enough to perform your task. Other queries are just in case. Sure, you can count more stats like "the most popular terms" etc...

把人绕傻吧 2024-07-27 08:15:44

这是原始函数的稍微优化的版本。 它产生完全相同的结果。 (我在维基百科的两篇文章上运行它,其中包含 10000 多个术语,每篇运行大约 20 次:

check():
test A score: 4.55712524522
test B score: 5.08138042619
--Time: 1.0707

check2():
test A score: 4.55712524522
test B score: 5.08138042619
--Time: 0.2624

这是代码:(

function check2($terms_in_article1, $terms_in_article2) {
    $length1 = count($terms_in_article1); // number of words
    $length2 = count($terms_in_article2); // number of words

    $score_table = array();
    foreach($terms_in_article1 as $term){
        if(!isset($score_table[$term])) $score_table[$term] = 0;
        $score_table[$term] += 1;
    }
    $score_table2 = array();
    foreach($terms_in_article2 as $term){
        if(isset($score_table[$term])){
            if(!isset($score_table2[$term])) $score_table2[$term] = 0;
            $score_table2[$term] += 1;
        }
    }
    $score =0;
    foreach($score_table2 as $key => $entry){
        $score += $score_table[$key] * $entry;
    }
    $score = $score / ($length1*$length2);
    $score *= 500;
    return $score;
}

顺便说一句。不包括将所有单词拆分为数组所需的时间。)

Here's a slightly optimized version of your original function. It produces the exact same results. (I run it on two articles from Wikipedia with 10000+ terms and like 20 runs each:

check():
test A score: 4.55712524522
test B score: 5.08138042619
--Time: 1.0707

check2():
test A score: 4.55712524522
test B score: 5.08138042619
--Time: 0.2624

Here's the code:

function check2($terms_in_article1, $terms_in_article2) {
    $length1 = count($terms_in_article1); // number of words
    $length2 = count($terms_in_article2); // number of words

    $score_table = array();
    foreach($terms_in_article1 as $term){
        if(!isset($score_table[$term])) $score_table[$term] = 0;
        $score_table[$term] += 1;
    }
    $score_table2 = array();
    foreach($terms_in_article2 as $term){
        if(isset($score_table[$term])){
            if(!isset($score_table2[$term])) $score_table2[$term] = 0;
            $score_table2[$term] += 1;
        }
    }
    $score =0;
    foreach($score_table2 as $key => $entry){
        $score += $score_table[$key] * $entry;
    }
    $score = $score / ($length1*$length2);
    $score *= 500;
    return $score;
}

(Btw. The time needed to split all the words into arrays was not included.)

指尖凝香 2024-07-27 08:15:44

编辑:尝试更明确:

  1. 首先,将每个术语编码为
    整数。 你可以使用字典
    关联数组,如下所示:

    <前><代码> $count = 0;
    foreach ($doc as $term) {
    $val = $dict[$term];
    如果(!定义($ val)){
    $dict[$term] = $count++;
    }
    $doc_as_int[$val] ++;
    }

    这样,你就可以替换字符串
    整数计算
    计算。 例如,您可以
    将“云”一词表示为
    数字 5,然后使用索引 5
    的数组来存储计数
    “云”字。 请注意,我们只
    在这里使用关联数组搜索,
    不需要 CRC 等。

  2. 请将所有文本存储为矩阵,最好是稀疏矩阵
  3. 使用功能选择 (PDF)
  4. 也许使用更快的语言的本机实现。
  5. 我建议您首先对大约 20 个簇使用 K 均值,这样可以粗略地了解哪个文档靠近另一个文档,然后仅比较每个簇内的对。 假设集群大小一致,这会将比较次数提高到 20*200 + 20*10*9 - 大约 6000 次比较,而不是 19900 次。

EDIT: Trying to be more explicit:

  1. First, encode every term into an
    integer. You can use a dictionary
    associative array, like this:

       $count = 0;
        foreach ($doc as $term) {
          $val = $dict[$term];
          if (!defined($val)) {
            $dict[$term] = $count++;
          }
          $doc_as_int[$val] ++;
        }
    

    This way, you replace string
    calculations with integer
    calculations. For example, you can
    represent the word "cloud" as the
    number 5, and then use the index 5
    of arrays to store counts of the
    word "cloud". Notice that we only
    use associative array search here,
    no need for CRC etc.

  2. Do store all texts as a matrix, preferably a sparse one.
  3. Use feature selection (PDF).
  4. Maybe use a native implementation in a faster language.
  5. I suggest you first use K-means with about 20 clusters, this way get a rough draft of which document is near another, and then compare only pairs inside each cluster. Assuming uniformly-sized cluster, this improves the number of comparisons to 20*200 + 20*10*9 - around 6000 comparisons instead of 19900.
走走停停 2024-07-27 08:15:44

如果您可以使用简单文本而不是数组进行比较,并且如果我正确理解您的目标,您可以使用 levenshtein php 函数(通常用于在 php 搜索引擎中提供类似于 google 的“您的意思是……?”函数)。

它的工作方式与您使用的相反:返回两个字符串之间的差异。

示例:

<?php
function check($a, $b) {
    return levenshtein($a, $b);
}

$a = 'this is just a test';
$b = 'this is not test';
$c = 'this is just a test';

echo check($a, $b) . '<br />';
//return 5
echo check($a, $c) . '<br />';
//return 0, the strings are identical
?>

但我不确切知道这是否会提高执行速度..但也许是的,您删除了许多 foreach 循环和 array_merge 函数。

编辑:

简单的速度测试(是一个30秒编写的脚本,它不是100%准确的呃):

function check($terms_in_article1, $terms_in_article2) {
    $length1 = count($terms_in_article1); // number of words
    $length2 = count($terms_in_article2); // number of words
    $all_terms = array_merge($terms_in_article1, $terms_in_article2);
    $all_terms = array_unique($all_terms);
    foreach ($all_terms as $all_termsa) {
        $term_vector1[$all_termsa] = 0;
        $term_vector2[$all_termsa] = 0;
    }
    foreach ($terms_in_article1 as $terms_in_article1a) {
        $term_vector1[$terms_in_article1a]++;
    }
    foreach ($terms_in_article2 as $terms_in_article2a) {
        $term_vector2[$terms_in_article2a]++;
    }
    $score = 0;
    foreach ($all_terms as $all_termsa) {
        $score += $term_vector1[$all_termsa]*$term_vector2[$all_termsa];
    }
    $score = $score/($length1*$length2);
    $score *= 500; // for better readability
    return $score;
}


$a = array('this', 'is', 'just', 'a', 'test');
$b = array('this', 'is', 'not', 'test');

$timenow = microtime();
list($m_i, $t_i) = explode(' ', $timenow);

for($i = 0; $i != 10000; $i++){
    check($a, $b);
}
$last = microtime();
list($m_f, $t_f) = explode(' ', $last);
$fine = $m_f+$t_f;
$inizio = $m_i+$t_i;
$quindi = $fine - $inizio;
$quindi = substr($quindi, 0, 7);
echo 'end in ' . $quindi . ' seconds';

打印:在0.36765秒内结束

第二次测试:

<?php
function check($a, $b) {
    return levenshtein($a, $b);
}

$a = 'this is just a test';
$b = 'this is not test';

$timenow = microtime();
list($m_i, $t_i) = explode(' ', $timenow);
for($i = 0; $i != 10000; $i++){
    check($a, $b);
}
$last = microtime();
list($m_f, $t_f) = explode(' ', $last);
$fine = $m_f+$t_f;
$inizio = $m_i+$t_i;
$quindi = $fine - $inizio;
$quindi = substr($quindi, 0, 7);
echo 'end in ' . $quindi . ' seconds';
?>

print: end in 0.05023

所以,是的,看起来更快。
尝试使用许多数组项(以及许多用于levenshtein的单词)

2°编辑会很不错:

使用类似的文本,速度似乎等于levenshtein方法:

<?php
function check($a, $b) {
    return similar_text($a, $b);
}

$a = 'this is just a test ';
$b = 'this is not test';

$timenow = microtime();
list($m_i, $t_i) = explode(' ', $timenow);
for($i = 0; $i != 10000; $i++){
    check($a, $b);
}
$last = microtime();
list($m_f, $t_f) = explode(' ', $last);
$fine = $m_f+$t_f;
$inizio = $m_i+$t_i;
$quindi = $fine - $inizio;
$quindi = substr($quindi, 0, 7);
echo 'end in ' . $quindi . ' seconds';
?>

print: end in 0.05988< /strong> 秒

但可能需要超过 255 个字符:

另请注意,此操作的复杂性
算法是 O(N**3),其中 N 是
最长字符串的长度。

而且,它甚至可以返回百分比相似值:

function check($a, $b) {
    similar_text($a, $b, $p);
    return $p;
}

又一个编辑

创建一个数据库函数,直接在 sql 查询中进行比较,而不是检索所有数据并循环它们怎么样?

如果您正在运行 Mysql,请查看 这个(手工制作的levenshtein函数,仍然有255个字符的限制)
另外,如果您使用 Postgresql,另一个(许多函数应该评估)

If you can use simple text instead of arrays for comparing, and if i understood right where your goal is, you can use the levenshtein php function (that is usually used for give the google-like 'Did you meaning ...?' function in php search engines).

It works in the opposite way youre using: return the difference between two strings.

Example:

<?php
function check($a, $b) {
    return levenshtein($a, $b);
}

$a = 'this is just a test';
$b = 'this is not test';
$c = 'this is just a test';

echo check($a, $b) . '<br />';
//return 5
echo check($a, $c) . '<br />';
//return 0, the strings are identical
?>

But i dont know exactly if this will improve the speed of execution.. but maybe yes, you take-out many foreach loops and the array_merge function.

EDIT:

A simply test for the speed (is a 30-second-wroted-script, its not 100% accurated eh):

function check($terms_in_article1, $terms_in_article2) {
    $length1 = count($terms_in_article1); // number of words
    $length2 = count($terms_in_article2); // number of words
    $all_terms = array_merge($terms_in_article1, $terms_in_article2);
    $all_terms = array_unique($all_terms);
    foreach ($all_terms as $all_termsa) {
        $term_vector1[$all_termsa] = 0;
        $term_vector2[$all_termsa] = 0;
    }
    foreach ($terms_in_article1 as $terms_in_article1a) {
        $term_vector1[$terms_in_article1a]++;
    }
    foreach ($terms_in_article2 as $terms_in_article2a) {
        $term_vector2[$terms_in_article2a]++;
    }
    $score = 0;
    foreach ($all_terms as $all_termsa) {
        $score += $term_vector1[$all_termsa]*$term_vector2[$all_termsa];
    }
    $score = $score/($length1*$length2);
    $score *= 500; // for better readability
    return $score;
}


$a = array('this', 'is', 'just', 'a', 'test');
$b = array('this', 'is', 'not', 'test');

$timenow = microtime();
list($m_i, $t_i) = explode(' ', $timenow);

for($i = 0; $i != 10000; $i++){
    check($a, $b);
}
$last = microtime();
list($m_f, $t_f) = explode(' ', $last);
$fine = $m_f+$t_f;
$inizio = $m_i+$t_i;
$quindi = $fine - $inizio;
$quindi = substr($quindi, 0, 7);
echo 'end in ' . $quindi . ' seconds';

print: end in 0.36765 seconds

Second test:

<?php
function check($a, $b) {
    return levenshtein($a, $b);
}

$a = 'this is just a test';
$b = 'this is not test';

$timenow = microtime();
list($m_i, $t_i) = explode(' ', $timenow);
for($i = 0; $i != 10000; $i++){
    check($a, $b);
}
$last = microtime();
list($m_f, $t_f) = explode(' ', $last);
$fine = $m_f+$t_f;
$inizio = $m_i+$t_i;
$quindi = $fine - $inizio;
$quindi = substr($quindi, 0, 7);
echo 'end in ' . $quindi . ' seconds';
?>

print: end in 0.05023 seconds

So, yes, seem faster.
Would be nice to try with many array items (and many words for levenshtein)

2°EDIT:

With similar text the speed seem to be equal to the levenshtein method:

<?php
function check($a, $b) {
    return similar_text($a, $b);
}

$a = 'this is just a test ';
$b = 'this is not test';

$timenow = microtime();
list($m_i, $t_i) = explode(' ', $timenow);
for($i = 0; $i != 10000; $i++){
    check($a, $b);
}
$last = microtime();
list($m_f, $t_f) = explode(' ', $last);
$fine = $m_f+$t_f;
$inizio = $m_i+$t_i;
$quindi = $fine - $inizio;
$quindi = substr($quindi, 0, 7);
echo 'end in ' . $quindi . ' seconds';
?>

print: end in 0.05988 seconds

But it can take more than 255 char:

Note also that the complexity of this
algorithm is O(N**3) where N is the
length of the longest string.

and, it can even return the similary value in percentage:

function check($a, $b) {
    similar_text($a, $b, $p);
    return $p;
}

Yet another edit

What about create a database function, to make the compare directly in the sql query, instead of retrieving all the data and loop them?

If youre running Mysql, give a look at this one (hand-made levenshtein function, still 255 char limit)
Else, if youre on Postgresql, this other one (many functions that should be evalutate)

天冷不及心凉 2024-07-27 08:15:44

另一种方法是潜在语义分析,它利用大量数据来查找文档之间的相似性。

它的工作方式是通过获取文本的共现矩阵并将其与语料库进行比较,本质上为您提供文档在“语义空间”中的抽象位置。 这将加快您的文本比较速度,因为您可以使用 LSA 语义空间中的欧几里得距离来比较文档。 这是非常有趣的语义索引。 因此,添加新文章不会花费太长时间。

我无法给出这种方法的具体用例,因为我只在学校学到过它,但似乎 KnowledgeSearch 是该算法的开源实现。

(抱歉,这是我的第一篇文章,所以无法发布链接,请自行查找)

Another approach to take would be Latent Semantic Analysis, which leverages a large corpus of data to find similarities between documents.

The way it works is by taking the co-occurance matrix of the text and comparing it to the Corpus, essentially providing you with an abstract location of your document in a 'semantic space'. This will speed up your text comparison, as you can compare documents using Euclidian distance in the LSA Semantic space. It's pretty fun semantic indexing. Thus, adding new articles will not take much longer.

I can't give a specific use case of this approach, having only learned it in school but it appears that KnowledgeSearch is an open source implementation of the algorithm.

(Sorry, its my first post, so can't post links, just look it up)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文