将大量文本（聚类）与矩阵进行比较

发布于 2024-07-20 08:15:44 字数 1181 浏览 10 评论 0原文

我有以下 PHP 函数来计算与文本之间的关系：

function check($terms_in_article1, $terms_in_article2) {
    $length1 = count($terms_in_article1); // number of words
    $length2 = count($terms_in_article2); // number of words
    $all_terms = array_merge($terms_in_article1, $terms_in_article2);
    $all_terms = array_unique($all_terms);
    foreach ($all_terms as $all_termsa) {
        $term_vector1[$all_termsa] = 0;
        $term_vector2[$all_termsa] = 0;
    }
    foreach ($terms_in_article1 as $terms_in_article1a) {
        $term_vector1[$terms_in_article1a]++;
    }
    foreach ($terms_in_article2 as $terms_in_article2a) {
        $term_vector2[$terms_in_article2a]++;
    }
    $score = 0;
    foreach ($all_terms as $all_termsa) {
        $score += $term_vector1[$all_termsa]*$term_vector2[$all_termsa];
    }
    $score = $score/($length1*$length2);
    $score *= 500; // for better readability
    return $score;
}

变量 $terms_in_articleX 必须是一个包含文本中出现的所有单个单词的数组。

假设我有一个包含 20,000 条文本的数据库，这个函数将需要很长时间才能运行完所有连接。

我怎样才能加速这个过程？我应该将所有文本添加到一个巨大的矩阵中，而不是总是只比较两个文本吗？如果您有一些代码方法（最好是 PHP），那就太好了。

我希望你可以帮助我。提前致谢！

原文

I have the following PHP function to calculate the relation between to texts:

function check($terms_in_article1, $terms_in_article2) {
    $length1 = count($terms_in_article1); // number of words
    $length2 = count($terms_in_article2); // number of words
    $all_terms = array_merge($terms_in_article1, $terms_in_article2);
    $all_terms = array_unique($all_terms);
    foreach ($all_terms as $all_termsa) {
        $term_vector1[$all_termsa] = 0;
        $term_vector2[$all_termsa] = 0;
    }
    foreach ($terms_in_article1 as $terms_in_article1a) {
        $term_vector1[$terms_in_article1a]++;
    }
    foreach ($terms_in_article2 as $terms_in_article2a) {
        $term_vector2[$terms_in_article2a]++;
    }
    $score = 0;
    foreach ($all_terms as $all_termsa) {
        $score += $term_vector1[$all_termsa]*$term_vector2[$all_termsa];
    }
    $score = $score/($length1*$length2);
    $score *= 500; // for better readability
    return $score;
}

The variable $terms_in_articleX must be an array containing all single words which appear in the text.

Assuming I have a database of 20,000 texts, this function would take a very long time to run through all the connections.

How can I accelerate this process? Should I add all texts into a huge matrix instead of always comparing only two texts? It would be great if you had some approaches with code, preferably in PHP.

I hope you can help me. Thanks in advance!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

你不是我要的菜∠ 2024-07-27 08:15:44

您可以在添加文本时拆分文本。简单的例子： preg_match_all(/\w+/, $text, $matches); 当然真正的分割并不是那么简单......但是可能的，只需更正模式:)

创建表 id(int Primary autoincrement )、value(varchar unique) 和链接表，如下所示：word_id(int)、text_id(int)、word_count(int)。然后在分割文本后用新值填充表格。

最后，您可以使用这些数据做任何您想做的事情，快速使用数据库中的索引整数（ID）进行操作。

更新：
以下是表格和查询：

CREATE TABLE terms (
    id int(11) NOT NULL auto_increment, value char(255) NOT NULL,
    PRIMARY KEY  (`id`), UNIQUE KEY `value` (`value`)
);

CREATE TABLE `terms_in_articles` (
    term int(11) NOT NULL, 
    article int(11) NOT NULL, 
    cnt int(11) NOT NULL default '1',
    UNIQUE KEY `term` (`term`,`article`)
);


/* Returns all unique terms in both articles (your $all_terms) */
SELECT t.id, t.value 
FROM terms t, terms_in_articles a 
WHERE a.term = t.id AND a.article IN (1, 2);

/* Returns your $term_vector1, $term_vector2 */
SELECT article, term, cnt 
FROM terms_in_articles 
WHERE article IN (1, 2) ORDER BY article;

/* Returns article and total count of term entries in it ($length1, $length2) */
SELECT article, SUM(cnt) AS total 
FROM terms_in_articles 
WHERE article IN (1, 2) GROUP BY article;

/* Returns your $score wich you may divide by ($length1 / $length2) from previous query */
SELECT SUM(tmp.term_score) * 500 AS total_score FROM 
(
    SELECT (a1.cnt * a2.cnt) AS term_score 
    FROM terms_in_articles a1, terms_in_articles a2 
    WHERE a1.article = 1 AND a2.article = 2 AND a1.term = a2.term
    GROUP BY a2.term, a1.term
) AS tmp;

好吧，现在，我希望这会有所帮助？最后 2 个查询足以执行您的任务。其他查询以防万一。当然，您可以统计更多统计数据，例如“最流行的术语”等......

You can split the text on adding it. Simple example: preg_match_all(/\w+/, $text, $matches); Sure real splitting is not so simple... but possible, just correct the pattern :)

Create table id(int primary autoincrement), value(varchar unique) and link-table like this: word_id(int), text_id(int), word_count(int). Then fill the tables with new values after splitting text.

Finally you can do with this data anything you want, quickly operating with indexed integers(IDs) in DB.

UPDATE:
Here are the tables and queries:

CREATE TABLE terms (
    id int(11) NOT NULL auto_increment, value char(255) NOT NULL,
    PRIMARY KEY  (`id`), UNIQUE KEY `value` (`value`)
);

CREATE TABLE `terms_in_articles` (
    term int(11) NOT NULL, 
    article int(11) NOT NULL, 
    cnt int(11) NOT NULL default '1',
    UNIQUE KEY `term` (`term`,`article`)
);


/* Returns all unique terms in both articles (your $all_terms) */
SELECT t.id, t.value 
FROM terms t, terms_in_articles a 
WHERE a.term = t.id AND a.article IN (1, 2);

/* Returns your $term_vector1, $term_vector2 */
SELECT article, term, cnt 
FROM terms_in_articles 
WHERE article IN (1, 2) ORDER BY article;

/* Returns article and total count of term entries in it ($length1, $length2) */
SELECT article, SUM(cnt) AS total 
FROM terms_in_articles 
WHERE article IN (1, 2) GROUP BY article;

/* Returns your $score wich you may divide by ($length1 / $length2) from previous query */
SELECT SUM(tmp.term_score) * 500 AS total_score FROM 
(
    SELECT (a1.cnt * a2.cnt) AS term_score 
    FROM terms_in_articles a1, terms_in_articles a2 
    WHERE a1.article = 1 AND a2.article = 2 AND a1.term = a2.term
    GROUP BY a2.term, a1.term
) AS tmp;

Well, now, I hope, this will help? The 2 last queries are enough to perform your task. Other queries are just in case. Sure, you can count more stats like "the most popular terms" etc...

回复收藏 0 原文

把人绕傻吧 2024-07-27 08:15:44

这是原始函数的稍微优化的版本。它产生完全相同的结果。（我在维基百科的两篇文章上运行它，其中包含 10000 多个术语，每篇运行大约 20 次：

check():
test A score: 4.55712524522
test B score: 5.08138042619
--Time: 1.0707

check2():
test A score: 4.55712524522
test B score: 5.08138042619
--Time: 0.2624

这是代码：（

function check2($terms_in_article1, $terms_in_article2) {
    $length1 = count($terms_in_article1); // number of words
    $length2 = count($terms_in_article2); // number of words

    $score_table = array();
    foreach($terms_in_article1 as $term){
        if(!isset($score_table[$term])) $score_table[$term] = 0;
        $score_table[$term] += 1;
    }
    $score_table2 = array();
    foreach($terms_in_article2 as $term){
        if(isset($score_table[$term])){
            if(!isset($score_table2[$term])) $score_table2[$term] = 0;
            $score_table2[$term] += 1;
        }
    }
    $score =0;
    foreach($score_table2 as $key => $entry){
        $score += $score_table[$key] * $entry;
    }
    $score = $score / ($length1*$length2);
    $score *= 500;
    return $score;
}

顺便说一句。不包括将所有单词拆分为数组所需的时间。）

Here's a slightly optimized version of your original function. It produces the exact same results. (I run it on two articles from Wikipedia with 10000+ terms and like 20 runs each:

check():
test A score: 4.55712524522
test B score: 5.08138042619
--Time: 1.0707

check2():
test A score: 4.55712524522
test B score: 5.08138042619
--Time: 0.2624

Here's the code:

function check2($terms_in_article1, $terms_in_article2) {
    $length1 = count($terms_in_article1); // number of words
    $length2 = count($terms_in_article2); // number of words

    $score_table = array();
    foreach($terms_in_article1 as $term){
        if(!isset($score_table[$term])) $score_table[$term] = 0;
        $score_table[$term] += 1;
    }
    $score_table2 = array();
    foreach($terms_in_article2 as $term){
        if(isset($score_table[$term])){
            if(!isset($score_table2[$term])) $score_table2[$term] = 0;
            $score_table2[$term] += 1;
        }
    }
    $score =0;
    foreach($score_table2 as $key => $entry){
        $score += $score_table[$key] * $entry;
    }
    $score = $score / ($length1*$length2);
    $score *= 500;
    return $score;
}

(Btw. The time needed to split all the words into arrays was not included.)

回复收藏 0 原文

指尖凝香 2024-07-27 08:15:44

编辑：尝试更明确：

首先，将每个术语编码为
整数。你可以使用字典
关联数组，如下所示：
<前><代码> $count = 0;
foreach ($doc as $term) {
$val = $dict[$term];
如果（！定义（$ val））{
$dict[$term] = $count++;
}
$doc_as_int[$val] ++;
}
这样，你就可以替换字符串
整数计算
计算。例如，您可以
将“云”一词表示为
数字 5，然后使用索引 5
的数组来存储计数
“云”字。请注意，我们只
在这里使用关联数组搜索，
不需要 CRC 等。
请将所有文本存储为矩阵，最好是稀疏矩阵。
使用功能选择 (PDF)。
也许使用更快的语言的本机实现。
我建议您首先对大约 20 个簇使用 K 均值，这样可以粗略地了解哪个文档靠近另一个文档，然后仅比较每个簇内的对。假设集群大小一致，这会将比较次数提高到 20*200 + 20*10*9 - 大约 6000 次比较，而不是 19900 次。

EDIT: Trying to be more explicit:

First, encode every term into an
integer. You can use a dictionary
associative array, like this:
```
   $count = 0;
    foreach ($doc as $term) {
      $val = $dict[$term];
      if (!defined($val)) {
        $dict[$term] = $count++;
      }
      $doc_as_int[$val] ++;
    }
```
This way, you replace string
calculations with integer
calculations. For example, you can
represent the word "cloud" as the
number 5, and then use the index 5
of arrays to store counts of the
word "cloud". Notice that we only
use associative array search here,
no need for CRC etc.
Do store all texts as a matrix, preferably a sparse one.
Use feature selection (PDF).
Maybe use a native implementation in a faster language.
I suggest you first use K-means with about 20 clusters, this way get a rough draft of which document is near another, and then compare only pairs inside each cluster. Assuming uniformly-sized cluster, this improves the number of comparisons to 20*200 + 20*10*9 - around 6000 comparisons instead of 19900.

回复收藏 0 原文

走走停停 2024-07-27 08:15:44

如果您可以使用简单文本而不是数组进行比较，并且如果我正确理解您的目标，您可以使用 levenshtein php 函数（通常用于在 php 搜索引擎中提供类似于 google 的“您的意思是……？”函数）。

它的工作方式与您使用的相反：返回两个字符串之间的差异。

示例：

<?php
function check($a, $b) {
    return levenshtein($a, $b);
}

$a = 'this is just a test';
$b = 'this is not test';
$c = 'this is just a test';

echo check($a, $b) . '<br />';
//return 5
echo check($a, $c) . '<br />';
//return 0, the strings are identical
?>

但我不确切知道这是否会提高执行速度..但也许是的，您删除了许多 foreach 循环和 array_merge 函数。

编辑：

简单的速度测试（是一个30秒编写的脚本，它不是100％准确的呃）：

function check($terms_in_article1, $terms_in_article2) {
    $length1 = count($terms_in_article1); // number of words
    $length2 = count($terms_in_article2); // number of words
    $all_terms = array_merge($terms_in_article1, $terms_in_article2);
    $all_terms = array_unique($all_terms);
    foreach ($all_terms as $all_termsa) {
        $term_vector1[$all_termsa] = 0;
        $term_vector2[$all_termsa] = 0;
    }
    foreach ($terms_in_article1 as $terms_in_article1a) {
        $term_vector1[$terms_in_article1a]++;
    }
    foreach ($terms_in_article2 as $terms_in_article2a) {
        $term_vector2[$terms_in_article2a]++;
    }
    $score = 0;
    foreach ($all_terms as $all_termsa) {
        $score += $term_vector1[$all_termsa]*$term_vector2[$all_termsa];
    }
    $score = $score/($length1*$length2);
    $score *= 500; // for better readability
    return $score;
}


$a = array('this', 'is', 'just', 'a', 'test');
$b = array('this', 'is', 'not', 'test');

$timenow = microtime();
list($m_i, $t_i) = explode(' ', $timenow);

for($i = 0; $i != 10000; $i++){
    check($a, $b);
}
$last = microtime();
list($m_f, $t_f) = explode(' ', $last);
$fine = $m_f+$t_f;
$inizio = $m_i+$t_i;
$quindi = $fine - $inizio;
$quindi = substr($quindi, 0, 7);
echo 'end in ' . $quindi . ' seconds';

打印：在0.36765秒内结束

第二次测试：

<?php
function check($a, $b) {
    return levenshtein($a, $b);
}

$a = 'this is just a test';
$b = 'this is not test';

$timenow = microtime();
list($m_i, $t_i) = explode(' ', $timenow);
for($i = 0; $i != 10000; $i++){
    check($a, $b);
}
$last = microtime();
list($m_f, $t_f) = explode(' ', $last);
$fine = $m_f+$t_f;
$inizio = $m_i+$t_i;
$quindi = $fine - $inizio;
$quindi = substr($quindi, 0, 7);
echo 'end in ' . $quindi . ' seconds';
?>

print: end in 0.05023 秒

所以，是的，看起来更快。
尝试使用许多数组项（以及许多用于levenshtein的单词）

2°编辑会很不错：

使用类似的文本，速度似乎等于levenshtein方法：

<?php
function check($a, $b) {
    return similar_text($a, $b);
}

$a = 'this is just a test ';
$b = 'this is not test';

$timenow = microtime();
list($m_i, $t_i) = explode(' ', $timenow);
for($i = 0; $i != 10000; $i++){
    check($a, $b);
}
$last = microtime();
list($m_f, $t_f) = explode(' ', $last);
$fine = $m_f+$t_f;
$inizio = $m_i+$t_i;
$quindi = $fine - $inizio;
$quindi = substr($quindi, 0, 7);
echo 'end in ' . $quindi . ' seconds';
?>

print: end in 0.05988< /strong> 秒

但可能需要超过 255 个字符：

另请注意，此操作的复杂性
算法是 O(N**3)，其中 N 是
最长字符串的长度。

而且，它甚至可以返回百分比相似值：

function check($a, $b) {
    similar_text($a, $b, $p);
    return $p;
}

又一个编辑

创建一个数据库函数，直接在 sql 查询中进行比较，而不是检索所有数据并循环它们怎么样？

如果您正在运行 Mysql，请查看这个（手工制作的levenshtein函数，仍然有255个字符的限制）
另外，如果您使用 Postgresql，另一个（许多函数应该评估）

If you can use simple text instead of arrays for comparing, and if i understood right where your goal is, you can use the levenshtein php function (that is usually used for give the google-like 'Did you meaning ...?' function in php search engines).

It works in the opposite way youre using: return the difference between two strings.

Example:

<?php
function check($a, $b) {
    return levenshtein($a, $b);
}

$a = 'this is just a test';
$b = 'this is not test';
$c = 'this is just a test';

echo check($a, $b) . '<br />';
//return 5
echo check($a, $c) . '<br />';
//return 0, the strings are identical
?>

But i dont know exactly if this will improve the speed of execution.. but maybe yes, you take-out many foreach loops and the array_merge function.

EDIT:

A simply test for the speed (is a 30-second-wroted-script, its not 100% accurated eh):

function check($terms_in_article1, $terms_in_article2) {
    $length1 = count($terms_in_article1); // number of words
    $length2 = count($terms_in_article2); // number of words
    $all_terms = array_merge($terms_in_article1, $terms_in_article2);
    $all_terms = array_unique($all_terms);
    foreach ($all_terms as $all_termsa) {
        $term_vector1[$all_termsa] = 0;
        $term_vector2[$all_termsa] = 0;
    }
    foreach ($terms_in_article1 as $terms_in_article1a) {
        $term_vector1[$terms_in_article1a]++;
    }
    foreach ($terms_in_article2 as $terms_in_article2a) {
        $term_vector2[$terms_in_article2a]++;
    }
    $score = 0;
    foreach ($all_terms as $all_termsa) {
        $score += $term_vector1[$all_termsa]*$term_vector2[$all_termsa];
    }
    $score = $score/($length1*$length2);
    $score *= 500; // for better readability
    return $score;
}


$a = array('this', 'is', 'just', 'a', 'test');
$b = array('this', 'is', 'not', 'test');

$timenow = microtime();
list($m_i, $t_i) = explode(' ', $timenow);

for($i = 0; $i != 10000; $i++){
    check($a, $b);
}
$last = microtime();
list($m_f, $t_f) = explode(' ', $last);
$fine = $m_f+$t_f;
$inizio = $m_i+$t_i;
$quindi = $fine - $inizio;
$quindi = substr($quindi, 0, 7);
echo 'end in ' . $quindi . ' seconds';

print: end in 0.36765 seconds

Second test:

<?php
function check($a, $b) {
    return levenshtein($a, $b);
}

$a = 'this is just a test';
$b = 'this is not test';

$timenow = microtime();
list($m_i, $t_i) = explode(' ', $timenow);
for($i = 0; $i != 10000; $i++){
    check($a, $b);
}
$last = microtime();
list($m_f, $t_f) = explode(' ', $last);
$fine = $m_f+$t_f;
$inizio = $m_i+$t_i;
$quindi = $fine - $inizio;
$quindi = substr($quindi, 0, 7);
echo 'end in ' . $quindi . ' seconds';
?>

print: end in 0.05023 seconds

So, yes, seem faster.
Would be nice to try with many array items (and many words for levenshtein)

2°EDIT:

With similar text the speed seem to be equal to the levenshtein method:

<?php
function check($a, $b) {
    return similar_text($a, $b);
}

$a = 'this is just a test ';
$b = 'this is not test';

$timenow = microtime();
list($m_i, $t_i) = explode(' ', $timenow);
for($i = 0; $i != 10000; $i++){
    check($a, $b);
}
$last = microtime();
list($m_f, $t_f) = explode(' ', $last);
$fine = $m_f+$t_f;
$inizio = $m_i+$t_i;
$quindi = $fine - $inizio;
$quindi = substr($quindi, 0, 7);
echo 'end in ' . $quindi . ' seconds';
?>

print: end in 0.05988 seconds

But it can take more than 255 char:

Note also that the complexity of this
algorithm is O(N**3) where N is the
length of the longest string.

and, it can even return the similary value in percentage:

function check($a, $b) {
    similar_text($a, $b, $p);
    return $p;
}

Yet another edit

What about create a database function, to make the compare directly in the sql query, instead of retrieving all the data and loop them?

If youre running Mysql, give a look at this one (hand-made levenshtein function, still 255 char limit)
Else, if youre on Postgresql, this other one (many functions that should be evalutate)

回复收藏 0 原文