将大量文本(聚类)与矩阵进行比较
我有以下 PHP 函数来计算与文本之间的关系:
function check($terms_in_article1, $terms_in_article2) {
$length1 = count($terms_in_article1); // number of words
$length2 = count($terms_in_article2); // number of words
$all_terms = array_merge($terms_in_article1, $terms_in_article2);
$all_terms = array_unique($all_terms);
foreach ($all_terms as $all_termsa) {
$term_vector1[$all_termsa] = 0;
$term_vector2[$all_termsa] = 0;
}
foreach ($terms_in_article1 as $terms_in_article1a) {
$term_vector1[$terms_in_article1a]++;
}
foreach ($terms_in_article2 as $terms_in_article2a) {
$term_vector2[$terms_in_article2a]++;
}
$score = 0;
foreach ($all_terms as $all_termsa) {
$score += $term_vector1[$all_termsa]*$term_vector2[$all_termsa];
}
$score = $score/($length1*$length2);
$score *= 500; // for better readability
return $score;
}
变量 $terms_in_articleX
必须是一个包含文本中出现的所有单个单词的数组。
假设我有一个包含 20,000 条文本的数据库,这个函数将需要很长时间才能运行完所有连接。
我怎样才能加速这个过程? 我应该将所有文本添加到一个巨大的矩阵中,而不是总是只比较两个文本吗? 如果您有一些代码方法(最好是 PHP),那就太好了。
我希望你可以帮助我。 提前致谢!
I have the following PHP function to calculate the relation between to texts:
function check($terms_in_article1, $terms_in_article2) {
$length1 = count($terms_in_article1); // number of words
$length2 = count($terms_in_article2); // number of words
$all_terms = array_merge($terms_in_article1, $terms_in_article2);
$all_terms = array_unique($all_terms);
foreach ($all_terms as $all_termsa) {
$term_vector1[$all_termsa] = 0;
$term_vector2[$all_termsa] = 0;
}
foreach ($terms_in_article1 as $terms_in_article1a) {
$term_vector1[$terms_in_article1a]++;
}
foreach ($terms_in_article2 as $terms_in_article2a) {
$term_vector2[$terms_in_article2a]++;
}
$score = 0;
foreach ($all_terms as $all_termsa) {
$score += $term_vector1[$all_termsa]*$term_vector2[$all_termsa];
}
$score = $score/($length1*$length2);
$score *= 500; // for better readability
return $score;
}
The variable $terms_in_articleX
must be an array containing all single words which appear in the text.
Assuming I have a database of 20,000 texts, this function would take a very long time to run through all the connections.
How can I accelerate this process? Should I add all texts into a huge matrix instead of always comparing only two texts? It would be great if you had some approaches with code, preferably in PHP.
I hope you can help me. Thanks in advance!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
您可以在添加文本时拆分文本。 简单的例子:
preg_match_all(/\w+/, $text, $matches);
当然真正的分割并不是那么简单......但是可能的,只需更正模式:)创建表 id(int Primary autoincrement )、value(varchar unique) 和链接表,如下所示:word_id(int)、text_id(int)、word_count(int)。 然后在分割文本后用新值填充表格。
最后,您可以使用这些数据做任何您想做的事情,快速使用数据库中的索引整数(ID)进行操作。
更新:
以下是表格和查询:
好吧,现在,我希望这会有所帮助? 最后 2 个查询足以执行您的任务。 其他查询以防万一。 当然,您可以统计更多统计数据,例如“最流行的术语”等......
You can split the text on adding it. Simple example:
preg_match_all(/\w+/, $text, $matches);
Sure real splitting is not so simple... but possible, just correct the pattern :)Create table id(int primary autoincrement), value(varchar unique) and link-table like this: word_id(int), text_id(int), word_count(int). Then fill the tables with new values after splitting text.
Finally you can do with this data anything you want, quickly operating with indexed integers(IDs) in DB.
UPDATE:
Here are the tables and queries:
Well, now, I hope, this will help? The 2 last queries are enough to perform your task. Other queries are just in case. Sure, you can count more stats like "the most popular terms" etc...
这是原始函数的稍微优化的版本。 它产生完全相同的结果。 (我在维基百科的两篇文章上运行它,其中包含 10000 多个术语,每篇运行大约 20 次:
这是代码:(
顺便说一句。不包括将所有单词拆分为数组所需的时间。)
Here's a slightly optimized version of your original function. It produces the exact same results. (I run it on two articles from Wikipedia with 10000+ terms and like 20 runs each:
Here's the code:
(Btw. The time needed to split all the words into arrays was not included.)
编辑:尝试更明确:
首先,将每个术语编码为
整数。 你可以使用字典
关联数组,如下所示:
<前><代码> $count = 0;
foreach ($doc as $term) {
$val = $dict[$term];
如果(!定义($ val)){
$dict[$term] = $count++;
}
$doc_as_int[$val] ++;
}
这样,你就可以替换字符串
整数计算
计算。 例如,您可以
将“云”一词表示为
数字 5,然后使用索引 5
的数组来存储计数
“云”字。 请注意,我们只
在这里使用关联数组搜索,
不需要 CRC 等。
20*200 + 20*10*9
- 大约 6000 次比较,而不是 19900 次。EDIT: Trying to be more explicit:
First, encode every term into an
integer. You can use a dictionary
associative array, like this:
This way, you replace string
calculations with integer
calculations. For example, you can
represent the word "cloud" as the
number 5, and then use the index 5
of arrays to store counts of the
word "cloud". Notice that we only
use associative array search here,
no need for CRC etc.
20*200 + 20*10*9
- around 6000 comparisons instead of 19900.如果您可以使用简单文本而不是数组进行比较,并且如果我正确理解您的目标,您可以使用 levenshtein php 函数(通常用于在 php 搜索引擎中提供类似于 google 的“您的意思是……?”函数)。
它的工作方式与您使用的相反:返回两个字符串之间的差异。
示例:
但我不确切知道这是否会提高执行速度..但也许是的,您删除了许多 foreach 循环和 array_merge 函数。
编辑:
简单的速度测试(是一个30秒编写的脚本,它不是100%准确的呃):
打印:在0.36765秒内结束
第二次测试:
print: end in 0.05023 秒
所以,是的,看起来更快。
尝试使用许多数组项(以及许多用于levenshtein的单词)
2°编辑会很不错:
使用类似的文本,速度似乎等于levenshtein方法:
print: end in 0.05988< /strong> 秒
但可能需要超过 255 个字符:
而且,它甚至可以返回百分比相似值:
又一个编辑
创建一个数据库函数,直接在 sql 查询中进行比较,而不是检索所有数据并循环它们怎么样?
如果您正在运行 Mysql,请查看 这个(手工制作的levenshtein函数,仍然有255个字符的限制)
另外,如果您使用 Postgresql,另一个(许多函数应该评估)
If you can use simple text instead of arrays for comparing, and if i understood right where your goal is, you can use the levenshtein php function (that is usually used for give the google-like 'Did you meaning ...?' function in php search engines).
It works in the opposite way youre using: return the difference between two strings.
Example:
But i dont know exactly if this will improve the speed of execution.. but maybe yes, you take-out many foreach loops and the array_merge function.
EDIT:
A simply test for the speed (is a 30-second-wroted-script, its not 100% accurated eh):
print: end in 0.36765 seconds
Second test:
print: end in 0.05023 seconds
So, yes, seem faster.
Would be nice to try with many array items (and many words for levenshtein)
2°EDIT:
With similar text the speed seem to be equal to the levenshtein method:
print: end in 0.05988 seconds
But it can take more than 255 char:
and, it can even return the similary value in percentage:
Yet another edit
What about create a database function, to make the compare directly in the sql query, instead of retrieving all the data and loop them?
If youre running Mysql, give a look at this one (hand-made levenshtein function, still 255 char limit)
Else, if youre on Postgresql, this other one (many functions that should be evalutate)
另一种方法是潜在语义分析,它利用大量数据来查找文档之间的相似性。
它的工作方式是通过获取文本的共现矩阵并将其与语料库进行比较,本质上为您提供文档在“语义空间”中的抽象位置。 这将加快您的文本比较速度,因为您可以使用 LSA 语义空间中的欧几里得距离来比较文档。 这是非常有趣的语义索引。 因此,添加新文章不会花费太长时间。
我无法给出这种方法的具体用例,因为我只在学校学到过它,但似乎 KnowledgeSearch 是该算法的开源实现。
(抱歉,这是我的第一篇文章,所以无法发布链接,请自行查找)
Another approach to take would be Latent Semantic Analysis, which leverages a large corpus of data to find similarities between documents.
The way it works is by taking the co-occurance matrix of the text and comparing it to the Corpus, essentially providing you with an abstract location of your document in a 'semantic space'. This will speed up your text comparison, as you can compare documents using Euclidian distance in the LSA Semantic space. It's pretty fun semantic indexing. Thus, adding new articles will not take much longer.
I can't give a specific use case of this approach, having only learned it in school but it appears that KnowledgeSearch is an open source implementation of the algorithm.
(Sorry, its my first post, so can't post links, just look it up)