贝叶斯分类器的 PHP 实现:将主题分配给文本
在我的新闻页面项目中,我有一个具有以下结构的数据库表news:
- id: [integer] unique number identifying the news entry, e.g.: *1983*
- title: [string] title of the text, e.g.: *New Life in America No Longer Means a New Name*
- topic: [string] category which should be chosen by the classificator, e.g: *Sports*
此外,还有一个表bayes,其中包含有关词频的信息:
- word: [string] a word which the frequencies are given for, e.g.: *real estate*
- topic: [string] same content as "topic" field above, e.h. *Economics*
- count: [integer] number of occurrences of "word" in "topic" (incremented when new documents go to "topic"), e.g: *100*
现在我希望我的 PHP 脚本对所有新闻条目进行分类,并为其分配几个可能的类别(主题)之一。
这是正确的实施吗?你能改进一下吗?
<?php
include 'mysqlLogin.php';
$get1 = "SELECT id, title FROM ".$prefix."news WHERE topic = '' LIMIT 0, 150";
$get2 = mysql_abfrage($get1);
// pTOPICS BEGIN
$pTopics1 = "SELECT topic, SUM(count) AS count FROM ".$prefix."bayes WHERE topic != '' GROUP BY topic";
$pTopics2 = mysql_abfrage($pTopics1);
$pTopics = array();
while ($pTopics3 = mysql_fetch_assoc($pTopics2)) {
$pTopics[$pTopics3['topic']] = $pTopics3['count'];
}
// pTOPICS END
// pWORDS BEGIN
$pWords1 = "SELECT word, topic, count FROM ".$prefix."bayes";
$pWords2 = mysql_abfrage($pWords1);
$pWords = array();
while ($pWords3 = mysql_fetch_assoc($pWords2)) {
if (!isset($pWords[$pWords3['topic']])) {
$pWords[$pWords3['topic']] = array();
}
$pWords[$pWords3['topic']][$pWords3['word']] = $pWords3['count'];
}
// pWORDS END
while ($get3 = mysql_fetch_assoc($get2)) {
$pTextInTopics = array();
$tokens = tokenizer($get3['title']);
foreach ($pTopics as $topic=>$documentsInTopic) {
if (!isset($pTextInTopics[$topic])) { $pTextInTopics[$topic] = 1; }
foreach ($tokens as $token) {
echo '....'.$token;
if (isset($pWords[$topic][$token])) {
$pTextInTopics[$topic] *= $pWords[$topic][$token]/array_sum($pWords[$topic]);
}
}
$pTextInTopics[$topic] *= $pTopics[$topic]/array_sum($pTopics); // #documentsInTopic / #allDocuments
}
asort($pTextInTopics); // pick topic with lowest value
if ($chosenTopic = each($pTextInTopics)) {
echo '<p>The text belongs to topic '.$chosenTopic['key'].' with a likelihood of '.$chosenTopic['value'].'</p>';
}
}
?>
训练是手动完成的,它不包含在此代码中。如果文本“You can make Money if you sell real Estates”被分配给类别/主题“Economics”,则所有单词(you、can、make...)都会插入到表 bayes 以“经济学”为主题,1 作为标准计数。如果该单词已经与同一主题组合存在,则计数会增加。
示例学习数据:
主题字数
kaczynski 政治 1
索尼技术 1
银行 经济 1
电话技术 1
索尼经济 3
爱立信技术 2
示例输出/结果:
文本标题:电话测试索尼爱立信阿斯彭 - 敏感的温伯里
政治
..手机 ....测试 ....索尼 ....爱立信 ....白杨 ....敏感的 ....winberry
Technology
....找到手机 ....测试 ....索尼发现 ....爱立信发现 ....白杨 ....敏感的 ....winberry
Economics
....电话 ....测试 ....索尼发现 ....爱立信 ....白杨 ....敏感的 ....winberry
结果:文本属于主题技术,可能性为 0.013888888888889
提前非常感谢!
In my news page project, I have a database table news with the following structure:
- id: [integer] unique number identifying the news entry, e.g.: *1983*
- title: [string] title of the text, e.g.: *New Life in America No Longer Means a New Name*
- topic: [string] category which should be chosen by the classificator, e.g: *Sports*
Additionally, there's a table bayes with information about word frequencies:
- word: [string] a word which the frequencies are given for, e.g.: *real estate*
- topic: [string] same content as "topic" field above, e.h. *Economics*
- count: [integer] number of occurrences of "word" in "topic" (incremented when new documents go to "topic"), e.g: *100*
Now I want my PHP script to classify all news entries and assign one of several possible categories (topics) to them.
Is this the correct implementation? Can you improve it?
<?php
include 'mysqlLogin.php';
$get1 = "SELECT id, title FROM ".$prefix."news WHERE topic = '' LIMIT 0, 150";
$get2 = mysql_abfrage($get1);
// pTOPICS BEGIN
$pTopics1 = "SELECT topic, SUM(count) AS count FROM ".$prefix."bayes WHERE topic != '' GROUP BY topic";
$pTopics2 = mysql_abfrage($pTopics1);
$pTopics = array();
while ($pTopics3 = mysql_fetch_assoc($pTopics2)) {
$pTopics[$pTopics3['topic']] = $pTopics3['count'];
}
// pTOPICS END
// pWORDS BEGIN
$pWords1 = "SELECT word, topic, count FROM ".$prefix."bayes";
$pWords2 = mysql_abfrage($pWords1);
$pWords = array();
while ($pWords3 = mysql_fetch_assoc($pWords2)) {
if (!isset($pWords[$pWords3['topic']])) {
$pWords[$pWords3['topic']] = array();
}
$pWords[$pWords3['topic']][$pWords3['word']] = $pWords3['count'];
}
// pWORDS END
while ($get3 = mysql_fetch_assoc($get2)) {
$pTextInTopics = array();
$tokens = tokenizer($get3['title']);
foreach ($pTopics as $topic=>$documentsInTopic) {
if (!isset($pTextInTopics[$topic])) { $pTextInTopics[$topic] = 1; }
foreach ($tokens as $token) {
echo '....'.$token;
if (isset($pWords[$topic][$token])) {
$pTextInTopics[$topic] *= $pWords[$topic][$token]/array_sum($pWords[$topic]);
}
}
$pTextInTopics[$topic] *= $pTopics[$topic]/array_sum($pTopics); // #documentsInTopic / #allDocuments
}
asort($pTextInTopics); // pick topic with lowest value
if ($chosenTopic = each($pTextInTopics)) {
echo '<p>The text belongs to topic '.$chosenTopic['key'].' with a likelihood of '.$chosenTopic['value'].'</p>';
}
}
?>
The training is done manually, it isn't included in this code. If the text "You can make money if you sell real estates" is assigned to the category/topic "Economics", then all words (you,can,make,...) are inserted into the table bayes with "Economics" as the topic and 1 as standard count. If the word is already there in combination with the same topic, the count is incremented.
Sample learning data:
word topic count
kaczynski Politics 1
sony Technology 1
bank Economics 1
phone Technology 1
sony Economics 3
ericsson Technology 2
Sample output/result:
Title of the text: Phone test Sony Ericsson Aspen - sensitive Winberry
Politics
....phone
....test
....sony
....ericsson
....aspen
....sensitive
....winberry
Technology
....phone FOUND
....test
....sony FOUND
....ericsson FOUND
....aspen
....sensitive
....winberry
Economics
....phone
....test
....sony FOUND
....ericsson
....aspen
....sensitive
....winberry
Result: The text belongs to topic Technology with a likelihood of 0.013888888888889
Thank you very much in advance!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
看起来您的代码是正确的,但是有一些简单的方法可以对其进行优化。例如,您可以即时计算每个单词的 p(word|topic),同时您可以轻松地预先计算这些值。 (我假设你想在这里对多个文档进行分类,如果你只做一个文档,我想这没问题,因为你不会计算不在文档中的单词)
类似地,p(topic)的计算可以移到循环之外。
最后,您不需要对整个数组进行排序来找到最大值。
都是小点!但这就是你所要求的:)
我已经编写了一些未经测试的 PHP 代码,展示了我如何在下面实现这一点:
至于数学......
你试图最大化 p(topic|words),所以找到
(IE p(topic|words) 最高的参数主题)
贝叶斯定理说
所以你正在寻找
相同
由于文档的 p(words) 对于任何主题都是相同的,这与查找朴素贝叶斯假设 (其中使它成为一个朴素贝叶斯分类器)是
所以使用这个,你需要找到
Where
And
It looks like your code is correct, but there are a few easy ways to optimize it. For example, you calculate p(word|topic) on the fly for every word while you could easily calculate these values beforehand. (I'm assuming you want to classify multiple documents here, if you're only doing a single document I suppose this is okay since you don't calculate it for words not in the document)
Similarly, the calculation of p(topic) could be moved outside of the loop.
Finally, you don't need to sort the entire array to find the maximum.
All small points! But that's what you asked for :)
I've written some untested PHP-code showing how I'd implement this below:
As for the maths...
You're trying to maximize p(topic|words), so find
(IE the argument topic for which p(topic|words) is the highest)
Bayes theorem says
So you're looking for
Since p(words) of a document is the same for any topic this is the same as finding
The naive bayes assumption (which makes this a naive bayes classifier) is that
So using this, you need to find
Where
And