贝叶斯分类器的 PHP 实现：将主题分配给文本

发布于 2024-09-16 06:31:32 字数 3193 浏览 9 评论 0原文

在我的新闻页面项目中，我有一个具有以下结构的数据库表news：

 - id: [integer] unique number identifying the news entry, e.g.: *1983*
 - title: [string] title of the text, e.g.: *New Life in America No Longer Means a New Name*
 - topic: [string] category which should be chosen by the classificator, e.g: *Sports*

此外，还有一个表bayes，其中包含有关词频的信息：

 - word: [string] a word which the frequencies are given for, e.g.: *real estate*
 - topic: [string] same content as "topic" field above, e.h. *Economics*
 - count: [integer] number of occurrences of "word" in "topic" (incremented when new documents go to "topic"), e.g: *100*

现在我希望我的 PHP 脚本对所有新闻条目进行分类，并为其分配几个可能的类别（主题）之一。

这是正确的实施吗？你能改进一下吗？

<?php
include 'mysqlLogin.php';
$get1 = "SELECT id, title FROM ".$prefix."news WHERE topic = '' LIMIT 0, 150";
$get2 = mysql_abfrage($get1);
// pTOPICS BEGIN
$pTopics1 = "SELECT topic, SUM(count) AS count FROM ".$prefix."bayes WHERE topic != '' GROUP BY topic";
$pTopics2 = mysql_abfrage($pTopics1);
$pTopics = array();
while ($pTopics3 = mysql_fetch_assoc($pTopics2)) {
    $pTopics[$pTopics3['topic']] = $pTopics3['count'];
}
// pTOPICS END
// pWORDS BEGIN
$pWords1 = "SELECT word, topic, count FROM ".$prefix."bayes";
$pWords2 = mysql_abfrage($pWords1);
$pWords = array();
while ($pWords3 = mysql_fetch_assoc($pWords2)) {
    if (!isset($pWords[$pWords3['topic']])) {
        $pWords[$pWords3['topic']] = array();
    }
    $pWords[$pWords3['topic']][$pWords3['word']] = $pWords3['count'];
}
// pWORDS END
while ($get3 = mysql_fetch_assoc($get2)) {
    $pTextInTopics = array();
    $tokens = tokenizer($get3['title']);
    foreach ($pTopics as $topic=>$documentsInTopic) {
        if (!isset($pTextInTopics[$topic])) { $pTextInTopics[$topic] = 1; }
        foreach ($tokens as $token) {
            echo '....'.$token;
            if (isset($pWords[$topic][$token])) {
                $pTextInTopics[$topic] *= $pWords[$topic][$token]/array_sum($pWords[$topic]);
            }
        }
        $pTextInTopics[$topic] *= $pTopics[$topic]/array_sum($pTopics); // #documentsInTopic / #allDocuments
    }
    asort($pTextInTopics); // pick topic with lowest value
    if ($chosenTopic = each($pTextInTopics)) {
        echo '<p>The text belongs to topic '.$chosenTopic['key'].' with a likelihood of '.$chosenTopic['value'].'</p>';
    }
}
?>

训练是手动完成的，它不包含在此代码中。如果文本“You can make Money if you sell real Estates”被分配给类别/主题“Economics”，则所有单词（you、can、make...）都会插入到表 bayes 以“经济学”为主题，1 作为标准计数。如果该单词已经与同一主题组合存在，则计数会增加。

示例学习数据：

主题字数

kaczynski 政治 1

索尼技术 1

银行经济 1

电话技术 1

索尼经济 3

爱立信技术 2

示例输出/结果：

文本标题：电话测试索尼爱立信阿斯彭 - 敏感的温伯里

政治

..手机 ....测试 ....索尼 ....爱立信 ....白杨 ....敏感的 ....winberry

Technology

....找到手机 ....测试 ....索尼发现 ....爱立信发现 ....白杨 ....敏感的 ....winberry

Economics

....电话 ....测试 ....索尼发现 ....爱立信 ....白杨 ....敏感的 ....winberry

结果：文本属于主题技术，可能性为 0.013888888888889

提前非常感谢！

原文

In my news page project, I have a database table news with the following structure:

 - id: [integer] unique number identifying the news entry, e.g.: *1983*
 - title: [string] title of the text, e.g.: *New Life in America No Longer Means a New Name*
 - topic: [string] category which should be chosen by the classificator, e.g: *Sports*

Additionally, there's a table bayes with information about word frequencies:

 - word: [string] a word which the frequencies are given for, e.g.: *real estate*
 - topic: [string] same content as "topic" field above, e.h. *Economics*
 - count: [integer] number of occurrences of "word" in "topic" (incremented when new documents go to "topic"), e.g: *100*

Now I want my PHP script to classify all news entries and assign one of several possible categories (topics) to them.

Is this the correct implementation? Can you improve it?

<?php
include 'mysqlLogin.php';
$get1 = "SELECT id, title FROM ".$prefix."news WHERE topic = '' LIMIT 0, 150";
$get2 = mysql_abfrage($get1);
// pTOPICS BEGIN
$pTopics1 = "SELECT topic, SUM(count) AS count FROM ".$prefix."bayes WHERE topic != '' GROUP BY topic";
$pTopics2 = mysql_abfrage($pTopics1);
$pTopics = array();
while ($pTopics3 = mysql_fetch_assoc($pTopics2)) {
    $pTopics[$pTopics3['topic']] = $pTopics3['count'];
}
// pTOPICS END
// pWORDS BEGIN
$pWords1 = "SELECT word, topic, count FROM ".$prefix."bayes";
$pWords2 = mysql_abfrage($pWords1);
$pWords = array();
while ($pWords3 = mysql_fetch_assoc($pWords2)) {
    if (!isset($pWords[$pWords3['topic']])) {
        $pWords[$pWords3['topic']] = array();
    }
    $pWords[$pWords3['topic']][$pWords3['word']] = $pWords3['count'];
}
// pWORDS END
while ($get3 = mysql_fetch_assoc($get2)) {
    $pTextInTopics = array();
    $tokens = tokenizer($get3['title']);
    foreach ($pTopics as $topic=>$documentsInTopic) {
        if (!isset($pTextInTopics[$topic])) { $pTextInTopics[$topic] = 1; }
        foreach ($tokens as $token) {
            echo '....'.$token;
            if (isset($pWords[$topic][$token])) {
                $pTextInTopics[$topic] *= $pWords[$topic][$token]/array_sum($pWords[$topic]);
            }
        }
        $pTextInTopics[$topic] *= $pTopics[$topic]/array_sum($pTopics); // #documentsInTopic / #allDocuments
    }
    asort($pTextInTopics); // pick topic with lowest value
    if ($chosenTopic = each($pTextInTopics)) {
        echo '<p>The text belongs to topic '.$chosenTopic['key'].' with a likelihood of '.$chosenTopic['value'].'</p>';
    }
}
?>

The training is done manually, it isn't included in this code. If the text "You can make money if you sell real estates" is assigned to the category/topic "Economics", then all words (you,can,make,...) are inserted into the table bayes with "Economics" as the topic and 1 as standard count. If the word is already there in combination with the same topic, the count is incremented.

Sample learning data:

word topic count

kaczynski Politics 1

sony Technology 1

bank Economics 1

phone Technology 1

sony Economics 3

ericsson Technology 2

Sample output/result:

Title of the text: Phone test Sony Ericsson Aspen - sensitive Winberry

Politics

....phone
....test
....sony
....ericsson
....aspen
....sensitive
....winberry

Technology

....phone FOUND
....test
....sony FOUND
....ericsson FOUND
....aspen
....sensitive
....winberry

Economics

....phone
....test
....sony FOUND
....ericsson
....aspen
....sensitive
....winberry

Result: The text belongs to topic Technology with a likelihood of 0.013888888888889

Thank you very much in advance!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

欢你一世 2024-09-23 06:31:32

看起来您的代码是正确的，但是有一些简单的方法可以对其进行优化。例如，您可以即时计算每个单词的 p(word|topic)，同时您可以轻松地预先计算这些值。（我假设你想在这里对多个文档进行分类，如果你只做一个文档，我想这没问题，因为你不会计算不在文档中的单词）

类似地，p(topic)的计算可以移到循环之外。

最后，您不需要对整个数组进行排序来找到最大值。

都是小点！但这就是你所要求的:)

我已经编写了一些未经测试的 PHP 代码，展示了我如何在下面实现这一点：

<?php

// Get word counts from database
$nWordPerTopic = mystery_sql();

// Calculate p(word|topic) = nWord / sum(nWord for every word)
$nTopics = array();
$pWordPerTopic = array();
foreach($nWordPerTopic as $topic => $wordCounts)
{
    // Get total word count in topic
    $nTopic = array_sum($wordCounts);

    // Calculate p(word|topic)
    $pWordPerTopic[$topic] = array();
    foreach($wordCounts as $word => $count)
        $pWordPerTopic[$topic][$word] = $count / $nTopic;

    // Save $nTopic for next step
    $nTopics[$topic] = $nTopic;
}

// Calculate p(topic)
$nTotal = array_sum($nTopics);
$pTopics = array();
foreach($nTopics as $topic => $nTopic)
    $pTopics[$topic] = $nTopic / $nTotal;

// Classify
foreach($documents as $document)
{
    $title = $document['title'];
    $tokens = tokenizer($title);
    $pMax = -1;
    $selectedTopic = null;
    foreach($pTopics as $topic => $pTopic)
    {
        $p = $pTopic;
        foreach($tokens as $word)
        {
            if (!array_key_exists($word, $pWordPerTopic[$topic]))
                continue;
            $p *= $pWordPerTopic[$topic][$word];
        }

        if ($p > $pMax)
        {
            $selectedTopic = $topic;
            $pMax = $p;
        }
    }
} 
?>

至于数学......

你试图最大化 p(topic|words)，所以找到

arg max p(topic|words)

(IE p(topic|words) 最高的参数主题）

贝叶斯定理说

                  p(topic)*p(words|topic)
p(topic|words) = -------------------------
                        p(words)

所以你正在寻找

         p(topic)*p(words|topic)
arg max -------------------------
               p(words)

相同

arg max p(topic)*p(words|topic)

由于文档的 p(words) 对于任何主题都是相同的，这与查找朴素贝叶斯假设（其中使它成为一个朴素贝叶斯分类器）是

p(words|topic) = p(word1|topic) * p(word2|topic) * ...

所以使用这个，你需要找到

arg max p(topic) * p(word1|topic) * p(word2|topic) * ...

Where

p(topic) = number of words in topic / number of words in total

And

                   p(word, topic)                         1
p(word | topic) = ---------------- = p(word, topic) * ----------
                      p(topic)                         p(topic)

      number of times word occurs in topic     number of words in total
   = -------------------------------------- * --------------------------
            number of words in total           number of words in topic

      number of times word occurs in topic 
   = --------------------------------------
            number of words in topic

It looks like your code is correct, but there are a few easy ways to optimize it. For example, you calculate p(word|topic) on the fly for every word while you could easily calculate these values beforehand. (I'm assuming you want to classify multiple documents here, if you're only doing a single document I suppose this is okay since you don't calculate it for words not in the document)

Similarly, the calculation of p(topic) could be moved outside of the loop.

Finally, you don't need to sort the entire array to find the maximum.

All small points! But that's what you asked for :)

I've written some untested PHP-code showing how I'd implement this below:

<?php

// Get word counts from database
$nWordPerTopic = mystery_sql();

// Calculate p(word|topic) = nWord / sum(nWord for every word)
$nTopics = array();
$pWordPerTopic = array();
foreach($nWordPerTopic as $topic => $wordCounts)
{
    // Get total word count in topic
    $nTopic = array_sum($wordCounts);

    // Calculate p(word|topic)
    $pWordPerTopic[$topic] = array();
    foreach($wordCounts as $word => $count)
        $pWordPerTopic[$topic][$word] = $count / $nTopic;

    // Save $nTopic for next step
    $nTopics[$topic] = $nTopic;
}

// Calculate p(topic)
$nTotal = array_sum($nTopics);
$pTopics = array();
foreach($nTopics as $topic => $nTopic)
    $pTopics[$topic] = $nTopic / $nTotal;

// Classify
foreach($documents as $document)
{
    $title = $document['title'];
    $tokens = tokenizer($title);
    $pMax = -1;
    $selectedTopic = null;
    foreach($pTopics as $topic => $pTopic)
    {
        $p = $pTopic;
        foreach($tokens as $word)
        {
            if (!array_key_exists($word, $pWordPerTopic[$topic]))
                continue;
            $p *= $pWordPerTopic[$topic][$word];
        }

        if ($p > $pMax)
        {
            $selectedTopic = $topic;
            $pMax = $p;
        }
    }
} 
?>

As for the maths...

You're trying to maximize p(topic|words), so find

arg max p(topic|words)

(IE the argument topic for which p(topic|words) is the highest)

Bayes theorem says

                  p(topic)*p(words|topic)
p(topic|words) = -------------------------
                        p(words)

So you're looking for

         p(topic)*p(words|topic)
arg max -------------------------
               p(words)

Since p(words) of a document is the same for any topic this is the same as finding

arg max p(topic)*p(words|topic)

The naive bayes assumption (which makes this a naive bayes classifier) is that

p(words|topic) = p(word1|topic) * p(word2|topic) * ...

So using this, you need to find

arg max p(topic) * p(word1|topic) * p(word2|topic) * ...

Where

p(topic) = number of words in topic / number of words in total

And

                   p(word, topic)                         1
p(word | topic) = ---------------- = p(word, topic) * ----------
                      p(topic)                         p(topic)

      number of times word occurs in topic     number of words in total
   = -------------------------------------- * --------------------------
            number of words in total           number of words in topic

      number of times word occurs in topic 
   = --------------------------------------
            number of words in topic

回复收藏 0 原文

~没有更多了~