sql中的朴素贝叶斯计算

发布于 2024-07-18 01:49:36 字数 3655 浏览 8 评论 0 原文

我想使用朴素贝叶斯将文档分类为相对较多的类。 我希望根据该文章是否与已正确验证该实体的文章相似来确认文章中提及的实体名称是否确实是该实体。

比如说,我们在一篇文章中找到了“通用汽车”文本。 我们有一组数据,其中包含文章和其中提到的正确实体。因此,如果我们发现新文章中提到“通用汽车”,它是否应该属于先前数据中包含已知正版的文章类别提及“通用汽车”与未提及该实体的文章类别?

(我不会为每个实体创建一个类,并尝试将每一篇新文章分类到每个可能的类中。我已经有一种启发式方法来查找实体名称的合理提及,我只是想验证有限数量的实体名称的合理性实体名称提到了该方法已经检测到的每篇文章。)

鉴于潜在的类和文章的数量相当大,并且朴素贝叶斯相对简单,我想在 sql 中完成整个操作,但我在评分方面遇到了麻烦查询...

这是我到目前为止所得到的:

CREATE TABLE `each_entity_word` (
  `word` varchar(20) NOT NULL,
  `entity_id` int(10) unsigned NOT NULL,
  `word_count` mediumint(8) unsigned NOT NULL,
  PRIMARY KEY (`word`, `entity_id`)
);

CREATE TABLE `each_entity_sum` (
  `entity_id` int(10) unsigned NOT NULL DEFAULT '0',
  `word_count_sum` int(10) unsigned DEFAULT NULL,
  `doc_count` mediumint(8) unsigned NOT NULL,
  PRIMARY KEY (`entity_id`)
);

CREATE TABLE `total_entity_word` (
  `word` varchar(20) NOT NULL,
  `word_count` int(10) unsigned NOT NULL,
  PRIMARY KEY (`word`)
);

CREATE TABLE `total_entity_sum` (
  `word_count_sum` bigint(20) unsigned NOT NULL,
  `doc_count` int(10) unsigned NOT NULL,
  `pkey` enum('singleton') NOT NULL DEFAULT 'singleton',
  PRIMARY KEY (`pkey`)
);

标记数据中的每篇文章都被分成不同的单词,并且对于每个实体的每篇文章,每个单词都被添加到 each_entity_word 和/或其 entity_word_sum 中的 word_count 递增,doc_count 递增,两者都与 entity_id 相关。 对于该文章中提到的每个已知实体都重复此操作。

对于每篇文章,无论每个单词中包含的实体如何,total_entity_word total_entity_word_sum 都会类似地递增。

  • P(word|any document) 应等于 该单词的 total_entity_word 中的 word_count total_entity_sum 中的 doc_count
  • P(单词|文档提及实体x) 应等于 word_count each_entity_word 表示 entity_id x 超过 doc_count 中的该单词 each_entity_sum for entity_id x
  • P(单词|文档提及实体x)应等于(该实体的单词的 total_entity_word 中的 word_count 减去其 each_entity_word 中的 word_counttotal_entity_sum 中的 doc_count 减去 each_entity_sum 中该实体的 doc_count
  • P(文档提及实体 x) 对于该实体 ID,应等于 each_entity_sum 中的 doc_count 超过 total_entity_word
  • P 中的 doc_count(文档未提及实体x)应等于1减去(each_entity_sum中的doc_count,对于x的实体ID除以doc_count total_entity_word 中)。

对于新出现的文章,将其拆分为单词,然后只需针对 each_entity_word 或 <代码>total_entity_word。 在我正在使用的数据库平台(mysql)中,IN 子句得到了相对较好的优化。

sql中也没有product()聚合函数,所以当然你可以只做sum(log(x))或exp(sum(log(x)))来获得product(x)的等价物。

因此,如果我收到一篇新文章,将其分成不同的单词,并将这些单词放入一个大 IN() 子句和一个潜在的实体 id 中进行测试,我怎样才能获得该文章落入该实体的朴素贝叶斯概率sql中id的类?

编辑:

尝试#1:

set @entity_id = 1;

select @entity_doc_count = doc_count from each_entity_sum where entity_id=@entity_id;

select @total_doc_count = doc_count from total_entity_sum;

select 
            exp(

                log(@entity_doc_count / @total_doc_count) + 

                (
                    sum(log((ifnull(ew.word_count,0) + 1) / @entity_doc_count)) / 
                    sum(log(((aew.word_count + 1) - ifnull(ew.word_count, 0)) / (@total_doc_count - @entity_doc_count)))
                )

            ) as likelihood,
        from total_entity_word aew 
        left outer join each_entity_word ew on ew.word=aew.word and ew.entity_id=@entity_id

        where aew.word in ('I', 'want', 'to', 'use'...);

I want to use naive bayes to classify documents into a relatively large number of classes. I'm looking to confirm whether an mention of an entity name in an article really is that entity, on the basis of whether that article is similar to articles where that entity has been correctly verified.

Say, we find the text "General Motors" in an article. We have a set of data that contains articles and the correct entities mentioned within in. So, if we have found "General Motors" mentioned in a new article, should it fall into that class of articles in the prior data that contained a known genuine mention "General Motors" vs. the class of articles which did not mention that entity?

(I'm not creating a class for every entity and trying to classify every new article into every possible class. I already have a heuristic method for finding plausible mentions of entity names, and I just want to verify the plausibility of the limited number of entity name mentions per article that the method already detects.)

Given that the number of potential classes and articles was quite large and naive bayes is relatively simple, I wanted to do the whole thing in sql, but I'm having trouble with the scoring query...

Here's what I have so far:

CREATE TABLE `each_entity_word` (
  `word` varchar(20) NOT NULL,
  `entity_id` int(10) unsigned NOT NULL,
  `word_count` mediumint(8) unsigned NOT NULL,
  PRIMARY KEY (`word`, `entity_id`)
);

CREATE TABLE `each_entity_sum` (
  `entity_id` int(10) unsigned NOT NULL DEFAULT '0',
  `word_count_sum` int(10) unsigned DEFAULT NULL,
  `doc_count` mediumint(8) unsigned NOT NULL,
  PRIMARY KEY (`entity_id`)
);

CREATE TABLE `total_entity_word` (
  `word` varchar(20) NOT NULL,
  `word_count` int(10) unsigned NOT NULL,
  PRIMARY KEY (`word`)
);

CREATE TABLE `total_entity_sum` (
  `word_count_sum` bigint(20) unsigned NOT NULL,
  `doc_count` int(10) unsigned NOT NULL,
  `pkey` enum('singleton') NOT NULL DEFAULT 'singleton',
  PRIMARY KEY (`pkey`)
);

Each article in the marked data is split into distinct words, and for each article for each entity every word is added to each_entity_word and/or its word_count is incremented and doc_count is incremented in entity_word_sum, both with respect to an entity_id. This is repeated for each entity known to be mentioned in that article.

For each article regardless of the entities contained within for each word total_entity_word total_entity_word_sum are similarly incremented.

  • P(word|any document) should equal the
    word_count in total_entity_word for that word over
    doc_count in total_entity_sum
  • P(word|document mentions entity x)
    should equal word_count in
    each_entity_word for that word for entity_id x over doc_count in
    each_entity_sum for entity_id x
  • P(word|document does not mention entity x) should equal (the word_count in total_entity_word minus its word_count in each_entity_word for that word for that entity) over (the doc_count in total_entity_sum minus doc_count for that entity in each_entity_sum)
  • P(document mentions entity x) should equal doc_count in each_entity_sum for that entity id over doc_count in total_entity_word
  • P(document does not mention entity x) should equal 1 minus (doc_count in each_entity_sum for x's entity id over doc_count in total_entity_word).

For a new article that comes in, split it into words and just select where word in ('I', 'want', 'to', 'use'...) against either each_entity_word or total_entity_word. In the db platform I'm working with (mysql) IN clauses are relatively well optimized.

Also there is no product() aggregate function in sql, so of course you can just do sum(log(x)) or exp(sum(log(x))) to get the equivalent of product(x).

So, if I get a new article in, split it up into distinct words and put those words into a big IN() clause and a potential entity id to test, how can I get the naive bayesian probability that the article falls into that entity id's class in sql?

EDIT:

Try #1:

set @entity_id = 1;

select @entity_doc_count = doc_count from each_entity_sum where entity_id=@entity_id;

select @total_doc_count = doc_count from total_entity_sum;

select 
            exp(

                log(@entity_doc_count / @total_doc_count) + 

                (
                    sum(log((ifnull(ew.word_count,0) + 1) / @entity_doc_count)) / 
                    sum(log(((aew.word_count + 1) - ifnull(ew.word_count, 0)) / (@total_doc_count - @entity_doc_count)))
                )

            ) as likelihood,
        from total_entity_word aew 
        left outer join each_entity_word ew on ew.word=aew.word and ew.entity_id=@entity_id

        where aew.word in ('I', 'want', 'to', 'use'...);

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

飘落散花 2024-07-25 01:49:36

使用 R 到 Postgres(或 MySQL 等)接口

或者,我建议使用带有数据库连接器的已建立的统计数据包。 如果您想从朴素贝叶斯切换到更复杂的东西,这将使您的应用程序更加灵活:

http://rpgsql。 sourceforge.net/

bnd.pr> data(airquality)

bnd.pr> db.write.table(airquality, no.clobber = F)

bnd.pr> bind.proxy("airquality")

bnd.pr> summary(airquality)
Table name: airquality 
Database: test 
Host: localhost
Dimensions: 6 (columns) 153 (rows)


bnd.pr> print(airquality)
   Day Month Ozone Solar.R Temp
1    1     5    41     190   67
2    2     5    36     118   72
3    3     5    12     149   74
4    4     5    18     313   62
5    5     5    NA      NA   56
6    6     5    28      NA   66
7    7     5    23     299   65
8    8     5    19      99   59
9    9     5     8      19   61
10  10     5    NA     194   69
Continues for 143 more rows and 1 more cols...

bnd.pr> airquality[50:55, ]
   Ozone Solar.R Wind Temp Month Day
50    12     120 11.5   73     6  19
51    13     137 10.3   76     6  20
52    NA     150  6.3   77     6  21
53    NA      59  1.7   76     6  22
54    NA      91  4.6   76     6  23
55    NA     250  6.3   76     6  24

bnd.pr> airquality[["Ozone"]]
  [1]  41  36  12  18  NA  28  23  19   8  NA   7  16  11  14  18  14  34   6
 [19]  30  11   1  11   4  32  NA  NA  NA  23  45 115  37  NA  NA  NA  NA  NA
 [37]  NA  29  NA  71  39  NA  NA  23  NA  NA  21  37  20  12  13  NA  NA  NA
 [55]  NA  NA  NA  NA  NA  NA  NA 135  49  32  NA  64  40  77  97  97  85  NA
 [73]  10  27  NA   7  48  35  61  79  63  16  NA  NA  80 108  20  52  82  50
 [91]  64  59  39   9  16  78  35  66 122  89 110  NA  NA  44  28  65  NA  22
[109]  59  23  31  44  21   9  NA  45 168  73  NA  76 118  84  85  96  78  73
[127]  91  47  32  20  23  21  24  44  21  28   9  13  46  18  13  24  16  13
[145]  23  36   7  14  30  NA  14  18  20

然后,您需要安装 e1071 软件包来执行朴素贝叶斯运算。 在 R 提示符处:

[ramanujan:~/base]$R

R version 2.7.2 (2008-08-25)
Copyright (C) 2008 The R Foundation for Statistical Computing
ISBN 3-900051-07-0

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.


 ~/.Rprofile loaded.
Welcome at  Sun Apr 19 00:45:30 2009
> install.packages("e1071")  
> install.packages("mlbench")
> library(e1071)
> ?naiveBayes
> example(naiveBayes)

更多信息:

http://cran.r-project .org/web/packages/e1071/index.html

Use an R to Postgres (or MySQL, etc.) interface

Alternatively, I'd recommend using an established stats package with a connector to the db. This will make your app a lot more flexible if you want to switch from Naive Bayes to something more sophisticated:

http://rpgsql.sourceforge.net/

bnd.pr> data(airquality)

bnd.pr> db.write.table(airquality, no.clobber = F)

bnd.pr> bind.proxy("airquality")

bnd.pr> summary(airquality)
Table name: airquality 
Database: test 
Host: localhost
Dimensions: 6 (columns) 153 (rows)


bnd.pr> print(airquality)
   Day Month Ozone Solar.R Temp
1    1     5    41     190   67
2    2     5    36     118   72
3    3     5    12     149   74
4    4     5    18     313   62
5    5     5    NA      NA   56
6    6     5    28      NA   66
7    7     5    23     299   65
8    8     5    19      99   59
9    9     5     8      19   61
10  10     5    NA     194   69
Continues for 143 more rows and 1 more cols...

bnd.pr> airquality[50:55, ]
   Ozone Solar.R Wind Temp Month Day
50    12     120 11.5   73     6  19
51    13     137 10.3   76     6  20
52    NA     150  6.3   77     6  21
53    NA      59  1.7   76     6  22
54    NA      91  4.6   76     6  23
55    NA     250  6.3   76     6  24

bnd.pr> airquality[["Ozone"]]
  [1]  41  36  12  18  NA  28  23  19   8  NA   7  16  11  14  18  14  34   6
 [19]  30  11   1  11   4  32  NA  NA  NA  23  45 115  37  NA  NA  NA  NA  NA
 [37]  NA  29  NA  71  39  NA  NA  23  NA  NA  21  37  20  12  13  NA  NA  NA
 [55]  NA  NA  NA  NA  NA  NA  NA 135  49  32  NA  64  40  77  97  97  85  NA
 [73]  10  27  NA   7  48  35  61  79  63  16  NA  NA  80 108  20  52  82  50
 [91]  64  59  39   9  16  78  35  66 122  89 110  NA  NA  44  28  65  NA  22
[109]  59  23  31  44  21   9  NA  45 168  73  NA  76 118  84  85  96  78  73
[127]  91  47  32  20  23  21  24  44  21  28   9  13  46  18  13  24  16  13
[145]  23  36   7  14  30  NA  14  18  20

You'll then want to install the e1071 package to do Naive Bayes. At the R prompt:

[ramanujan:~/base]$R

R version 2.7.2 (2008-08-25)
Copyright (C) 2008 The R Foundation for Statistical Computing
ISBN 3-900051-07-0

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.


 ~/.Rprofile loaded.
Welcome at  Sun Apr 19 00:45:30 2009
> install.packages("e1071")  
> install.packages("mlbench")
> library(e1071)
> ?naiveBayes
> example(naiveBayes)

More info:

http://cran.r-project.org/web/packages/e1071/index.html

失去的东西太少 2024-07-25 01:49:36

这是 SQL Server 的简单版本。 我在免费的 SQL Express 实现上运行它,速度相当快。

http://sqldatamine.blogspot.com/2013/07/classification -using-naive-bayes.html

Here's a simple version for SQL Server. I run it on a free SQL Express implementation and it is pretty fast.

http://sqldatamine.blogspot.com/2013/07/classification-using-naive-bayes.html

ゃ懵逼小萝莉 2024-07-25 01:49:36

我没有时间计算 NB 公式的所有表达式,但主要思想如下:

SET @entity = 123;

SELECT  EXP(SUM(LOG(probability))) / (EXP(SUM(LOG(probability))) + EXP(SUM(LOG(1 - probability))))
FROM    (
        SELECT  @entity AS _entity,
                /* Above is required for efficiency, subqueries using _entity will be DEPENDENT and use the indexes */
                (
                SELECT  SUM(word_count)
                FROM    total_entity_word
                WHERE   word = d.word
                )
                /
                (
                SELECT  doc_count
                FROM    each_entity_sum
                WHERE   entity_id = _entity
                ) AS pwordentity,
                /* I've just referenced a previously selected field */
                (
                SELECT  1 - pwordentity
                ) AS pwordnotentity,
                /* Again referenced a previously selected field */
                ... etc AS probability
        FROM    total_entity_word
        ) q

请注意,您可以通过使用轻松引用 SELECT 中的上一个字段它们位于相关子查询中(如示例所示)。

I don't have time to calculate all the expressions for the NB formula, but here's the main idea:

SET @entity = 123;

SELECT  EXP(SUM(LOG(probability))) / (EXP(SUM(LOG(probability))) + EXP(SUM(LOG(1 - probability))))
FROM    (
        SELECT  @entity AS _entity,
                /* Above is required for efficiency, subqueries using _entity will be DEPENDENT and use the indexes */
                (
                SELECT  SUM(word_count)
                FROM    total_entity_word
                WHERE   word = d.word
                )
                /
                (
                SELECT  doc_count
                FROM    each_entity_sum
                WHERE   entity_id = _entity
                ) AS pwordentity,
                /* I've just referenced a previously selected field */
                (
                SELECT  1 - pwordentity
                ) AS pwordnotentity,
                /* Again referenced a previously selected field */
                ... etc AS probability
        FROM    total_entity_word
        ) q

Note that you can easily refer to the previous field in SELECT by using them in correlated subqueries (as in example).

維他命╮ 2024-07-25 01:49:36

如果使用 Oracle,它内置数据挖掘

我不确定您正在运行什么数据库,但如果您使用 Oracle,数据挖掘功能会内置到数据库中:

http://www.oracle.com/technology/products/bi/odm/index。 html

...包括朴素贝叶斯:

http://download.oracle.com/docs/cd/B28359_01/datamine.111/b28129/algo_nb.htm

和大量其他内容:

http://www.oracle.com/technology/products/bi/odm/odm_techniques_algorithms.html

这让我感到惊讶。 这绝对是 Oracle 相对于该领域的开源替代品的竞争优势之一。

If using Oracle, it has data mining built in

I'm not sure what db you're running, but if you're using Oracle, data mining capabilities are baked into the db:

http://www.oracle.com/technology/products/bi/odm/index.html

...including Naive Bayes:

http://download.oracle.com/docs/cd/B28359_01/datamine.111/b28129/algo_nb.htm

and a ton of others:

http://www.oracle.com/technology/products/bi/odm/odm_techniques_algorithms.html

That was surprising to me. Definitely one of the competitive advantages that Oracle has over the open source alternatives in this area.

玩心态 2024-07-25 01:49:36

这是一篇详细介绍您要查找的内容的博客文章:http:// /nuncupatively.blogspot.com/2011/07/naive-bayes-in-sql.html

我已经用 SQL 编写了许多版本的 NB 分类器。 上面主张更改分析包的答案无法满足我的大数据和处理时间要求。 我有一个表,其中每个单词/类组合都有一行(行=单词*类)和一个系数列。 我有另一个表,其中有一列为 document_id 和 word。 我只是在 word 上将这些表连接在一起,按文档分组,并对系数求和,然后调整类概率的总和。 这给我留下了一张包含 document_id、class、score 的表。 然后我只选择了最低分数(因为我正在做补充朴素贝叶斯方法,我发现这种方法在多类别情况下效果更好)。

作为旁注,我发现许多转换/算法修改极大地改善了我的坚持预测。 Jason Rennie 在“解决朴素贝叶斯文本分类器的不良假设”的著作中对此进行了描述,并总结如下:http://www.ist.temple.edu/~vucetic/cis526fall2007/liang.ppt

Here is a blog post detailing what you are looking for: http://nuncupatively.blogspot.com/2011/07/naive-bayes-in-sql.html

I have coded up many versions of NB classifiers in SQL. The answers above advocating changing analysis packages were not scalable to my large data and processing time requirements. I had a table with a row for each word/class combination (nrows = words * classes) and a coefficient column. I had another table with a column for document_id and word. I just joined these tables together on word, grouped by document, and summed the coefficients and then adjusted the sums for the class probability. This left me with a table of document_id, class, score. I then just picked the min score (since I was doing a complement naive bayes approach, which I found worked better in a multi-class situation).

As a side note, I found many transformations/algorithm modifications improved my holdout predictions a great deal. They are described in the work of Jason Rennie on "Tackling the Poor Assumptions of Naive Bayes Text Classifiers" and summarized here: http://www.ist.temple.edu/~vucetic/cis526fall2007/liang.ppt

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文