在 Java 中实现朴素贝叶斯算法 - 需要一些指导

发布于 2024-09-02 11:44:11 字数 1356 浏览 3 评论 0原文

作为学校作业,我需要实现朴素贝叶斯算法,我打算用 Java 来实现。

在试图理解它是如何完成的过程中,我读了《数据挖掘 - 实用机器学习工具和技术》一书,其中有一个关于这个主题的部分,但我仍然不确定阻碍我进步的一些主要观点。

由于我在这里寻求指导而不是解决方案,我会告诉你们我的想法,我认为正确的方法,并作为回报要求纠正/指导,这将非常感激。请注意,我是朴素贝叶斯算法、数据挖掘和一般编程的绝对初学者,因此您可能会看到下面愚蠢的评论/计算:

我给出的训练数据集有 4 个数字和标准化的属性/特征(在范围内) [0 1])使用Weka(无缺失值)和一个名义类(是/否)

1)来自csv文件的数据是数字因此

    * Given the attributes are numeric i use PDF (probability density function) formula.
      + To calculate the PDF in java i first separate the attributes based on whether they're in class yes or class no and hold them into different array (array class yes and array class no)
      + Then calculate the mean(sum of the values in row / number of values in that row) and standard divination for each of the 4 attributes (columns) of each class
      + Now to find PDF of a given value(n) i do (n-mean)^2/(2*SD^2),
      + Then to find P( yes | E) and P( no | E) i multiply the PDF value of all 4 given attributes and compare which is larger, which indicates the class it belongs to

在Java中,我正在使用ArrayList的ArrayList 和 Double 来存储属性值。

最后我不确定如何获取新数据?我应该要求输入文件(如 csv)还是命令提示符并要求 4 个值?

我现在就停在这里(确实还有更多问题),但我担心考虑到它需要多长时间,不会得到任何答复。我非常感谢那些花时间阅读我的问题和评论的人。

As a School assignment i'm required to implement Naïve Bayes algorithm which i am intending to do in Java.

In trying to understand how its done, i've read the book "Data Mining - Practical Machine Learning Tools and Techniques" which has a section on this topic but am still unsure on some primary points that are blocking my progress.

Since i'm seeking guidance not solution in here, i'll tell you guys what i thinking in my head, what i think is the correct approach and in return ask for correction/guidance which will very much be appreciated. please note that i am an absolute beginner on Naïve Bayes algorithm, Data mining and in general programming so you might see stupid comments/calculations below:

The training data set i'm given has 4 attributes/features that are numeric and normalized(in range[0 1]) using Weka (no missing values)and one nominal class(yes/no)

1) The data coming from a csv file is numeric HENCE

    * Given the attributes are numeric i use PDF (probability density function) formula.

      + To calculate the PDF in java i first separate the attributes based on whether they're in class yes or class no and hold them into different array (array class yes and array class no)
      + Then calculate the mean(sum of the values in row / number of values in that row) and standard divination for each of the 4 attributes (columns) of each class
      + Now to find PDF of a given value(n) i do (n-mean)^2/(2*SD^2),
      + Then to find P( yes | E) and P( no | E) i multiply the PDF value of all 4 given attributes and compare which is larger, which indicates the class it belongs to

In temrs of Java, i'm using ArrayList of ArrayList and Double to store the attribute values.

lastly i'm unsure how to to get new data? Should i ask for input file (like csv) or command prompt and ask for 4 values?

I'll stop here for now (do have more questions) but I'm worried this won't get any responses given how long its got. I will really appreciate for those that give their time reading my problems and comment.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

绾颜 2024-09-09 11:44:11

你所做的几乎是正确的。

 + 然后找到 P( yes | E) 和 P( no | E) 我将所有 4 个给定属性的 PDF 值相乘,然后比较哪个较大,这表明它所属的类 

在这里,您忘记乘以先前的 P(是)或 P(否)。记住决策公式:

P(Yes | E) ~= P(Attr_1 | Yes) * P(Attr_2 | Yes) * P(Attr_3 | Yes) * P(Attr_4 | Yes) * P(Yes)

对于朴素贝叶斯(以及任何其他监督学习/分类算法),您需要有训练数据和测试数据。您使用训练数据来训练模型并对测试数据进行预测。您可以简单地使用训练数据作为测试数据。或者您可以将 csv 文件分成两部分,一份用于训练,一份用于测试。您还可以对 csv 文件进行交叉验证。

What you are doing is almost correct.

         + Then to find P( yes | E) and P( no | E) i multiply the PDF value of all 4 given attributes and compare which is larger, which indicates the class it belongs to 

Here, you forgot to multiply the prior P(yes) or P(no). Remember the decision formulae:

P(Yes | E) ~= P(Attr_1 | Yes) * P(Attr_2 | Yes) * P(Attr_3 | Yes) * P(Attr_4 | Yes) * P(Yes)

For Naive Bayes (and any other supervised learning/classification algorithms), you need to have training data and testing data. You use training data to train the model and do prediction on the testing data. You could simply use training data as testing data. Or you can split the csv file into two pieces, one for training and one for testing. You could also do cross validation on the csv file.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文