机器学习挑战:java/groovy 中的诊断程序(数据挖掘、机器学习)
我计划用 Java 开发程序来提供诊断。数据集分为两部分,一部分用于训练,另一部分用于测试。我的程序应该学会从训练数据中进行分类(顺便说一句,其中每个新列中包含 30 个问题的答案,新行中的每条记录最后一列将是诊断 0 或 1,在数据诊断列的测试部分将为空 -数据集包含大约 1000 条记录),然后在测试部分数据时做出预测:/
我从未做过类似的事情,因此我将不胜感激有关解决类似问题的任何建议或信息。
我正在考虑 Java 机器学习 库或 Java 数据挖掘包 但我不确定它是否是正确的方向......?我仍然不确定如何应对这一挑战......
请指教。
一切顺利!
I'm planning to develop program in Java which will provide diagnosis. The data set is divided into two parts one for training and the other for testing. My program should learn to classify from the training data (BTW which contain answer for 30 questions each in new column, each record in new line the last column will be diagnosis 0 or 1, in the testing part of data diagnosis column will be empty - data set contain about 1000 records) and then make predictions in testing part of data :/
I've never done anything similar so I'll appreciate any advice or information about solution to similar problem.
I was thinking about Java Machine Learning Library or Java Data Mining Package but I'm not sure if it's right direction... ? and I'm still not sure how to tackle this challenge...
Please advise.
All the best!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
我强烈建议您使用 Weka 来完成您的任务
它是机器学习算法的集合,具有用户友好的前端,可促进许多不同类型的特征和模型选择策略
您可以使用它做很多非常复杂的事情,而无需真正进行任何编码或数学
制作者还出版了一本非常好的教科书,解释了数据挖掘的实际方面
一旦掌握了它的窍门,您就可以使用它的 API 将其任何分类器集成到您自己的 java 程序中
I strongly recommend you use Weka for your task
Its a collection of machine learning algorithms with a user friendly front-end which facilitates a lot of different kinds of feature and model selection strategies
You can do a lot of really complicated stuff using this without really having to do any coding or math
The makers have also published a pretty good textbook that explains the practical aspects of data mining
Once you get the hang of it, you could use its API to integrate any of its classifiers into your own java programs
嗨,正如江恩比尔纳所说,这是一个分类问题。据我所知,最适合您需求的分类算法是 Ross Quinlan 算法。从概念上讲,它非常容易理解。
对于分类算法的现成实现,最好的选择是 Weka。 http://www.cs.waikato.ac.nz/ml/weka/。我研究过Weka,但没有使用,因为我发现它有点晚了。
我使用了一个更简单的实现,称为 JadTi。它对于像您这样的较小数据集非常有效。我已经用过它很多次了,所以可以自信地说。 JadTi 可以在以下位置找到:
http://www.run.montefiore .ulg.ac.be/~francois/software/jaDTi/
话虽如此,您的挑战将是通过网络构建一个可用的界面。为此,数据集的用途将受到限制。该数据集基本上工作的前提是您已经拥有训练集,并且一步输入新的测试数据集,然后立即得到答案。
但我的应用程序(可能也是您的应用程序)是一步一步的用户发现,具有在决策树节点上来回切换的功能。
为了构建这样的应用程序,我从训练集创建了一个 PMML 文档,并构建了一个 Java 引擎,该引擎遍历树的每个节点,要求用户提供输入(文本/无线电/列表)并将这些值用作下一个可能的节点谓词。
PMML 标准可以在这里找到: http://www.dmg.org/ 这里您只需要 TreeModel 。 NetBeans XML Plugin 是一款用于 PMML 创作的优秀模式感知编辑器。 Altova XML 可以做得更好,但成本较高。
还可以使用 RDBMS 来存储数据集并自动创建 PMML!我还没有尝试过。
祝您的项目顺利,如果您需要进一步的意见,请随时告诉我。
Hi As Gann Bierner said, this is a classification problem. The best classification algorithm for your needs I know of is, Ross Quinlan algorithm. It's conceptually very easy to understand.
For off-the-shelf implementations of the classification algorithms, the best bet is Weka. http://www.cs.waikato.ac.nz/ml/weka/. I have studied Weka but not used, as I discovered it a little too late.
I used a much simpler implementation called JadTi. It works pretty good for smaller data sets such as yours. I have used it quite a bit, so can confidently tell so. JadTi can be found at:
http://www.run.montefiore.ulg.ac.be/~francois/software/jaDTi/
Having said all that, your challenge will be building a usable interface over web. To do so, the dataset will be of limited use. The data set basically works on the premise that you have the training set already, and you feed the new test dataset in one step, and you get the answer(s) immediately.
But my application, probably yours also, was a step by step user discovery, with features to go back and forth on the decision tree nodes.
To build such an application, I created a PMML document from my training set, and built a Java Engine that traverses each node of the tree asking the user to give an input (text/radio/list) and use the values as inputs to the next possible node predicate.
The PMML standard can be found here: http://www.dmg.org/ Here you need the TreeModel only. NetBeans XML Plugin is a good schema-aware editor for PMML authoring. Altova XML can do a better job, but costs $$.
It is also possible to use an RDBMS to store your dataset and create the PMML automagically! I have not tried that.
Good luck with your project, please feel free to let me know if you need further inputs.
有多种算法属于“机器学习”类别,哪种算法适合您的情况取决于您正在处理的数据类型。
如果您的数据本质上由一组问题到一组诊断的映射组成,每个诊断都可以是/否,那么我认为可能有效的方法包括神经网络和基于测试数据自动构建决策树的方法。
我会看一些标准文本,例如 Russel & Norvig(“人工智能:一种现代方法”)和其他有关人工智能/机器学习的介绍,看看您是否可以轻松地将他们提到的算法适应您的特定数据。另请参阅 O'Reilly 的“集体智能编程”,了解可能适合您的情况的一种或两种算法的示例 Python 代码。
如果你看得懂西班牙语,墨西哥出版社 Alfaomega 近年来也出版了各种不错的 AI 相关介绍。
There are various algorithms that fall into the category of "machine learning", and which is right for your situation depends on the type of data you're dealing with.
If your data essentially consists of mappings of a set of questions to a set of diagnoses each of which can be yes/no, then I think methods that could potentially work include neural networks and methods for automatically building a decision tree based on the test data.
I'd have a look at some of the standard texts such as Russel & Norvig ("Artificial Intelligence: A Modern Approach") and other introductions to AI/machine learning and see if you can easily adapt the algorithms they mention to your particular data. See also O'Reilly, "Programming Collective Intelligence" for some sample Python code of one or two algorithms that might be adaptable to your case.
If you can read Spanish, the Mexican publishing house Alfaomega have also published various good AI-related introductions in recent years.
这是一个分类问题,而不是真正的数据挖掘问题。一般方法是从每个数据实例中提取特征,并让分类算法从特征和结果(对您来说是 0 或 1)中学习模型。想必您的 30 个问题中的每一个问题都有自己的特色。
您可以使用多种分类技术。支持向量机和最大熵一样很受欢迎。我没有使用过 Java 机器学习库,但乍一看我没有看到其中任何一个。 OpenNLP项目有一个最大熵的实现。 LibSVM 有一个支持向量机实现。您几乎肯定必须将数据修改为图书馆可以理解的内容。
祝你好运!
更新:我同意另一位评论者的观点,即 Russel 和 Norvig 是一本很棒的人工智能书籍,其中讨论了其中的一些内容。如果您对底层和肮脏的细节感兴趣,Bishop 的“模式识别和机器学习”深入讨论了分类问题。
This is a classification problem, not really data mining. The general approach is to extract features from each data instance and let the classification algorithm learn a model from the features and the outcome (which for you is 0 or 1). Presumably each of your 30 questions would be its own feature.
There are many classification techniques you can use. Support vector machines is popular as is maximum entropy. I haven't used the Java Machine Learning library, but at a glance I don't see either of these. The OpenNLP project has a maximum entropy implementation. LibSVM has a support vector machine implementation. You'll almost certainly have to modify your data to something that the library can understand.
Good luck!
Update: I agree with the other commenter that Russel and Norvig is a great AI book which discusses some of this. Bishop's "Pattern Recognition and Machine Learning" discusses classification issues in depth if you're interested in the down and dirty details.
您的任务对于神经网络来说是经典的,其首先旨在准确解决分类任务。神经网络在任何语言中都有相当简单的实现,它是“机器学习”的“主流”,比其他任何东西都更接近人工智能。
您只需实现(或获取现有的实现)标准神经网络,例如通过误差反向传播进行学习的多层网络,并为其循环提供学习示例。经过一段时间的学习后,您将能够在真实的示例中发挥作用。
您可以从这里开始阅读有关神经网络的更多信息:
http://en.wikipedia.org/wiki/Neural_network
http://en.wikipedia.org/wiki/Artificial_neural_network
您还可以在这里获取许多现成实现的链接:
http://en.wikipedia.org/wiki/Neural_network_software
Your task is classical for neural networks, which are intended first of all to solve exactly classification tasks. Neural network has rather simple realization in any language, and it is the "mainstream" of "machine learning", closer to AI than anything other.
You just implement (or get existing implementation) standart neural network, for example multilayered network with learning by error back propagation, and give it learning examples in cycle. After some time of such learning you will get it working on real examples.
You can read more about neural networks starting from here:
http://en.wikipedia.org/wiki/Neural_network
http://en.wikipedia.org/wiki/Artificial_neural_network
Also you can get links to many ready implementations here:
http://en.wikipedia.org/wiki/Neural_network_software