机器学习平台的选择
我有一组用户及其贷款偿还指标(他们花了多长时间,分期付款多少等)。现在我想分析用户过去的贷款历史并说,“如果我们借给他们X,他们很可能会在Z天内偿还Y期分期付款”
这是我的看法
- 该算法是一种聚类算法,根据还款对所有用户进行分组习惯
- 我想使用 SOM 或 K-Means
所以我的问题是,哪些平台适合这个?到目前为止我已经看过 Mahout 了。
I have a data set of users and their loan repayment metrics (how long they took, how many installments etc). Now I want to analyse a user's past loan history and say, "If we loan them X they will most likely repay over Y installments, over Z days"
Here is my take
- The algorithm is a Clustering algorithm to group all users according to their repayment habits
- I want to use a SOM or K-Means
So my question is, what platforms are good for this? I have had a look at Mahout so far.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
非常值得一看 Weka - 它是一个相当成熟的开源工具包许多机器学习算法,包括聚类。
Well worth taking a look at Weka - it's a reasonably mature open source toolkit with lots of machine learning algorithms, clustering included.
快速矿工
- 免费提供社区版
- 便于使用
- 漂亮的可视化
http://rapid-i.com/content/view/181/190 /
RapidMiner
- community edition available for free
- easy to use
- nice visualizations
http://rapid-i.com/content/view/181/190/
另一个不错的库是 scikits.learn,这是一个为 Python 程序员提供的机器学习库。
Another good library is scikits.learn, a machine learning library for Python programmers.
关于这个主题有一本很棒的书 - Toby Segaran 的“集体智能编程”。它讨论了不同的机器学习算法、聚类等。还包括有用的库和示例代码的链接。
There is an amazing book on this topic - "Programming Collective Intelligence" by Toby Segaran. It discusses different machine learning algorithms, clustering, etc. Also includes links to useful libraries and sample code.
为什么要聚类?它看起来不像聚类问题。您可以将聚类分析作为预处理阶段来区分几组用户(或者您可以省略此阶段),但随后您需要进行某种数字预测:两者 - 分期付款和天数计数 -是数字,那么如何通过聚类获得这些数字呢?
我建议您使用回归来完成此任务。线性回归必须满足您的需求。如果因变量(分期付款数和天数)非线性地依赖于其他属性,您可以尝试多项式回归,甚至像 M5' 这样的算法,首先构建决策树,然后向每个叶子添加回归模型那棵树的。
如果您有非数字属性,您还可以尝试使用分类 - 在这种情况下,您需要手动创建可能的类别(例如分期付款数:从 3 到 5、从 6 到 10 等) .),然后使用任何分类算法(C4.5、SVM、朴素贝叶斯等)。
事实上,我认为你没有大量数据。我相信如果总体小于 50Mb,那么就没有必要使用像 Mahout 这样的怪物,它们被设计用来处理非常非常大的数据量。您可以使用 Weka 或 RapidMiner 用于此目的。即使他们无法使用默认配置处理您的数据,只需增加 JVM 的内存,在 99% 的情况下就可以了。
Why clustering? It doesn't look like clustering problem. You can make cluster analysis as preprocessing phase to distinguish several groups of users (or you may omit this phase), but then you need to do some kind of numeric prediction: both - count of installments and days - are numbers, so how are you going to get these numbers with clustering?
I suggest you using regression for this task. Linear regression must fit your needs. If dependent variables (# of installments and days) depend on other attributes non-linearly, you can try polynomial regression or even algorithms like M5', that first build decision tree and then add regression model to each leaf of that tree.
If you have non-numeric attributes, you can also try to use classification - in this case you need to manually create possible classes (e.g. # of installments: from 3 to 5, from 6 to 10, etc.) and then use any of classification algorithms (C4.5, SVM, Naive Bayes to mention a few).
Actually, I don't think you have tons of data. I believe if is less then 50Mb overall, so there's no need to use monsters like Mahout, that are designed to process really, really big amounts of data. You can use Weka or RapidMiner for this purpose. Even if they are not able to handle your data with default config, just increase memory for JVM and in 99% of cases they will be ok.