HBase& Mahout - 使用 HBase 作为 Mahout 的数据存储/源 - 分类
我正在开发一个大型文本分类项目,我们将文本数据(简单消息)存储在 HBase 中。
我们有两个问题,首先我们想使用 HBase 作为 Mahout 分类器(即拜耳和随机森林)的来源。
其次,我们希望能够存储在 HBase 中生成的模型,而不是使用内存中方法 (InMemoryBayesDatastore),但是随着我们的集合增长,我们遇到了内存利用率问题,并且希望测试 HBase 作为可行的替代方案。
似乎很少有关于使用 HBase 和 Mahout 以及是否可以将其用作潜在数据源的材料。我在 Java 中使用 Mahout 0.6 核心 API,它具有 InMemory 数据存储区。
经过一番挖掘,我相信有一个 HBase Bayers 数据存储组件 - org.apache.mahout.classifier.bayes.datastore.HBaseBayesDatastore
请参阅此处的旧 JavaDoc:http://www.jarvana.com/jarvana/view/org/apache/mahout/mahout-core/0.3/mahout-core-0.3-javadoc.jar!/org/apache/mahout/classifier /bayes/datastore/HBaseBayesDatastore.html
但是,查看最新文档,该功能似乎已经消失了..? https://builds.apache.org/job/Mahout-Quality/javadoc/
我想知道是否仍然可以使用 HBase 作为 Bayers 和 RandomForests 的数据源,并且之前是否有任何用例?
谢谢!
I'm working on a large text classification project and we have our text data (simple messages) stored in HBase.
We have two problems, first we would like to use HBase as the source for Mahout classifiers namely Bayers and Random Forests.
Second, we would like to be able to store the model generated in HBase instead of using the in memory approach (InMemoryBayesDatastore) however as our sets grow we are running into problems with memory utilization and would like to test out HBase as a viable alternative.
There seems to be little material floating around using HBase with Mahout and if it's possible to use it as a potential datasource. I'm using Mahout 0.6 core API in Java which has the InMemory datastore.
Doing a bit of digging I belive that there (was) a HBase Bayers Datastore component - org.apache.mahout.classifier.bayes.datastore.HBaseBayesDatastore
See older JavaDoc here: http://www.jarvana.com/jarvana/view/org/apache/mahout/mahout-core/0.3/mahout-core-0.3-javadoc.jar!/org/apache/mahout/classifier/bayes/datastore/HBaseBayesDatastore.html
However, looking at the latest documentation it looks like this feature has disappeared..? https://builds.apache.org/job/Mahout-Quality/javadoc/
I wanted to know if it was still possible to use HBase as a datastource for Bayers and RandomForests and are there any previous uses cases in this?
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这不是直接可能的,不。您可以恢复这个旧的实现,然后掸掉灰尘,并可能让它正常工作,而不会有太多麻烦。它确实被删除了,以精简和集中项目。
当然,您也可以考虑以某种形式导出数据,并将其添加到直接支持的表示或存储中。
一般来说,您可以将 HBase 与 Mahout 结合使用,因为 Mahout(大部分)使用 Hadoop,并且 Hadoop 可以使用 HBase。这里的情况并非如此。这里有一个更直接的集成点,但已被弃用。
It's not directly possible, no. You can revive this old implementation, and dust it off and probably make it work without much trouble. It was indeed removed to slim down and focus the project.
You can of course also look at exporting your data, in some form, and adding it to a representation or store that is directly supported.
Generally speaking, you can use HBase with Mahout by virtue of the fact that Mahout uses Hadoop (mostly) and Hadoop can use HBase. That's not quite the situation here; there's a more direct integration point here, that has been deprecated.