我还推荐 umd 网站。然而,您似乎对 Hadoop 完全陌生。我推荐 Tom White 所著的《Hadoop:权威指南》一书。它有点过时了[指的是 0.18 版本,而不是最新的 0.20+)。阅读它,做一些例子,你应该能够更好地判断如何构建你的项目。
I would also recommend the umd site. However it looks like you are completely new to Hadoop. I woudl recommend the book "Hadoop: THe Definant Guide" by Tom White. Its a bit dated [meant for the 0.18 version, rather than the latest 0.20+). Read it, do the examples and you should be at a better place to judge how to structure your project.
Can I do this with HDFS alone, or do I need to use the sub-projects like hive/hbase/pig?
HDFS is a file system of hadoop. It stands for Hadoop distributed file system. No matter what tool you are going to use in Hadoop stack, they should process the data which is in the distributed environment. So, you can't do anything just with HDFS. You need any of the computation technicques/tools like Map Reduce, Pig, Hive and etc.
Hadoop is a tool for Distributed/parallel data processing. Mahout is a data mining/ machine learning framework that can work standalone mode as well as in Hadoop distribution environment. The decision to use it as standalone or with Hadoop boils down to the size of the historical data that needs to be mined. If the data size is of the order of Terabytes and Petabytes, you typically use Mahout with Hadoop.
Mahout supports 3 machine Learning algorithms, recommendation, clustering and classification. Mahout in action book by manning does a very good job of explaining this. Weka is another similar open source projects. All these come under a category called machine learning frameworks.
depends on your application. you need to understand purpose of hive,pig, hbase and then you can figure out where exactly they fit in your application. these are created a specific reasons that you need to understand simple google will get the results
HDFS is a distributed storage system to dump your data for further analytics.
Hive/Pig/MR/Spark/Scala etc.... are tools for analyzing the data. You actually write your algorithms in any of these. You can't achieve 100% just only by Pig/Hive/Hbase. You should know how to write Map Reduce algorithms and need to import these into Hive/Pig.
ETL tools:
Pig (Scripting language)
Hive (SQl like query language for structured data)
HBASE for Unstructured data you can achieve real time data analyzation. While MapReduce operates in steps, Spark operates on the
whole data set in one fell swoop. Sqoop : Import/Export data from RDDMS Flume: Import streaming data to hadoop Mahout: Machine learning algorith tool
Hadoop Definitive guide is good to start for beginners.
You can use Content Management System to process Un-structured data & Semi-Structured data - Tera Or Peta bytes of data. If you are storing un-structured data, I prefer to store the data in CMS and use meta data information in NoSQL database like HBASE (e.g. Image Id , MD5SUM of image)
Spark is a fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation
发布评论
评论(10)
我找到了一个大学网站,其中包含一些仅基于 Hadoop 构建的 MapReduce 练习和解决方案:
http://www.umiacs.umd.edu/~jimmylin/Cloud9/docs/index.html
此外,还有来自 Yahoo 和 Google 的课程:
http://developer.yahoo.com/hadoop/tutorial/
http://code.google.com/edu/parallel/index.html
所有这些课程都在普通 Hadoop 上运行,以回答您的问题。
I've found a university site with some exercises and solutions for MapReduce that build only on Hadoop:
http://www.umiacs.umd.edu/~jimmylin/Cloud9/docs/index.html
Additionally there are courses from Yahoo and Google:
http://developer.yahoo.com/hadoop/tutorial/
http://code.google.com/edu/parallel/index.html
All these courses work on plain Hadoop, to answer your question.
从初学者级别的普通 MapReduce 开始。您可以在下一个级别尝试 Pig/Hive/Hbase。
除非你足够努力地使用普通的 MapReduce,否则你将无法欣赏 Pig/Hive/Hbase
Start with plain mapreduce at beginner level. You can try Pig/Hive/Hbase at the next level.
You will not be able appreciate Pig/Hive/Hbase unless you struggle enough to use plain map reduce
我还推荐 umd 网站。然而,您似乎对 Hadoop 完全陌生。我推荐 Tom White 所著的《Hadoop:权威指南》一书。它有点过时了[指的是 0.18 版本,而不是最新的 0.20+)。阅读它,做一些例子,你应该能够更好地判断如何构建你的项目。
I would also recommend the umd site. However it looks like you are completely new to Hadoop. I woudl recommend the book "Hadoop: THe Definant Guide" by Tom White. Its a bit dated [meant for the 0.18 version, rather than the latest 0.20+). Read it, do the examples and you should be at a better place to judge how to structure your project.
我正在尝试使用 hadoop 练习一些数据挖掘算法。
使用在 Hadoop 之上运行的 Apache mahout。
[http://mahout.apache.org/][1]
我可以这样做吗单独使用 HDFS,还是需要使用 hive/hbase/pig 等子项目?
HDFS是hadoop的文件系统。它代表 Hadoop 分布式文件系统。无论您要在 Hadoop 堆栈中使用什么工具,它们都应该处理分布式环境中的数据。因此,仅使用 HDFS 无法做任何事情。您需要任何计算技术/工具,例如 MapReduce、Pig、Hive 等。
希望这会有所帮助!
I'm trying to practice some data mining algorithms using hadoop.
Use Apache mahout which runs on top of Hadoop.
[http://mahout.apache.org/][1]
Can I do this with HDFS alone, or do I need to use the sub-projects like hive/hbase/pig?
HDFS is a file system of hadoop. It stands for Hadoop distributed file system. No matter what tool you are going to use in Hadoop stack, they should process the data which is in the distributed environment. So, you can't do anything just with HDFS. You need any of the computation technicques/tools like Map Reduce, Pig, Hive and etc.
Hope this helps!
您还可以使用 Mahout http://mahout.apache.org/
它是一个机器学习和数据-可在 Hadoop 之上使用的挖掘库。
一般来说,Mahout 目前支持(取自 Mahout 站点):
You could also use Mahout http://mahout.apache.org/
It is a machine-learning and data-mining library that can be used on top of Hadoop.
In general Mahout currently supports (taken from Mahout site):
您可以将 R、Spark Hadoop 一起使用作为完整的开源解决方案。
R- 统计语言,提供许多开箱即用的库。
Spark - 使用机器学习算法的数据处理速度比 MR 更快的框架。
Hadoop - 基于商用硬件的可扩展且强大的数据存储。
You can use R, Spark Hadoop together as complete open source solution.
R- Statistical language which provides many libraries out of box.
Spark- framework for data processing faster then MR with machine learning algorithms.
Hadoop- Data storage which is scalable and robust based on commodity hardware.
Hadoop 是一种分布式/并行数据处理工具。 Mahout 是一个数据挖掘/机器学习框架,可以在独立模式下工作,也可以在 Hadoop 分发环境中工作。决定将其单独使用还是与 Hadoop 一起使用,归根结底取决于需要挖掘的历史数据的大小。如果数据大小为 TB 和 PB 量级,您通常将 Mahout 与 Hadoop 结合使用。
Mahout 支持 3 种机器学习算法:推荐、聚类和分类。曼宁的Mahout in action一书很好地解释了这一点。 Weka 是另一个类似的开源项目。所有这些都属于机器学习框架的类别。
请参阅博客,其中讨论了有关如何 Mahout 和 Hadoop 分布式文件系统可以工作吗?作为这方面的先驱,还有一个关于 组件架构这些工具如何组合在一起解决 Hadoop /Mahout 生态系统中的数据挖掘问题。
Hadoop is a tool for Distributed/parallel data processing. Mahout is a data mining/ machine learning framework that can work standalone mode as well as in Hadoop distribution environment. The decision to use it as standalone or with Hadoop boils down to the size of the historical data that needs to be mined. If the data size is of the order of Terabytes and Petabytes, you typically use Mahout with Hadoop.
Mahout supports 3 machine Learning algorithms, recommendation, clustering and classification. Mahout in action book by manning does a very good job of explaining this. Weka is another similar open source projects. All these come under a category called machine learning frameworks.
Refer to the blog which talks about a use case about how Mahout and Hadoop distributed file system works? As a precursor to this, there is also a blog on Component architecture of how each of these tools fit together for a data mining problem in Hadoop /Mahout ecosystem.
取决于您的应用程序。您需要了解 hive、pig、hbase 的用途,然后才能找出它们到底适合您的应用程序的位置。这些都是创建的具体原因,你需要了解简单google一下就会得到结果
depends on your application. you need to understand purpose of hive,pig, hbase and then you can figure out where exactly they fit in your application. these are created a specific reasons that you need to understand simple google will get the results
HDFS 是一个分布式存储系统,用于转储数据以进行进一步分析。
Hive/Pig/MR/Spark/Scala 等...都是分析数据的工具。您实际上可以用其中任何一个来编写您的算法。仅靠 Pig/Hive/Hbase 无法达到 100%。您应该知道如何编写 MapReduce 算法并需要将它们导入到 Hive/Pig 中。
Hadoop Definitive 指南非常适合初学者入门。
HDFS is a distributed storage system to dump your data for further analytics.
Hive/Pig/MR/Spark/Scala etc.... are tools for analyzing the data. You actually write your algorithms in any of these. You can't achieve 100% just only by Pig/Hive/Hbase. You should know how to write Map Reduce algorithms and need to import these into Hive/Pig.
Hadoop Definitive guide is good to start for beginners.
您必须根据 Hadoop 生态系统的优势使用不同的工具。
Hive 和 Hbase 适合处理结构化数据
Sqoop 用于从传统 RDBMS 导入结构化数据数据库 Oracle、SQL Server 等。
Flume 用于处理非结构化数据。
您可以使用内容管理系统来处理非结构化数据和内容。半结构化数据 - Tera 或 Peta 字节的数据。如果您要存储非结构化数据,我更喜欢将数据存储在 CMS 中,并使用 NoSQL 数据库(如 HBASE)中的元数据信息>(例如
图像ID,图像的MD5SUM
)要处理大数据流,您可以使用Pig
Spark 是一个针对 Hadoop 数据的快速通用计算引擎。 Spark 提供了一种简单而富有表现力的编程模型,支持广泛的应用程序,包括 ETL、机器学习、流处理和图形计算。
请查看 结构化数据 和 Hadoop 中的非结构化数据 处理
查看完整的 hadoop 生态系统 和这个 SE 问题
You have to use different tools in Hadoop ecosystem depending on their strengths.
Hive and Hbase are good to handle structured data
Sqoop is used to import structured data from traditional RDBMS database Oracle, SQL Server etc.
Flume is used for processing Un-structured data.
You can use Content Management System to process Un-structured data & Semi-Structured data - Tera Or Peta bytes of data. If you are storing un-structured data, I prefer to store the data in CMS and use meta data information in NoSQL database like HBASE (e.g.
Image Id , MD5SUM of image
)To process Big data streaming, you can use Pig
Spark is a fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation
Have a look at Structured Data and Un-Structured data handling in Hadoop
Have a look at complete hadoop ecosystem and this SE question