从聚合的博客数据中检索信息,该怎么做?
我想知道如何从聚合日志中检索数据?这就是我所拥有的:
- 每天大约 30GB 的未压缩日志数据加载到 HDFS(并且很快就会增长到大约 100GB)
这是我的想法:
- 每天晚上这些数据都由 Pig 处理
- 日志被读取、分割,自定义 UDF 检索数据,例如:timestamp
、url
、user_id
(可以说,这就是我所需要的)
- 从日志条目并将其加载到 HBase 中(日志数据将被无限存储)
然后,如果我想知道哪些用户在给定时间范围内看到了特定页面,我可以快速查询 HBase,而无需使用每个查询扫描整个日志数据(并且我想要快速答案 - 分钟是可以接受的)。并且会有多个查询同时发生。
您对这个工作流程有何看法?您认为将这些信息加载到 HBase 中有意义吗?还有哪些其他选项以及它们与我的解决方案相比如何? 我感谢所有评论/问题和答案。先感谢您。
I would like to know how to retrieve data from aggregated logs? This is what I have:
- about 30GB daily of uncompressed log data loaded into HDFS (and this will grow soon to about 100GB)
This is my idea:
- each night this data is processed with Pig
- logs are read, split, and custom UDF retrieves data like: timestamp
, url
, user_id
(lets say, this is all what I need)
- from log entry and loads this into HBase (log data will be stored infinitely)
Then if I want to know which users saw particular page within given time range I can quickly query HBase without scanning whole log data with each query (and I want fast answers - minutes are acceptable). And there will be multiple querying taking place simultaneously.
What do you think about this workflow? Do you think, that loading this information into HBase would make sense? What are other options and how do they compare to my solution?
I appreciate all comments/questions and answers. Thank you in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
使用 Hadoop,您总是在执行两件事之一(处理或查询)。
对于您正在寻找的待办事项,我建议使用 HIVE http://hadoop.apache.org/hive/ 。您可以获取数据,然后创建 M/R 作业来处理该数据并将其推送到 HIVE 表中。从那里(您甚至可以对数据进行分区,因为它可能适合速度而不是像您所说的那样查看不需要的数据)。从这里您可以随意查询您的数据结果。这是非常好的在线教程 http://www.cloudera.com/videos/hive_tutorial
那里有很多方法可以解决这个问题,但听起来 HBase 有点矫枉过正,除非您想设置它运行所需的所有服务器作为学习它的练习。如果有数千人同时寻求信息,HBase 会很好。
您可能还想研究 FLUME,它是 Cloudera 的新导入服务器。它将把你的文件从某个地方直接传输到 HDFS http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3b2-flume/
With Hadoop you are always doing one of two things (either processing or querying).
For what you are looking to-do I would suggest using HIVE http://hadoop.apache.org/hive/. You can take your data and then create a M/R job to process and push that data how you like it into HIVE tables. From there (you can even partition on data as it might be appropriate for speed to not look at data not required as you say). From here you can query out your data results as you like. Here is very good online tutorial http://www.cloudera.com/videos/hive_tutorial
There are a lots of ways to solve this but it sounds like HBase is a bit overkill unless you want to setup all the server required for it to run as an exercise to learn it. HBase would be good if you have thousands of people simultaneously looking to get at the information.
You might also want to look into FLUME which is new import server from Cloudera . It will get your files from some place straight to HDFS http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3b2-flume/