将解析后的日志数据存储在hadoop中并导出到关系型DB中
我需要使用 MapReduce 依次解析 Apache 访问日志和 Tomcat 日志。从 tomcat 日志中提取的字段很少,从 Apache 日志中提取的字段很少。我需要根据时间戳合并/映射提取的字段,并将这些映射字段导出到传统的关系数据库(例如 MySQL)中。
我可以使用正则表达式或 Pig 来解析和提取信息。我面临的挑战是如何将从两个日志中提取的信息映射到单个聚合格式或文件以及如何将此数据导出到 MYSQL。
我想到的几种方法
1)将解析后的 Apache 访问日志和 tomcat 日志中的 MapReduce 输出写入单独的文件中,并将它们合并到单个文件中(再次基于时间戳)。将此数据导出到 MySQL。
2)使用Hbase或Hive将数据以表的形式存储在hadoop中并将其导出到MySQL
3)使用JDBC将mapreduce的输出直接写入MySQL。
哪种方法最可行,也请建议您知道的任何其他替代解决方案。
I have a requirement of parsing both Apache access logs and tomcat logs one after another using map reduce. Few fields are being extracted from tomcat log and rest from Apache log.I need to merge /map extracted fields based on the timestamp and export these mapped fields into a traditional relational db ( ex. MySQL ).
I can parse and extract information using regular expression or pig. The challenge i am facing is on how to map extracted information from both logs into a single aggregate format or file and how to export this data to MYSQL.
Few approaches I am thinking of
1) Write output of map reduce from both parsed Apache access logs and tomcat logs into separate files and merge those into a single file ( again based on timestamp ). Export this data to MySQL.
2) Use Hbase or Hive to store data in table format in hadoop and export that to MySQL
3) Directly write the output of map reduce to MySQL using JDBC.
Which approach would be most viable and also please suggest any other alternative solutions you know.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
拥有更小、更简单的 MR 作业并将它们链接在一起几乎总是比拥有大型、复杂的作业更好。我认为你最好的选择是选择像#1这样的东西。换句话说:
您可以在同一步骤中执行连接和转换(1 和 2)。使用映射进行转换并进行减少侧连接。
听起来您不需要/想要随机访问的开销,所以我不会考虑 HBase。这不是它的强项(尽管你可以在随机访问意义上通过时间戳查找 HBase 中的每条记录,查看它是否存在,合并记录,或者如果不存在则简单地插入,但是这相对而言,速度非常慢)。 Hive 可以方便地存储两种格式的“统一”结果,但您仍然需要将记录转换为该格式。
您绝对不希望reducer直接写入MySQL。这实际上对数据库造成了 DDOS 攻击。考虑一个由 10 个节点组成的集群,每个节点运行 5 个减速器,您将有 50 个并发写入同一个表。随着集群的增长,您将很快超过最大连接数并阻塞 RDBMS。
综上所述,如果您正在考虑完整的日志记录,请问问自己将这么多数据放入数据库是否有意义。这种数据量正是 Hadoop 本身旨在长期存储和处理的情况类型。如果您正在计算这些数据的聚合,请务必将其放入 MySQL。
希望这有帮助。
It's almost always preferable to have smaller, simpler MR jobs and chain them together than to have large, complex jobs. I think your best option is to go with something like #1. In other words:
You can probably perform the join and transform (1 and 2) in the same step. Use the map to transform and do a reduce side join.
It doesn't sound like you need / want the overhead of random access so I wouldn't look at HBase. This isn't its strong point (although you could do it in the random access sense by looking up each record in HBase by timestamp, seeing if it exists, merging the record in, or simply inserting if it doesn't exist, but this is very slow, comparatively). Hive could be conveinnient to store the "unified" result of the two formats, but you'd still have to transform the records into that format.
You absolutely do not want to have the reducer write to MySQL directly. This effectively creates a DDOS attack on the database. Consider a cluster of 10 nodes, each running 5 reducers, you'll have 50 concurrent writers to the same table. As you grow the cluster you'll exceed max connections very quickly and choke the RDBMS.
All of that said, ask yourself if it makes sense to put this much data into the database, if you're considering the full log records. This amount of data is precisely the type of case Hadoop itself is meant to store and process long term. If you're computing aggregates of this data, by all means, toss it into MySQL.
Hope this helps.