使用hadoop流写入不同的文件

发布于 2024-12-06 19:38:03 字数 1169 浏览 0 评论 0原文

我目前正在 10 个服务器的 hadoop 集群上处理大约 300 GB 的日志文件。我的数据保存在名为 YYMMDD 的文件夹中，因此每天都可以快速访问。

我的问题是，我今天刚刚发现日志文件中的时间戳采用 DST (GMT -0400)，而不是预期的 UTC。简而言之，这意味着logs/20110926/*.log.lzo包含从2011-09-26 04:00到2011-09-27 20:00的元素，并且它几乎破坏了对该数据所做的任何map/reduce（即生成统计数据）。

有没有办法执行映射/减少作业来正确地重新分割每个日志文件？据我所知，似乎没有办法使用流式传输输出文件 A 中的某些记录和输出文件 B 中的其余记录。

这是我当前使用的命令：

/opt/hadoop/bin/hadoop jar /opt/hadoop/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar \
-D mapred.reduce.tasks=15 -D mapred.output.compress=true \
-D mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec \
-mapper map-ppi.php -reducer reduce-ppi.php \
-inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat \
-file map-ppi.php -file reduce-ppi.php \
-input "logs/20110922/*.lzo" -output "logs-processed/20110922/"

我什么都不知道关于 java 和/或创建自定义类。我确实尝试了 http:// /blog.aggregateknowledge.com/2011/08/30/custom-inputoutput-formats-in-hadoop-streaming/ （几乎复制/粘贴了那里的内容）但我根本无法让它工作。无论我尝试什么，我都会收到“-outputformat：找不到类”错误。

非常感谢您的时间和帮助:)。

原文

I'm currently processing about 300 GB of log files on a 10 servers hadoop cluster. My data is being saved in folders named YYMMDD so each day can be accessed quickly.

My problem is that I just found out today that the timestamps I have in my log files are in DST (GMT -0400) instead of UTC as expected. In short, this means that logs/20110926/*.log.lzo contains elements from 2011-09-26 04:00 to 2011-09-27 20:00 and it's pretty much ruining any map/reduce done on that data (i.e. generating statistics).

Is there a way to do a map/reduce job to re-split every log files correctly? From what I can tell, there doesn't seem to be a way using streaming to send certain records in output file A and the rest of the records in output file B.

Here is the command I currently use:

/opt/hadoop/bin/hadoop jar /opt/hadoop/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar \
-D mapred.reduce.tasks=15 -D mapred.output.compress=true \
-D mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec \
-mapper map-ppi.php -reducer reduce-ppi.php \
-inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat \
-file map-ppi.php -file reduce-ppi.php \
-input "logs/20110922/*.lzo" -output "logs-processed/20110922/"

I don't know anything about java and/or creating custom classes. I did try the code posted at http://blog.aggregateknowledge.com/2011/08/30/custom-inputoutput-formats-in-hadoop-streaming/ (pretty much copy/pasted what was on there) but I couldn't get it to work at all. No matter what I tried, I would get a "-outputformat : class not found" error.

Thank you very much for your time and help :).

分享到QQ

分享到微博