我有一个 s3 存储桶,其中包含约 300GB 的日志文件(无特定顺序)。
我想使用日期时间戳对这些数据进行分区,以便在 hadoop-hive 中使用,以便与特定日期相关的日志行聚集在同一个 s3“文件夹”中。例如,1 月 1 日的日志条目将位于与以下命名匹配的文件中:
s3://bucket1/partitions/created_date=2010-01-01/file1
s3://bucket1/partitions/created_date=2010-01-01/file2
s3://bucket1/partitions/created_date=2010-01-01/file3
etc
对我来说,转换数据的最佳方法是什么?我最好只运行一个脚本来一次读取每个文件并将数据输出到正确的 s3 位置吗?
我确信有一个使用 hadoop 来做到这一点的好方法,有人可以告诉我那是什么吗?
我尝试过的:
我尝试通过传递一个映射器来使用hadoop-streaming,该映射器收集每个日期的所有日志条目,然后将它们直接写入S3,不为reducer返回任何内容,但这似乎创建了重复。 (使用上面的例子,我最终在 1 月 1 日得到了 250 万个条目,而不是 140 万个)
有谁知道如何最好地解决这个问题?
I have a s3 bucket containing about 300gb of log files in no particular order.
I want to partition this data for use in hadoop-hive using a date-time stamp so that log-lines related to a particular day are clumped together in the same s3 'folder'. For example log entries for January 1st would be in files matching the following naming:
s3://bucket1/partitions/created_date=2010-01-01/file1
s3://bucket1/partitions/created_date=2010-01-01/file2
s3://bucket1/partitions/created_date=2010-01-01/file3
etc
What would be the best way for me to transform the data? Am I best just running a single script that reads in each file at a time and outputs data to the right s3 location?
I'm sure there's a good way to do this using hadoop, could someone tell me what that is?
What I've tried:
I tried using hadoop-streaming by passing in a mapper that collected all log entries for each date then wrote those directly to S3, returning nothing for the reducer, but that seemed to create duplicates. (using the above example, I ended up with 2.5 million entries for Jan 1st instead of 1.4million)
Does anyone have any ideas how best to approach this?
发布评论
评论(2)
如果 Hadoop 在任务跟踪器中有空闲槽,它将运行同一任务的多个副本。如果您的输出格式没有正确忽略生成的重复输出键和值(S3 可能就是这种情况;我从未使用过它),您应该关闭推测执行。如果您的作业仅包含地图,请将
mapred.map.tasks.speculative.execution
设置为 false。如果您有一个reducer,请将mapred.reduce.tasks.speculative.execution设置为false。查看Hadoop:权威指南了解更多信息。If Hadoop has free slots in the task tracker, it will run multiple copies of the same task. If your output format doesn't properly ignore the resulting duplicate output keys and values (which is possibly the case for S3; I've never used it), you should turn off speculative execution. If your job is map-only, set
mapred.map.tasks.speculative.execution
to false. If you have a reducer, setmapred.reduce.tasks.speculative.execution
to false. Check out Hadoop: The Definitive Guide for more information.为什么不根据这些数据创建一个外部表,然后使用 hive 创建新表?
事实上,我还没有查找语法,因此您可能需要参考 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries。
Why not create an external table over this data, then use hive to create the new table?
In fact, I haven't looked up the syntax, so you may need to correct it with reference to https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries.