我应该如何对 s3 中的数据进行分区以便与 hadoop hive 一起使用?

发布于 2024-10-07 18:56:53 字数 666 浏览 4 评论 0 原文

我有一个 s3 存储桶,其中包含约 300GB 的日志文件(无特定顺序)。

我想使用日期时间戳对这些数据进行分区,以便在 hadoop-hive 中使用,以便与特定日期相关的日志行聚集在同一个 s3“文件夹”中。例如,1 月 1 日的日志条目将位于与以下命名匹配的文件中:

s3://bucket1/partitions/created_date=2010-01-01/file1
s3://bucket1/partitions/created_date=2010-01-01/file2
s3://bucket1/partitions/created_date=2010-01-01/file3

etc

对我来说,转换数据的最佳方法是什么?我最好只运行一个脚本来一次读取每个文件并将数据输出到正确的 s3 位置吗?

我确信有一个使用 hadoop 来做到这一点的好方法,有人可以告诉我那是什么吗?

我尝试过的:

我尝试通过传递一个映射器来使用hadoop-streaming,该映射器收集每个日期的所有日志条目,然后将它们直接写入S3,不为reducer返回任何内容,但这似乎创建了重复。 (使用上面的例子,我最终在 1 月 1 日得到了 250 万个条目,而不是 140 万个)

有谁知道如何最好地解决这个问题?

I have a s3 bucket containing about 300gb of log files in no particular order.

I want to partition this data for use in hadoop-hive using a date-time stamp so that log-lines related to a particular day are clumped together in the same s3 'folder'. For example log entries for January 1st would be in files matching the following naming:

s3://bucket1/partitions/created_date=2010-01-01/file1
s3://bucket1/partitions/created_date=2010-01-01/file2
s3://bucket1/partitions/created_date=2010-01-01/file3

etc

What would be the best way for me to transform the data? Am I best just running a single script that reads in each file at a time and outputs data to the right s3 location?

I'm sure there's a good way to do this using hadoop, could someone tell me what that is?

What I've tried:

I tried using hadoop-streaming by passing in a mapper that collected all log entries for each date then wrote those directly to S3, returning nothing for the reducer, but that seemed to create duplicates. (using the above example, I ended up with 2.5 million entries for Jan 1st instead of 1.4million)

Does anyone have any ideas how best to approach this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

青春如此纠结 2024-10-14 18:56:54

如果 Hadoop 在任务跟踪器中有空闲槽,它将运行同一任务的多个副本。如果您的输出格式没有正确忽略生成的重复输出键和值(S3 可能就是这种情况;我从未使用过它),您应该关闭推测执行。如果您的作业仅包含地图,请将 mapred.map.tasks.speculative.execution 设置为 false。如果您有一个reducer,请将mapred.reduce.tasks.speculative.execution设置为false。查看Hadoop:权威指南了解更多信息。

If Hadoop has free slots in the task tracker, it will run multiple copies of the same task. If your output format doesn't properly ignore the resulting duplicate output keys and values (which is possibly the case for S3; I've never used it), you should turn off speculative execution. If your job is map-only, set mapred.map.tasks.speculative.execution to false. If you have a reducer, set mapred.reduce.tasks.speculative.execution to false. Check out Hadoop: The Definitive Guide for more information.

冰葑 2024-10-14 18:56:54

为什么不根据这些数据创建一个外部表,然后使用 hive 创建新表?

create table partitioned (some_field string, timestamp string, created_date date) partition(created_date);
insert overwrite partitioned partition(created_date) as select some_field, timestamp, date(timestamp) from orig_external_table;

事实上,我还没有查找语法,因此您可能需要参考 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries

Why not create an external table over this data, then use hive to create the new table?

create table partitioned (some_field string, timestamp string, created_date date) partition(created_date);
insert overwrite partitioned partition(created_date) as select some_field, timestamp, date(timestamp) from orig_external_table;

In fact, I haven't looked up the syntax, so you may need to correct it with reference to https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文