导入hadoop/pig中日志的多级目录

发布于 2024-10-21 23:08:37 字数 733 浏览 2 评论 0原文

我们将日志存储在 S3 中,我们的 (Pig) 查询之一将获取三种不同的日志类型。每个日志类型都位于基于类型/日期的子目录集中。例如:

/logs/<type>/<year>/<month>/<day>/<hour>/lots_of_logs_for_this_hour_and_type.log*

我的查询想要在给定时间内加载所有三种类型的日志。例如:

type1 = load 's3:/logs/type1/2011/03/08' as ...
type2 = load 's3:/logs/type2/2011/03/08' as ...
type3 = load 's3:/logs/type3/2011/03/08' as ...
result = join type1 ..., type2, etc...

我的查询将针对所有这些日志运行。

处理这个问题最有效的方法是什么?

  1. 我们需要使用 bash 脚本扩展吗?不确定这是否适用于多个目录,并且我怀疑如果要加载 10k 日志,它是否会有效(甚至可能)。
  2. 我们是否创建一个服务来聚合所有日志并将它们直接推送到 hdfs?
  3. 自定义 java/python 导入器?
  4. 还有其他想法吗?

如果您也可以留下一些示例代码(如果合适的话),那将会很有帮助。

谢谢

We store our logs in S3, and one of our (Pig) queries would grab three different log types. Each log type is in sets of subdirectories based upon type/date. For instance:

/logs/<type>/<year>/<month>/<day>/<hour>/lots_of_logs_for_this_hour_and_type.log*

my query would want to load all three types of logs, for a give time. For instance:

type1 = load 's3:/logs/type1/2011/03/08' as ...
type2 = load 's3:/logs/type2/2011/03/08' as ...
type3 = load 's3:/logs/type3/2011/03/08' as ...
result = join type1 ..., type2, etc...

my queries would then run against all of these logs.

What is the most efficient way to handle this?

  1. Do we need use the bash script expansion? Not sure if this works with multi directories, and I doubt it would be efficient (or even possible) if there were 10k logs to load.
  2. Do we create a service to aggregate all of the logs and push them to hdfs directly?
  3. Custom java/python importers?
  4. Other thoughts?

If you could leave some example code, if appropriate, as well, that would be helpful.

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

大海や 2024-10-28 23:08:37

PigStorage 默认支持 Globbing,因此您可以尝试:

type1 = load 's3:/logs/type{1,2,3}/2011/03/08' as ..

甚至

type1 = load 's3:/logs/*/2011/03/08' as ..

Globbing is supported by default with PigStorage so you could just try:

type1 = load 's3:/logs/type{1,2,3}/2011/03/08' as ..

or even

type1 = load 's3:/logs/*/2011/03/08' as ..

空城之時有危險 2024-10-28 23:08:37

我有一个像你一样的类似日志系统,唯一的区别是我实际上不是按日期而是按类型分析日志,所以我会使用:

type1 = load 's3:/logs/type1/2011/03/' as ...

分析 type1 的月份日志,并且不要将其与 type2 混合。因为您不是按类型而是按日期进行分析,所以我建议您将结构更改为:

/logs/<year>/<month>/<day>/<hour>/<type>/lots_of_logs_for_this_hour_and_type.log*

这样您就可以加载每日(或每月)数据,然后按类型过滤它们,会更方便。

I had a similiar log system like yours the only difference is I actually analyze the logs not by date but by type so I would use:

type1 = load 's3:/logs/type1/2011/03/' as ...

to analyze that months logs for type1 and don't mix it with with type2. since you are analyzing not by type but by date I would recommend you to change your structure to:

/logs/<year>/<month>/<day>/<hour>/<type>/lots_of_logs_for_this_hour_and_type.log*

so you can load the daily(or monthly) data then filter them by type, would be more convenient.

陌路终见情 2024-10-28 23:08:37

如果像我一样您正在使用 Hive 并且您的数据已分区,您可以使用 PiggyBank 中的一些加载器(例如 AllLoader) 支持分区,只要您要过滤的目录结构如下所示:

.../type=value1/...
.../type=value2/...
.../type=value3/...

然后您应该能够加载文件并然后按类型筛选 = 'value1'。

例子:

REGISTER piggybank.jar;
I = LOAD '/hive/warehouse/mytable' using AllLoader() AS ( a:int, b:int );
F = FILTER I BY type = 1 OR type = 2;

If like me you are using Hive and your data is partitioned, you could use some of the loaders in PiggyBank (e.g. AllLoader) that support partitioning so long the directory structure you want to filter on is looking like:

.../type=value1/...
.../type=value2/...
.../type=value3/...

You should then be able to LOAD the file and then FILTER BY type = 'value1'.

Example:

REGISTER piggybank.jar;
I = LOAD '/hive/warehouse/mytable' using AllLoader() AS ( a:int, b:int );
F = FILTER I BY type = 1 OR type = 2;
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文