Pig与Cassandra集成:简单的分布式查询只需几分钟即可完成。这是正常的吗?

发布于 2024-12-27 04:26:09 字数 2471 浏览 2 评论 0原文

我设置了 Cassandra + Pig/Hadoop 的测试集成。 8 个节点是 Cassandra + TaskTracker 节点,1 个节点是 JobTracker/NameNode。

我启动了 cassandra 客户端,并创建了 Cassandra 发行版的 Readme.txt 中列出的简单数据:

  [default@unknown] create keyspace Keyspace1;
  [default@unknown] use Keyspace1;
  [default@Keyspace1] create column family Users with comparator=UTF8Type and default_validation_class=UTF8Type and key_validation_class=UTF8Type;
  [default@KS1] set Users[jsmith][first] = 'John';
  [default@KS1] set Users[jsmith][last] = 'Smith';
  [default@KS1] set Users[jsmith][age] = long(42)

然后我运行了 CASSANDRA_HOME 中列出的示例 Pig 查询(使用 Pig_cassandra):

grunt> rows = LOAD 'cassandra://Keyspace1/Users' USING CassandraStorage() AS (key, columns: bag {T: tuple(name, value)});
grunt> cols = FOREACH rows GENERATE flatten(columns);
grunt> colnames = FOREACH cols GENERATE $0;
grunt> namegroups = GROUP colnames BY (chararray) $0;
grunt> namecounts = FOREACH namegroups GENERATE COUNT($1), group;
grunt> orderednames = ORDER namecounts BY $0;
grunt> topnames = LIMIT orderednames 50;
grunt> dump topnames;

大约需要 3 分钟才能完成。

    HadoopVersion   PigVersion      UserId  StartedAt                FinishedAt                            Features
    1.0.0             0.9.1          root    2012-01-12      22:16:53     2012-01-12 22:20:22     GROUP_BY,ORDER_BY,LIMIT
Success!

Job Stats (time in seconds):
JobId   Maps    Reduces MaxMapTime      MinMapTIme      AvgMapTime      MaxReduceTime   MinReduceTime   AvgReduceTime   Alias   Feature Outputs
job_201201121817_0010   8       1       12      6       9       21      21      21      colnames,cols,namecounts,namegroups,rows        GROUP_BY,COMBINER       
job_201201121817_0011   1       1       6       6       6       15      15      15      orderednames    SAMPLER 
job_201201121817_0012   1       1       9       9       9       15      15      15      orderednames    ORDER_BY,COMBINER       hdfs://xxxx/tmp/temp-744158198/tmp-1598279340,

Input(s):
Successfully read 1 records (3232 bytes) from: "cassandra://Keyspace1/Users"

Output(s):
Successfully stored 3 records (63 bytes) in: "hdfs://xxxx/tmp/temp-744158198/tmp-1598279340"

Counters:
Total records written : 3
Total bytes written : 63
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

日志记录中没有错误或警告。

这是正常现象还是有什么问题?

I set up a test integration of Cassandra + Pig/Hadoop. 8 nodes are Cassandra + TaskTracker nodes, 1 node is the JobTracker/NameNode.

I fired up the cassandra client and created a the simple bit of data listed in the Readme.txt in the Cassandra distribution:

  [default@unknown] create keyspace Keyspace1;
  [default@unknown] use Keyspace1;
  [default@Keyspace1] create column family Users with comparator=UTF8Type and default_validation_class=UTF8Type and key_validation_class=UTF8Type;
  [default@KS1] set Users[jsmith][first] = 'John';
  [default@KS1] set Users[jsmith][last] = 'Smith';
  [default@KS1] set Users[jsmith][age] = long(42)

Then I ran the sample pig query listed in CASSANDRA_HOME (using pig_cassandra):

grunt> rows = LOAD 'cassandra://Keyspace1/Users' USING CassandraStorage() AS (key, columns: bag {T: tuple(name, value)});
grunt> cols = FOREACH rows GENERATE flatten(columns);
grunt> colnames = FOREACH cols GENERATE $0;
grunt> namegroups = GROUP colnames BY (chararray) $0;
grunt> namecounts = FOREACH namegroups GENERATE COUNT($1), group;
grunt> orderednames = ORDER namecounts BY $0;
grunt> topnames = LIMIT orderednames 50;
grunt> dump topnames;

It took about 3 minutes to complete.

    HadoopVersion   PigVersion      UserId  StartedAt                FinishedAt                            Features
    1.0.0             0.9.1          root    2012-01-12      22:16:53     2012-01-12 22:20:22     GROUP_BY,ORDER_BY,LIMIT
Success!

Job Stats (time in seconds):
JobId   Maps    Reduces MaxMapTime      MinMapTIme      AvgMapTime      MaxReduceTime   MinReduceTime   AvgReduceTime   Alias   Feature Outputs
job_201201121817_0010   8       1       12      6       9       21      21      21      colnames,cols,namecounts,namegroups,rows        GROUP_BY,COMBINER       
job_201201121817_0011   1       1       6       6       6       15      15      15      orderednames    SAMPLER 
job_201201121817_0012   1       1       9       9       9       15      15      15      orderednames    ORDER_BY,COMBINER       hdfs://xxxx/tmp/temp-744158198/tmp-1598279340,

Input(s):
Successfully read 1 records (3232 bytes) from: "cassandra://Keyspace1/Users"

Output(s):
Successfully stored 3 records (63 bytes) in: "hdfs://xxxx/tmp/temp-744158198/tmp-1598279340"

Counters:
Total records written : 3
Total bytes written : 63
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

There were no errors or warnings in the logging.

Is this normal, or is there something wrong?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

枕头说它不想醒 2025-01-03 04:26:09

是的,这是正常现象,因为在 Hadoop 上运行 Map/Reduce 作业通常仅需要 1 分钟左右的启动时间。 Pig 根据脚本的复杂性生成多个 Map/Reduce 作业。

Yes this is normal because running a Map/Reduce job on Hadoop usually takes about 1 minute just for startup. Pig generates multiple Map/Reduce jobs dependent on the complexity of the script.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文