Hive / Hadoop / Flatfile:组合和连接行的有效方法是什么
id col1 col2 ... coln
---------------------
foo barA barB ...
foo barD barX
boo barA barC
foo barC barC
我想将其组合成“折叠”行,如下所示:
foo barA;barD;barC barB;barX;barC
boo barD barC
目前,源文档是一个配置单元“表”,[我想这与平面文本文件基本相同] - 我想知道什么是实现这一目标的最有效方法?
编辑:相关的早期问题(对于SQL,唉不是hive)合并多行成一个空格分隔的字符串
id col1 col2 ... coln
---------------------
foo barA barB ...
foo barD barX
boo barA barC
foo barC barC
I'd like to combine this into 'collapsed' rows which look like this:
foo barA;barD;barC barB;barX;barC
boo barD barC
At the moment the source document is a hive 'table', [which is essentially the same as a flat text file I suppose] - and I am wondering what is the most efficient way to accomplish this?
EDIT: related earlier question (for SQL, alas not hive) Combine multiple rows into one space separated string
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果您要从 MapReduce 作业将数据加载到 Hive 中,您可以调整 MR 来为您对数据进行转换,然后根据需要将其加载到表中(数组或 ; 分隔等)
如果您如果您希望能够更新/调整数据,HIVE 可能不是最佳选择。您可能需要查看 HBase 并执行“聚合”来生成您希望加载到 HBase 中的数据。每当生成相同的 Key/ColumnFamily/Column 值时,如果存在,它就会覆盖它,以便它会“更新”该值。我在生产中使用它来生成全天不断更新的数据。
无论哪种情况,为了操纵大量数据的结构,您可能需要使用 MapReduce 作业并让它为您进行重组。
If you are loading the data into hive from a mapreduce job you may be able to adjust that MR to do the transformation on the data for you, and load it into the table as you want it (array or ; delimited, ect)
If you are looking to be able to update/adjust the data, HIVE probably isn't the best option for that. You may want to look at HBase and do an 'aggregation' to generate the data as you want it to load into HBase. Anytime the same Key/ColumnFamily/Column value is generated, it overwrites it if it exists so it will 'update' the value. I use this in production to generate data throughout the day that constantly updates.
In either case, to manipulate the structure of large quantities of data, you will probably want to use a mapreduce job and have it do the restructuring for you.