Hive - 如何为可变长度的现有 hdfs 文件编写创建语句?
所以,我有一个现有的 hdfs 目录,其中包含一堆文件。这些文件都是制表符分隔的。
我有一个配置单元声明......
create external table
mytable(
key string,
name string,
address string,
ssn string)
row format delimited fields
terminated by '09', lines terminted by '10'
STORED AS TEXTFILE location '/MyHiveFiles/data';
除了所有额外的字段之外,这工作得很好。该文件在 ssn 字段之后还包含 0 到 x 个额外数据元素。它们仍然以制表符分隔,并以“\n”记录分隔。我可以添加一堆“valuex string”(其中 x 是额外元素的增量)...但我不知道最终可能有多少,而且无论如何这看起来都很混乱。
有没有一种方法可以告诉配置单元将该行的所有剩余字段放入一个字段中,例如“其他字符串”?即使它在配置单元返回值中是制表符分隔的......我对此表示同意。
提前致谢。
So, I have an existing hdfs directory, containing a bunch of files. These files are all tab delimited.
I have a hive statement....
create external table
mytable(
key string,
name string,
address string,
ssn string)
row format delimited fields
terminated by '09', lines terminted by '10'
STORED AS TEXTFILE location '/MyHiveFiles/data';
This works pretty well, except for all of the extra fields. The file also contains between 0 and x extra data elements after the ssn field. They are still tab delimited, and '\n' record delimited. I could add a bunch of 'valuex string' (where x is the increment of extra elements)... but I don't know how many there might eventually be, and that seems messy anyway.
Is there a way to tell hive to just put all the remaining fields of that row into ONE field, like 'others string'? Even if it is tab delimted in the hive return value... I am ok with that.
Thanks, in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
在 Hive 中创建表本质上只是创建元数据,告诉 Hive 如何解释文件。 Hive 不“了解”其余数据。
如果您添加另一列作为数组并指定
COLLECTION ITEMS TERMINATED BY '\0002'
(\0002 或其他字符),那么选项卡将不会终止数组集合,并且应全部作为单个返回元素,包括选项卡。还没有测试过这个。 :)Creating a table in Hive essentially just creates the Metadata telling hive how to interpret the files. Hive doesn't 'know' about the rest of the data.
If you add another column as an array and specify
COLLECTION ITEMS TERMINATED BY '\0002'
(\0002 or some other character) then the tabs will not terminate the array collection and should all be returned as a single element, including tabs. Haven't tested this yet. :)