使用 PIG 加载文件
我对 PIG 很陌生,我遇到了一个非常基本的问题。 我有一行代码,内容如下:
A = load 'Sites/trial_clustering/shortdocs/*'
AS (word1:chararray, word2:chararray, word3:chararray, word4:chararray);
其中每个文件基本上都是一行 4 个逗号分隔的单词。然而 PIG 并没有将其分成 4 个词。当我转储 A
时,我得到:(金钱、硬币、贷款、债务、、、)
我尝试过谷歌搜索,但似乎无法找到我的文件需要采用什么格式,以便 PIG 能够正确解释它。请帮忙!
I am very new to PIG and I am having what feels like a very basic problem.
I have a line of code that reads:
A = load 'Sites/trial_clustering/shortdocs/*'
AS (word1:chararray, word2:chararray, word3:chararray, word4:chararray);
where each file is basically a line of 4 comma separated words. However PIG is not splitting this into the 4 words. When I do dump A
, I get: (Money, coins, loans, debt,,,)
I have tried googling and I cannot seem to find what format my file needs to be in so that PIG will interpret it properly. Please help!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您的问题是,默认情况下,Pig 加载由制表符分隔的文件,而不是逗号。发生的情况是
“金钱、硬币、贷款、债务”
被困在您的第一列word1
中。当您打印它时,您会产生一种错觉,即您有多个列,但实际上第一个列填满了整行,然后其他列为空。要解决此问题,您应该指定
PigStorage
通过逗号加载:Your problem is that Pig, by default, loads files delimited by tab, not comma. What's happening is
"Money, coins, loans, debt"
are getting stuck in your first column,word1
. When you are printing it, you get the illusion that you have multiple columns, but really the first one is filled with your whole line, then the others are null.To fix this, you should specify
PigStorage
to load by comma by doing: