hive regexp_extract 怪异
我在 regexp_extract 方面遇到一些问题:
我正在查询制表符分隔的文件,我正在检查的列具有如下所示的字符串:
abc.def.ghi
现在,如果我这样做:
select distinct regexp_extract(name, '[^.]+', 0) from dummy;
MR 作业运行,它可以工作,并且我得到“abc”从索引 0 开始。
但是现在,如果我想从索引 1 中获取“def”:
select distinct regexp_extract(name, '[^.]+', 1) from dummy;
Hive 失败并显示:
2011-12-13 23:17:08,132 Stage-1 map = 0%, reduce = 0%
2011-12-13 23:17:28,265 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201112071152_0071 with errors
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
日志文件显示:
java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row
我在这里做的事情根本上是错误的吗?
谢谢, 马里奥
I am having some problems with regexp_extract:
I am querying on a tab-delimited file, the column I'm checking has strings that look like this:
abc.def.ghi
Now, if I do:
select distinct regexp_extract(name, '[^.]+', 0) from dummy;
MR job runs, it works, and I get "abc" from index 0.
But now, if I want to get "def" from index 1:
select distinct regexp_extract(name, '[^.]+', 1) from dummy;
Hive fails with:
2011-12-13 23:17:08,132 Stage-1 map = 0%, reduce = 0%
2011-12-13 23:17:28,265 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201112071152_0071 with errors
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
Log file says:
java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row
Am I doing something fundamentally wrong here?
Thanks,
Mario
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
从文档 https://cwiki.apache.org/confluence/display/Hive /LanguageManual+UDF 看起来 regexp_extract() 是您想要提取的数据的记录/行提取。
它似乎适用于第一次发现(然后退出)而不是全局。因此索引引用捕获组。
0 = 整场比赛
1 = 捕获组 1
2 = 捕获组 2 等...
摘自手册:
因此,在您的情况下,要获取点后的文本,类似这样的操作可能会起作用:
regexp_extract(name, '\.([^.]+)', 1)
或者这个
regexp_extract(name, '[.]([^.]+)', 1)
edit
我对此重新感兴趣,仅供参考,可能有适合您的快捷方式/解决方法。
看起来您想要用点
.
字符分隔特定的段,这几乎就像分割一样。如果它被量化多次,那么所使用的正则表达式引擎很可能会覆盖一个组。
您可以通过以下方式利用这一点:
返回第一个段:
abc
.def.ghiregexp_extract(name, '^(?:([^.]+)\.?){1}', 1)
返回第二段:abc.
def
。吉regexp_extract(name, '^(?:([^.]+)\.?){2}', 1)
返回第三段:abc.def.
ghi
regexp_extract(name, '^(?:([^.]+)\.?){3}', 1)
索引不会更改(因为索引仍然引用捕获组 1 ),只有正则表达式重复发生变化。
一些注意事项:
这个正则表达式
^(?:([^.]+)\.?){n}
有问题。它要求段中的点之间有一些东西,否则正则表达式将不匹配
...
。可能是这个
^(?:([^.]*)\.?){n}
但即使少于 n-1 个点也会匹配,包括空字符串。这可能不是我们所希望的。
有一种方法可以做到这一点,它不需要点之间的文本,但仍然需要至少 n-1 个点。
这使用前瞻断言和捕获缓冲区 2 作为标志。
^(?:(?!\2)([^.]*)(?:\.|$())){2}
,其他一切都相同。所以,如果它使用 java 风格的正则表达式,那么这应该可以工作。
regexp_extract(name, '^(?:(?!\2)([^.]*)(?:\.|$())){2}', 1)
更改 {2 } 到任何需要的“段”(这就是段 2)。并且在第 {N} 次迭代后它仍然返回捕获缓冲区 1。
这里它被分解了
如果它不做断言,那么这是行不通的!
From the docs https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF it appears that regexp_extract() is a record/line extraction of the data you wish to extract.
It seems to work on a first found (then quit) as opposed to global. Therefore the index references the capture group.
0 = the entire match
1 = capture group 1
2 = capture group 2, etc ...
Paraphrased from the manual:
So, in your case, to get the text after the dot, something like this might work:
regexp_extract(name, '\.([^.]+)', 1)
or this
regexp_extract(name, '[.]([^.]+)', 1)
edit
I got re-interested in this, just a fyi, there could be a shortcut/workaround for you.
It looks like you want a particular segment separated with a dot
.
character, which is almost like split.Its more than likely the regex engine used overwrites a group if it is quantified more than once.
You can take advantage of that with something like this:
Returns the first segment:
abc
.def.ghiregexp_extract(name, '^(?:([^.]+)\.?){1}', 1)
Returns the second segment: abc.
def
.ghiregexp_extract(name, '^(?:([^.]+)\.?){2}', 1)
Returns the third segment: abc.def.
ghi
regexp_extract(name, '^(?:([^.]+)\.?){3}', 1)
The index doesn't change (because the index still referrs to capture group 1), only the regex repetition changes.
Some notes:
This regex
^(?:([^.]+)\.?){n}
has problems though.It requires there be something between dots in the segment or the regex won't match
...
.It could be this
^(?:([^.]*)\.?){n}
but this will match even if there is less than n-1 dots,including the empty string. This is probably not desireable.
There is a way to do it where it doesn't require text between the dots, but still requires at least n-1 dots.
This uses a lookahead assertion and capture buffer 2 as a flag.
^(?:(?!\2)([^.]*)(?:\.|$())){2}
, everything else is the same.So, if it uses java style regex, then this should work.
regexp_extract(name, '^(?:(?!\2)([^.]*)(?:\.|$())){2}', 1)
change {2} to whatever 'segment' is needed (this does segment 2).and it still returns capture buffer 1 after the {N}'th iteration.
Here it is broken down
If it doesn't do assertions, then this won't work!
我认为你必须建立“团体”,不是吗?
(未经测试)
我认为它的行为类似于 java 库,这应该可以工作,但请告诉我。
I think you have to make 'groups' no?
(untested)
I think it behaves like the java library and this should work, let me know though.