hive regexp_extract 怪异

发布于 2024-12-21 10:31:58 字数 877 浏览 2 评论 0原文

我在 regexp_extract 方面遇到一些问题：

我正在查询制表符分隔的文件，我正在检查的列具有如下所示的字符串：

abc.def.ghi

现在，如果我这样做：

select distinct regexp_extract(name, '[^.]+', 0) from dummy;

MR 作业运行，它可以工作，并且我得到“abc”从索引 0 开始。

但是现在，如果我想从索引 1 中获取“def”：

select distinct regexp_extract(name, '[^.]+', 1) from dummy;

Hive 失败并显示：

2011-12-13 23:17:08,132 Stage-1 map = 0%,  reduce = 0%
2011-12-13 23:17:28,265 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201112071152_0071 with errors
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask

日志文件显示：

java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row

我在这里做的事情根本上是错误的吗？

谢谢，马里奥

原文

I am having some problems with regexp_extract:

I am querying on a tab-delimited file, the column I'm checking has strings that look like this:

abc.def.ghi

Now, if I do:

select distinct regexp_extract(name, '[^.]+', 0) from dummy;

MR job runs, it works, and I get "abc" from index 0.

But now, if I want to get "def" from index 1:

select distinct regexp_extract(name, '[^.]+', 1) from dummy;

Hive fails with:

2011-12-13 23:17:08,132 Stage-1 map = 0%,  reduce = 0%
2011-12-13 23:17:28,265 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201112071152_0071 with errors
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask

Log file says:

java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row

Am I doing something fundamentally wrong here?

Thanks,
Mario

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

鹤仙姿 2024-12-28 10:31:58

从文档 https://cwiki.apache.org/confluence/display/Hive /LanguageManual+UDF 看起来 regexp_extract() 是您想要提取的数据的记录/行提取。

它似乎适用于第一次发现（然后退出）而不是全局。因此索引引用捕获组。

0 = 整场比赛
1 = 捕获组 1
2 = 捕获组 2 等...

摘自手册：

regexp_extract('foothebar', 'foo(.*?)(bar)', 2)
                                  ^    ^   
               groups             1    2

This returns 'bar'.

因此，在您的情况下，要获取点后的文本，类似这样的操作可能会起作用：
regexp_extract(name, '\.([^.]+)', 1)
或者这个
regexp_extract(name, '[.]([^.]+)', 1)

edit

我对此重新感兴趣，仅供参考，可能有适合您的快捷方式/解决方法。

看起来您想要用点 . 字符分隔特定的段，这几乎就像分割一样。
如果它被量化多次，那么所使用的正则表达式引擎很可能会覆盖一个组。
您可以通过以下方式利用这一点：

返回第一个段：abc.def.ghi
regexp_extract(name, '^(?:([^.]+)\.?){1}', 1)

返回第二段：abc.def。吉
regexp_extract(name, '^(?:([^.]+)\.?){2}', 1)

返回第三段：abc.def.ghi
regexp_extract(name, '^(?:([^.]+)\.?){3}', 1)

索引不会更改（因为索引仍然引用捕获组 1 )，只有正则表达式重复发生变化。

一些注意事项：

这个正则表达式 ^(?:([^.]+)\.?){n} 有问题。
它要求段中的点之间有一些东西，否则正则表达式将不匹配 ...。
可能是这个 ^(?:([^.]*)\.?){n} 但即使少于 n-1 个点也会匹配，
包括空字符串。这可能不是我们所希望的。

有一种方法可以做到这一点，它不需要点之间的文本，但仍然需要至少 n-1 个点。
这使用前瞻断言和捕获缓冲区 2 作为标志。

^(?:(?!\2)([^.]*)(?:\.|$())){2} ，其他一切都相同。

所以，如果它使用 java 风格的正则表达式，那么这应该可以工作。
regexp_extract(name, '^(?:(?!\2)([^.]*)(?:\.|$())){2}', 1) 更改 {2 } 到任何需要的“段”（这就是段 2）。

并且在第 {N} 次迭代后它仍然返回捕获缓冲区 1。

这里它被分解了

^                # Begining of string
 (?:             # Grouping
    (?!\2)            # Assertion: Capture buffer 2 is UNDEFINED
    ( [^.]*)          # Capture buffer 1, optional non-dot chars, many times
    (?:               # Grouping
        \.                # Dot character
      |                 # or,
        $ ()              # End of string, set capture buffer 2 DEFINED (prevents recursion when end of string)
    )                 # End grouping
 ){3}            # End grouping, repeat group exactly 3 (or N) times (overwrites capture buffer 1 each time)

如果它不做断言，那么这是行不通的！

From the docs https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF it appears that regexp_extract() is a record/line extraction of the data you wish to extract.

It seems to work on a first found (then quit) as opposed to global. Therefore the index references the capture group.

0 = the entire match
1 = capture group 1
2 = capture group 2, etc ...

Paraphrased from the manual:

regexp_extract('foothebar', 'foo(.*?)(bar)', 2)
                                  ^    ^   
               groups             1    2

This returns 'bar'.

So, in your case, to get the text after the dot, something like this might work:
regexp_extract(name, '\.([^.]+)', 1)
or this
regexp_extract(name, '[.]([^.]+)', 1)

edit

I got re-interested in this, just a fyi, there could be a shortcut/workaround for you.

It looks like you want a particular segment separated with a dot . character, which is almost like split.
Its more than likely the regex engine used overwrites a group if it is quantified more than once.
You can take advantage of that with something like this:

Returns the first segment: abc.def.ghi
regexp_extract(name, '^(?:([^.]+)\.?){1}', 1)

Returns the second segment: abc.def.ghi
regexp_extract(name, '^(?:([^.]+)\.?){2}', 1)

Returns the third segment: abc.def.ghi
regexp_extract(name, '^(?:([^.]+)\.?){3}', 1)

The index doesn't change (because the index still referrs to capture group 1), only the regex repetition changes.

Some notes:

This regex ^(?:([^.]+)\.?){n} has problems though.
It requires there be something between dots in the segment or the regex won't match ....
It could be this ^(?:([^.]*)\.?){n} but this will match even if there is less than n-1 dots,
including the empty string. This is probably not desireable.

There is a way to do it where it doesn't require text between the dots, but still requires at least n-1 dots.
This uses a lookahead assertion and capture buffer 2 as a flag.

^(?:(?!\2)([^.]*)(?:\.|$())){2} , everything else is the same.

So, if it uses java style regex, then this should work.
regexp_extract(name, '^(?:(?!\2)([^.]*)(?:\.|$())){2}', 1) change {2} to whatever 'segment' is needed (this does segment 2).

and it still returns capture buffer 1 after the {N}'th iteration.

Here it is broken down

^                # Begining of string
 (?:             # Grouping
    (?!\2)            # Assertion: Capture buffer 2 is UNDEFINED
    ( [^.]*)          # Capture buffer 1, optional non-dot chars, many times
    (?:               # Grouping
        \.                # Dot character
      |                 # or,
        $ ()              # End of string, set capture buffer 2 DEFINED (prevents recursion when end of string)
    )                 # End grouping
 ){3}            # End grouping, repeat group exactly 3 (or N) times (overwrites capture buffer 1 each time)

If it doesn't do assertions, then this won't work!

回复收藏 0 原文

-黛色若梦 2024-12-28 10:31:58

我认为你必须建立“团体”，不是吗？

select distinct regexp_extract(name, '([^.]+)', 1) from dummy;

（未经测试）

我认为它的行为类似于 java 库，这应该可以工作，但请告诉我。

I think you have to make 'groups' no?

select distinct regexp_extract(name, '([^.]+)', 1) from dummy;

(untested)

I think it behaves like the java library and this should work, let me know though.

回复收藏 0 原文

~没有更多了~

关于作者

五里雾

暂无简介

文章

28 人气

关注发私信

友情链接

文江博客

hive regexp_extract 怪异

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

达拉崩吧

PANGOO

kkgtx

WordPress小学生

酷炫老祖宗

硪扪都還晓

友情链接

hive regexp_extract 怪异

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

达拉崩吧

PANGOO

kkgtx

WordPress小学生

酷炫老祖宗

硪扪都還晓

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。