当前位置：文江博客话题详情

Hadoop/Pig正则表达式匹配

发布于 2024-11-01 19:18:09 字数 482 浏览 0 评论 0原文

这是一种奇怪的情况，但我正在寻找一种使用 MATCHES 之类的东西进行过滤的方法，但在未知模式（未知长度）列表上。

也就是说，如果给定的输入是两个文件，一个带有数字 A:

xxxx

yyyy

zzzz

zzyy

...etc...

另一个带有模式 B:

xx.*

yyy.*

...etc...

我该如何做通过第二个输入中的所有模式过滤第一个输入？

如果我事先知道所有模式，我就可以 A = FILTER A BY (num MATCHES 'somepattern.*' OR num MATCHES 'someotherpattern'....);

问题是我事先不知道它们，并且由于它们是模式而不是简单的字符串，所以我不能只使用连接/组（至少据我所知）。也许是一个奇怪的嵌套 FOREACH...东西？有什么想法吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

淡淡の花香 2024-11-08 19:18:09

如果您使用作为 OR 运行的 |，您可以从各个模式构造出一个模式。

(xx.*|yyy.*|zzzz.*)

这将检查它是否与任何模式匹配。

编辑：
要创建组合的正则表达式模式：
* 创建一个以 (
开头的字符串
* 读入每一行（假设每一行都是一个模式）并将其附加到后跟 |
的字符串中
* 读完行后，删除最后一个字符（这将是不需要的 |）
* 附加 )

这将创建一个正则表达式模式来检查输入文件中的所有模式。（注意：假设该文件包含有效模式）

If you use the | which operates as an OR you can construct a pattern out of the individual patterns.

(xx.*|yyy.*|zzzz.*)

This will do a check to see if it matches any of the patterns.

Edit:
To create the combined regex pattern:
* Create a string starting with (
* Read in each line (assuming each line is a pattern) and append it to a string followed by a |
* When done reading lines, remove the last character (which will be an unneeded |)
* Append a )

This will create a regex pattern to check all the patterns in the input file. (Note: It's assumed the file contains valid patterns)

回复收藏 0 原文

~没有更多了~