Java 或 Pig 正则表达式从 UserAgent 字符串中删除值
我需要删除用户代理字符串的“括号”部分中的第三个及后续值。
为了得到
Mozilla/4.0(兼容;MSIE 8.0)
)
Mozilla/4.0(兼容;MSIE 8.0;Windows NT 6.0;Trident/4.0;GTB6;SLCC1;.NET CLR 2.0.50727;Media Center PC 5.0;.NET CLR 3.5.30729;WinTSI 06.12.2009;.NET CLR 3.0.30729; .NET4.0C)
我成功使用 sed 命令,
sed 's/(\([^;]\+; [^;]\+\)[^)]*)/(\1)/'
我需要使用 Java 正则表达式在 Apache Pig 中获得相同的结果。 有人可以帮我将上面的 sed 正则表达式重写为 Java 吗?
像这样的东西:
new = FOREACH userAgent GENERATE FLATTEN(EXTRACT(userAgent, 'JAVA REGEX?') as (term:chararray);
I need to strip out the third and subsequent values in the 'bracketed' component of the user agent string.
In order to get
Mozilla/4.0 (compatible; MSIE 8.0)
from
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; GTB6; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.5.30729; WinTSI 06.12.2009; .NET CLR 3.0.30729; .NET4.0C)
I successfully use sed command
sed 's/(\([^;]\+; [^;]\+\)[^)]*)/(\1)/'
I need to get the same result in Apache Pig with a Java regex.
Could anybody help me to re-write the above sed regular expression into Java?
Something like:
new = FOREACH userAgent GENERATE FLATTEN(EXTRACT(userAgent, 'JAVA REGEX?') as (term:chararray);
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我不使用 Pig,但浏览文档会发现一个 REPLACE 函数,它包装了 Java 的
replaceAll()
方法。试试这个:这应该匹配 UserAgent 字符串的整个括号部分,并用前两个分号分隔的术语替换其内容,就像 sed 命令所做的那样。
I don't use Pig, but a look through the docs reveals a REPLACE function which wraps Java's
replaceAll()
method. Try this:That should match the whole parenthesized portion of the UserAgent string and replace its contents with just the first two semicolon-separated terms, just like your sed command does.
在java中,如果您使用Matcher类,您可以提取捕获组。以下内容似乎可以满足您的要求,至少对于您提供的测试用例而言。
嗯...我似乎回答了错误的问题,因为你问的是如何从“PIG”而不是直接的 JAVA 中做到这一点。
In java if you use the Matcher class you can extract the capturing group. The following appears to do what you want, at least for the test case you provided.
Hmm... I seemed to have answered the wrong question, since you were asking how to do this from 'PIG' not straight JAVA.
由于两个建议的解决方案似乎都不适用于 PIG,我将发布通过流使用 sed 的解决方法:
这效果很好,但我仍然更喜欢本机 PIG 解决方案(使用 EXTRACT 或 REPLACE 函数)。
As none of two suggested solutions seems to work in PIG I will post workaround which uses sed through stream:
This works well, however I would still prefer native PIG solution (using EXTRACT or REPLACE function).