Java 或 Pig 正则表达式从 UserAgent 字符串中删除值

发布于 2024-12-17 19:04:11 字数 686 浏览 3 评论 0原文

我需要删除用户代理字符串的“括号”部分中的第三个及后续值。

为了得到

Mozilla/4.0(兼容;MSIE 8.0)

Mozilla/4.0(兼容;MSIE 8.0;Windows NT 6.0;Trident/4.0;GTB6;SLCC1;.NET CLR 2.0.50727;Media Center PC 5.0;.NET CLR 3.5.30729;WinTSI 06.12.2009;.NET CLR 3.0.30729; .NET4.0C)

我成功使用 sed 命令,

 sed 's/(\([^;]\+; [^;]\+\)[^)]*)/(\1)/'

我需要使用 Java 正则表达式在 Apache Pig 中获得相同的结果。 有人可以帮我将上面的 sed 正则表达式重写为 Java 吗?

像这样的东西:

new = FOREACH userAgent GENERATE FLATTEN(EXTRACT(userAgent, 'JAVA REGEX?') as (term:chararray);

I need to strip out the third and subsequent values in the 'bracketed' component of the user agent string.

In order to get

Mozilla/4.0 (compatible; MSIE 8.0)

from

Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; GTB6; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.5.30729; WinTSI 06.12.2009; .NET CLR 3.0.30729; .NET4.0C)

I successfully use sed command

 sed 's/(\([^;]\+; [^;]\+\)[^)]*)/(\1)/'

I need to get the same result in Apache Pig with a Java regex.
Could anybody help me to re-write the above sed regular expression into Java?

Something like:

new = FOREACH userAgent GENERATE FLATTEN(EXTRACT(userAgent, 'JAVA REGEX?') as (term:chararray);

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

长安忆 2024-12-24 19:04:12

我不使用 Pig,但浏览文档会发现一个 REPLACE 函数,它包装了 Java 的 replaceAll() 方法。试试这个:

REPLACE(userAgent, '\(([^;]+; [^;]+)[^)]*\)', '($1)')

这应该匹配 UserAgent 字符串的整个括号部分,并用前两个分号分隔的术语替换其内容,就像 sed 命令所做的那样。

I don't use Pig, but a look through the docs reveals a REPLACE function which wraps Java's replaceAll() method. Try this:

REPLACE(userAgent, '\(([^;]+; [^;]+)[^)]*\)', '($1)')

That should match the whole parenthesized portion of the UserAgent string and replace its contents with just the first two semicolon-separated terms, just like your sed command does.

丢了幸福的猪 2024-12-24 19:04:12

在java中,如果您使用Matcher类,您可以提取捕获组。以下内容似乎可以满足您的要求,至少对于您提供的测试用例而言。

import java.util.regex.Matcher;
import java.util.regex.Pattern;


public class Test {

    public static void main(String[] args){
        String str = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; GTB6; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.5.30729; WinTSI 06.12.2009; .NET CLR 3.0.30729; .NET4.0C)";
        //str = "aaa";
        Pattern pat = Pattern.compile("(.*\\(.*?;.*?;).*\\)");
        Matcher m = pat.matcher(str);
        System.out.println(m.lookingAt());
        String group = m.group(1) + ")";
        System.out.println(group);
    }
 }

嗯...我似乎回答了错误的问题,因为你问的是如何从“PIG”而不是直接的 JAVA 中做到这一点。

In java if you use the Matcher class you can extract the capturing group. The following appears to do what you want, at least for the test case you provided.

import java.util.regex.Matcher;
import java.util.regex.Pattern;


public class Test {

    public static void main(String[] args){
        String str = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; GTB6; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.5.30729; WinTSI 06.12.2009; .NET CLR 3.0.30729; .NET4.0C)";
        //str = "aaa";
        Pattern pat = Pattern.compile("(.*\\(.*?;.*?;).*\\)");
        Matcher m = pat.matcher(str);
        System.out.println(m.lookingAt());
        String group = m.group(1) + ")";
        System.out.println(group);
    }
 }

Hmm... I seemed to have answered the wrong question, since you were asking how to do this from 'PIG' not straight JAVA.

德意的啸 2024-12-24 19:04:12

由于两个建议的解决方案似乎都不适用于 PIG,我将发布通过流使用 sed 的解决方法:

user_agent_mangled = STREAM logs THROUGH `sed 's/(\\([^;]\\+; [^;]\\+\\)[^)]*)/(\\1)/'`;

这效果很好,但我仍然更喜欢本机 PIG 解决方案(使用 EXTRACT 或 REPLACE 函数)。

As none of two suggested solutions seems to work in PIG I will post workaround which uses sed through stream:

user_agent_mangled = STREAM logs THROUGH `sed 's/(\\([^;]\\+; [^;]\\+\\)[^)]*)/(\\1)/'`;

This works well, however I would still prefer native PIG solution (using EXTRACT or REPLACE function).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文