带有制表符的java模式

发布于 2024-12-17 13:30:27 字数 246 浏览 0 评论 0原文

我有一个文件,其中包含以下行:

string1 (tab) sting2 (tab) string3 (tab) string4

我想从每一行中获取 string3... 我现在从这些行中获取的是 string3 位于第二个和第三个制表符之间。 是否可以采用类似的模式

Pattern pat = Pattern.compile(".\t.\t.\t.");

i have a file with lines like:

string1 (tab) sting2 (tab) string3 (tab) string4

I want to get from every line, string3... All i now from the lines is that string3 is between the second and the third tab character.
is it possible to take it with a pattern like

Pattern pat = Pattern.compile(".\t.\t.\t.");

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

人心善变 2024-12-24 13:30:27
String string3 = tempValue.split("\\t")[2];
String string3 = tempValue.split("\\t")[2];
樱花落人离去 2024-12-24 13:30:27

听起来你只是想要:(

foreach (String line in lines) {
    String[] bits = line.split("\t");
    if (bits.length != 4) {
        // Handle appropriately, probably throwing an exception
        // or at least logging and then ignoring the line (using a continue
        // statement)
    }
    String third = bits[2];
    // Use...
}

可以转义字符串,以便正则表达式引擎必须将反斜杠-t解析为制表符,但你不必这样做。上面的工作正常。)

另一个使用正则表达式的内置 String.split 方法的替代方法是 Guava 分割器< /code>类。这里可能没有必要,但值得注意。

编辑:如评论中所述,如果您要重复使用相同的模式,则编译单个 Pattern 并使用 Pattern.split 会更有效:

private static final Pattern TAB_SPLITTER = Pattern.compile("\t");

...

String[] bits = TAB_SPLITTER.split(line);

It sounds like you just want:

foreach (String line in lines) {
    String[] bits = line.split("\t");
    if (bits.length != 4) {
        // Handle appropriately, probably throwing an exception
        // or at least logging and then ignoring the line (using a continue
        // statement)
    }
    String third = bits[2];
    // Use...
}

(You can escape the string so that the regex engine has to parse the backslash-t as tab, but you don't have to. The above works fine.)

Another alternative to the built-in String.split method using a regex is the Guava Splitter class. Probably not necessary here, but worth being aware of.

EDIT: As noted in comments, if you're going to repeatedly use the same pattern, it's more efficient to compile a single Pattern and use Pattern.split:

private static final Pattern TAB_SPLITTER = Pattern.compile("\t");

...

String[] bits = TAB_SPLITTER.split(line);
暮光沉寂 2024-12-24 13:30:27

如果您想要一个仅捕获第三个字段而没有其他字段的正则表达式,您可以使用以下内容:

String regex = "(?:[^\\t]*)\\t(?:[^\\t]*)\\t([^\\t]*)\\t(?:[^\\t]*)";
Pattern pattern = Pattern.compile(regex);

Matcher matcher = pattern.matcher(input);
if (matcher.matches()) {
  System.err.println(matcher.group(1));
}

我不知道这是否会比 split("\\t" 执行得更好) 用于解析大文件。

更新

我很好奇简单的分割与更明确的正则表达式将如何执行,因此我测试了三种不同的解析器实现。

/** Simple split parser */
static class SplitParser implements Parser {
    public String parse(String line) {
        String[] fields = line.split("\\t");
        if (fields.length == 4) {
            return fields[2];
        }
        return null;
    }
}

/** Split parser, but with compiled pattern */
static class CompiledSplitParser implements Parser {
    private static final String regex = "\\t";
    private static final Pattern pattern = Pattern.compile(regex);

    public String parse(String line) {
        String[] fields = pattern.split(line);
        if (fields.length == 4) {
            return fields[2];
        }
        return null;
    }
}

/** Regex group parser */
static class RegexParser implements Parser {
    private static final String regex = "(?:[^\\t]*)\\t(?:[^\\t]*)\\t([^\\t]*)\\t(?:[^\\t]*)";
    private static final Pattern pattern = Pattern.compile(regex);

    public String parse(String line) {
        Matcher m = pattern.matcher(line);
        if (m.matches()) {
            return m.group(1);
        }
        return null;
    }
}

我针对同一个百万行文件运行了十次。以下是平均结果:

  • split: 2768.8 ms
  • 已编译 split: 1041.5 ms
  • group regex: 1015.5 ms

明确的结论是编译非常重要你的模式,而不是依赖String.split< /a>,如果您要重复使用它。

根据此测试,编译的拆分与组正则表达式的结果并不是结论性的。也许可以进一步调整正则表达式以提高性能。

更新

另一种简单的优化是重用匹配器,而不是为每个循环迭代创建一个匹配器。

static class RegexParser implements Parser {
    private static final String regex = "(?:[^\\t]*)\\t(?:[^\\t]*)\\t([^\\t]*)\\t(?:[^\\t]*)";
    private static final Pattern pattern = Pattern.compile(regex);

    // Matcher is not thread-safe...
    private Matcher matcher = pattern.matcher("");

    // ... so this method is no-longer thread-safe
    public String parse(String line) {
        matcher = matcher.reset(line);
        if (matcher.matches()) {
            return matcher.group(1);
        }
        return null;
    }
}

If you want a regex which captures the third field only and nothing else, you could use the following:

String regex = "(?:[^\\t]*)\\t(?:[^\\t]*)\\t([^\\t]*)\\t(?:[^\\t]*)";
Pattern pattern = Pattern.compile(regex);

Matcher matcher = pattern.matcher(input);
if (matcher.matches()) {
  System.err.println(matcher.group(1));
}

I don't know whether this would perform any better than split("\\t") for parsing a large file.

UPDATE

I was curious to see how the simple split versus the more explicit regex would perform, so I tested three different parser implementations.

/** Simple split parser */
static class SplitParser implements Parser {
    public String parse(String line) {
        String[] fields = line.split("\\t");
        if (fields.length == 4) {
            return fields[2];
        }
        return null;
    }
}

/** Split parser, but with compiled pattern */
static class CompiledSplitParser implements Parser {
    private static final String regex = "\\t";
    private static final Pattern pattern = Pattern.compile(regex);

    public String parse(String line) {
        String[] fields = pattern.split(line);
        if (fields.length == 4) {
            return fields[2];
        }
        return null;
    }
}

/** Regex group parser */
static class RegexParser implements Parser {
    private static final String regex = "(?:[^\\t]*)\\t(?:[^\\t]*)\\t([^\\t]*)\\t(?:[^\\t]*)";
    private static final Pattern pattern = Pattern.compile(regex);

    public String parse(String line) {
        Matcher m = pattern.matcher(line);
        if (m.matches()) {
            return m.group(1);
        }
        return null;
    }
}

I ran each ten times against the same million line file. Here are the average results:

  • split: 2768.8 ms
  • compiled split: 1041.5 ms
  • group regex: 1015.5 ms

The clear conclusion is that it is important to compile your pattern, rather than rely on String.split, if you are going to use it repeatedly.

The result on compiled split versus group regex is not conclusive based on this testing. And probably the regex could be tweaked further for performance.

UPDATE

A further simple optimization is to re-use the Matcher rather than create one per loop iteration.

static class RegexParser implements Parser {
    private static final String regex = "(?:[^\\t]*)\\t(?:[^\\t]*)\\t([^\\t]*)\\t(?:[^\\t]*)";
    private static final Pattern pattern = Pattern.compile(regex);

    // Matcher is not thread-safe...
    private Matcher matcher = pattern.matcher("");

    // ... so this method is no-longer thread-safe
    public String parse(String line) {
        matcher = matcher.reset(line);
        if (matcher.matches()) {
            return matcher.group(1);
        }
        return null;
    }
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文