StreamTokenizer 将 001_to_003 拆分为两个令牌;我怎样才能阻止它这样做?

发布于 2024-10-01 15:40:02 字数 736 浏览 5 评论 0原文

Java的StreamTokenizer在识别数字方面似乎太贪婪了。它的配置选项相对较少,而且我还没有找到让它执行我想要的操作的方法。以下测试通过,IMO 显示了实现中的一个错误;我真正想要的是将第二个标记识别为单词“20001_to_30000”。有什么想法吗?

public void testBrokenTokenizer()
        throws Exception
{
    final String query = "foo_bah 20001_to_30000";

    StreamTokenizer tok = new StreamTokenizer(new StringReader(query));
    tok.wordChars('_', '_');       
    assertEquals(tok.nextToken(), StreamTokenizer.TT_WORD);
    assertEquals(tok.sval, "foo_bah");
    assertEquals(tok.nextToken(), StreamTokenizer.TT_NUMBER);
    assertEquals(tok.nval, 20001.0);
    assertEquals(tok.nextToken(), StreamTokenizer.TT_WORD);
    assertEquals(tok.sval, "_to_30000");
}

FWIW 我可以使用 StringTokenizer 代替,但它需要大量重构。

Java's StreamTokenizer seems to be too greedy in identifying numbers. It is relatively light on configuration options, and I haven't found a way to make it do what I want. The following test passes, IMO showing a bug in the implementation; what I'd really like is for the second token to be identified as a word "20001_to_30000". Any ideas?

public void testBrokenTokenizer()
        throws Exception
{
    final String query = "foo_bah 20001_to_30000";

    StreamTokenizer tok = new StreamTokenizer(new StringReader(query));
    tok.wordChars('_', '_');       
    assertEquals(tok.nextToken(), StreamTokenizer.TT_WORD);
    assertEquals(tok.sval, "foo_bah");
    assertEquals(tok.nextToken(), StreamTokenizer.TT_NUMBER);
    assertEquals(tok.nval, 20001.0);
    assertEquals(tok.nextToken(), StreamTokenizer.TT_WORD);
    assertEquals(tok.sval, "_to_30000");
}

FWIW I could use a StringTokenizer instead, but it would require a lot of refactoring.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

荒芜了季节 2024-10-08 15:40:02

IMO,最好的解决方案是使用扫描仪,但如果您想强制古老的 StreamTokenizer 为您工作,请尝试以下操作:

import java.util.regex.*;
...

final String query = "foo_bah 20001_to_30000\n2.001 this is line number 2 blargh";

StreamTokenizer tok = new StreamTokenizer(new StringReader(query));

// recreate standard syntax table
tok.resetSyntax();
tok.whitespaceChars('\u0000', '\u0020');
tok.wordChars('a', 'z');
tok.wordChars('A', 'Z');
tok.wordChars('\u00A0', '\u00FF');
tok.commentChar('/');
tok.quoteChar('\'');
tok.quoteChar('"');
tok.eolIsSignificant(false);
tok.slashSlashComments(false);
tok.slashStarComments(false);
//tok.parseNumbers();  // this WOULD be part of the standard syntax

// syntax additions
tok.wordChars('0', '9');
tok.wordChars('.', '.');
tok.wordChars('_', '_');

// create regex to verify numeric conversion in order to avoid having
// to catch NumberFormatException errors from Double.parseDouble()
Pattern double_regex = Pattern.compile("[-+]?[0-9]*\\.?[0-9]+([eE][-+]?[0-9]+)?");

try {
    int type = StreamTokenizer.TT_WORD;

    while (type != StreamTokenizer.TT_EOF) {
        type = tok.nextToken();

        if (type == StreamTokenizer.TT_WORD) {
            String str = tok.sval;
            Matcher regex_match = double_regex.matcher(str);

            if (regex_match.matches()) {  // NUMBER
                double val = Double.parseDouble(str);
                System.out.println("double = " + val);
            }
            else {  // WORD
                System.out.println("string = " + str);
            }
        }
    }
}
catch (IOException err) {
    err.printStackTrace();
}

本质上,您正在从 StreamTokenizer 中卸载数值的标记化。正则表达式匹配是为了避免依赖 NumericFormatException 来告诉您 Double.parseDouble() 不适用于给定的标记。

IMO, the best solution is using a Scanner, but if you want to force the venerable StreamTokenizer to work for you, try the following:

import java.util.regex.*;
...

final String query = "foo_bah 20001_to_30000\n2.001 this is line number 2 blargh";

StreamTokenizer tok = new StreamTokenizer(new StringReader(query));

// recreate standard syntax table
tok.resetSyntax();
tok.whitespaceChars('\u0000', '\u0020');
tok.wordChars('a', 'z');
tok.wordChars('A', 'Z');
tok.wordChars('\u00A0', '\u00FF');
tok.commentChar('/');
tok.quoteChar('\'');
tok.quoteChar('"');
tok.eolIsSignificant(false);
tok.slashSlashComments(false);
tok.slashStarComments(false);
//tok.parseNumbers();  // this WOULD be part of the standard syntax

// syntax additions
tok.wordChars('0', '9');
tok.wordChars('.', '.');
tok.wordChars('_', '_');

// create regex to verify numeric conversion in order to avoid having
// to catch NumberFormatException errors from Double.parseDouble()
Pattern double_regex = Pattern.compile("[-+]?[0-9]*\\.?[0-9]+([eE][-+]?[0-9]+)?");

try {
    int type = StreamTokenizer.TT_WORD;

    while (type != StreamTokenizer.TT_EOF) {
        type = tok.nextToken();

        if (type == StreamTokenizer.TT_WORD) {
            String str = tok.sval;
            Matcher regex_match = double_regex.matcher(str);

            if (regex_match.matches()) {  // NUMBER
                double val = Double.parseDouble(str);
                System.out.println("double = " + val);
            }
            else {  // WORD
                System.out.println("string = " + str);
            }
        }
    }
}
catch (IOException err) {
    err.printStackTrace();
}

Essentially, you're offloading the tokenizing of numeric values from StreamTokenizer. The regex matching is to avoid relying on NumericFormatException to tell you that Double.parseDouble() doesn't work on the given token.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文