用Java StringTokenizer分裂令牌

发布于 2025-02-11 21:51:20 字数 1831 浏览 1 评论 0原文

我有一个看起来像这样的数据集：

drawdate    lotterynumbers  meganumber  multiplier
2005-01-04  03 06 07 12 32  30            NULL
2005-01-07  02 08 14 15 51  38            NULL
etc.

以及以下代码：

public class LotteryCount {

    /**
     * Mapper which extracts the lottery number and passes it to the Reducer with a single occurrence
     */
    public static class LotteryMapper extends Mapper<Object, Text, IntWritable, IntWritable> {

        private final static IntWritable one = new IntWritable(1);
        private IntWritable lotteryKey;

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

            StringTokenizer itr = new StringTokenizer(value.toString(), ",");
            while (itr.hasMoreTokens()) {
                lotteryKey.set(Integer.valueOf(itr.nextToken()));
                context.write(lotteryKey, one);
            }
        }
    }

    /**
     * Reducer to sum up the occurrence
     */
    public static class LotteryReducer
            extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable> {
        IntWritable result = new IntWritable();

        public void reduce(IntWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int sum = 0;

            for (IntWritable val : values) {
                sum += val.get();
            }

            result.set(sum);
            context.write(key, result);
        }
    }
}

实际上是官方Apache Hadoop文档中的单词计数，只是对我的数据集进行了一些自定义。

我会收到以下错误：

Caused by: java.lang.NumberFormatException: For input string: "2005-01-04"

我只是有兴趣计算每个绘制彩票号码的出现。如何使用来自代码的StringTokenizer来执行此操作？我知道我必须将整个行分开，因为令牌仪是“喂”整个。我该如何将彩票名单分开然后数量计算？

先感谢您

原文

I have a data set that looks like this:

drawdate    lotterynumbers  meganumber  multiplier
2005-01-04  03 06 07 12 32  30            NULL
2005-01-07  02 08 14 15 51  38            NULL
etc.

and the following code:

public class LotteryCount {

    /**
     * Mapper which extracts the lottery number and passes it to the Reducer with a single occurrence
     */
    public static class LotteryMapper extends Mapper<Object, Text, IntWritable, IntWritable> {

        private final static IntWritable one = new IntWritable(1);
        private IntWritable lotteryKey;

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

            StringTokenizer itr = new StringTokenizer(value.toString(), ",");
            while (itr.hasMoreTokens()) {
                lotteryKey.set(Integer.valueOf(itr.nextToken()));
                context.write(lotteryKey, one);
            }
        }
    }

    /**
     * Reducer to sum up the occurrence
     */
    public static class LotteryReducer
            extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable> {
        IntWritable result = new IntWritable();

        public void reduce(IntWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int sum = 0;

            for (IntWritable val : values) {
                sum += val.get();
            }

            result.set(sum);
            context.write(key, result);
        }
    }
}

It is actually the word count from the official apache hadoop documentation, just a bit customized to my data set.

I get the following error:

Caused by: java.lang.NumberFormatException: For input string: "2005-01-04"

I am just interested in counting the occurrences for each individual drawn lottery number. How can I do this by using the StringTokenizer from my code? I know that I have to split the whole row because the tokenizer is "fed" with the whole. How can I take the lotterynumbers, split them and then count?

Thank you in advance

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

感情旳空白 2025-02-18 21:51:20

我只是有兴趣计算每个绘制彩票号码的出现。如何使用来自代码的StringTokenizer来执行此操作？我知道我必须将整个行分开，因为令牌仪是“喂”整个。我该如何将彩票名人分开然后计算？

我该如何进行彩票名单

drawdate    lotterynumbers  meganumber  multiplier
2005-01-04  03 06 07 12 32  30            NULL
2005-01-07  02 08 14 15 51  38            NULL

然后对它们

， LINE ，包括分隔数据字段的标签字符，就像您发布的那样。
它使用StringTokenizer与将其定义为单个选项卡字符（\ t）定义的令牌分隔符（）
hasmoretokens（）直到所有令牌均为所有令牌看到，沿途打印每个人。
输出包括左+右支架，以显示每个令牌的边界。例如，“ 30”具有一个尾随的空间字符，如果不使用[]字符，它与“ null”前面的领先的whitesapce相同。

String line = "2005-01-04   03 06 07 12 32  30            NULL";
StringTokenizer tokenizer = new StringTokenizer(line, "\t");

while (tokenizer.hasMoreTokens()) {
    String token = tokenizer.nextToken();
    System.out.println("token: [" + token + "]");
}

这是输出：

token: [2005-01-04 ]
token: [03 06 07 12 32 ]
token: [30 ]
token: [          NULL]

您可以采用这种方法，处理所有行，对选项卡字符进行令牌化，然后使用第二代币作为您的“彩票名称”数据来完成您喜欢的事情。

I am just interested in counting the occurrences for each individual drawn lottery number. How can I do this by using the StringTokenizer from my code? I know that I have to split the whole row because the tokenizer is "fed" with the whole. How can I take the lotterynumbers, split them and then count?

The data sample you posted is tab-delimited:

drawdate    lotterynumbers  meganumber  multiplier
2005-01-04  03 06 07 12 32  30            NULL
2005-01-07  02 08 14 15 51  38            NULL

Here's a simple example, and a few notes:

This uses the first line of your sample data as line, including the tab characters separating the data fields, just like you posted.
It uses a StringTokenizer with the token separator defined as as a single tab character (\t)
The program calls hasMoreTokens() until all tokens are seen, printing each one along the way.
The output includes left+right brackets to show the boundary of each token. For example, the "30" has a trailing space character that wouldn't be noticeable without using [] characters, same with leading whitesapce in front of "NULL".

String line = "2005-01-04   03 06 07 12 32  30            NULL";
StringTokenizer tokenizer = new StringTokenizer(line, "\t");

while (tokenizer.hasMoreTokens()) {
    String token = tokenizer.nextToken();
    System.out.println("token: [" + token + "]");
}

Here's the output:

token: [2005-01-04 ]
token: [03 06 07 12 32 ]
token: [30 ]
token: [          NULL]

You could take this approach, processing all lines, tokenizing on tab character, and use the 2nd token as your "lotterynumbers" data to do what you like.

回复收藏 0 原文