用Java StringTokenizer分裂令牌

发布于 2025-02-11 21:51:20 字数 1831 浏览 1 评论 0原文

我有一个看起来像这样的数据集:

drawdate    lotterynumbers  meganumber  multiplier
2005-01-04  03 06 07 12 32  30            NULL
2005-01-07  02 08 14 15 51  38            NULL
etc.

以及以下代码:

public class LotteryCount {

    /**
     * Mapper which extracts the lottery number and passes it to the Reducer with a single occurrence
     */
    public static class LotteryMapper extends Mapper<Object, Text, IntWritable, IntWritable> {

        private final static IntWritable one = new IntWritable(1);
        private IntWritable lotteryKey;

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

            StringTokenizer itr = new StringTokenizer(value.toString(), ",");
            while (itr.hasMoreTokens()) {
                lotteryKey.set(Integer.valueOf(itr.nextToken()));
                context.write(lotteryKey, one);
            }
        }
    }

    /**
     * Reducer to sum up the occurrence
     */
    public static class LotteryReducer
            extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable> {
        IntWritable result = new IntWritable();

        public void reduce(IntWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int sum = 0;

            for (IntWritable val : values) {
                sum += val.get();
            }

            result.set(sum);
            context.write(key, result);
        }
    }
}

实际上是官方Apache Hadoop文档中的单词计数,只是对我的数据集进行了一些自定义。

我会收到以下错误:

Caused by: java.lang.NumberFormatException: For input string: "2005-01-04"

我只是有兴趣计算每个绘制彩票号码的出现。如何使用来自代码的StringTokenizer来执行此操作?我知道我必须将整个行分开,因为令牌仪是“喂”整个。我该如何将彩票名单分开然后数量计算?

先感谢您

I have a data set that looks like this:

drawdate    lotterynumbers  meganumber  multiplier
2005-01-04  03 06 07 12 32  30            NULL
2005-01-07  02 08 14 15 51  38            NULL
etc.

and the following code:

public class LotteryCount {

    /**
     * Mapper which extracts the lottery number and passes it to the Reducer with a single occurrence
     */
    public static class LotteryMapper extends Mapper<Object, Text, IntWritable, IntWritable> {

        private final static IntWritable one = new IntWritable(1);
        private IntWritable lotteryKey;

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

            StringTokenizer itr = new StringTokenizer(value.toString(), ",");
            while (itr.hasMoreTokens()) {
                lotteryKey.set(Integer.valueOf(itr.nextToken()));
                context.write(lotteryKey, one);
            }
        }
    }

    /**
     * Reducer to sum up the occurrence
     */
    public static class LotteryReducer
            extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable> {
        IntWritable result = new IntWritable();

        public void reduce(IntWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int sum = 0;

            for (IntWritable val : values) {
                sum += val.get();
            }

            result.set(sum);
            context.write(key, result);
        }
    }
}

It is actually the word count from the official apache hadoop documentation, just a bit customized to my data set.

I get the following error:

Caused by: java.lang.NumberFormatException: For input string: "2005-01-04"

I am just interested in counting the occurrences for each individual drawn lottery number. How can I do this by using the StringTokenizer from my code? I know that I have to split the whole row because the tokenizer is "fed" with the whole. How can I take the lotterynumbers, split them and then count?

Thank you in advance

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

感情旳空白 2025-02-18 21:51:20

我只是有兴趣计算每个绘制彩票号码的出现。如何使用来自代码的StringTokenizer来执行此操作?我知道我必须将整个行分开,因为令牌仪是“喂”整个。我该如何将彩票名人分开然后计算?

我该如何进行彩票名单

drawdate    lotterynumbers  meganumber  multiplier
2005-01-04  03 06 07 12 32  30            NULL
2005-01-07  02 08 14 15 51  38            NULL

然后对它们

  • , LINE ,包括分隔数据字段的标签字符,就像您发布的那样。
  • 它使用StringTokenizer与将其定义为单个选项卡字符(\ t)定义的令牌分隔符()
  • hasmoretokens()直到所有令牌均为所有令牌看到,沿途打印每个人。
  • 输出包括左+右支架,以显示每个令牌的边界。例如,“ 30”具有一个尾随的空间字符,如果不使用[]字符,它与“ null”前面的领先的whitesapce相同。
String line = "2005-01-04   03 06 07 12 32  30            NULL";
StringTokenizer tokenizer = new StringTokenizer(line, "\t");

while (tokenizer.hasMoreTokens()) {
    String token = tokenizer.nextToken();
    System.out.println("token: [" + token + "]");
}

这是输出:

token: [2005-01-04 ]
token: [03 06 07 12 32 ]
token: [30 ]
token: [          NULL]

您可以采用这种方法,处理所有行,对选项卡字符进行令牌化,然后使用第二代币作为您的“彩票名称”数据来完成您喜欢的事情。

I am just interested in counting the occurrences for each individual drawn lottery number. How can I do this by using the StringTokenizer from my code? I know that I have to split the whole row because the tokenizer is "fed" with the whole. How can I take the lotterynumbers, split them and then count?

The data sample you posted is tab-delimited:

drawdate    lotterynumbers  meganumber  multiplier
2005-01-04  03 06 07 12 32  30            NULL
2005-01-07  02 08 14 15 51  38            NULL

Here's a simple example, and a few notes:

  • This uses the first line of your sample data as line, including the tab characters separating the data fields, just like you posted.
  • It uses a StringTokenizer with the token separator defined as as a single tab character (\t)
  • The program calls hasMoreTokens() until all tokens are seen, printing each one along the way.
  • The output includes left+right brackets to show the boundary of each token. For example, the "30" has a trailing space character that wouldn't be noticeable without using [] characters, same with leading whitesapce in front of "NULL".
String line = "2005-01-04   03 06 07 12 32  30            NULL";
StringTokenizer tokenizer = new StringTokenizer(line, "\t");

while (tokenizer.hasMoreTokens()) {
    String token = tokenizer.nextToken();
    System.out.println("token: [" + token + "]");
}

Here's the output:

token: [2005-01-04 ]
token: [03 06 07 12 32 ]
token: [30 ]
token: [          NULL]

You could take this approach, processing all lines, tokenizing on tab character, and use the 2nd token as your "lotterynumbers" data to do what you like.

同尘 2025-02-18 21:51:20

第一个问题 - 在传递到MapReduce之前,您需要删除文件的标头。

其次 - 您所显示的数据集中没有逗号,因此“,”不应将其提供给StringTokenizer。尝试“ \ t”接下来

- 并非所有令牌都是整数,因此盲目调用integer.valueof(itr.nexttoken())将无法使用。第一列是日期。您可以在循环丢弃日期之前调用itr.nexttoken(),但是最后您需要在结尾处处理null

最终,映射者不需要解析任何东西。您还可以计算还原器中的字符串。

First problem - you'll need to remove the header of your file before passing to MapReduce.

Second - you have no commas in your shown dataset, so "," should not be given to StringTokenizer. Try "\t" instead

Next - Not all your tokens are Integers, so blindly calling Integer.valueOf(itr.nextToken()) will not work. The first column is a date. You can call itr.nextToken() before the loop to discard the date, but then you need to handle the NULL at the end.

Ultimately, the mapper doesn't need to parse anything. You can also count strings in the reducer.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文