用Java StringTokenizer分裂令牌
我有一个看起来像这样的数据集:
drawdate lotterynumbers meganumber multiplier
2005-01-04 03 06 07 12 32 30 NULL
2005-01-07 02 08 14 15 51 38 NULL
etc.
以及以下代码:
public class LotteryCount {
/**
* Mapper which extracts the lottery number and passes it to the Reducer with a single occurrence
*/
public static class LotteryMapper extends Mapper<Object, Text, IntWritable, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private IntWritable lotteryKey;
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString(), ",");
while (itr.hasMoreTokens()) {
lotteryKey.set(Integer.valueOf(itr.nextToken()));
context.write(lotteryKey, one);
}
}
}
/**
* Reducer to sum up the occurrence
*/
public static class LotteryReducer
extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable> {
IntWritable result = new IntWritable();
public void reduce(IntWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
}
实际上是官方Apache Hadoop文档中的单词计数,只是对我的数据集进行了一些自定义。
我会收到以下错误:
Caused by: java.lang.NumberFormatException: For input string: "2005-01-04"
我只是有兴趣计算每个绘制彩票号码的出现。如何使用来自代码的StringTokenizer来执行此操作?我知道我必须将整个行分开,因为令牌仪是“喂”整个。我该如何将彩票名单分开然后数量计算?
先感谢您
I have a data set that looks like this:
drawdate lotterynumbers meganumber multiplier
2005-01-04 03 06 07 12 32 30 NULL
2005-01-07 02 08 14 15 51 38 NULL
etc.
and the following code:
public class LotteryCount {
/**
* Mapper which extracts the lottery number and passes it to the Reducer with a single occurrence
*/
public static class LotteryMapper extends Mapper<Object, Text, IntWritable, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private IntWritable lotteryKey;
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString(), ",");
while (itr.hasMoreTokens()) {
lotteryKey.set(Integer.valueOf(itr.nextToken()));
context.write(lotteryKey, one);
}
}
}
/**
* Reducer to sum up the occurrence
*/
public static class LotteryReducer
extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable> {
IntWritable result = new IntWritable();
public void reduce(IntWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
}
It is actually the word count from the official apache hadoop documentation, just a bit customized to my data set.
I get the following error:
Caused by: java.lang.NumberFormatException: For input string: "2005-01-04"
I am just interested in counting the occurrences for each individual drawn lottery number. How can I do this by using the StringTokenizer from my code? I know that I have to split the whole row because the tokenizer is "fed" with the whole. How can I take the lotterynumbers, split them and then count?
Thank you in advance
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我该如何进行彩票名单
然后对它们
StringTokenizer
与将其定义为单个选项卡字符(\ t
)定义的令牌分隔符()[]
字符,它与“ null”前面的领先的whitesapce相同。这是输出:
您可以采用这种方法,处理所有行,对选项卡字符进行令牌化,然后使用第二代币作为您的“彩票名称”数据来完成您喜欢的事情。
The data sample you posted is tab-delimited:
Here's a simple example, and a few notes:
line
, including the tab characters separating the data fields, just like you posted.StringTokenizer
with the token separator defined as as a single tab character (\t
)hasMoreTokens()
until all tokens are seen, printing each one along the way.[]
characters, same with leading whitesapce in front of "NULL".Here's the output:
You could take this approach, processing all lines, tokenizing on tab character, and use the 2nd token as your "lotterynumbers" data to do what you like.
第一个问题 - 在传递到MapReduce之前,您需要删除文件的标头。
其次 - 您所显示的数据集中没有逗号,因此
“,”
不应将其提供给StringTokenizer
。尝试“ \ t”
接下来- 并非所有令牌都是整数,因此盲目调用
integer.valueof(itr.nexttoken())
将无法使用。第一列是日期。您可以在循环丢弃日期之前调用itr.nexttoken()
,但是最后您需要在结尾处处理null
。最终,映射者不需要解析任何东西。您还可以计算还原器中的字符串。
First problem - you'll need to remove the header of your file before passing to MapReduce.
Second - you have no commas in your shown dataset, so
","
should not be given toStringTokenizer
. Try"\t"
insteadNext - Not all your tokens are Integers, so blindly calling
Integer.valueOf(itr.nextToken())
will not work. The first column is a date. You can callitr.nextToken()
before the loop to discard the date, but then you need to handle theNULL
at the end.Ultimately, the mapper doesn't need to parse anything. You can also count strings in the reducer.