当前位置：文江博客话题详情

Hadoop-Hadoop中文解析乱码

发布于 2017-02-07 10:34:23 字数 38 浏览 1284 评论 2

Hadoop解析带有中文链接，gbk格式输出地址，结果为乱码？

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

灵芸 2017-05-16 23:35:05

Hadoop中涉及编码时都是写死的UTF-8，如果文件编码格式是其它类型（如GBK)，则会出现乱码。此时只需在mapper或reducer程序中读取Text时，使用transformTextToUTF8(text, "GBK");进行一下转码，以确保都是以UTF-8的编码方式在运行。

public static Text transformTextToUTF8(Text text, String encoding) {
String value = null;
try {
value = new String(text.getBytes(), 0, text.getLength(), encoding);
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
return new Text(value);
}

输出文件为GBK，则重写TextOutputFormat类，public class GBKFileOutputFormat<K, V> extends FileOutputFormat<K, V>，把TextOutputFormat的源码拷过来，然后把里面写死的utf-8编码改成GBK编码。最后，在run程序中，设置job.setOutputFormatClass(GBKFileOutputFormat.class);

回复收藏 0