java中从BufferedReader到BufferedWriter的字符损坏
在 Java 中,我试图解析包含复杂文本(例如希腊符号)的 HTML 文件。
当文本包含左引号时,我遇到一个已知问题。文本例如
mutations to particular “hotspot” regions
成为
mutations to particular “hotspot�? regions
我通过编写一个简单的文本复制方法隔离了问题:
public static int CopyFile()
{
try
{
StringBuffer sb = null;
String NullSpace = System.getProperty("line.separator");
Writer output = new BufferedWriter(new FileWriter(outputFile));
String line;
BufferedReader input = new BufferedReader(new FileReader(myFile));
while((line = input.readLine())!=null)
{
sb = new StringBuffer();
//Parsing would happen
sb.append(line);
output.write(sb.toString()+NullSpace);
}
return 0;
}
catch (Exception e)
{
return 1;
}
}
任何人都可以提供一些关于如何纠正此问题的建议吗?
★我的解决方案
InputStream in = new FileInputStream(myFile);
Reader reader = new InputStreamReader(in,"utf-8");
Reader buffer = new BufferedReader(reader);
Writer output = new BufferedWriter(new FileWriter(outputFile));
int r;
while ((r = reader.read()) != -1)
{
if (r<126)
{
output.write(r);
}
else
{
output.write("&#"+Integer.toString(r)+";");
}
}
output.flush();
In Java, I am trying to parse an HTML file that contains complex text such as greek symbols.
I encounter a known problem when text contains a left facing quotation mark. Text such as
mutations to particular “hotspot” regions
becomes
mutations to particular “hotspot�? regions
I have isolated the problem by writting a simple text copy meathod:
public static int CopyFile()
{
try
{
StringBuffer sb = null;
String NullSpace = System.getProperty("line.separator");
Writer output = new BufferedWriter(new FileWriter(outputFile));
String line;
BufferedReader input = new BufferedReader(new FileReader(myFile));
while((line = input.readLine())!=null)
{
sb = new StringBuffer();
//Parsing would happen
sb.append(line);
output.write(sb.toString()+NullSpace);
}
return 0;
}
catch (Exception e)
{
return 1;
}
}
Can anybody offer some advice as how to correct this problem?
★My solution
InputStream in = new FileInputStream(myFile);
Reader reader = new InputStreamReader(in,"utf-8");
Reader buffer = new BufferedReader(reader);
Writer output = new BufferedWriter(new FileWriter(outputFile));
int r;
while ((r = reader.read()) != -1)
{
if (r<126)
{
output.write(r);
}
else
{
output.write(""+Integer.toString(r)+";");
}
}
output.flush();
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
读取的文件的编码(可能是 UTF-8)与写入的文件(可能是 ISO-8859-1)的编码不同。
请尝试以下操作来生成 UTF-8 编码的文件:
不幸的是,确定文件的编码非常困难。请参阅 Java:如何确定正确的字符集编码流的
The file read is not in the same encoding (probably UTF-8) as the file written (probably ISO-8859-1).
Try the following to generate a file with UTF-8 encoding:
Unfortunately, determining the encoding of a file is very difficult. See Java : How to determine the correct charset encoding of a stream
除了 Thierry-Dimitri Roy 所写的之外,如果您知道编码,则必须创建 FileReader 需要一些额外的工作。来自文档:
In addition to what Thierry-Dimitri Roy wrote, if you know the encoding you have to create your FileReader with a bit of extra work. From the docs:
FileReader 的 Javadoc< /a> 说:
在您的情况下,默认字符编码可能不合适。查找输入文件使用的编码并指定它。例如:
The Javadoc for
FileReader
says:In your case the default character encoding is probably not appropriate. Find what encoding the input file uses, and specify it. For example: