java中从BufferedReader到BufferedWriter的字符损坏

发布于 2024-09-16 00:01:14 字数 1444 浏览 10 评论 0原文

在 Java 中,我试图解析包含复杂文本(例如希腊符号)的 HTML 文件。

当文本包含左引号时,我遇到一个已知问题。文本例如

mutations to particular “hotspot” regions

成为

 mutations to particular “hotspot�? regions

我通过编写一个简单的文本复制方法隔离了问题:

public static int CopyFile()
{
    try
    {
    StringBuffer sb = null;
    String NullSpace = System.getProperty("line.separator");
    Writer output = new BufferedWriter(new FileWriter(outputFile));
    String line;
    BufferedReader input =  new BufferedReader(new FileReader(myFile));
while((line = input.readLine())!=null)
    {
        sb = new StringBuffer();
        //Parsing would happen
        sb.append(line);
        output.write(sb.toString()+NullSpace);
    }
        return 0;
    }
    catch (Exception e)
    {
        return 1;
    }
}

任何人都可以提供一些关于如何纠正此问题的建议吗?

★我的解决方案

InputStream in = new FileInputStream(myFile);
        Reader reader = new InputStreamReader(in,"utf-8");
        Reader buffer = new BufferedReader(reader);
        Writer output = new BufferedWriter(new FileWriter(outputFile));
        int r;
        while ((r = reader.read()) != -1)
        {
            if (r<126)
            {
                output.write(r);
            }
            else
            {
                output.write("&#"+Integer.toString(r)+";");
            }
        }
        output.flush();

In Java, I am trying to parse an HTML file that contains complex text such as greek symbols.

I encounter a known problem when text contains a left facing quotation mark. Text such as

mutations to particular “hotspot” regions

becomes

 mutations to particular “hotspot�? regions

I have isolated the problem by writting a simple text copy meathod:

public static int CopyFile()
{
    try
    {
    StringBuffer sb = null;
    String NullSpace = System.getProperty("line.separator");
    Writer output = new BufferedWriter(new FileWriter(outputFile));
    String line;
    BufferedReader input =  new BufferedReader(new FileReader(myFile));
while((line = input.readLine())!=null)
    {
        sb = new StringBuffer();
        //Parsing would happen
        sb.append(line);
        output.write(sb.toString()+NullSpace);
    }
        return 0;
    }
    catch (Exception e)
    {
        return 1;
    }
}

Can anybody offer some advice as how to correct this problem?

★My solution

InputStream in = new FileInputStream(myFile);
        Reader reader = new InputStreamReader(in,"utf-8");
        Reader buffer = new BufferedReader(reader);
        Writer output = new BufferedWriter(new FileWriter(outputFile));
        int r;
        while ((r = reader.read()) != -1)
        {
            if (r<126)
            {
                output.write(r);
            }
            else
            {
                output.write("&#"+Integer.toString(r)+";");
            }
        }
        output.flush();

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

白昼 2024-09-23 00:01:14

读取的文件的编码(可能是 UTF-8)与写入的文件(可能是 ISO-8859-1)的编码不同。

请尝试以下操作来生成 UTF-8 编码的文件:

BufferedWriter output = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputFile),"UTF8"));

不幸的是,确定文件的编码非常困难。请参阅 Java:如何确定正确的字符集编码流的

The file read is not in the same encoding (probably UTF-8) as the file written (probably ISO-8859-1).

Try the following to generate a file with UTF-8 encoding:

BufferedWriter output = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputFile),"UTF8"));

Unfortunately, determining the encoding of a file is very difficult. See Java : How to determine the correct charset encoding of a stream

别再吹冷风 2024-09-23 00:01:14

除了 Thierry-Dimitri Roy 所写的之外,如果您知道编码,则必须创建 FileReader 需要一些额外的工作。来自文档:

阅读便利课
字符文件。的构造函数
这个类假设默认
字符编码和默认值
字节缓冲区大小合适。到
自己指定这些值,
构造一个InputStreamReader
文件输入流。

In addition to what Thierry-Dimitri Roy wrote, if you know the encoding you have to create your FileReader with a bit of extra work. From the docs:

Convenience class for reading
character files. The constructors of
this class assume that the default
character encoding and the default
byte-buffer size are appropriate. To
specify these values yourself,
construct an InputStreamReader on a
FileInputStream.

杯别 2024-09-23 00:01:14

FileReader 的 Javadoc< /a> 说:

此类的构造函数假定默认字符编码和默认字节缓冲区大小是适当的。要自己指定这些值,请在 FileInputStream 上构造一个 InputStreamReader。

在您的情况下,默认字符编码可能合适。查找输入文件使用的编码并指定它。例如:

FileInputStream fis = new FileInputStream(myFile);
InputStreamReader isr = new InputStreamReader(fis, "charset name goes here");
BufferedReader input = new BufferedReader(isr);

The Javadoc for FileReader says:

The constructors of this class assume that the default character encoding and the default byte-buffer size are appropriate. To specify these values yourself, construct an InputStreamReader on a FileInputStream.

In your case the default character encoding is probably not appropriate. Find what encoding the input file uses, and specify it. For example:

FileInputStream fis = new FileInputStream(myFile);
InputStreamReader isr = new InputStreamReader(fis, "charset name goes here");
BufferedReader input = new BufferedReader(isr);
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文