Java 读取文件有一个领先的 BOM [  ]

发布于 2024-11-14 15:18:27 字数 1096 浏览 1 评论 0原文

我正在逐行读取包含关键字的文件,发现一个奇怪的问题。 我希望如果内容相同的相互跟随的行,应该只处理一次。就像

sony
sony

只有第一个正在处理一样。 但问题是,java并没有平等地对待它们。

INFO: [, s, o, n, y]
INFO: [s, o, n, y]

我的代码如下所示,问题出在哪里?

    FileReader fileReader = new FileReader("some_file.txt");
    BufferedReader bufferedReader = new BufferedReader(fileReader);
    String prevLine = "";
    String strLine
    while ((strLine = bufferedReader.readLine()) != null) {
        logger.info(Arrays.toString(strLine.toCharArray()));
        if(strLine.contentEquals(prevLine)){
            logger.info("Skipping the duplicate lines " + strLine);
            continue;
        }
        prevLine = strLine;
    }

更新:

第一行似乎有一个空格,但实际上没有,并且 trim 方法对我不起作用。它们不一样:

INFO: [, s, o, n, y]
INFO: [ , s, o, n, y]

我不知道java添加的第一个Char是什么。

已解决:问题已通过 BalusC 的解决方案解决< /a>,感谢您指出这是BOM问题,帮助我快速找到解决方案。

I am reading a file containing keywords line by line and found a strange problem.
I hope lines that following each other if their contents are the same, they should be handled only once. Like

sony
sony

only the first one is getting processed.
but the problems is, java doesn't treat them as equals.

INFO: [, s, o, n, y]
INFO: [s, o, n, y]

My code looks like the following, where's the problem?

    FileReader fileReader = new FileReader("some_file.txt");
    BufferedReader bufferedReader = new BufferedReader(fileReader);
    String prevLine = "";
    String strLine
    while ((strLine = bufferedReader.readLine()) != null) {
        logger.info(Arrays.toString(strLine.toCharArray()));
        if(strLine.contentEquals(prevLine)){
            logger.info("Skipping the duplicate lines " + strLine);
            continue;
        }
        prevLine = strLine;
    }

Update:

It seems like there's leading a space in the first line, but actually not, and the trim approach doesn't work for me. They're not the same:

INFO: [, s, o, n, y]
INFO: [ , s, o, n, y]

I don't know what's the first Char added by java.

Solved: the problem was solved with BalusC's solution, thanks for pointing out it's BOM problem which helped me to find out the solution quickly.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

流年里的时光 2024-11-21 15:18:27

文件的编码是什么?

文件开头的看不见的字符可能是

使用 ANSI 或 UTF 保存的 字节顺序标记 8 无 BOM 可以帮助您突出这一点。

What is the encoding of the file?

The unseen char at the start of the file could be the Byte Order Mark

Saving with ANSI or UTF-8 without BOM can help highlight this for you.

一刻暧昧 2024-11-21 15:18:27

字节顺序标记 (BOM) 是一个 Unicode 字符。您将得到类似  出现在文本流的开头,因为 BOM 的使用是可选的,并且如果使用,则应出现在文本流的开头

  • Microsoft 编译器和解释器以及 Microsoft Windows 上的许多软件(例如记事本)将 BOM 视为必需的幻数,而不是使用启发式方法。这些工具在将文本保存为 UTF-8 时添加 BOM,并且除非存在 BOM 或文件仅包含 ASCII,否则无法解释 UTF-8。当将文档转换为纯文本文件以供下载时,Google 文档还会添加 BOM。
File file = new File( csvFilename );
FileInputStream inputStream = new FileInputStream(file);
// [{"Key2":"21","Key1":"11","Key3":"31"} ]
InputStreamReader inputStreamReader = new InputStreamReader( inputStream, "UTF-8" );

我们可以通过向InputStreamReader显式指定字符集为UTF-8来解决。然后在 UTF-8 中,字节序列  解码为一个字符,即 U+FEFF (?)。

使用 Google Guava jar CharMatcher,您可以删除任何不可打印的字符,然后保留所有 ASCII 字符(删除任何重音符号),例如 这个

String printable = CharMatcher.INVISIBLE.removeFrom( input );
String clean = CharMatcher.ASCII.retainFrom( printable );

CSV文件读取数据到JSON对象的完整示例:

public class CSV_FileOperations {
    static List<HashMap<String, String>> listObjects = new ArrayList<HashMap<String,String>>();
    protected static List<JSONObject> jsonArray = new ArrayList<JSONObject >();

    public static void main(String[] args) {
        String csvFilename = "D:/Yashwanth/json2Bson.csv";

        csvToJSONString(csvFilename);
        String jsonData = jsonArray.toString();
        System.out.println("File JSON Data : \n"+ jsonData);
    }

    @SuppressWarnings("deprecation")
    public static String csvToJSONString( String csvFilename ) {
        try {
            File file = new File( csvFilename );
            FileInputStream inputStream = new FileInputStream(file);

            String fileExtensionName = csvFilename.substring(csvFilename.indexOf(".")); // fileName.split(".")[1];
            System.out.println("File Extension : "+ fileExtensionName);

            // [{"Key2":"21","Key1":"11","Key3":"31"} ]
            InputStreamReader inputStreamReader = new InputStreamReader( inputStream, "UTF-8" );

            BufferedReader buffer = new BufferedReader( inputStreamReader );
            Stream<String> readLines = buffer.lines();
            boolean headerStream = true;

            List<String> headers = new ArrayList<String>();
            for (String line : (Iterable<String>) () -> readLines.iterator()) {
                String[] columns = line.split(",");
                if (headerStream) {
                    System.out.println(" ===== Headers =====");

                    for (String keys : columns) {
                        //  - UTF-8 - ? « https://stackoverflow.com/a/11021401/5081877
                        String printable = CharMatcher.INVISIBLE.removeFrom( keys );
                        String clean = CharMatcher.ASCII.retainFrom(printable);
                        String key = clean.replace("\\P{Print}", "");
                        headers.add( key );
                    }
                    headerStream = false;
                    System.out.println(" ===== ----- Data ----- =====");
                } else {
                    addCSVData(headers, columns );
                }
            }
            inputStreamReader.close();
            buffer.close();


        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return null;
    }
    @SuppressWarnings("unchecked")
    public static void addCSVData( List<String> headers, String[] row ) {
        if( headers.size() == row.length ) {
            HashMap<String,String> mapObj = new HashMap<String,String>();
            JSONObject jsonObj = new JSONObject();
            for (int i = 0; i < row.length; i++) {
                mapObj.put(headers.get(i), row[i]);
                jsonObj.put(headers.get(i), row[i]);
            }
            jsonArray.add(jsonObj);
            listObjects.add(mapObj);
        } else {
            System.out.println("Avoiding the Row Data...");
        }
    }
}

json2Bson.csv 文件数据。

Key1    Key2    Key3
11  21  31
12  22  32
13  23  33

The Byte Order Mark (BOM) is a Unicode character. You will get characters like  at the start of a text stream, because BOM use is optional, and, if used, should appear at the start of the text stream.

  • Microsoft compilers and interpreters, and many pieces of software on Microsoft Windows such as Notepad treat the BOM as a required magic number rather than use heuristics. These tools add a BOM when saving text as UTF-8, and cannot interpret UTF-8 unless the BOM is present or the file contains only ASCII. Google Docs also adds a BOM when converting a document to a plain text file for download.
File file = new File( csvFilename );
FileInputStream inputStream = new FileInputStream(file);
// [{"Key2":"21","Key1":"11","Key3":"31"} ]
InputStreamReader inputStreamReader = new InputStreamReader( inputStream, "UTF-8" );

We can resolve by explicitly specifying charset as UTF-8 to InputStreamReader. Then in UTF-8, the byte sequence  decodes to one character, which is U+FEFF (?).

Using Google Guava's jar CharMatcher, you can remove any non-printable characters and then retain all ASCII characters (dropping any accents) like this:

String printable = CharMatcher.INVISIBLE.removeFrom( input );
String clean = CharMatcher.ASCII.retainFrom( printable );

Full Example to read data from the CSV file to JSON Object:

public class CSV_FileOperations {
    static List<HashMap<String, String>> listObjects = new ArrayList<HashMap<String,String>>();
    protected static List<JSONObject> jsonArray = new ArrayList<JSONObject >();

    public static void main(String[] args) {
        String csvFilename = "D:/Yashwanth/json2Bson.csv";

        csvToJSONString(csvFilename);
        String jsonData = jsonArray.toString();
        System.out.println("File JSON Data : \n"+ jsonData);
    }

    @SuppressWarnings("deprecation")
    public static String csvToJSONString( String csvFilename ) {
        try {
            File file = new File( csvFilename );
            FileInputStream inputStream = new FileInputStream(file);

            String fileExtensionName = csvFilename.substring(csvFilename.indexOf(".")); // fileName.split(".")[1];
            System.out.println("File Extension : "+ fileExtensionName);

            // [{"Key2":"21","Key1":"11","Key3":"31"} ]
            InputStreamReader inputStreamReader = new InputStreamReader( inputStream, "UTF-8" );

            BufferedReader buffer = new BufferedReader( inputStreamReader );
            Stream<String> readLines = buffer.lines();
            boolean headerStream = true;

            List<String> headers = new ArrayList<String>();
            for (String line : (Iterable<String>) () -> readLines.iterator()) {
                String[] columns = line.split(",");
                if (headerStream) {
                    System.out.println(" ===== Headers =====");

                    for (String keys : columns) {
                        //  - UTF-8 - ? « https://stackoverflow.com/a/11021401/5081877
                        String printable = CharMatcher.INVISIBLE.removeFrom( keys );
                        String clean = CharMatcher.ASCII.retainFrom(printable);
                        String key = clean.replace("\\P{Print}", "");
                        headers.add( key );
                    }
                    headerStream = false;
                    System.out.println(" ===== ----- Data ----- =====");
                } else {
                    addCSVData(headers, columns );
                }
            }
            inputStreamReader.close();
            buffer.close();


        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return null;
    }
    @SuppressWarnings("unchecked")
    public static void addCSVData( List<String> headers, String[] row ) {
        if( headers.size() == row.length ) {
            HashMap<String,String> mapObj = new HashMap<String,String>();
            JSONObject jsonObj = new JSONObject();
            for (int i = 0; i < row.length; i++) {
                mapObj.put(headers.get(i), row[i]);
                jsonObj.put(headers.get(i), row[i]);
            }
            jsonArray.add(jsonObj);
            listObjects.add(mapObj);
        } else {
            System.out.println("Avoiding the Row Data...");
        }
    }
}

json2Bson.csv File data.

Key1    Key2    Key3
11  21  31
12  22  32
13  23  33
暮光沉寂 2024-11-21 15:18:27

尝试修剪读取行的开头和结尾处的空格。只需将 while 替换为:

while ((strLine = bufferedReader.readLine()) != null) {
        strLine = strLine.trim();
        logger.info(Arrays.toString(strLine.toCharArray()));
    if(strLine.contentEquals(prevLine)){
        logger.info("Skipping the duplicate lines " + strLine);
        continue;
    }
    prevLine = strLine;
}

Try trimming whitespace at the beginning and end of lines read. Just replace your while with:

while ((strLine = bufferedReader.readLine()) != null) {
        strLine = strLine.trim();
        logger.info(Arrays.toString(strLine.toCharArray()));
    if(strLine.contentEquals(prevLine)){
        logger.info("Skipping the duplicate lines " + strLine);
        continue;
    }
    prevLine = strLine;
}
这样的小城市 2024-11-21 15:18:27

我之前的项目中也遇到过类似的情况。罪魁祸首是 字节顺序标记,我必须删除它。最终我根据这个示例实现了一个黑客。查一下,也许你也有同样的问题。

I had a similar case in my previous project. The culprit was the Byte order mark, which I had to get rid of. Eventually I implemented a hack based on this example. Check it out, might be that you have the same problem.

流绪微梦 2024-11-21 15:18:27

开头必须有一个空格或一些不可打印的字符。因此,要么解决这个问题,要么在比较期间/之前修剪字符串

[已编辑]

如果String.trim()没有用。尝试使用正确的正则表达式String.replaceAll()。试试这个,str.replaceAll("\\p{Cntrl}", "")

There must be a space or some non-printable character in the start. So, either fix that or trim the Strings during/before comparison.

[Edited]

In case String.trim() is of no avail. Try String.replaceAll() using proper regex. Try this, str.replaceAll("\\p{Cntrl}", "").

忘你却要生生世世 2024-11-21 15:18:27

如果空格在处理中并不重要,那么每次调用 strLine.trim() 可能都是值得的。这就是我在处理这样的输入时通常所做的 - 如果必须手动编辑空格,则空格很容易渗入文件中,并且如果它们不重要,则可以并且应该忽略它们。

编辑:文件编码为 UTF-8 吗?打开文件时您可能需要指定编码。如果它发生在第一行,它可能是字节顺序标记或类似的东西。

尝试:

BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF8"))

If spaces are not important in the processing it would probably be worth doing a strLine.trim() call each time anyway. This is what I generally do when handling input like this - spaces can easily creep into a file if it has to be edited manually and if they're not important they can and should be ignored.

Edit: is the file encoded as UTF-8? You may need to specify the encoding when you open the file. It could be the byte order mark or something like that, if it's happening on the first line.

Try:

BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF8"))
十二 2024-11-21 15:18:27

在文本编辑器中打开文件,导航至文件>另存为...并选择UTF-8编码,而不是UTF-8 with BOM

Open the file in a text editor, navigate to File > Save As... and choose UTF-8 encoding, instead of UTF-8 with BOM.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文