Java 读取文件有一个领先的 BOM [ ï»¿ ]

发布于 2024-11-14 15:18:27 字数 1096 浏览 1 评论 0原文

我正在逐行读取包含关键字的文件，发现一个奇怪的问题。我希望如果内容相同的相互跟随的行，应该只处理一次。就像

sony
sony

只有第一个正在处理一样。但问题是，java并没有平等地对待它们。

INFO: [, s, o, n, y]
INFO: [s, o, n, y]

我的代码如下所示，问题出在哪里？

    FileReader fileReader = new FileReader("some_file.txt");
    BufferedReader bufferedReader = new BufferedReader(fileReader);
    String prevLine = "";
    String strLine
    while ((strLine = bufferedReader.readLine()) != null) {
        logger.info(Arrays.toString(strLine.toCharArray()));
        if(strLine.contentEquals(prevLine)){
            logger.info("Skipping the duplicate lines " + strLine);
            continue;
        }
        prevLine = strLine;
    }

更新：

第一行似乎有一个空格，但实际上没有，并且 trim 方法对我不起作用。它们不一样：

INFO: [, s, o, n, y]
INFO: [ , s, o, n, y]

我不知道java添加的第一个Char是什么。

已解决：问题已通过 BalusC 的解决方案解决< /a>，感谢您指出这是BOM问题，帮助我快速找到解决方案。

原文

I am reading a file containing keywords line by line and found a strange problem.
I hope lines that following each other if their contents are the same, they should be handled only once. Like

sony
sony

only the first one is getting processed.
but the problems is, java doesn't treat them as equals.

INFO: [, s, o, n, y]
INFO: [s, o, n, y]

My code looks like the following, where's the problem?

    FileReader fileReader = new FileReader("some_file.txt");
    BufferedReader bufferedReader = new BufferedReader(fileReader);
    String prevLine = "";
    String strLine
    while ((strLine = bufferedReader.readLine()) != null) {
        logger.info(Arrays.toString(strLine.toCharArray()));
        if(strLine.contentEquals(prevLine)){
            logger.info("Skipping the duplicate lines " + strLine);
            continue;
        }
        prevLine = strLine;
    }

Update:

It seems like there's leading a space in the first line, but actually not, and the trim approach doesn't work for me. They're not the same:

INFO: [, s, o, n, y]
INFO: [ , s, o, n, y]

I don't know what's the first Char added by java.

Solved: the problem was solved with BalusC's solution, thanks for pointing out it's BOM problem which helped me to find out the solution quickly.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

流年里的时光 2024-11-21 15:18:27

文件的编码是什么？

文件开头的看不见的字符可能是

使用 ANSI 或 UTF 保存的字节顺序标记 8 无 BOM 可以帮助您突出这一点。

回复收藏 0 原文

一刻暧昧 2024-11-21 15:18:27

字节顺序标记 ^(BOM) 是一个 Unicode 字符。您将得到类似 ï»¿ 出现在文本流的开头，因为 BOM 的使用是可选的，并且如果使用，则应出现在文本流的开头。

Microsoft 编译器和解释器以及 Microsoft Windows 上的许多软件（例如记事本）将 BOM 视为必需的幻数，而不是使用启发式方法。这些工具在将文本保存为 UTF-8 时添加 BOM，并且除非存在 BOM 或文件仅包含 ASCII，否则无法解释 UTF-8。当将文档转换为纯文本文件以供下载时，Google 文档还会添加 BOM。

File file = new File( csvFilename );
FileInputStream inputStream = new FileInputStream(file);
// [{"Key2":"21","ï»¿Key1":"11","Key3":"31"} ]
InputStreamReader inputStreamReader = new InputStreamReader( inputStream, "UTF-8" );

我们可以通过向InputStreamReader显式指定字符集为UTF-8来解决。然后在 UTF-8 中，字节序列 ï»¿ 解码为一个字符，即 U+FEFF (?）。

使用 Google Guava ^jar CharMatcher,您可以删除任何不可打印的字符，然后保留所有 ASCII 字符（删除任何重音符号），例如这个：

String printable = CharMatcher.INVISIBLE.removeFrom( input );
String clean = CharMatcher.ASCII.retainFrom( printable );

从CSV文件读取数据到JSON对象的完整示例：

public class CSV_FileOperations {
    static List<HashMap<String, String>> listObjects = new ArrayList<HashMap<String,String>>();
    protected static List<JSONObject> jsonArray = new ArrayList<JSONObject >();

    public static void main(String[] args) {
        String csvFilename = "D:/Yashwanth/json2Bson.csv";

        csvToJSONString(csvFilename);
        String jsonData = jsonArray.toString();
        System.out.println("File JSON Data : \n"+ jsonData);
    }

    @SuppressWarnings("deprecation")
    public static String csvToJSONString( String csvFilename ) {
        try {
            File file = new File( csvFilename );
            FileInputStream inputStream = new FileInputStream(file);

            String fileExtensionName = csvFilename.substring(csvFilename.indexOf(".")); // fileName.split(".")[1];
            System.out.println("File Extension : "+ fileExtensionName);

            // [{"Key2":"21","ï»¿Key1":"11","Key3":"31"} ]
            InputStreamReader inputStreamReader = new InputStreamReader( inputStream, "UTF-8" );

            BufferedReader buffer = new BufferedReader( inputStreamReader );
            Stream<String> readLines = buffer.lines();
            boolean headerStream = true;

            List<String> headers = new ArrayList<String>();
            for (String line : (Iterable<String>) () -> readLines.iterator()) {
                String[] columns = line.split(",");
                if (headerStream) {
                    System.out.println(" ===== Headers =====");

                    for (String keys : columns) {
                        // ï»¿ - UTF-8 - ? « https://stackoverflow.com/a/11021401/5081877
                        String printable = CharMatcher.INVISIBLE.removeFrom( keys );
                        String clean = CharMatcher.ASCII.retainFrom(printable);
                        String key = clean.replace("\\P{Print}", "");
                        headers.add( key );
                    }
                    headerStream = false;
                    System.out.println(" ===== ----- Data ----- =====");
                } else {
                    addCSVData(headers, columns );
                }
            }
            inputStreamReader.close();
            buffer.close();


        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return null;
    }
    @SuppressWarnings("unchecked")
    public static void addCSVData( List<String> headers, String[] row ) {
        if( headers.size() == row.length ) {
            HashMap<String,String> mapObj = new HashMap<String,String>();
            JSONObject jsonObj = new JSONObject();
            for (int i = 0; i < row.length; i++) {
                mapObj.put(headers.get(i), row[i]);
                jsonObj.put(headers.get(i), row[i]);
            }
            jsonArray.add(jsonObj);
            listObjects.add(mapObj);
        } else {
            System.out.println("Avoiding the Row Data...");
        }
    }
}

json2Bson.csv 文件数据。

Key1    Key2    Key3
11  21  31
12  22  32
13  23  33

The Byte Order Mark ^(BOM) is a Unicode character. You will get characters like ï»¿ at the start of a text stream, because BOM use is optional, and, if used, should appear at the start of the text stream.

Microsoft compilers and interpreters, and many pieces of software on Microsoft Windows such as Notepad treat the BOM as a required magic number rather than use heuristics. These tools add a BOM when saving text as UTF-8, and cannot interpret UTF-8 unless the BOM is present or the file contains only ASCII. Google Docs also adds a BOM when converting a document to a plain text file for download.

File file = new File( csvFilename );
FileInputStream inputStream = new FileInputStream(file);
// [{"Key2":"21","ï»¿Key1":"11","Key3":"31"} ]
InputStreamReader inputStreamReader = new InputStreamReader( inputStream, "UTF-8" );

We can resolve by explicitly specifying charset as UTF-8 to InputStreamReader. Then in UTF-8, the byte sequence ï»¿ decodes to one character, which is U+FEFF (?).

Using Google Guava's ^jar CharMatcher, you can remove any non-printable characters and then retain all ASCII characters (dropping any accents) like this:

String printable = CharMatcher.INVISIBLE.removeFrom( input );
String clean = CharMatcher.ASCII.retainFrom( printable );

Full Example to read data from the CSV file to JSON Object:

public class CSV_FileOperations {
    static List<HashMap<String, String>> listObjects = new ArrayList<HashMap<String,String>>();
    protected static List<JSONObject> jsonArray = new ArrayList<JSONObject >();

    public static void main(String[] args) {
        String csvFilename = "D:/Yashwanth/json2Bson.csv";

        csvToJSONString(csvFilename);
        String jsonData = jsonArray.toString();
        System.out.println("File JSON Data : \n"+ jsonData);
    }

    @SuppressWarnings("deprecation")
    public static String csvToJSONString( String csvFilename ) {
        try {
            File file = new File( csvFilename );
            FileInputStream inputStream = new FileInputStream(file);

            String fileExtensionName = csvFilename.substring(csvFilename.indexOf(".")); // fileName.split(".")[1];
            System.out.println("File Extension : "+ fileExtensionName);

            // [{"Key2":"21","ï»¿Key1":"11","Key3":"31"} ]
            InputStreamReader inputStreamReader = new InputStreamReader( inputStream, "UTF-8" );

            BufferedReader buffer = new BufferedReader( inputStreamReader );
            Stream<String> readLines = buffer.lines();
            boolean headerStream = true;

            List<String> headers = new ArrayList<String>();
            for (String line : (Iterable<String>) () -> readLines.iterator()) {
                String[] columns = line.split(",");
                if (headerStream) {
                    System.out.println(" ===== Headers =====");

                    for (String keys : columns) {
                        // ï»¿ - UTF-8 - ? « https://stackoverflow.com/a/11021401/5081877
                        String printable = CharMatcher.INVISIBLE.removeFrom( keys );
                        String clean = CharMatcher.ASCII.retainFrom(printable);
                        String key = clean.replace("\\P{Print}", "");
                        headers.add( key );
                    }
                    headerStream = false;
                    System.out.println(" ===== ----- Data ----- =====");
                } else {
                    addCSVData(headers, columns );
                }
            }
            inputStreamReader.close();
            buffer.close();


        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return null;
    }
    @SuppressWarnings("unchecked")
    public static void addCSVData( List<String> headers, String[] row ) {
        if( headers.size() == row.length ) {
            HashMap<String,String> mapObj = new HashMap<String,String>();
            JSONObject jsonObj = new JSONObject();
            for (int i = 0; i < row.length; i++) {
                mapObj.put(headers.get(i), row[i]);
                jsonObj.put(headers.get(i), row[i]);
            }
            jsonArray.add(jsonObj);
            listObjects.add(mapObj);
        } else {
            System.out.println("Avoiding the Row Data...");
        }
    }
}

json2Bson.csv File data.

Key1    Key2    Key3
11  21  31
12  22  32
13  23  33

回复收藏 0 原文

暮光沉寂 2024-11-21 15:18:27

尝试修剪读取行的开头和结尾处的空格。只需将 while 替换为：

while ((strLine = bufferedReader.readLine()) != null) {
        strLine = strLine.trim();
        logger.info(Arrays.toString(strLine.toCharArray()));
    if(strLine.contentEquals(prevLine)){
        logger.info("Skipping the duplicate lines " + strLine);
        continue;
    }
    prevLine = strLine;
}

Try trimming whitespace at the beginning and end of lines read. Just replace your while with:

while ((strLine = bufferedReader.readLine()) != null) {
        strLine = strLine.trim();
        logger.info(Arrays.toString(strLine.toCharArray()));
    if(strLine.contentEquals(prevLine)){
        logger.info("Skipping the duplicate lines " + strLine);
        continue;
    }
    prevLine = strLine;
}

回复收藏 0 原文

这样的小城市 2024-11-21 15:18:27

我之前的项目中也遇到过类似的情况。罪魁祸首是字节顺序标记，我必须删除它。最终我根据这个示例实现了一个黑客。查一下，也许你也有同样的问题。

回复收藏 0 原文

流绪微梦 2024-11-21 15:18:27

开头必须有一个空格或一些不可打印的字符。因此，要么解决这个问题，要么在比较期间/之前修剪字符串。

[已编辑]

如果String.trim()没有用。尝试使用正确的正则表达式来String.replaceAll()。试试这个，str.replaceAll("\\p{Cntrl}", "")。

回复收藏 0 原文

忘你却要生生世世 2024-11-21 15:18:27

如果空格在处理中并不重要，那么每次调用 strLine.trim() 可能都是值得的。这就是我在处理这样的输入时通常所做的 - 如果必须手动编辑空格，则空格很容易渗入文件中，并且如果它们不重要，则可以并且应该忽略它们。

编辑：文件编码为 UTF-8 吗？打开文件时您可能需要指定编码。如果它发生在第一行，它可能是字节顺序标记或类似的东西。

尝试：

BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF8"))

If spaces are not important in the processing it would probably be worth doing a strLine.trim() call each time anyway. This is what I generally do when handling input like this - spaces can easily creep into a file if it has to be edited manually and if they're not important they can and should be ignored.

Edit: is the file encoded as UTF-8? You may need to specify the encoding when you open the file. It could be the byte order mark or something like that, if it's happening on the first line.

Try:

BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF8"))

回复收藏 0 原文

十二 2024-11-21 15:18:27

在文本编辑器中打开文件，导航至文件>另存为...并选择UTF-8编码，而不是UTF-8 with BOM。

回复收藏 0 原文

~没有更多了~

关于作者

不再让梦枯萎

暂无简介

0 文章

0 评论

24 人气

关注发私信

友情链接

文江博客

Java 读取文件有一个领先的 BOM [ ï»¿ ]

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（7）

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

Java 读取文件有一个领先的 BOM [ ï»¿ ]

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（7）

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。