Java 读取文件有一个领先的 BOM [  ]
我正在逐行读取包含关键字的文件,发现一个奇怪的问题。 我希望如果内容相同的相互跟随的行,应该只处理一次。就像
sony
sony
只有第一个正在处理一样。 但问题是,java并没有平等地对待它们。
INFO: [, s, o, n, y]
INFO: [s, o, n, y]
我的代码如下所示,问题出在哪里?
FileReader fileReader = new FileReader("some_file.txt");
BufferedReader bufferedReader = new BufferedReader(fileReader);
String prevLine = "";
String strLine
while ((strLine = bufferedReader.readLine()) != null) {
logger.info(Arrays.toString(strLine.toCharArray()));
if(strLine.contentEquals(prevLine)){
logger.info("Skipping the duplicate lines " + strLine);
continue;
}
prevLine = strLine;
}
更新:
第一行似乎有一个空格,但实际上没有,并且 trim
方法对我不起作用。它们不一样:
INFO: [, s, o, n, y]
INFO: [ , s, o, n, y]
我不知道java添加的第一个Char是什么。
I am reading a file containing keywords line by line and found a strange problem.
I hope lines that following each other if their contents are the same, they should be handled only once. Like
sony
sony
only the first one is getting processed.
but the problems is, java doesn't treat them as equals.
INFO: [, s, o, n, y]
INFO: [s, o, n, y]
My code looks like the following, where's the problem?
FileReader fileReader = new FileReader("some_file.txt");
BufferedReader bufferedReader = new BufferedReader(fileReader);
String prevLine = "";
String strLine
while ((strLine = bufferedReader.readLine()) != null) {
logger.info(Arrays.toString(strLine.toCharArray()));
if(strLine.contentEquals(prevLine)){
logger.info("Skipping the duplicate lines " + strLine);
continue;
}
prevLine = strLine;
}
Update:
It seems like there's leading a space in the first line, but actually not, and the trim
approach doesn't work for me. They're not the same:
INFO: [, s, o, n, y]
INFO: [ , s, o, n, y]
I don't know what's the first Char added by java.
Solved: the problem was solved with BalusC's solution, thanks for pointing out it's BOM problem which helped me to find out the solution quickly.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
文件的编码是什么?
文件开头的看不见的字符可能是
使用 ANSI 或 UTF 保存的 字节顺序标记 8 无 BOM 可以帮助您突出这一点。
What is the encoding of the file?
The unseen char at the start of the file could be the Byte Order Mark
Saving with ANSI or UTF-8 without BOM can help highlight this for you.
字节顺序标记 (BOM) 是一个 Unicode 字符。您将得到类似

出现在文本流的开头,因为 BOM 的使用是可选的,并且如果使用,则应出现在文本流的开头。我们可以通过向InputStreamReader显式指定字符集为
UTF-8
来解决。然后在 UTF-8 中,字节序列
解码为一个字符,即 U+FEFF (?
)。使用
Google Guava
jar
CharMatcher,您可以删除任何不可打印的字符,然后保留所有 ASCII 字符(删除任何重音符号),例如 这个:从CSV文件读取数据到JSON对象的完整示例:
json2Bson.csv
文件数据。The Byte Order Mark (BOM) is a Unicode character. You will get characters like

at the start of a text stream, because BOM use is optional, and, if used, should appear at the start of the text stream.We can resolve by explicitly specifying charset as
UTF-8
to InputStreamReader. Then in UTF-8, the byte sequence
decodes to one character, which is U+FEFF (?
).Using
Google Guava's
jar
CharMatcher, you can remove any non-printable characters and then retain all ASCII characters (dropping any accents) like this:Full Example to read data from the CSV file to JSON Object:
json2Bson.csv
File data.尝试修剪读取行的开头和结尾处的空格。只需将 while 替换为:
Try trimming whitespace at the beginning and end of lines read. Just replace your while with:
我之前的项目中也遇到过类似的情况。罪魁祸首是 字节顺序标记,我必须删除它。最终我根据这个示例实现了一个黑客。查一下,也许你也有同样的问题。
I had a similar case in my previous project. The culprit was the Byte order mark, which I had to get rid of. Eventually I implemented a hack based on this example. Check it out, might be that you have the same problem.
开头必须有一个
空格
或一些不可打印的字符。因此,要么解决这个问题,要么在比较期间/之前修剪字符串
。[已编辑]
如果
String.trim()
没有用。尝试使用正确的正则表达式
来String.replaceAll()
。试试这个,str.replaceAll("\\p{Cntrl}", "")
。There must be a
space
or some non-printable character in the start. So, either fix that or trim theStrings
during/before comparison.[Edited]
In case
String.trim()
is of no avail. TryString.replaceAll()
using properregex
. Try this,str.replaceAll("\\p{Cntrl}", "")
.如果空格在处理中并不重要,那么每次调用 strLine.trim() 可能都是值得的。这就是我在处理这样的输入时通常所做的 - 如果必须手动编辑空格,则空格很容易渗入文件中,并且如果它们不重要,则可以并且应该忽略它们。
编辑:文件编码为 UTF-8 吗?打开文件时您可能需要指定编码。如果它发生在第一行,它可能是字节顺序标记或类似的东西。
尝试:
If spaces are not important in the processing it would probably be worth doing a
strLine.trim()
call each time anyway. This is what I generally do when handling input like this - spaces can easily creep into a file if it has to be edited manually and if they're not important they can and should be ignored.Edit: is the file encoded as UTF-8? You may need to specify the encoding when you open the file. It could be the byte order mark or something like that, if it's happening on the first line.
Try:
在文本编辑器中打开文件,导航至文件>另存为...并选择UTF-8编码,而不是UTF-8 with BOM。
Open the file in a text editor, navigate to File > Save As... and choose UTF-8 encoding, instead of UTF-8 with BOM.