Java 中的 Unicode 感知 CSV 解析器
我正在寻找 CSV(逗号分隔值)解析器的 Java 实现,并正确处理 Unicode 数据,例如带有中文文本的 UTF-8 CSV 文件。我想这样的解析器应该在迭代、比较等时在内部使用代码点相关的方法。 Apache 2 许可证或类似许可证效果最好。
I'm looking for Java implementation of CSV (comma separated values) parser with proper handling of Unicode data, e.g. UTF-8 CSV files with Chinese text. I suppose such a parser should internally use code point related methods while iterating, comparing etc.
Apache 2 license or similar would work the best.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我不相信重新发明轮子。所以我不想编写自己的解析器并经历与其他人相同的头痛。
我个人喜欢 Ostermiller 的 CSV 解析器。如果有兴趣的话,他们还有一个 Maven 存储库。
您还可以查看 OpenCSV。 Stack Overflow 上已经存在一个关于解析 unicode 的问题 。
I don't believe in reinventing the wheel. So I do not want to write my own parser and go through the same headaches someone else did.
I personally like the CSV Parser from Ostermiller. They also have a Maven Repository if interested.
You can also check out OpenCSV. There is a Stack Overflow question already about parsing unicode.
您是否尝试过Commons CSV?
Have you tried Commons CSV?
自己写是很容易的。使用 FileInputStream 和使用 UTF-8 的 InputStreamReader 打开文件。将其包装在 BufferedReader 中,您可以使用 readLine() 对其进行迭代。将每一行作为字符串获取。使用正则表达式将其拆分为字段。
唯一棘手的部分是构造正则表达式,以便它们不会将引号内的逗号视为字段分隔符。
上面的方法效率有点低,但对于大多数应用程序来说足够快了。如果您有真正的性能要求,那么您将需要一些可以迭代字符的东西。几年前我写了一个使用运行正常的状态机的程序。
It's pretty easy to write yourself. Open the file with a FileInputStream and an InputStreamReader that uses UTF-8. Wrap it in a BufferedReader you can iterate through it using readLine(). Get each line as a String. Use regular expressions to split it into fields.
The only tricky part is constructing the regexes so they don't treat commas that are enclosed within quotes as field delimiters.
The approach above is a bit inefficient, but fast enough for most apps. If you have real performance requirements then you'll need something that iterates through characters. I wrote one a few years ago that uses a state machine that worked ok.