Java文件解析工具包设计,快速文件编码完整性检查
(免责声明:在提问之前我查看了这里的许多帖子,我发现 这个特别有帮助,我只是想从你们那里寻求一点健全性检查(如果可能的话))
大家好,
我有一个内部Java产品,我为处理加载数据文件而构建了它到数据库(又名 ETL 工具)。我已经为 XSLT 转换预先准备好阶段,并在原始文件中执行诸如模式替换之类的操作。输入文件可以是任何格式,它们可以是平面数据文件或 XML 数据文件,您可以配置加载的特定数据源所需的阶段。
到目前为止,我一直忽略文件编码的问题(我知道这是一个错误),因为一切都工作正常(主要是)。然而,我现在遇到了文件编码问题,长话短说,由于阶段可以配置在一起的方式的本质,我需要检测输入文件的文件编码并使用以下命令创建一个 Java Reader 对象:适当的论据。我只是想在深入研究一些我无法声称完全理解的内容之前与大家进行快速的理智检查:
- 采用 UTF-16 的标准文件编码(我不排除将来加载双字节字符) ) 对于我的工具包中每个阶段输出的所有文件
- 使用 JUniversalChardet 或 jchardet 嗅探输入文件编码
- 使用 Apache Commons IO 库为所有阶段创建标准读取器和写入器(我是你认为这没有类似的编码嗅探 API 吗?)
你在我概述的方法中看到任何陷阱/有任何额外的智慧可以提供吗?
有什么方法可以让我确信与使用现有方法加载的任何数据向后兼容,让 Java 运行时决定 windows-1252 的编码?
提前致谢,
-詹姆斯
(Disclaimer: I looked at a number of posts on here before asking, I found this one particularly helpful, I was just looking for a bit of a sanity check from you folks if possible)
Hi All,
I have an internal Java product that I have built for processing data files for loading into a database (AKA an ETL tool). I have pre-rolled stages for XSLT transformation, and doing things like pattern replacing within the original file. The input files can be of any format, they may be flat data files or XML data files, you configure the stages you require for the particular datafeed being loaded.
I have up until now ignored the issue of file encoding (a mistake I know), because all was working fine (in the main). However, I am now coming up against file encoding issues, to cut a long story short, because of the nature of the way stages can be configured together, I need to detect the file encoding of the input file and create a Java Reader object with the appropriate arguments. I just wanted to do a quick sanity check with you folks before I dive into something I can't claim to fully comprehend:
- Adopt a standard file encoding of UTF-16 (I'm not ruling out loading double-byte characters in the future) for all files that are output from every stage within my toolkit
- Use JUniversalChardet or jchardet to sniff the input file encoding
- Use the Apache Commons IO library to create a standard reader and writer for all stages (am I right in thinking this doesn't have a similar encoding-sniffing API?)
Do you see any pitfalls/have any extra wisdom to offer in my outlined approach?
Is there any way I can be confident of backwards compatibility with any data loaded using my existing approach of letting the Java runtime decide the encoding of windows-1252?
Thanks in advance,
-James
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
对于平面字符数据文件,任何编码检测都需要依赖统计和启发式(例如 BOM 的存在) ,或字符/模式频率),因为有些字节序列在多种编码中都是合法的,但映射到不同的字符。
XML 编码检测应该更简单,但这当然是可能的创建模糊编码的 XML(例如,通过省略标头中的编码)。
使用编码检测 API 向用户指示错误概率可能比依赖他们作为决策者更有意义。
当您在 Java 中将数据从
byte
转换为char
时,您正在从编码 X 转码为 UTF-16(BE)。发送到数据库的内容取决于您的数据库、其 JDBC 驱动程序以及您配置列的方式。这可能涉及从 UTF-16 转码为其他内容。假设您不更改数据库,现有的角色数据应该是安全的;如果您打算解析 BLOB,您可能会遇到问题。如果您已经解析了以不同编码编写的文件,但将它们视为另一种编码,则损坏已经发生 - 没有灵丹妙药可以解决这个问题。如果您需要将数据库的字符集从“ANSI”更改为 Unicode,可能会得到 痛苦。尽可能采用 Unicode 是个好主意。这可能不可能,但更喜欢可以使编码明确的文件格式 - 例如 XML(这使得编码变得容易)或 JSON(强制使用 UTF-8)。
With flat character data files, any encoding detection will need to rely on statistics and heuristics (like the presence of a BOM, or character/pattern frequency) because there are byte sequences that will be legal in more than one encoding, but map to different characters.
XML encoding detection should be more straightforward, but it is certainly possible to create ambiguously encoded XML (e.g. by leaving out the encoding in the header).
It may make more sense to use encoding detection APIs to indicate the probability of error to the user rather than rely on them as decision makers.
When you transform data from
byte
s tochar
s in Java, you are transcoding from encoding X to UTF-16(BE). What gets sent to your database depends on your database, its JDBC driver and how you've configured the column. That probably involves transcoding from UTF-16 to something else. Assuming you're not altering the database, existing character data should be safe; you might run into issues if you intend parsing BLOBs. If you've already parsed files written in disparate encodings, but treated them as another encoding, the corruption has already taken place - there are no silver bullets to fix that. If you need to alter the character set of a database from "ANSI" to Unicode, that might get painful.Adoption of Unicode wherever possible is a good idea. It may not be possible, but prefer file formats where you can make encoding unambiguous - things like XML (which makes it easy) or JSON (which mandates UTF-8).
选项 1 给我的印象是破坏了向后兼容性(当然从长远来看),尽管“正确的方式”(正确的方式选项通常会破坏向后兼容性),也许还需要考虑 UTF-8 是否是一个不错的选择。
如果您测试了一组有限的已知编码以了解您的嗅探器是否正确区分和识别,那么嗅探编码对我来说是合理的。
这里的另一个选择是使用某种形式的元数据(文件命名约定,如果没有其他更强大的选项),让您的代码知道数据是根据 UTF-16 标准提供的并相应地运行,否则将其转换为在继续前进之前先了解 UTF-16 标准。
Option 1 strikes me as breaking backwards compatibility (certainly in the long run), although the "right way" to go (the right way option generally does break backwards compatibility) with perhaps additional thoughts about if UTF-8 would be a good choice.
Sniffing the encoding strikes me as reasonable if you have a limited, known set of encodings that you tested to know that your sniffer correctly distinguishes and identifies.
Another option here is to use some form of meta-data (file naming convention if nothing else more robust is an option) that lets your code know that the data was provided according to the UTF-16 standard and behave accordingly, otherwise convert it to the UTF-16 standard before moving forward.