处理文本文件的应用程序转换为 Unicode
我的 Win32 Delphi 应用程序分析由不支持 Unicode 的其他应用程序生成的文本文件。 因此,我的应用程序需要读取和写入 ansi 字符串,但我希望通过在 GUI 中使用 Unicode 来提供更好的本地化用户体验。 该应用程序对 TList 后代对象中的字符串进行一些相当繁重的逐字符分析。
在从 Delphi 2006 到 Delphi 2009 过渡到 Unicode GUI 时,我是否应该计划:
- 在我的应用程序中完全使用 Unicode,除了 ansisstring 文件 I/O 之外?
- 封装来自其他 Unicode 应用程序的处理 ansistrings 的代码(即继续在内部将它们作为 ansistrings 处理)。
我意识到真正详细的响应将需要大量的代码 - 我只是询问那些已经完成此转换并且仍然需要使用纯文本文件的人的印象。 ansistrings 和 Unicode 之间的屏障在哪里?
编辑:如果#1,对于将 Unicode 字符串映射为 ansisstring 输出有什么建议吗? 我猜测输入字符串的转换将使用 tstringlist.loadfromfile (例如)自动进行。
My Win32 Delphi app analyzes text files produced by other applications that do not support Unicode. Thus, my apps needs to read and write ansi strings, but I would like to provide a better-localized user experience through use of Unicode in GUI. The app does some pretty heavy character-by-character analysis of string in objects descended from TList.
In making the transition to a Unicode GUI in going from Delphi 2006 to Delphi 2009, should I plan to:
- go fully Unicode within my app, with the exception of ansistring file I/O?
- encapsulate the code that handles the ansistrings (i.e. continue to handle them as ansistrings internally) from an otherwise Unicode application.
I realize that a truly detailed response would require a substantial amount of my code - I'm just asking about impressions from those who've made this transition and who still have to work with plain text files. Where to place the barrier between ansistrings and Unicode?
EDIT: if #1, any suggestions for mapping Unicode strings for ansistring output? I would guess that the conversion of input strings will be automatic using tstringlist.loadfromfile (for example).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
不存在 AnsiString 输出这样的东西 - 每个文本文件都有一个字符编码。 当您的文件包含 ASCII 范围之外的字符时,您必须考虑编码,因为即使在不同的国家/地区加载这些文件也会产生不同的结果 - 除非您碰巧使用 Unicode 编码。
如果加载文本文件,您需要知道它的编码。 对于 xml 或 html 等信息是文本一部分的格式,对于 Unicode,有 BOM,尽管对于 UTF-8 编码的文件来说这并不是绝对必要的。
将应用程序转换为 Delphi 2009 是一个思考文本文件编码并纠正过去错误的机会。 应用程序的数据文件的寿命通常比应用程序本身更长,因此考虑如何使它们面向未来和通用是值得的。 我建议使用 UTF-8 作为所有新应用程序的文本文件编码,这样将应用程序移植到不同的平台就很容易。 UTF-8 是数据交换的最佳编码,对于 ASCII 或 ISO8859-1 范围内的字符,它甚至还会创建比 UTF-16 或 UTF-32 小得多的文件。
如果您的数据文件仅包含 ASCII 字符,那么您就已经准备好了,因为它们也是有效的 UTF-8 编码文件。 如果您的数据文件采用 ISO8859-1 编码(或任何其他固定编码),则在将它们加载到字符串列表并将其保存回来时使用匹配转换。 如果您事先不知道它们将采用什么编码,请在加载时询问用户,或提供默认编码的应用程序设置。
内部使用 Unicode 字符串。 根据您需要处理的数据量,您可能会使用 UTF-8 编码的字符串。
There is no such thing as AnsiString output - every text file has a character encoding. The moment that your files contain characters outside of the ASCII range you have to think about encoding, as even loading those files in different countries will produce different results - unless you happen to be using a Unicode encoding.
If you load a text file you need to know which encoding it has. For formats like xml or html that information is part of the text, for Unicode there is the BOM, even though it isn't strictly necessary for UTF-8 encoded files.
Converting an application to Delphi 2009 is a chance to think about encoding of text files and correct past mistakes. Data files of an application do often have a longer life than the applications itself, so it pays to think about how to make them future-proof and universal. I would suggest to go UTF-8 as the text file encoding for all new applications, that way porting an application to different platforms is easy. UTF-8 is the best encoding for data exchange, and for characters in the ASCII or ISO8859-1 range it does also create much smaller files than UTF-16 or UTF-32 even.
If your data files contain only ASCII chars you are all set then, as they are valid UTF-8 encoded files then as well. If your data files are in ISO8859-1 encoding (or any other fixed encoding), then use the matching conversion while loading them into string lists and saving them back. If you don't know in advance what encoding they will have, ask the user upon loading, or provide an application setting for the default encoding.
Use Unicode strings internally. Depending on the amount of data you need to handle you might use UTF-8 encoded strings.
如果值得付出努力并且有要求的话,我建议使用完整的 unicode。 并将 ANSI 文件 I/O 与其他文件分开。 但这很大程度上取决于您的应用程序。
I suggest going full unicode if it's worth the effort and a requirement. And keeping the ANSI file I/O seperated from the rest. But this depends strongly from your application.
你说:
如果您在内部将文本文件加载为 Unicode,您可能会发现字符分析运行得更快。
另一方面,如果它是一个大文件,您也会发现它需要两倍的内存。
有关此内容的更多信息,请参阅 Jan Goyvaert 的文章:“使用本机 Win32 字符串类型的速度优势”
因此,这是您必须做出的权衡。
You say:
Since Windows runs Unicode natively, you may find your character analysis runs faster if you load the text file internally as Unicode.
On the other hand, if it is a large file, you will also find it takes twice as much memory.
For more about this, see Jan Goyvaert's article: "Speed Benefits of Using the Native Win32 String Type"
So it is a tradeoff you have to decide on.
如果您要从 GUI 获取 Unicode 输入,将其转换为 ASCII 输出的策略是什么? (这是一个假设,因为您提到写回 Ansi 文本,假设对于这些非基于 Unicode 的应用程序,您不会重写并且假设没有源代码。)我建议在整个应用程序中保留 AnsiString直到这些其他应用程序启用 Unicode。 如果应用程序的主要工作是分析非 Unicode ASCII 类型文件,那么为什么要在内部切换到 Unicode? 如果应用程序的主要工作涉及拥有更好的支持 Unicode 的 GUI,那么就使用 Unicode。 我认为没有提供足够的信息来决定正确的选择。
如果没有机会为这些非 Unicode 应用程序写回不易翻译的字符,那么建议使用 UTF-8 可能是可行的方法。 但是,如果有机会,那么非 Unicode 应用程序将如何处理多字节字符? 您将如何转换为(假设)基本 ASCII 字符集?
If you are going to take Unicode input from the GUI, what's the strategy going to be for converting it to ASCII output? (This is an assumption as you mention writing Ansi text back out, assumedly for these non-Unicode based applications that you are not going to rewrite and assumedly don't have the sourcecode for.) I'd suggest staying with AnsiString throughout the app until these other apps are Unicode enabled. If your main job of your application is analyzing non-Unicode ASCII type files, then why switch to Unicode internally? If the main job of your application involves having a better Unicode enabled GUI then go Unicode. I don't believe that there's enough info presented to decide a proper choice.
If there is no chance for non-easily translatable characters to be written back out for these non-Unicode applications, then the suggestion for UTF-8 is the likely way to go. However, if there is a chance, then how are the non-Unicode applications going to handle multi-byte characters? How are you going to convert to (assumedly) the basic ASCII character set?