在 MySQL 数据库中存储具有未确定代码页的文本的最佳方法
我目前正在编写一个应用程序 (App1),它从另一个应用程序(我们称之为 App2)远程检索部分文本。 世界各地有多个 App2 实例,它们都根据本地系统代码页解释其字符串。 App2 不支持 unicode。
App1 从 App2 检索文本,没有任何关于文本代码页的提示,但预计在稍后的时刻,将进行手动过程来选择代码页面以正确解释文本。
以前自动确定文本代码页的尝试都失败了。
同时,在等待手动确定之前,这些数据必须存储在MySQL数据库中。
存储这些数据的最佳方式是什么?具体来说,这里最适合采用什么CHARSET
和COLLATION
?
我相信MySQL不会容忍在字段中插入对于字段的字符集无效的字符。
如果我能够在插入数据库之前检测代码页并将数据转换为 unicode,那就太理想了,但我不知道如何一致且可靠地完成此操作。
I am currently writing an application (App1) which retrieves portions of text remotely from another application (let's call it App2).
There are several instances of App2 around the world, and they all interpret their strings according to their local system code page. App2 is not unicode-aware.
App1 retrieves the text from App2 without any hint as to the text's code page, but it is expected that at a latter point, a manual process will be undertaken to select the code page to correctly interpret the text.
Previous attempts to automatically determine the code page of the text have failed.
In the mean time, pending the manual determination, this data must be stored in a MySQL database.
What is the best way to store this data? Specifically, what CHARSET
and COLLATION
would be best employed here?
I believe that MySQL will not tolerate inserting characters into a field if they are not valid for the field's charset.
It would be ideal if I could detect the code page and convert the data to unicode before inserting into the database, but I am at a loss of how this can be done consistently and reliably.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果你确实不知道字符集,那么你只能将其存储为二进制数据。这将保留所有内容(没有任何内容被破坏)。当尝试将其用作文本时,您将不得不猜测编码。
If you really do not know the character set, then you can only store it as binary data. That will preserve all the contents (nothing gets mangled). When it comes to trying to use it as a text, you will have to guess the encoding.
唯一明智的方法是 App2 发送数据的编码信息。
使用该信息,您可以在将其插入数据库之前将其转换为 Unicode。那将是最佳的。
所有多字节库都具有通过查看特定字节值来猜测编码的函数,但它们非常不可靠,特别是当传入数据可能具有任何编码时。
The only sane way is for App2 to send along the information what encoding the data is in.
Using that information, you could convert it to Unicode before inserting it into the database. That would be optimal.
All multi-byte libraries have functions to guess the encoding by looking at specific tell-tale byte values, but they are terribly unreliable, especially when the incoming data could have any encoding.