在 Jet Database / Delphi 中转换数据库字段编码

发布于 2024-10-12 04:57:34 字数 297 浏览 8 评论 0原文

我有一个用 Delphi 编写的遗留应用程序，它使用 Jet 数据库作为存储数据的后端，我需要将数据导出为新格式。

使用 MS Access (Windows) 或 MDBViewer (Linux) 打开数据库，在“MEMO”类型（mysql 的 TEXT 等效项）字段中，我只能看到类似于亚洲字符的垃圾。运行旧应用程序时，字段的内容会正确显示。

有没有一种方法可以尝试所有可能的字符编码并将其转换以恢复数据（我对 PHP 和 C# 很满意）？我读到了一些关于 BOM（字节顺序标记）的内容，这可能是相关的，有什么想法吗？

谢谢！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

风情万种。 2024-10-19 04:57:35

当前的 MS Access 版本使用 UTF-8 来存储字符串值。较旧的只是遵循输入文本的机器的代码页。

大多数编码确实使用一些标记字节来指示后续内容的编码。您是否能从中受益，实际上取决于旧版应用程序。如果这只是遵循单一编码，或者依赖于机器的代码页，那么您将不得不做一些聪明的识别自己的事情。

快速检查

UTF-8

如果有标记，则为 $EFBBBF。
如果没有，当字符串中可以看到 ASCII (0-127) 字符序列时，您可以推断出它是 UTF-8。

UTF-16

有两种风格：Little Endian (LE) 和 Big Endian (BE)。对于基本多语言平面内的字符，每个字符都使用两个字节。两者的区别在于，对于 ASCII 字符，一个以零字节开始，另一个以零字节结束。

如果有标记，则 UTF-16LE 由 $FFFE 指定，UTF-16BE 由 $FEFF 指定。
如果这些标记都不存在，则备注字段中具有交替的零和非零字节是合理的指示。你的第一个选择应该是 UTF-16LE，因为这是 Windows 标准，而 UTF-16BE 的使用要少得多。（抱歉，永远记不清两者中哪一个以 ASCII 字符的零字节开头，哪一个以非零字节开头）。

其他

如果您可以排除 UTF-8 和 UTF-16，您可以尝试确定是否使用了其他 UTF 编码之一。不过我不会花时间，很可能该程序只是依赖于机器的代码页。由于您正在处理大量“亚洲外观”字符，因此最好的选择是检查 MBCS 代码页（多字节字符 S??? 代码页）。请参阅 MSDN 了解更多详细信息。由于我自己从未与他们打过交道，所以恐怕我无法在这里提供更多帮助。

尝试编码

如果您必须开始尝试所有编码，您可能需要查看 DIConvertors 库。它非常擅长在编码之间进行转换。 IIRC 它还可以识别编码，但除此之外它应该可以帮助您开始自己的检测。可以在 http://www.yunqa.de/delphi/ 找到doku.php/products/converters/index

Current MS Access versions use UTF-8 to store string values. Older ones simply followed the code page of the machine on which the text was entered.

Most encodings do indeed use some marker bytes to indicate the encoding of what follows. Whether or not you have the benefit of that, really depends on the legacy app. If that simply followed a single encoding, or relied on the machine's code page, then you'll have to do some clever recognizing yourself.

Quick checks

UTF-8

If there is a marker, it would be $EFBBBF.
If there isn't, you can make an educted guess that it is UTF-8 when sequences of ASCII (0-127) characters can be seen in the string.

UTF-16

Comes in two flavours: Little Endian (LE) and Big Endian (BE). For characters within the Basic Multilingual Plane, both use two bytes per character. The difference between the two is that for ASCII characters, one starts with a zero byte, the other ends with it.

If there is a marker UTF-16LE is designated by $FFFE and UTF-16BE by $FEFF.
If neither of those markers is present having alternating zero and non-zero bytes in the memo field is a fair indication. And your first bet should be UTF-16LE as that is the windows standard and UTF-16BE is used a lot less. (Sorry, can never remember which of the two starts with a zero-byte for ASCII characters and which one starts with a non-zero byte).

Other

If you can exclude UTF-8 and UTF-16, you could try to figure out whether one of the other UTF encodings was used. I wouldn't spend the time though, chances are that the program simply relied on the machine's code page. Seeing as your are dealing with a lot of "asian looking" characters, your best bet would be to check for the MBCS code pages (Multi Byte Character S??? code pages). See MSDN for more details. As I have never dealt with them myself, I'm afraid I can't be of more help here though.

Trying encodings

If you do have to start trying out every encoding there is, you may want to have a look at the DIConvertors library. It's pretty good at converting between encodings. IIRC it can also recognize encodings, but otherwise it should help getting you started with your own detection. It can be found at http://www.yunqa.de/delphi/doku.php/products/converters/index

回复收藏 0 原文

~没有更多了~