这个 mysql 字符串是如何编码的以及如何复制它?
以下是使用两种不同方法存储在 MySQL 数据库中的两个字符串的十六进制值。 20C3AFC2BBC2BFC3A0C2A4E280A2C3A0C2A4C2BEC3A0C2A4C5A1C3A0C2A4E2809A20C3A0C2A4C2B6C3A0C2A4E280A2C3A0C2A5C28DC 3A0C2A4C2A8C3A0C2A5E280B9C3A0C2A4C2AEC3A0C2A5C28DC3A0C2A4C2AFC3A0C2A4C2A4C3A0C2A5C28DC3A0C2A4C2A4C3A0C2A5C281C3A0C 2A4C2AEC3A0C2A5C28D20C3A0C2A5C2A420C3A0C2A4C2A8C3A0C2A5E280B9C3A0C2A4C2AAC3A0C2A4C2B9C3A0C2A4C2BFC3A0C2A4C2A8C3A0C 2A4C2B8C3A0C2A5C28DC3A0C2A4C2A4C3A0C2A4C2BF20C3A0C2A4C2AEC3A0C2A4C2BEC3A0C2A4C2AEC3A0C2A5C28D20C3A0C2A5C2A5
和
<代码>E0A495E0A4BEE0A49AE0A48220E0A4B6E0A495E0A58DE0A4A8E0A58BE0A4AEE0A58DE0A4AFE0A4A4E0A58DE0A4A4E0A581E0A4AEE0A 58D20E0A5A420E0A4A8E0A58BE0A4AAE0A4B9E0A4BFE0A4A8E0A4B8E0A58DE0A4A4E0A4BF20E0A4AEE0A4BEE0A4AEE0A58D20E0A5A5
它们代表字符串 काचं शक्नोम्यत्तुम् । नोपहिनस्तिमाम्॥。前者似乎编码错误,但在应用程序中有效,后者似乎编码正确,但事实并非如此。我需要能够从输入创建第一个十六进制字符串。
这是长版本:我有一个用 PHP/MySQL 构建的遗留应用程序。数据库连接字符集是latin1。表的字符集是utf8(不要问)。通过 ForceUTF8 编写器库将输入强制转换为正确的 utf8。直接在数据库中查看,这个字符串的存储值为कतचं शकà¥à¤¡à¥¤à¤®à¥à¤嗬嗬夤à¥à¤¤à¥à¤®à¥ । à¤ì।¤¤¤¤¤¤¤¸à¥à¤¤à¤¿ मतमॠ॥
我知道这看起来可怕的是,在我看来编码很糟糕,但是它超出了修复遗留应用程序的范围。应用程序的其余部分能够按原样处理这些数据,并且其他所有内容都可以很好地工作并显示。
我创建了一个外部节点应用程序来替换当前在 Azure 上运行的插入例程。我已将连接字符集设置为 latin1,它连接到相同的数据库并运行相同的插入语句。我无法复制的难题的唯一部分是 ForceUTF8 库,因为我在 npm 生态系统中找不到等效的库。当插入相同的字符串时,在查看 PHP Storm 中的原始字段时,它会完美呈现,即它看起来与上面的原始文本完全相同,并且字符串的十六进制值是问题顶部显示的两个值中的后者。然而,当在应用程序中查看时,这些值被问号和黑色菱形破坏。
如果在 PHP 应用程序中,我在渲染数据查询之前运行 SET NAMES utf8 ,则节点插入的值将正确渲染,而旧值现在显示为已损坏。向该查询的应用程序添加设置名称 utf8
不是一个可接受的解决方案,因为它破坏了旧数据的外观,并且修复旧数据也不是一个可接受的解决方案。
我尝试了各种连接字符集和各种 Iconv 函数,以使数据与旧应用程序的制作方式完全匹配,但无法以完全相同的方式“打破它”。
我怎样才能制作“काचंशक्नोम्यत्तुम्।नोपहिनस्तिमाम्॥”转换为字符串,其十六进制值为“20C3AFC2BBC2BFC3A0C2A4E280A2C3A0C2A4C2BEC3A0C2A4C5A1C3A0C2A4E2809A20C3A0C2A4C2B6C3A0C2A4E280A2C3A0C2A5C28DC3A0 C2A4C2A8C3A0C2A5E280B9C3A0C2A4C2AEC3A0C2A5C28DC3A0C2A4C2AFC3A0C2A4C2A4C3A0C2A5C28DC3A0C2A4C2A4C3A0C2A5C281C3A0C 2A4C2AEC3A0C2A5C28D20C3A0C2A5C2A420C3A0C2A4C2A8C3A0C2A5E280B9C3A0C2A4C2AAC3A0C2A4C2B9C3A0C2A4C2BFC3A0C2A4C2A8C3 A0C2A4C2B8C3A0C2A5C28DC3A0C2A4C2A4C3A0C2A4C2BF20C3A0C2A4C2AEC3A0C2A4C2BEC3A0C2A4C2AEC3A0C2A5C28D20C3A0C2A5C2A5"使用数据库连接字符集和字符串转换的某些变体?
Here are the hex values of two strings stored in a MySQL database using two different methods.20C3AFC2BBC2BFC3A0C2A4E280A2C3A0C2A4C2BEC3A0C2A4C5A1C3A0C2A4E2809A20C3A0C2A4C2B6C3A0C2A4E280A2C3A0C2A5C28DC3A0C2A4C2A8C3A0C2A5E280B9C3A0C2A4C2AEC3A0C2A5C28DC3A0C2A4C2AFC3A0C2A4C2A4C3A0C2A5C28DC3A0C2A4C2A4C3A0C2A5C281C3A0C2A4C2AEC3A0C2A5C28D20C3A0C2A5C2A420C3A0C2A4C2A8C3A0C2A5E280B9C3A0C2A4C2AAC3A0C2A4C2B9C3A0C2A4C2BFC3A0C2A4C2A8C3A0C2A4C2B8C3A0C2A5C28DC3A0C2A4C2A4C3A0C2A4C2BF20C3A0C2A4C2AEC3A0C2A4C2BEC3A0C2A4C2AEC3A0C2A5C28D20C3A0C2A5C2A5
and
E0A495E0A4BEE0A49AE0A48220E0A4B6E0A495E0A58DE0A4A8E0A58BE0A4AEE0A58DE0A4AFE0A4A4E0A58DE0A4A4E0A581E0A4AEE0A58D20E0A5A420E0A4A8E0A58BE0A4AAE0A4B9E0A4BFE0A4A8E0A4B8E0A58DE0A4A4E0A4BF20E0A4AEE0A4BEE0A4AEE0A58D20E0A5A5
They represent the string काचं शक्नोम्यत्तुम् । नोपहिनस्ति माम् ॥. The former appears to be encoded badly, but works in the application, the latter appears encoded correctly but does not. I need to be able to create the first hex string from the input.
Here comes the long version: I've got a legacy application built in PHP/MySQL. The database connection charset is latin1. The charset of the table is utf8 (don't ask). The input is coerced into being correct utf8 via the ForceUTF8 composer library. Looking directly in the database, the stored value of this string is काचं शकà¥à¤¨à¥‹à¤®à¥à¤¯à¤¤à¥à¤¤à¥à¤®à¥ । नोपहिनसà¥à¤¤à¤¿ मामॠ॥
I am aware that this looks horrendous and appears to me to be badly encoded, however it is out of scope to fix the legacy application. The rest of the application is able to cope with this data as it is and everything else works and displays perfectly well with it.
I have created an external node application to replace the current insert routine running on Azure. I've set the connection charset to latin1, it's connecting to the same database and running the same insert statement. The only part of the puzzle I've not been able to replicate is the ForceUTF8 library as I could find no equivalent in the npm ecosystem. When the same string is inserted it renders perfectly when looking at the raw field in PHP Storm i.e. it looks exactly like the original text above, and the hex value of the string is the latter of the two presented at the top of the question. However, when viewed in the application the values are corrupted by question marks and black diamonds.
If, within the PHP application, I run SET NAMES utf8
ahead of the rendering data query then the node-inserted values render correctly, and the legacy ones now display as corrupted. Adding set names utf8
to the application for this query is not an acceptable solution since it breaks the appearance of the legacy data, and fixing the legacy data is also not an acceptable solution.
I have tried all sorts of connection charsets and various Iconv functions to make the data exactly match how the legacy app makes it but have not been able to "break it" in exactly the same way.
How can I make "काचं शक्नोम्यत्तुम् । नोपहिनस्ति माम् ॥" into a string, the hex value of which is "20C3AFC2BBC2BFC3A0C2A4E280A2C3A0C2A4C2BEC3A0C2A4C5A1C3A0C2A4E2809A20C3A0C2A4C2B6C3A0C2A4E280A2C3A0C2A5C28DC3A0C2A4C2A8C3A0C2A5E280B9C3A0C2A4C2AEC3A0C2A5C28DC3A0C2A4C2AFC3A0C2A4C2A4C3A0C2A5C28DC3A0C2A4C2A4C3A0C2A5C281C3A0C2A4C2AEC3A0C2A5C28D20C3A0C2A5C2A420C3A0C2A4C2A8C3A0C2A5E280B9C3A0C2A4C2AAC3A0C2A4C2B9C3A0C2A4C2BFC3A0C2A4C2A8C3A0C2A4C2B8C3A0C2A5C28DC3A0C2A4C2A4C3A0C2A4C2BF20C3A0C2A4C2AEC3A0C2A4C2BEC3A0C2A4C2AEC3A0C2A5C28D20C3A0C2A5C2A5" using some variation of database connection charset and string conversion?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我不熟悉 PHP,但我能够用 Python 生成“可怕的”编码(而且这太可怕了......不知道有人是如何故意生成这个垃圾的)。希望这可以指导您找到解决方案:
I'm not familiar with PHP, but I was able to generate the "horrendous" encoding with Python (and it is horrendous...not sure how someone intentionally generated this crap). Hopefully this guides you to a solution:
HEX('काचं') = 'E0A495E0A4BEE0A49AE0A482'
-- utf8mb4 到 utf8mb4 十六进制
HEX(CONVERT(CONVERT(BINARY('काचं') USING latin1) USING utf8mb4)) = 'C3A0C2A4E280A2C3A0C2A4C2BEC3A0C2A4C5A1C3A0C2A4E2809A'
是 utf8mb4 到双编码请参阅UTF-8 字符出现问题;我看到的不是我存储的
更多
“双重编码”,据我了解,是 utf8 字节(每个“字符”最多 4 个字节)被视为 latin1 (或 cpnnnn)并转换为 utf8,然后第二次发生这种情况。在这种情况下,每个 3 字节天城文都会转换两次,从而得到 6 到 9 个字节。
您在这里解释了原因:
在我看来, BOM 是一种转移注意力的东西。它旨在成为“文本”文件以 UTF-8 编码的有用线索,但不幸的是,很少有产品生成它。因此,BOM 与其说是一种帮助,不如说是一种干扰。 (我认为 MySQL 没有任何方法来处理 BOM ——毕竟,大多数数据库活动都在行级别,而不是文件级别。
) MySQL 上下文中的解决方案(针对数据流)是删除所有“转换”功能,而是进行配置,以便 MySQL 在适当的位置进行转换。您提到的“latin1”是主要的“错误配置”。
长表达式(HEX...)给出了如何修复数据的线索,但它必须与配置更改和代码更改相协调。
HEX('काचं') = 'E0A495E0A4BEE0A49AE0A482'
-- utf8mb4 to utf8mb4 hex
HEX(CONVERT(CONVERT(BINARY('काचं') USING latin1) USING utf8mb4)) = 'C3A0C2A4E280A2C3A0C2A4C2BEC3A0C2A4C5A1C3A0C2A4E2809A'
is utf8mb4 to double-encodedSee "double-encoding" in Trouble with UTF-8 characters; what I see is not what I stored
More
"Double-encoding", as I understand it, is where utf8 bytes (up to 4 bytes per "character") are treated as latin1 (or cpnnnn) and converted to utf8, and then that happens a second time. In this case, each 3-byte Devanagari is converted twice, leading to between 6 and 9 bytes.
You explained the cause here:
BOM is, in my opinion, a red herring. It was intended to be a useful clue that a "text" file was encoded in UTF-8, but unfortunately, very few products generate it. Hence, BOM is more of a distraction than a help. (I don't think MySQL has any way to take care of BOM -- after all, most database activity is at the row level, not the file level.)
The solution (for the data flow) in MySQL context is to rip out all "conversion" functions and, instead, configure things so that MySQL will convert at the appropriate places. Your mention of "latin1" was the main "mis-configuration".
The long expression (HEX...) gives a clue of how to fix the data, but it must be coordinated with changes to configuration and changes to code.