'𠂉'不是有效的 unicode 字符,但在 unicode 字符集中?
简短的故事:我无法将像“
Short story: I can't get an entity like '𠂉' to store in a MySQL database, either by using a text field in a Ruby on Rails app (with default UTF-8 encoding) or by inputting it directly with a MySQL GUI app.
As far as I can tell, all Chinese characters and radicals can be entered into the database without problem, but not these rarely typed 'character components.' The character mentioned above is unicode U+20089 and html entity 𠂉
I can get it to display on the page by entering <html>𠂉</html>
and removing html escaping, but I would like to store it simply as the unicode character and keep the html escaping in place. There are many other Chinese 'components' (parts of full characters, generally consisting of 2 or 3 strokes) that cause the same problem.
According to this page, the character mentioned is in the UTF-8 charset: http://www.fileformat.info/info/unicode/char/20089/charset_support.htm
But on the neighboring '...20089/index.htm' page, there's an alert saying it's not a valid unicode character.
For reference, that entity can be found in Mac OS X by searching through the character palette (international menu, "Show Character Palette"), searching by radical, and looking under the '丿' radical.
Apologies if this is too open-ended... can a character like this be stored in a UTF-8-based database? How is this character both supported and unsupported, both present in the character set and not valid?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您使用的是哪个版本的 MySQL?如果是 5.5 之前的版本,则无法存储该字符,因为它需要四个字节,而 MySQL 最多只支持三个字节的 UTF-8(即 BMP 中的字符)。 MySQL 5.5 添加了对四字节 UTF-8 的支持,但您必须指定
utf8mb4
作为字符集。参考: http://dev.mysql.com/doc/refman /5.5/en/charset-unicode.html
Which version of MySQL are you using? If it's before 5.5, you can't store that character because it would take four bytes and MySQL only supports up to three bytes UTF-8 (i.e., characters in the BMP). MySQL 5.5 added support for four-byte UTF-8, but you have to specify
utf8mb4
as the Character Set.ref: http://dev.mysql.com/doc/refman/5.5/en/charset-unicode.html
U+20089 是 Unicode 集中定义的字符 (CJK 统一表意文字扩展 B),并成为编码为 UTF-8 时的字节序列
F0 A0 82 89
。问题可能不在于字符,而在于堆栈中某处软件的字符处理。万一,由于固有的技术原因导致该字符成为问题字符,则可能会在 Unicode 标准或常见问题解答。
U+20089 is a defined character in the Unicode set (CJK Unified Ideographs Extension B) and becomes the byte sequence
F0 A0 82 89
when encoded as UTF-8. The problem is probably not with the character, but character handling by the software somewhere in your stack.In the unlikely event that there is an inherent technical reason for this being a problem character, it is likely to be covered in the Unicode standard or in the FAQs.
如果对它进行双重编码并存储怎么办?
再次对其进行编码并存储。稍后在检索时将其解码一次并以 html 形式呈现。
what if you double encode it and store ?
get it encoded once again and stored. and later upon retrieval decode it once and render in html.
我无法回答它被列为受支持和不受支持的问题,这可能是运行 fileformat.info 网站的人的问题。
UTF-8 可用于表示任何 Unicode 字符(代码点)。所有 UTF 都是如此。执行此操作所需的字节数各不相同(例如,在 UTF-8 中,您识别的代码点需要四个字节,而罗马字母“A”只需要一个字节),但所有 Unicode 字符都可以表示为所有 UTF。这就是他们的目的。 (更多信息。)
听起来好像您遇到了编码问题您的应用程序中的一个(或多个)层。首先要查看的地方是您的应用程序提供的页面:它是否说明了它正在使用的字符集?可能值得检查页面返回的标题,看看它们是否有:
...。如果没有,请在 HTML 本身中查找等效的
meta
标记,尽管我似乎记得读过meta
并不是执行此操作的好方法。如果没有特定的标头,则应用的默认值可能是 ISO-8859-1 (尽管某些浏览器可能使用 Windows-1252 代替),这不起作用如果您的源文本使用 UTF-8 编码。下一个要查看的地方是您的数据库。我不认为 MySQL 默认以 UTF-8 存储文本,您需要确保它在 MySQL 配置中这样做。
从你的问题来看,我认为你不需要它,但我将完成文章的强制性插件每个软件开发人员绝对必须了解的 Unicode 和字符集(没有任何借口!) 作者:Joel Spolsky(如果只是为了保存)有人将其插入评论中)。 :-)
I can't answer the question of it being listed as both supported and unsupported, that's probably a question for the people running the fileformat.info site.
UTF-8 can be used to represent any Unicode character (code point). This is true of all of the UTFs. The number of bytes required to do so varies (in UTF-8, you need four for the code point you identified, for instance, whereas you only need one for the Roman letter 'A'), but all Unicode characters can be represented by all UTFs. That's what they're for. (More here.)
It sounds as though you're running into an encoding problem at one (or more) of the layers in your app. The first place to look would be the page served by your app: Does it say what charset it's using? It may be worth checking the headers being returned for your pages to see if they have:
...in them. If they don't, look for the equivalent
meta
tag in the HTML itself, though I seem to recall reading thatmeta
isn't a good way to do this. Absent the headers being specific, the default applied will probably be ISO-8859-1 (though some browsers may use Windows-1252 instead), which won't work if your source text is encoded with UTF-8.The next place to look is your database. I don't think MySQL stores text in UTF-8 by default, you'll need to ensure that it's doing that in your MySQL configuration.
From your question, I don't think you need it, but I'll finish with the obligatory plug for the article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky (if only to save someone from plugging it in a comment). :-)