从随机/垃圾 ASCII 中恢复原始 UTF8/汉字/中文文本
我知道这可能不可能,但无论如何我想尝试一下。
所以我有一些数据作为 html 表单提交的结果。用户最初在某些字段中输入汉字。但我得到的只是随机的 ASCII 字母,如下所示:
我的世界
修复了编码问题(以便新的表单提交可以很好地处理 utf8),但想看看是否可以恢复修复之前的旧数据(正确的汉字字母)。
感谢您的帮助。
更新:
我猜需要一些澄清。正如我所说,我已经已经修复了 html 表单的编码问题。实际的问题是是否可以从我已经收到的“垃圾”数据中恢复原始汉字。
例如,我试图对以下内容进行“逆向工程”:
ôüÒýR
å¼µå¥éºŸ
冉榆平
·¨¶vÚ¬
每一行都应该是某人的汉字或中文名字。我尝试了所有合理的编码,例如 GBK、gb18030 和 Big5-HKSCS。到目前为止还没有运气。
最后更新:
现在在 BIG5 编码方面运气不错。它并不适用于所有垃圾数据,但适用于大约 2/3 的垃圾数据。
I know this may not be possible but wanna give it a shot anyway.
So I have some data as results of html form submissions. Users originally typed in Kanji in some of the fields. But all I got were random ascii letters like this:
æŽå°çŽ²
I already fixed the encoding issue (so that new form submissions handle utf8 fine) but would like to see if I can recover the old data (the correct kanji letters) from before the fix.
Thanks for the help.
UPDATE:
Guess a little clarification is needed. As I said, I have already fixed the encoding problem for the html form. The actual question is whether or not one can recover the original kanji from the "garbage" data that I already received.
For example, I'm trying to "reverse-engineer" the following
ôüÒýR
å¼µå¥éºŸ
冉榆平
·¨¶vÚ¬
Every line is supposed to be someone's name in Kanji or Chinese. I tried all the sensible encodings such as GBK, gb18030, and Big5-HKSCS. No luck so far.
Last UPDATE:
Having some luck with BIG5 encoding now. It didn't work for all the garbage data, but it worked for about 2/3 of them.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
使用字符集转换器-在线工具
输入编码应该是UTF8
对于输出编码,请尝试东方字符的所有合理编码。
记住选中第二个复选框。
大多数(如果不是全部)垃圾信件都应该被恢复。
Use Character set converter - online tool
Input encoding should be UTF8
For Output encoding, try out all the sensible encodings for oriental characters.
Remember to check the 2nd checkbox.
Most if not all of the garbage letters should be recovered.
这些字母不是 ASCII。 ASCII 字母没有任何类型的重音。
目前尚不清楚您如何读取这些数据 - 是来自文件、数据库还是其他东西?不管怎样,它可能已经是 UTF-8 格式了 - 所以你应该尝试使用该编码来读取它。您还没有告诉我们您正在使用什么平台,但您应该确保无论您使用什么,您都可以按数字找出您读过的 Unicode 字符> - 这比将值打印为字符要可靠得多。
Those letters aren't ASCII. No ASCII letters have accents of any kind.
It's unclear how you're reading this data - is it from a file, a database, something else? Anyway, it's possible that it's already in UTF-8 - so you should just try to read it using that encoding. You haven't told us what platform you're using, but you should ensure that whatever you are using, you get to find out what Unicode characters you've read by number - that's a lot more reliable than printing out the values as characters.
仅供参考,java String 类由 2 字节字符支持,并且是在 unicode 只有 2 字节时设计的。因此它不处理 3 字节日语和中文字符。请参阅http://java.sun.com/developer/technicalArticles/Intl/Supplementary/< /a>
FYI, the java String class is backed by 2 byte chars, and was designed back when unicode was only 2 bytes. Thus it doesn't handle the 3 byte Japanese and Chinese characters. See http://java.sun.com/developer/technicalArticles/Intl/Supplementary/