该网站如何修复编码？

发布于 09-01 07:36 字数 544 浏览 15 评论 0原文

我正在尝试将这段文字：转换

××•×•×™×¨. ×”×¢×ª×™×“ ×©×œ ×¨×©×ª×•×ª ×—×‘×¨×ª×™×•×ª ×•×”×ª×§×©×•×¨×ª ×©×œ× ×•

成这段文字：

אוויר. העתיד של רשתות חברתיות והתקשורת שלנו

不知何故，这个网站：

http://www.pixiesoft. com/flip/

可以做到，我想知道我自己如何做到这一点（使用任何编程语言或软件）

仅将文件保存为 UTF8 是不行的。

我提出这个问题的动机是，我有一个朋友导出的 XML 文件，其中包含乱码文本，我想将其转换为更正的希伯来语文本文件。

XML 导出最初是由 MySQL 导入和导出造成的乱码，但我没有修复它或追溯问题所需的信息。

谢谢。

原文

I am trying to turn this text:

××•×•×™×¨. ×”×¢×ª×™×“ ×©×œ ×¨×©×ª×•×ª ×—×‘×¨×ª×™×•×ª ×•×”×ª×§×©×•×¨×ª ×©×œ× ×•

Into this text:

אוויר. העתיד של רשתות חברתיות והתקשורת שלנו

Somehow, this website:

http://www.pixiesoft.com/flip/

Can do it, and I would like to know how I might be able to do it myself (with whatever programming language or software)

Just saving the file as UTF8 won't do it.

My motivation for this question is that I have a friend's exported XML file with the garbled text which I want to turn into corrected Hebrew text file.

The XML export was originally garbled by MySQL import and exports, but I don't have the information needed to fix it or traceback the problem.

Thanks.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

有深☉意2024-09-08 07:36:50

由于该问题是 MySQL 的双编码 UTF8 字符串故障，因此 MySQL 是解决该问题的正确方法。

运行以下命令将解决它 -

mysqldump $DB_NAME -u $DB_USER -p -h $DB_HOST.EXAMPLE.NET --add-drop-table --default-character-set=latin1 > export.sql - 这里使用latin1来强制MySQL不分割字符，否则不应该使用。
cp export{,.utf8}.sql - 制作备份副本。
sed -i -e 's/latin1/utf8/g' export.utf8.sql - 将文件中的 latin1 替换为 utf8，以便将其导入为 UTF-8 而不是 8859-1 。
mysql $DB_NAME -u $DB_USER -p -h $DB_HOST.EXAMPLE.NET mysql $DB_NAME -u $DB_USER -p -h $DB_HOST.EXAMPLE.NET < export.utf8.sql - 将所有内容导入回数据库。

这将在大约十分钟内解决问题。

回复收藏 0 原文

微凉徒眸意2024-09-08 07:36:50

如果仔细观察这些乱码，您可以看出每个希伯来字符都被编码为 2 个字符 - 看起来 של 被编码为 ×©×œ。

这表明您正在将 UTF8 或 UTF16 视为 ASCII。转换为 UTF8 不会有帮助，因为它已经是 ASCII 并且将保留该编码。

您可以读取每对字节并从中重建原始 UTF8。

这是我想出的一些 C# - 这非常简单（不能完全工作 - 太多假设），但我可以看到一些字符被正确转换：

private string ToProperHebrew(string gibberish)
{
   byte[] orig = Encoding.Unicode.GetBytes(gibberish);
   byte[] heb = new byte[orig.Length / 2];

   for (int i = 0; i < orig.Length / 2; i++)
   {
     heb[i] = orig[i * 2];
   }

   return Encoding.UTF8.GetString(heb);
}

如果出现每个字节被重新编码为两个字节 - 不是确定使用了什么编码，但是对于大多数双倍字符来说，丢弃一个字节似乎是正确的事情。

If you look closely at the gibberish, you can tell that each Hebrew character is encoded as 2 characters - it appears that של is encoded as ×©×œ.

This suggests that you are looking at UTF8 or UTF16 as ASCII. Converting to UTF8 will not help because it is already ASCII and will keep that encoding.

You can read each pair of bytes and reconstruct the original UTF8 from them.

Here is some C# I came up with - this is very simplistic (doesn't fully work - too many assumptions), but I could see some of the characters converted properly:

private string ToProperHebrew(string gibberish)
{
   byte[] orig = Encoding.Unicode.GetBytes(gibberish);
   byte[] heb = new byte[orig.Length / 2];

   for (int i = 0; i < orig.Length / 2; i++)
   {
     heb[i] = orig[i * 2];
   }

   return Encoding.UTF8.GetString(heb);
}

If appears that each byte was re-encoded as two bytes - not sure what encoding was used for this, but discarding one byte seemed to be the right thing for most doubled up characters.

回复收藏 0 原文

放血2024-09-08 07:36:50

您可能想查看此处 -这个问题的接受答案展示了一种如何猜测 byte[] 编码的方法。那么您所要做的就是从乱码中获取正确的字节。
当然，猜测可能总是失败……

回复收藏 0 原文

情独悲2024-09-08 07:36:50

根据 Oded 和 Teddy 的回答，我想出了这个对我有用的方法：

public String getProperHebrew(String gibberish){
    byte[] orig = gibberish.getBytes(Charset.forName("windows-1252"));

    try {
        return new String(orig, "UTF-8");
    } catch (UnsupportedEncodingException e) {
        e.printStackTrace();
        return "";
    }
}

Based on Oded's and Teddy's answers, I came up with this method, which worked for me:

public String getProperHebrew(String gibberish){
    byte[] orig = gibberish.getBytes(Charset.forName("windows-1252"));

    try {
        return new String(orig, "UTF-8");
    } catch (UnsupportedEncodingException e) {
        e.printStackTrace();
        return "";
    }
}

回复收藏 0 原文