该网站如何修复编码?

发布于 09-01 07:36 字数 544 浏览 15 评论 0原文

我正在尝试将这段文字: 转换

×וויר. העתיד של רשתות חברתיות והתקשורת ×©×œ× ×•

成这段文字:

אוויר. העתיד של רשתות חברתיות והתקשורת שלנו

不知何故,这个网站:

http://www.pixiesoft. com/flip/

可以做到,我想知道我自己如何做到这一点(使用任何编程语言或软件)

仅将文件保存为 UTF8 是不行的。

我提出这个问题的动机是,我有一个朋友导出的 XML 文件,其中包含乱码文本,我想将其转换为更正的希伯来语文本文件。

XML 导出最初是由 MySQL 导入和导出造成的乱码,但我没有修复它或追溯问题所需的信息。

谢谢。

I am trying to turn this text:

×וויר. העתיד של רשתות חברתיות והתקשורת ×©×œ× ×•

Into this text:

אוויר. העתיד של רשתות חברתיות והתקשורת שלנו

Somehow, this website:

http://www.pixiesoft.com/flip/

Can do it, and I would like to know how I might be able to do it myself (with whatever programming language or software)

Just saving the file as UTF8 won't do it.

My motivation for this question is that I have a friend's exported XML file with the garbled text which I want to turn into corrected Hebrew text file.

The XML export was originally garbled by MySQL import and exports, but I don't have the information needed to fix it or traceback the problem.

Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

有深☉意2024-09-08 07:36:50

由于该问题是 MySQL 的双编码 UTF8 字符串故障,因此 MySQL 是解决该问题的正确方法。

运行以下命令将解决它 -

  • mysqldump $DB_NAME -u $DB_USER -p -h $DB_HOST.EXAMPLE.NET --add-drop-table --default-character-set=latin1 > export.sql - 这里使用latin1来强制MySQL不分割字符,否则不应该使用。
  • cp export{,.utf8}.sql - 制作备份副本。
  • sed -i -e 's/latin1/utf8/g' export.utf8.sql - 将文件中的 latin1 替换为 utf8,以便将其导入为 UTF-8 而不是 8859-1 。
  • mysql $DB_NAME -u $DB_USER -p -h $DB_HOST.EXAMPLE.NET mysql $DB_NAME -u $DB_USER -p -h $DB_HOST.EXAMPLE.NET < export.utf8.sql - 将所有内容导入回数据库。

这将在大约十分钟内解决问题。

Since the issue was a MySQL fault with double-encoded UTF8 strings, MySQL is the right way to solve it.

Running the following commands will solve it -

  • mysqldump $DB_NAME -u $DB_USER -p -h $DB_HOST.EXAMPLE.NET --add-drop-table --default-character-set=latin1 > export.sql - latin1 is used here to force MySQL not to split the characters, and should not be used otherwise.
  • cp export{,.utf8}.sql - making a backup copy.
  • sed -i -e 's/latin1/utf8/g' export.utf8.sql - Replacing the latin1 with utf8 in the file, in order to import it as UTF-8 instead of 8859-1.
  • mysql $DB_NAME -u $DB_USER -p -h $DB_HOST.EXAMPLE.NET < export.utf8.sql - import everything back to the database.

This will solve the issue in about ten minutes.

微凉徒眸意2024-09-08 07:36:50

如果仔细观察这些乱码,您可以看出每个希伯来字符都被编码为 2 个字符 - 看起来 של 被编码为 של

这表明您正在将 UTF8 或 UTF16 视为 ASCII。转换为 UTF8 不会有帮助,因为它已经是 ASCII 并且将保留该编码。

您可以读取每对字节并从中重建原始 UTF8。

这是我想出的一些 C# - 这非常简单(不能完全工作 - 太多假设),但我可以看到一些字符被正确转换:

private string ToProperHebrew(string gibberish)
{
   byte[] orig = Encoding.Unicode.GetBytes(gibberish);
   byte[] heb = new byte[orig.Length / 2];

   for (int i = 0; i < orig.Length / 2; i++)
   {
     heb[i] = orig[i * 2];
   }

   return Encoding.UTF8.GetString(heb);
}

如果出现每个字节被重新编码为两个字节 - 不是确定使用了什么编码,但是对于大多数双倍字符来说,丢弃一个字节似乎是正确的事情。

If you look closely at the gibberish, you can tell that each Hebrew character is encoded as 2 characters - it appears that של is encoded as של.

This suggests that you are looking at UTF8 or UTF16 as ASCII. Converting to UTF8 will not help because it is already ASCII and will keep that encoding.

You can read each pair of bytes and reconstruct the original UTF8 from them.

Here is some C# I came up with - this is very simplistic (doesn't fully work - too many assumptions), but I could see some of the characters converted properly:

private string ToProperHebrew(string gibberish)
{
   byte[] orig = Encoding.Unicode.GetBytes(gibberish);
   byte[] heb = new byte[orig.Length / 2];

   for (int i = 0; i < orig.Length / 2; i++)
   {
     heb[i] = orig[i * 2];
   }

   return Encoding.UTF8.GetString(heb);
}

If appears that each byte was re-encoded as two bytes - not sure what encoding was used for this, but discarding one byte seemed to be the right thing for most doubled up characters.

放血2024-09-08 07:36:50

您可能想查看此处 -这个问题的接受答案展示了一种如何猜测 byte[] 编码的方法。那么您所要做的就是从乱码中获取正确的字节。
当然,猜测可能总是失败……

You might want to look here - the accepted answer to this question shows a way how to guess the encoding of a byte[]. All you have to ensure then, is getting the proper bytes from the gibberish.
Guessing might always fail, of course...

情独悲2024-09-08 07:36:50

根据 Oded 和 Teddy 的回答,我想出了这个对我有用的方法:

public String getProperHebrew(String gibberish){
    byte[] orig = gibberish.getBytes(Charset.forName("windows-1252"));

    try {
        return new String(orig, "UTF-8");
    } catch (UnsupportedEncodingException e) {
        e.printStackTrace();
        return "";
    }
}

Based on Oded's and Teddy's answers, I came up with this method, which worked for me:

public String getProperHebrew(String gibberish){
    byte[] orig = gibberish.getBytes(Charset.forName("windows-1252"));

    try {
        return new String(orig, "UTF-8");
    } catch (UnsupportedEncodingException e) {
        e.printStackTrace();
        return "";
    }
}
养猫人2024-09-08 07:36:50

您可以使用元标记为页面设置正确的编码。以下是如何执行此操作的示例:

我想这种编码可以完成工作。

You can use the meta tag to set the proper encoding for your page. Here is an example how you can do that:

<meta http-equiv="Content-Type" content="text/html; charset=Windows-1255" />

I suppose that this encoding would do the work.

若言繁花未落2024-09-08 07:36:50

gibberish.encode('windows-1252').decode('utf-8', '替换')

gibberish.encode('windows-1252').decode('utf-8', 'replace')

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文