如何纠正文件的字符编码?

发布于 2024-07-05 07:41:31 字数 254 浏览 14 评论 0原文

我有一个 ANSI 编码的文本文件,该文件不应该被编码为 ANSI,因为有重音符号 ANSI 不支持的字符。 我宁愿使用 UTF-8。

数据能否正确解码或者在转码过程中丢失?

我可以使用什么工具?

这是我所拥有的示例:

ç é

我可以从上下文中看出(咖啡馆应该是咖啡馆),这应该是这两个字符:

ç é

I have an ANSI encoded text file that should not have been encoded as ANSI as there were accented
characters that ANSI does not support. I would rather work with UTF-8.

Can the data be decoded correctly or is it lost in transcoding?

What tools could I use?

Here is a sample of what I have:

ç é

I can tell from context (café should be café) that these should be these two characters:

ç é

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(11

深海蓝天 2024-07-12 07:41:31

在 Sublime Text 编辑器中,文件 -> 使用编码重新打开 -> 选择正确的编码。

一般情况下会自动检测编码,如果没有,可以使用上面的方法。

In sublime text editor, file -> reopen with encoding -> choose the correct encoding.

Generally, the encoding is auto-detected, but if not, you can use the above method.

愛放△進行李 2024-07-12 07:41:31

如果您在文件中看到问号或者重音符号已经丢失,那么返回 utf8 对您的事业没有任何帮助。 例如,如果咖啡馆变成了咖啡馆 - 仅更改编码不会有帮助(并且您将需要原始数据)。

您能否在此处粘贴一些文字,这将帮助我们确定答案。

If you see question marks in the file or if the accents are already lost, going back to utf8 will not help your cause. e.g. if café became cafe - changing encoding alone will not help (and you'll need original data).

Can you paste some text here, that'll help us answer for sure.

水溶 2024-07-12 07:41:31

我找到了一种自动检测文件编码的简单方法 - 将文件更改为文本文件(在 Mac 上将文件扩展名重命名为 .txt)并将其拖动到 Mozilla Firefox 窗口(或“文件”->“打开”)。 Firefox 将检测编码 - 您可以在查看 -> 下看到它所提供的内容。 字符编码。

一旦我知道正确的编码,我就使用 TextMate 更改了文件的编码。 文件-> 使用编码重新打开并选择您的编码。 然后文件-> 另存为并将编码更改为 UTF-8 并将行结尾更改为 LF (或任何您想要的)

I found a simple way to auto-detect file encodings - change the file to a text file (on a mac rename the file extension to .txt) and drag it to a Mozilla Firefox window (or File -> Open). Firefox will detect the encoding - you can see what it came up with under View -> Character Encoding.

I changed my file's encoding using TextMate once I knew the correct encoding. File -> Reopen using encoding and choose your encoding. Then File -> Save As and change the encoding to UTF-8 and line endings to LF (or whatever you want)

你是年少的欢喜 2024-07-12 07:41:31

编辑:在进入更复杂的解决方案之前消除一个简单的可能性:您是否尝试在您正在读取文件的文本编辑器中将字符集设置为 utf8 ? 这可能只是某人向您发送了一个 utf8 文件,而您正在设置为 cp1252 的编辑器中阅读该文件。

仅举这两个例子,这是通过单字节编码(可能是 iso-8859-1、iso-8859-15 或 cp1252 之一)读取 utf8 的情况。 如果您可以发布其他问题字符的示例,应该可以进一步缩小范围。

由于对字符的目视检查可能会产生误导,因此您还需要查看底层字节:您在屏幕上看到的 § 可能是 0xa7 或 0xc2a7,这将决定您必须执行的字符集转换类型。

您是否可以假设您的所有数据都以完全相同的方式被扭曲 - 它来自相同的来源并经过相同的转换序列,因此例如您的文本中没有一个 é,它始终是A§? 如果是这样,可以通过一系列字符集转换来解决问题。 如果您可以更具体地了解您所处的环境和您正在使用的数据库,那么这里的某人可能会告诉您如何执行适当的转换。

否则,如果问题字符仅出现在数据中的某些位置,则您必须根据“没有作者打算在文本中放入 §,因此每当您看到将其替换为 ç"。 后一种选择风险更大,首先是因为这些关于作者意图的假设可能是错误的,其次是因为你必须自己发现每个问题字符,如果有太多文本需要目视检查或者是书面的,这可能是不可能的使用对您而言陌生的语言或书写系统。

EDIT: A simple possibility to eliminate before getting into more complicated solutions: have you tried setting the character set to utf8 in the text editor in which you're reading the file? This could just be a case of somebody sending you a utf8 file that you're reading in an editor set to say cp1252.

Just taking the two examples, this is a case of utf8 being read through the lens of a single-byte encoding, likely one of iso-8859-1, iso-8859-15, or cp1252. If you can post examples of other problem characters, it should be possible to narrow that down more.

As visual inspection of the characters can be misleading, you'll also need to look at the underlying bytes: the § you see on screen might be either 0xa7 or 0xc2a7, and that will determine the kind of character set conversion you have to do.

Can you assume that all of your data has been distorted in exactly the same way - that it's come from the same source and gone through the same sequence of transformations, so that for example there isn't a single é in your text, it's always ç? If so, the problem can be solved with a sequence of character set conversions. If you can be more specific about the environment you're in and the database you're using, somebody here can probably tell you how to perform the appropriate conversion.

Otherwise, if the problem characters are only occurring in some places in your data, you'll have to take it instance by instance, based on assumptions along the lines of "no author intended to put ç in their text, so whenever you see it, replace by ç". The latter option is more risky, firstly because those assumptions about the intentions of the authors might be wrong, secondly because you'll have to spot every problem character yourself, which might be impossible if there's too much text to visually inspect or if it's written in a language or writing system that's foreign to you.

濫情▎り 2024-07-12 07:41:31

从命令行使用 vim:

vim -c "set encoding=utf8" -c "set fileencoding=utf8" -c "wq" filename

With vim from command line:

vim -c "set encoding=utf8" -c "set fileencoding=utf8" -c "wq" filename
为你拒绝所有暧昧 2024-07-12 07:41:31

当您看到像 § 和 é 这样的字符序列时,通常表明 UTF-8 文件已被程序打开并以 ANSI(或类似)格式读入。 Unicode 字符,例如:

U+00C2 带扬音符号的拉丁文大写字母 A
U+00C3 带波形符的拉丁文大写字母 A
U+0082 此处允许中断
这里的无中断

U+0083由于 UTF-8 使用的可变字节策略, 往往会出现在 ANSI 文本中。 此处对此策略进行了很好的解释。

对您来说的优点是,这些奇怪字符的出现使您相对容易找到并替换不正确转换的实例。

我相信,由于 ANSI 每个字符始终使用 1 个字节,因此您可以通过简单的搜索和替换操作来处理这种情况。 或者更方便的是,使用包含违规序列和所需字符之间的表映射的程序,如下所示:

“->” “ # 应该是左双大引号
”? -> ” # 应该是结束双大引号

任何给定的文本,假设它是英文的,都会有相对少量的不同类型的替换。

希望有帮助。

When you see character sequences like ç and é, it's usually an indication that a UTF-8 file has been opened by a program that reads it in as ANSI (or similar). Unicode characters such as these:

U+00C2 Latin capital letter A with circumflex
U+00C3 Latin capital letter A with tilde
U+0082 Break permitted here
U+0083 No break here

tend to show up in ANSI text because of the variable-byte strategy that UTF-8 uses. This strategy is explained very well here.

The advantage for you is that the appearance of these odd characters makes it relatively easy to find, and thus replace, instances of incorrect conversion.

I believe that, since ANSI always uses 1 byte per character, you can handle this situation with a simple search-and-replace operation. Or more conveniently, with a program that includes a table mapping between the offending sequences and the desired characters, like these:

“ -> “ # should be an opening double curly quote
â€? -> ” # should be a closing double curly quote

Any given text, assuming it's in English, will have a relatively small number of different types of substitutions.

Hope that helps.

靑春怀旧 2024-07-12 07:41:31

使用 Notepad++ 按照以下步骤操作

1- 复制原始文本

2- 在 Notepad++ 中,打开新文件,更改编码 -> 选择您认为原始文本遵循的编码。 也尝试编码“ANSI”,因为有时 Unicode 文件会被某些程序读取为 ANSI

3- 粘贴

4- 然后再次通过同一菜单转换为 Unicode:编码 -> “以 UTF-8 编码”(不是“转换为 UTF-8”)并希望它变得可读

上述步骤适用于大多数语言。 您只需在粘贴到记事本++之前猜测原始编码,然后通过同一菜单转换为替代的基于 Unicode 的编码,看看内容是否变得可读。

大多数语言存在两种编码形式: 1- 大多数计算机最初使用旧的 ANSI (ASCII) 形式,只有 8 位。 8 位仅允许 256 种可能性,其中 128 种是常规拉丁字符和控制字符,最后 128 位的读取方式不同,具体取决于 PC 语言设置 2- 新的 Unicode 标准(最多 32 位)为每个字符提供唯一的代码以所有当前已知的语言以及未来更多的语言。 如果文件是 unicode,则任何安装了该语言字体的 PC 都应该可以理解它。 请注意,即使 UTF-8 也高达 32 位,并且与 UTF-16 和 UTF-32 一样广泛,只是它尝试保留带有拉丁字符的 8 位,只是为了节省磁盘空间

Follow these steps with Notepad++

1- Copy the original text

2- In Notepad++, open new file, change Encoding -> pick an encoding you think the original text follows. Try as well the encoding "ANSI" as sometimes Unicode files are read as ANSI by certain programs

3- Paste

4- Then to convert to Unicode by going again over the same menu: Encoding -> "Encode in UTF-8" (Not "Convert to UTF-8") and hopefully it will become readable

The above steps apply for most languages. You just need to guess the original encoding before pasting in notepad++, then convert through the same menu to an alternate Unicode-based encoding to see if things become readable.

Most languages exist in 2 forms of encoding: 1- The old legacy ANSI (ASCII) form, only 8 bits, was used initially by most computers. 8 bits only allowed 256 possibilities, 128 of them where the regular latin and control characters, the final 128 bits were read differently depending on the PC language settings 2- The new Unicode standard (up to 32 bit) give a unique code for each character in all currently known languages and plenty more to come. if a file is unicode it should be understood on any PC with the language's font installed. Note that even UTF-8 goes up to 32 bit and is just as broad as UTF-16 and UTF-32 only it tries to stay 8 bits with latin characters just to save up disk space

停滞 2024-07-12 07:41:31

我在寻找中文字符代码页问题的解决方案时发现了这个问题,但最终我的问题只是 Windows 无法在 UI 中正确显示它们的问题。

如果其他人也遇到同样的问题,您只需将 Windows 中的本地更改为中国,然后再更改回来即可解决此问题。

我在这里找到了解决方案:

http://answers.microsoft.com/en-us/windows/forum/windows_7-desktop/how-can-i-get -chinesejapanese-characters-to/fdb1f1da-b868-40d1-a4a4-7acadff4aafa?page=2&auth=1

也赞成加布里埃尔的回答,因为查看记事本++中的数据让我对Windows产生了兴趣。

I found this question when searching for a solution to a code page issue i had with Chinese characters, but in the end my problem was just an issue with Windows not displaying them correctly in the UI.

In case anyone else has that same issue, you can fix it simply by changing the local in windows to China and then back again.

I found the solution here:

http://answers.microsoft.com/en-us/windows/forum/windows_7-desktop/how-can-i-get-chinesejapanese-characters-to/fdb1f1da-b868-40d1-a4a4-7acadff4aafa?page=2&auth=1

Also upvoted Gabriel's answer as looking at the data in notepad++ was what tipped me off about windows.

橘寄 2024-07-12 07:41:31

然后是有点旧的 recode 程序。

And then there is the somewhat older recode program.

合约呢 2024-07-12 07:41:31

有些程序会尝试检测文件的编码,例如 chardet。 然后您可以使用 iconv 将其转换为不同的编码。 但这要求原始文本仍然完整并且不会丢失任何信息(例如通过删除重音符号或整个重音字母)。

There are programs that try to detect the encoding of an file like chardet. Then you could convert it to a different encoding using iconv. But that requires that the original text is still intact and no information is lost (for example by removing accents or whole accented letters).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文