Windows-1252 转 UTF-8 编码

离旧人 2024-08-23 23:42:39

iconv -f WINDOWS-1252 -t UTF-8 文件名.txt

回复收藏 0 原文

最偏执的依靠 2024-08-23 23:42:39

您如何期望重新编码才能知道文件是 Windows-1252？理论上，我相信任何文件都是有效的Windows-1252文件，因为它将每个可能的字节映射到一个字符。

现在肯定有一些特征强烈表明它是 UTF-8 - 例如，如果它以 UTF-8 BOM 开头 - 但它们并不是确定的。

一种选择是首先检测它是否实际上是一个完全有效的 UTF-8 文件，我想……再说一次，这只是建议性的。

我对重新编码工具本身并不熟悉，但您可能想看看它是否能够将文件重新编码为相同编码 - 如果您使用无效文件（即包含无效的 UTF-8 字节序列）它很可能将无效序列转换为问号或类似的东西。此时，您可以通过将文件重新编码为 UTF-8 并查看输入和输出是否相同来检测文件是否为有效的 UTF-8。

或者，以编程方式执行此操作，而不是使用重新编码实用程序 - 例如，在 C# 中这将非常简单。

但重申一下：所有这些都是启发式的。如果您确实不知道文件的编码，那么没有任何东西可以 100% 准确地告诉您。

回复收藏 0 原文

三人与歌 2024-08-23 23:42:39

这是我对类似问题给出的另一个答案的转录：

如果将 utf8_encode() 应用于已经是 UTF8 的字符串，它将返回乱码的 UTF8 输出。

我做了一个函数来解决所有这些问题。它称为 Encoding::toUTF8()。

您不需要知道字符串的编码是什么。它可以是 Latin1 (iso 8859-1)、Windows-1252 或 UTF8，或者字符串可以是它们的混合。 Encoding::toUTF8() 会将所有内容转换为 UTF8。

我这样做是因为一项服务向我提供的数据完全混乱，将 UTF8 和 Latin1 混合在同一个字符串中。

用法：

$utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);

$latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);

下载：

https://github.com/neitanod/forceutf8

更新：

我已经包含了另一个函数 Encoding::fixUFT8()，它将修复每个看起来乱码的 UTF8 字符串。

用法：

$utf8_string = Encoding::fixUTF8($garbled_utf8_string);

示例：

echo Encoding::fixUTF8("FÃ©dÃ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂ©dÃÂ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂÃÂ©dÃÂÃÂ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂ©dération Camerounaise de Football");

将输出：

Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football

更新：我已将函数 (forceUTF8) 转换为名为 Encoding 的类上的一系列静态函数。新函数是 Encoding::toUTF8()。

Here's a transcription of another answer I gave to a similar question:

If you apply utf8_encode() to an already UTF8 string it will return a garbled UTF8 output.

I made a function that addresses all this issues. It´s called Encoding::toUTF8().

You dont need to know what the encoding of your strings is. It can be Latin1 (iso 8859-1), Windows-1252 or UTF8, or the string can have a mix of them. Encoding::toUTF8() will convert everything to UTF8.

I did it because a service was giving me a feed of data all messed up, mixing UTF8 and Latin1 in the same string.

Usage:

$utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);

$latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);

Download:

https://github.com/neitanod/forceutf8

Update:

I've included another function, Encoding::fixUFT8(), wich will fix every UTF8 string that looks garbled.

Usage:

$utf8_string = Encoding::fixUTF8($garbled_utf8_string);

Examples:

echo Encoding::fixUTF8("FÃ©dÃ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂ©dÃÂ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂÃÂ©dÃÂÃÂ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂ©dération Camerounaise de Football");

will output:

Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football

Update: I've transformed the function (forceUTF8) into a family of static functions on a class called Encoding. The new function is Encoding::toUTF8().

回复收藏 0 原文

姐不稀罕 2024-08-23 23:42:39

没有通用方法可以判断文件是否使用特定编码进行编码。请记住，编码只不过是文件中的位应如何映射到字符的“协议”。

如果您不知道哪些文件实际上已经以 UTF-8 编码，哪些文件以 windows-1252 编码，则必须检查所有文件并自行查找。在最坏的情况下，这可能意味着您必须使用两种编码中的任何一种打开它们中的每一个，并查看它们是否“看起来”正确 - 即所有字符都正确显示。当然，您可以使用工具支持来做到这一点，例如，如果您确定某些字符包含在 windows-1252 与 UTF-8 中具有不同映射的文件中，您可以 grep 查找它们正如 Seva Akekseyev 提到的，通过“iconv”运行文件后。

另一个幸运的情况是，如果您知道这些文件实际上只包含以 UTF-8 和 windows-1252 编码相同的字符。当然，在这种情况下，你已经完成了。

回复收藏 0 原文

水水月牙 2024-08-23 23:42:39

如果您想在单个命令中重命名多个文件 - 假设您想转换所有 *.txt 文件 - 以下是命令：

find . -name "*.txt" -exec iconv -f WINDOWS-1252 -t UTF-8 {} -o {}.ren \; -a -exec mv {}.ren {} \;

If you want to rename multiple files in a single command ‒ let's say you want to convert all *.txt files ‒ here is the command:

find . -name "*.txt" -exec iconv -f WINDOWS-1252 -t UTF-8 {} -o {}.ren \; -a -exec mv {}.ren {} \;

回复收藏 0 原文

我也只是我 2024-08-23 23:42:39

使用 iconv 命令。

要确保文件位于 Windows-1252 中，请在记事本（Windows 下）中打开它，然后单击“另存为”。记事本建议将当前编码设置为默认编码；如果它是 Windows-1252（或任何 1 字节代码页），则会显示“ANSI”。

回复收藏 0 原文

两人的回忆 2024-08-23 23:42:39

您可以使用记事本++等编辑器更改文件的编码。只需转到编码并选择您想要的即可。

我一直更喜欢Windows 1252

回复收藏 0 原文

默嘫て 2024-08-23 23:42:39

1. 已经是UTF-8的文件不应该被改变 ¹

当我最近遇到这个问题时，我通过首先查找所有内容来解决它
需要转换的文件。
我通过排除不应转换的文件来做到这一点。
这包括二进制文件、纯ASCII文件（其中
根据定义已经具有有效的 UTF-8 编码），并且文件
至少包含一些有效的非 ASCII UTF-8 字符。

简而言之，我递归地搜索了可能应该是的文件
Converted ：

$ find . -type f -name '*' -exec sh -c 'for n; do file -i "$n" | grep -Ev "binary|us-ascii|utf-8"; done' sh {} +

我有一个包含 300 – 400 个文件的子目录树。
其中大约有六个被证明编码错误，并且
通常会返回如下响应：

./<some-path>/plain-text-file.txt: text/plain; charset=iso-8859-1
./<some-other-path>/text-file.txt: text/plain; charset=unknown-8bit

注意编码是 iso-8859-1 或 unknown-8bit。
这是有道理的 – 任何非 ASCII Windows-1252 字符都可以
是有效的 ISO 8859-1 字符 - 或 - 它可以是 27 个字符之一
128 – 159 (x80 – x9F) 范围内不可打印的字符
ISO 8859-1 字符已定义。

1.a.关于 find 的警告。 -exec 解决方案 ²

find 的问题。 -exec 解决方案是它可能会非常慢
– 随着子目录树的大小而增长的问题
审查。

根据我的经验，它可能会更快 - 可能快得多 -
运行数量个命令而不是建议的单个命令
如上所述，如下：

$ file -i * | grep -Ev "binary|us-ascii|utf-8"
$ file -i */* | grep -Ev "binary|us-ascii|utf-8"
$ file -i */*/* | grep -Ev "binary|us-ascii|utf-8"
$ file -i */*/*/* | grep -Ev "binary|us-ascii|utf-8"
$ …

继续增加这些命令的深度，直到响应
像这样的东西：

*/*/*/*/*/*/*: cannot open `*/*/*/*/*/*/*' (No such file or directory)

一旦你看到无法打开/（没有这样的文件或目录），它就是
清楚整个子目录树已经被搜索过。

2. 转换罪魁祸首文件

既然已经找到了所有可疑文件，我更喜欢使用文本
编辑器来帮助转换，而不是使用命令行
像重新编码这样的工具。

2.a.在 Windows 上，考虑使用 Notepad++

在 Windows 上，我喜欢使用 Notepad++ 来转换文件。
如果您需要帮助，请查看这篇优秀的文章。

2.b.在 Linux 或 macOS 上，考虑使用 Visual Studio Code

在 Linux 和 macOS 上，尝试使用 VS Code 来转换文件。
我在这篇文章中给出了一些提示。

引用

^{¹
第 1 部分依赖于使用 file 命令，不幸的是
并不完全可靠。
只要您的所有文件都小于 64 kB，就不应该有
任何问题。
对于（远）大于 64 kB 的文件，存在非 ASCII
文件将错误被识别为纯 ASCII 文件。
此类文件中的非 ASCII 字符越少，风险就越大
他们会被错误地识别。
有关此内容的更多信息，请参阅这篇文章及其评论。
²
小节1。 a. 的灵感来自于这个答案。}

1. The files which are already in UTF-8 should not be changed ¹

When I recently had this issue, I solved it by first finding all
files in need of conversion.
I did this by excluding the files that should not be converted.
This includes binary files, pure ASCII files (which
by definition already have a valid UTF-8 encoding), and files that
contain at least some valid non-ASCII UTF-8 characters.

In short, I recursively searched the files that probably should be
converted :

$ find . -type f -name '*' -exec sh -c 'for n; do file -i "$n" | grep -Ev "binary|us-ascii|utf-8"; done' sh {} +

I had a subdirectory tree containing some 300 – 400 files.
About half a dozen of them turned out to be wrongly encoded, and
typically returned responses like :

./<some-path>/plain-text-file.txt: text/plain; charset=iso-8859-1
./<some-other-path>/text-file.txt: text/plain; charset=unknown-8bit

Note how the encoding was either iso-8859-1, or unknown-8bit.
This makes sense – any non-ASCII Windows-1252 character can either
be a valid ISO 8859-1 character – or – it can be one of the 27
characters in the 128 – 159 (x80 – x9F) range for which no printable
ISO 8859-1 characters are defined.

1. a. A caveat with the `find . -exec` solution ²

A problem with the find . -exec solution is that it can be very slow
– a problem that grows with the size of the subdirectory tree under
scrutiny.

In my experience, it might be faster – potentially much faster –
to run a number of commands instead of the single command suggested
above, as follows :

$ file -i * | grep -Ev "binary|us-ascii|utf-8"
$ file -i */* | grep -Ev "binary|us-ascii|utf-8"
$ file -i */*/* | grep -Ev "binary|us-ascii|utf-8"
$ file -i */*/*/* | grep -Ev "binary|us-ascii|utf-8"
$ …

Continue increasing the depth in these commands until the response is
something like this:

*/*/*/*/*/*/*: cannot open `*/*/*/*/*/*/*' (No such file or directory)

Once you see cannot open / (No such file or directory), it is
clear that the entire subdirectory tree has been searched.

2. Convert the culprit files

Now that all suspicious files have been found, I prefer to use a text
editor to help with the conversion, instead of using a command line
tool like recode.

2. a. On Windows, consider using Notepad++

On Windows, I like to use Notepad++ for converting files.
Have a look at this excellent post if you need help on that.

2. b. On Linux or macOS, consider using Visual Studio Code

On Linux and macOS, try VS Code for converting files.
I've given a few hints in this post.

References

^{¹
Section 1 relies on using the file command, which unfortunately
isn't completely reliable.
As long as all your files are smaller than 64 kB, there shouldn't be
any problem.
For files (much) larger than 64 kB, there is a risk that non-ASCII
files will falsely be identified as pure ASCII files.
The fewer non-ASCII characters in such files, the bigger the risk
that they will be wrongly identified.
For more on this, see this post and its comments.
²
Subsection 1. a. is inspired by this answer.}

回复收藏 0 原文

仙气飘飘 2024-08-23 23:42:39

如果您确定您的文件是 UTF-8 或 Windows 1252（或 Latin1），则可以利用以下事实：如果您尝试转换无效文件，重新编码将退出并出现错误。

虽然 utf8 是有效的 Win-1252，但反之则不然：win-1252 不是有效的 UTF-8。所以：

recode utf8..utf16 <unknown.txt >/dev/null || recode cp1252..utf8 <unknown.txt >utf8-2.txt

会吐出所有 cp1252 文件的错误，然后继续将它们转换为 UTF8。

我会将其包装到一个更干净的 bash 脚本中，保留每个转换后的文件的备份。

在进行字符集转换之前，您可能希望首先确保所有文件中的行结尾一致。否则，重新编码会因此而抱怨，并且可能会转换已经是 UTF8 但行尾错误的文件。

If you are sure your files are either UTF-8 or Windows 1252 (or Latin1), you can take advantage of the fact that recode will exit with an error if you try to convert an invalid file.

While utf8 is valid Win-1252, the reverse is not true: win-1252 is NOT valid UTF-8. So:

recode utf8..utf16 <unknown.txt >/dev/null || recode cp1252..utf8 <unknown.txt >utf8-2.txt

Will spit out errors for all cp1252 files, and then proceed to convert them to UTF8.

I would wrap this into a cleaner bash script, keeping a backup of every converted file.

Before doing the charset conversion, you may wish to first ensure you have consistent line-endings in all files. Otherwise, recode will complain because of that, and may convert files which were already UTF8, but just had the wrong line-endings.

回复收藏 0 原文

阳光①夏 2024-08-23 23:42:39

这个脚本在 Win10/PS5.1 CP1250 到 UTF-8 上对我有用

Get-ChildItem -Include *.php -Recurse | ForEach-Object {
    $file = $_.FullName

    $mustReWrite = $false
    # Try to read as UTF-8 first and throw an exception if
    # invalid-as-UTF-8 bytes are encountered.
    try
    {
        [IO.File]::ReadAllText($file,[Text.Utf8Encoding]::new($false, $true))
    }
    catch [System.Text.DecoderFallbackException]
    {
        # Fall back to Windows-1250
        $content = [IO.File]::ReadAllText($file,[Text.Encoding]::GetEncoding(1250))
        $mustReWrite = $true
    }

    # Rewrite as UTF-8 without BOM (the .NET frameworks' default)
    if ($mustReWrite)
    {
        Write "Converting from 1250 to UTF-8"
        [IO.File]::WriteAllText($file, $content)
    }
    else
    {
        Write "Already UTF-8-encoded"
    }
}

this script worked for me on Win10/PS5.1 CP1250 to UTF-8

Get-ChildItem -Include *.php -Recurse | ForEach-Object {
    $file = $_.FullName

    $mustReWrite = $false
    # Try to read as UTF-8 first and throw an exception if
    # invalid-as-UTF-8 bytes are encountered.
    try
    {
        [IO.File]::ReadAllText($file,[Text.Utf8Encoding]::new($false, $true))
    }
    catch [System.Text.DecoderFallbackException]
    {
        # Fall back to Windows-1250
        $content = [IO.File]::ReadAllText($file,[Text.Encoding]::GetEncoding(1250))
        $mustReWrite = $true
    }

    # Rewrite as UTF-8 without BOM (the .NET frameworks' default)
    if ($mustReWrite)
    {
        Write "Converting from 1250 to UTF-8"
        [IO.File]::WriteAllText($file, $content)
    }
    else
    {
        Write "Already UTF-8-encoded"
    }
}

回复收藏 0 原文

杀手六號 2024-08-23 23:42:39

如前所述，您无法可靠地确定文件是否为 Windows-1252，因为 Windows-1252 将几乎所有字节映射到有效的代码点。但是，如果文件仅采用 Windows-1252 和 UTF-8 且没有其他编码，那么您可以尝试解析 UTF-8 格式的文件，如果它包含无效字节，则它是 Windows-1252 文件

if iconv -f UTF-8 -t UTF-16 $FILE 1>/dev/null 2>&1; then
    # Conversion succeeded
    echo "$FILE is in UTF-8"
else
    # iconv returns error if there are invalid characters in the byte stream
    echo "$FILE is in Windows-1252. Converting to UTF-8"
    iconv -f WINDOWS-1252 -t UTF-8 -o ${FILE}_utf8.txt $FILE
fi

这与许多其他尝试将文件视为 UTF-8 并检查是否存在错误的答案类似。它在 99% 的情况下都可以工作，因为大多数 Windows-1252 文本在 UTF-8 中都是无效的，但在极少数情况下它仍然无法工作。毕竟这是启发式的！

还有各种库和工具可以检测字符集，例如 chardet

$ chardet utf8.txt windows1252.txt iso-8859-1.txt
utf8.txt: utf-8 with confidence 0.99
windows1252.txt: Windows-1252 with confidence 0.73
iso-8859-1.txt: ISO-8859-1 with confidence 0.73

由于启发式的性质，它不可能完全可靠，因此它会输出一个置信度值供人们判断。文件中的人类文本越多，它就越有信心。如果您有非常具体的文本，则需要对图书馆进行更多培训。有关详细信息，请参阅浏览器如何确定所使用的编码？

As said, you can't reliably determine whether a file is Windows-1252 because Windows-1252 maps almost all bytes to a valid code point. However if the files are only in Windows-1252 and UTF-8 and no other encodings then you can try to parse a file in UTF-8 and if it contains invalid bytes then it's a Windows-1252 file

if iconv -f UTF-8 -t UTF-16 $FILE 1>/dev/null 2>&1; then
    # Conversion succeeded
    echo "$FILE is in UTF-8"
else
    # iconv returns error if there are invalid characters in the byte stream
    echo "$FILE is in Windows-1252. Converting to UTF-8"
    iconv -f WINDOWS-1252 -t UTF-8 -o ${FILE}_utf8.txt $FILE
fi

This is similar to many other answers that try to treat the file as UTF-8 and check if there are errors. It works 99% of the time because most Windows-1252 texts will be invalid in UTF-8, but there will still be rare cases when it won't work. It's heuristic after all!

There are also various libraries and tools to detect the character set, such as chardet

$ chardet utf8.txt windows1252.txt iso-8859-1.txt
utf8.txt: utf-8 with confidence 0.99
windows1252.txt: Windows-1252 with confidence 0.73
iso-8859-1.txt: ISO-8859-1 with confidence 0.73

It can't be completely reliable due to the heuristic nature, so it outputs a confidence value for people to judge. The more human text in the file, the more confident it'll be. If you have very specific texts then more trainings for the library will be needed. For more information read How do browsers determine the encoding used?

回复收藏 0 原文

献世佛 2024-08-23 23:42:39

找到这个TYPE 命令的文档：

将 ASCII (Windows1252) 文件转换为 Unicode (UCS- 2 le) 文本文件：

For /f "tokens=2 delims=:" %%G in ('CHCP') do Set _codepage=%%G    
CHCP 1252 >NUL    
CMD.EXE /D /A /C (SET/P=ÿþ)<NUL > unicode.txt 2>NUL    
CMD.EXE /D /U /C TYPE ascii_file.txt >> unicode.txt    
CHCP %_codepage%

上述技术（基于 Carlos M. 的脚本）首先创建一个带有字节顺序标记 (BOM) 的文件，然后附加原始文件的内容。 CHCP 用于确保会话使用 Windows1252 代码页运行，以便正确解释字符 0xFF 和 0xFE (ÿþ)。

Found this documentation for the TYPE command:

Convert an ASCII (Windows1252) file into a Unicode (UCS-2 le) text file:

For /f "tokens=2 delims=:" %%G in ('CHCP') do Set _codepage=%%G    
CHCP 1252 >NUL    
CMD.EXE /D /A /C (SET/P=ÿþ)<NUL > unicode.txt 2>NUL    
CMD.EXE /D /U /C TYPE ascii_file.txt >> unicode.txt    
CHCP %_codepage%

The technique above (based on a script by Carlos M.) first creates a file with a Byte Order Mark (BOM) and then appends the content of the original file. CHCP is used to ensure the session is running with the Windows1252 code page so that the characters 0xFF and 0xFE (ÿþ) are interpreted correctly.

回复收藏 0 原文

万人眼中万个我 2024-08-23 23:42:39

UTF-8 没有 BOM，因为它既多余又无效。 BOM 有用的地方是 UTF-16，它可以像 Microsoft 那样进行字节交换。 UTF-16 如果用于内存缓冲区中的内部表示。使用 UTF-8 进行交换。默认情况下，UTF-8、从 US-ASCII 派生的任何其他内容和 UTF-16 都是自然/网络字节顺序。 Microsoft UTF-16 需要 BOM，因为它是字节交换的。

为了将 Windows-1252 转换为 ISO8859-15，我首先将 ISO8859-1 转换为 US-ASCII，以获取具有相似字形的代码。然后，我将 Windows-1252 转换为 ISO8859-15，将其他非 ISO8859-15 字形转换为多个 US-ASCII 字符。

回复收藏 0 原文

Windows-1252 转 UTF-8 编码

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（13）

1. 已经是UTF-8的文件不应该被改变 ¹

1.a.关于 find 的警告。 -exec 解决方案 ²

2. 转换罪魁祸首文件

2.a.在 Windows 上，考虑使用 Notepad++

2.b.在 Linux 或 macOS 上，考虑使用 Visual Studio Code

引用

1. The files which are already in UTF-8 should not be changed ¹

1. a. A caveat with the `find . -exec` solution ²

2. Convert the culprit files

2. a. On Windows, consider using Notepad++

2. b. On Linux or macOS, consider using Visual Studio Code

References

关于作者

相关话题

热门标签

推荐作者

梦中的蝴蝶

时光病人

眼角的笑意。

zhxjcooler

莫言歌

暖树树初阳…

友情链接

Windows-1252 转 UTF-8 编码

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（13）

1. 已经是UTF-8的文件不应该被改变 1

1.a.关于 find 的警告。 -exec 解决方案 2

2. 转换罪魁祸首文件

2.a.在 Windows 上，考虑使用 Notepad++

2.b.在 Linux 或 macOS 上，考虑使用 Visual Studio Code

引用

1. The files which are already in UTF-8 should not be changed 1

1. a. A caveat with the find . -exec solution 2

2. Convert the culprit files

2. a. On Windows, consider using Notepad++

2. b. On Linux or macOS, consider using Visual Studio Code

References

关于作者

相关话题

热门标签

推荐作者

梦中的蝴蝶

时光病人

眼角的笑意。

zhxjcooler

莫言歌

暖树树初阳…

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

1. 已经是UTF-8的文件不应该被改变 ¹

1.a.关于 find 的警告。 -exec 解决方案 ²

1. The files which are already in UTF-8 should not be changed ¹

1. a. A caveat with the `find . -exec` solution ²