获取 Windows 中文件的编码

发布于 2024-09-19 03:37:43 字数 99 浏览 5 评论 0原文

这实际上不是一个编程问题,是否有命令行或 Windows 工具(Windows 7)来获取文本文件的当前编码?当然,我可以编写一个小 C# 应用程序,但我想知道是否已经内置了一些东西?

This isn't really a programming question, is there a command line or Windows tool (Windows 7) to get the current encoding of a text file? Sure I can write a little C# app but I wanted to know if there is something already built in?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(15

貪欢 2024-09-26 03:37:44

以下是我如何通过 BOM 检测 Unicode 系列文本编码的方法。此方法的准确性较低,因为此方法仅适用于文本文件(特别是 Unicode 文件),并且当不存在 BOM 时默认为 ascii(与大多数文本编辑器一样,默认值为 ascii) >UTF8(如果您想匹配 HTTP/web 生态系统)。

2018 年更新我不再推荐此方法。我建议使用来自 GIT 的 file.exe 或 @Sybren 推荐的 *nix 工具,以及 我将在稍后的答案中展示如何通过 PowerShell 执行此操作

# from https://gist.github.com/zommarin/1480974
function Get-FileEncoding($Path) {
    $bytes = [byte[]](Get-Content $Path -Encoding byte -ReadCount 4 -TotalCount 4)

    if(!$bytes) { return 'utf8' }

    switch -regex ('{0:x2}{1:x2}{2:x2}{3:x2}' -f $bytes[0],$bytes[1],$bytes[2],$bytes[3]) {
        '^efbbbf'   { return 'utf8' }
        '^2b2f76'   { return 'utf7' }
        '^fffe'     { return 'unicode' }
        '^feff'     { return 'bigendianunicode' }
        '^0000feff' { return 'utf32' }
        default     { return 'ascii' }
    }
}

dir ~\Documents\WindowsPowershell -File | 
    select Name,@{Name='Encoding';Expression={Get-FileEncoding $_.FullName}} | 
    ft -AutoSize

建议:如果 dirlsGet-ChildItem 仅检查已知文本文件,并且当您仅从已知的工具列表中寻找“错误的编码”。 (即 SQL Management Studio 默认为 UTF16,这打破了 Windows 的 GIT auto-cr-lf,这是多年来的默认设置。)

Here's my take how to detect the Unicode family of text encodings via BOM. The accuracy of this method is low, as this method only works on text files (specifically Unicode files), and defaults to ascii when no BOM is present (like most text editors, the default would be UTF8 if you want to match the HTTP/web ecosystem).

Update 2018: I no longer recommend this method. I recommend using file.exe from GIT or *nix tools as recommended by @Sybren, and I show how to do that via PowerShell in a later answer.

# from https://gist.github.com/zommarin/1480974
function Get-FileEncoding($Path) {
    $bytes = [byte[]](Get-Content $Path -Encoding byte -ReadCount 4 -TotalCount 4)

    if(!$bytes) { return 'utf8' }

    switch -regex ('{0:x2}{1:x2}{2:x2}{3:x2}' -f $bytes[0],$bytes[1],$bytes[2],$bytes[3]) {
        '^efbbbf'   { return 'utf8' }
        '^2b2f76'   { return 'utf7' }
        '^fffe'     { return 'unicode' }
        '^feff'     { return 'bigendianunicode' }
        '^0000feff' { return 'utf32' }
        default     { return 'ascii' }
    }
}

dir ~\Documents\WindowsPowershell -File | 
    select Name,@{Name='Encoding';Expression={Get-FileEncoding $_.FullName}} | 
    ft -AutoSize

Recommendation: This can work reasonably well if the dir, ls, or Get-ChildItem only checks known text files, and when you're only looking for "bad encodings" from a known list of tools. (i.e. SQL Management Studio defaults to UTF16, which broke GIT auto-cr-lf for Windows, which was the default for many years.)

薄凉少年不暖心 2024-09-26 03:37:44

一个简单解决方案可能是在 Firefox 中打开该文件。

  1. 将文件拖放到 Firefox 中
  2. 按 Ctrl+I 打开页面信息

,文本编码将出现在“页面信息”窗口中。

输入图片此处描述

注意:如果文件不是txt格式,只需将其重命名为txt,然后重试即可。

PS 有关详细信息,请参阅此 文章。

A simple solution might be opening the file in Firefox.

  1. Drag and drop the file into firefox
  2. Press Ctrl+I to open the page info

and the text encoding will appear on the "Page Info" window.

enter image description here

Note: If the file is not in txt format, just rename it to txt and try again.

P.S. For more info see this article.

你不是我要的菜∠ 2024-09-26 03:37:44

我写了#4 答案(在撰写本文时)。但最近我所有的电脑上都安装了 git,所以现在我使用@Sybren 的解决方案。这是一个新的答案,可以从 powershell 中方便地使用该解决方案(无需将所有 git/usr/bin 放入 PATH 中,这对我来说太混乱了)。

将其添加到您的 profile.ps1 中:

$global:gitbin = 'C:\Program Files\Git\usr\bin'
Set-Alias file.exe $gitbin\file.exe

并使用如下方式:file.exe --mime-encoding *。您必须在命令中包含 .exe 才能使 PS 别名发挥作用。

但是,如果您不自定义 PowerShell profile.ps1,我建议您从我的开始: https:// gist.github.com/yzorg/8215221/8e38fd722a3dfc526bbe4668d1f3b08eb7c08be0
并将其保存到~\Documents\WindowsPowerShell。在没有git的计算机上使用是安全的,但是当找不到git时会写警告。

命令中的.exe也是我从powershell使用C:\WINDOWS\system32\where.exe的方式;以及许多其他被 powershell“默认隐藏”的操作系统 CLI 命令,*shrug*。

I wrote the #4 answer (at time of writing). But lately I have git installed on all my computers, so now I use @Sybren's solution. Here is a new answer that makes that solution handy from powershell (without putting all of git/usr/bin in the PATH, which is too much clutter for me).

Add this to your profile.ps1:

$global:gitbin = 'C:\Program Files\Git\usr\bin'
Set-Alias file.exe $gitbin\file.exe

And used like: file.exe --mime-encoding *. You must include .exe in the command for PS alias to work.

But if you don't customize your PowerShell profile.ps1 I suggest you start with mine: https://gist.github.com/yzorg/8215221/8e38fd722a3dfc526bbe4668d1f3b08eb7c08be0
and save it to ~\Documents\WindowsPowerShell. It's safe to use on a computer without git, but will write warnings when git is not found.

The .exe in the command is also how I use C:\WINDOWS\system32\where.exe from powershell; and many other OS CLI commands that are "hidden by default" by powershell, *shrug*.

梦亿 2024-09-26 03:37:44

这里有一些 C 代码用于可靠的 ascii、bom 和 utf8 检测: https://unicodebook.readthedocs.io/ guess_encoding.html

仅限 ASCII、UTF-8 和使用 BOM 的编码(带 BOM 的 UTF-7、带 BOM 的 UTF-8、
UTF-16 和 UTF-32)具有可靠的算法来获取文档的编码。
对于所有其他编码,您必须信任基于统计数据的启发式方法。

编辑:

C# 答案的 powershell 版本来自:查找任何文件编码的有效方法。仅适用于签名 (boms)。

# get-encoding.ps1
param([Parameter(ValueFromPipeline=$True)] $filename)    
begin {
  # set .net current directoy                                                                                                   
  [Environment]::CurrentDirectory = (pwd).path
}
process {
  $reader = [System.IO.StreamReader]::new($filename, 
    [System.Text.Encoding]::default,$true)
  $peek = $reader.Peek()
  $encoding = $reader.currentencoding
  $reader.close()
  [pscustomobject]@{Name=split-path $filename -leaf
                BodyName=$encoding.BodyName
                EncodingName=$encoding.EncodingName}
}


.\get-encoding chinese8.txt

Name         BodyName EncodingName
----         -------- ------------
chinese8.txt utf-8    Unicode (UTF-8)


get-childitem -file | .\get-encoding

Some C code here for reliable ascii, bom's, and utf8 detection: https://unicodebook.readthedocs.io/guess_encoding.html

Only ASCII, UTF-8 and encodings using a BOM (UTF-7 with BOM, UTF-8 with BOM,
UTF-16, and UTF-32) have reliable algorithms to get the encoding of a document.
For all other encodings, you have to trust heuristics based on statistics.

EDIT:

A powershell version of a C# answer from: Effective way to find any file's Encoding. Only works with signatures (boms).

# get-encoding.ps1
param([Parameter(ValueFromPipeline=$True)] $filename)    
begin {
  # set .net current directoy                                                                                                   
  [Environment]::CurrentDirectory = (pwd).path
}
process {
  $reader = [System.IO.StreamReader]::new($filename, 
    [System.Text.Encoding]::default,$true)
  $peek = $reader.Peek()
  $encoding = $reader.currentencoding
  $reader.close()
  [pscustomobject]@{Name=split-path $filename -leaf
                BodyName=$encoding.BodyName
                EncodingName=$encoding.EncodingName}
}


.\get-encoding chinese8.txt

Name         BodyName EncodingName
----         -------- ------------
chinese8.txt utf-8    Unicode (UTF-8)


get-childitem -file | .\get-encoding
简单气质女生网名 2024-09-26 03:37:44

您只需在文件位置打开 git bash 然后运行命令 file -i file_name

example即可进行检查

user filesData
$ file -i data.csv
data.csv: text/csv; charset=utf-8

you can simply check that by opening your git bash on the file location then running the command file -i file_name

example

user filesData
$ file -i data.csv
data.csv: text/csv; charset=utf-8
ヤ经典坏疍 2024-09-26 03:37:44

正在寻找 Node.js/npm 解决方案?尝试 encoding-checker

npm install -g encoding-checker

用法

Usage: encoding-checker [-p pattern] [-i encoding] [-v]
 
Options:
  --help                 Show help                                     [boolean]
  --version              Show version number                           [boolean]
  --pattern, -p, -d                                               [default: "*"]
  --ignore-encoding, -i                                            [default: ""]
  --verbose, -v                                                 [default: false]

示例

获取当前目录中所有文件的编码:

encoding-checker

返回编码当前目录中的所有 md 文件:

encoding-checker -p "*.md"

获取当前目录及其子文件夹中所有文件的编码(对于大型文件夹将花费相当长的时间;看似没有响应):

encoding-checker -p "**"

有关更多示例,请参阅 npm docu 或官方

Looking for a Node.js/npm solution? Try encoding-checker:

npm install -g encoding-checker

Usage

Usage: encoding-checker [-p pattern] [-i encoding] [-v]
 
Options:
  --help                 Show help                                     [boolean]
  --version              Show version number                           [boolean]
  --pattern, -p, -d                                               [default: "*"]
  --ignore-encoding, -i                                            [default: ""]
  --verbose, -v                                                 [default: false]

Examples

Get encoding of all files in current directory:

encoding-checker

Return encoding of all md files in current directory:

encoding-checker -p "*.md"

Get encoding of all files in current directory and its subfolders (will take quite some time for huge folders; seemingly unresponsive):

encoding-checker -p "**"

For more examples refer to the npm docu or the official repository.

浮世清欢 2024-09-26 03:37:44

EncodingChecker

文件编码检查器是一种 GUI 工具,可让您验证一个或多个文本编码文件。该工具可以显示所有选定文件的编码,或仅显示没有指定编码的文件。

文件编码检查器需要 .NET 4 或更高版本才能运行。

EncodingChecker

File Encoding Checker is a GUI tool that allows you to validate the text encoding of one or more files. The tool can display the encoding for all selected files, or only the files that do not have the encodings you specify.

File Encoding Checker requires .NET 4 or above to run.

倾`听者〃 2024-09-26 03:37:44

与上面使用记事本列出的解决方案类似,您也可以在 Visual Studio 中打开该文件(如果您使用的是 Visual Studio)。在 Visual Studio 中,您可以选择“文件 > 高级保存选项...”,

“编码:”组合框将具体告诉您文件当前使用的编码。它比记事本列出了更多的文本编码,因此在处理来自世界各地的各种文件和其他文件时非常有用。

就像记事本一样,您也可以从选项列表中更改编码,然后单击“确定”后保存文件。您还可以通过“另存为”对话框中的“使用编码保存...”选项选择所需的编码(通过单击“保存”按钮旁边的箭头)。

Similar to the solution listed above with Notepad, you can also open the file in Visual Studio, if you're using that. In Visual Studio, you can select "File > Advanced Save Options..."

The "Encoding:" combo box will tell you specifically which encoding is currently being used for the file. It has a lot more text encodings listed in there than Notepad does, so it's useful when dealing with various files from around the world and whatever else.

Just like Notepad, you can also change the encoding from the list of options there, and then saving the file after hitting "OK". You can also select the encoding you want through the "Save with Encoding..." option in the Save As dialog (by clicking the arrow next to the Save button).

情深缘浅 2024-09-26 03:37:44

我发现做到这一点的唯一方法是 VIM 或 Notepad++。

The only way that I have found to do this is VIM or Notepad++.

極樂鬼 2024-09-26 03:37:44

使用 Powershell

经过多年尝试从本机 CMD/Powershell 方法获取文件编码,并且总是不得不求助于使用(和安装)第 3 方软件,例如 Cygwingit-bash 和其他外部二进制文件,终于有了一个本地方法。

之前,人们一直抱怨这个工具可能会失败,请理解这个工具主要用于识别文本、日志、CSV 和 TAB 类型的文件。不是二进制文件。此外,文件编码主要是一个猜测游戏,因此提供的脚本正在进行一些基本的猜测,这在大文件上肯定会失败。请随意测试并在要点上提供改进的反馈。

为了测试这一点,我将一堆奇怪的垃圾文本转储到一个字符串中,然后使用可用的 Windows 编码将其导出。

ASCII, BigEndianUnicode, BigEndianUTF32, OEM, Unicode, UTF7, UTF8, UTF8BOM, UTF8NoBOM, UTF32

# The Garbage
$d=''; (33..126 && 161..252) | ForEach-Object { $c = $([char]$_); $d += ${c} }; $d = "1234 5678 ABCD EFGH`nCRLF: `r`nESC[ :`e[`nESC[m :`e[m`n`r`nASCII [22-126,161-252]:`n$d";

$elist=@('ASCII','BigEndianUnicode','BigEndianUTF32','OEM','Unicode','UTF7','UTF8','UTF8BOM','UTF8NoBOM','UTF32')
$elist | ForEach-Object { $ec=[string]($_); $fp = "zx_$ec.txt"; Write-Host -Fo DarkGray ("Encoding to file: {0}" -f $fp); $d | Out-File -Encoding $ec -FilePath $fp; }

# ls | encguess

 ascii           zx_ASCII.txt
 utf-16 BE       zx_BigEndianUnicode.txt
 utf-32 BE       zx_BigEndianUTF32.txt
 OEM (finds) : (3)
 OEM 437 ?       zx_OEM.txt
 utf-16 LE       zx_Unicode.txt
 utf-32 LE       zx_UTF32.txt
 utf-7           zx_UTF7.txt
 utf-8           zx_UTF8.txt
 utf-8 BOM       zx_UTF8BOM.txt
 utf-8           zx_UTF8NoBOM.txt

这是代码,也可以在 gist URL 中找到。

#!/usr/bin/env pwsh
# GuessFileEncoding.ps1 - Guess File Encoding for Windows-11 using Powershell
# -*- coding: utf-8 -*-
#------------------------------------------------------------------------------
#   Author      : not2qubit
#   Date        : 2023-11-27
#   Version:    : 1.0.0
#   License:    : CC-BY-SA-4.0
#   URL:        : https://gist.github.com/eabase/d4f16c8c6535f3868d5dfb1efbde0e5a
#--------------------------------------------------------
#   Usage       : ls | encguess
#               : encguess .\somefile.txt
#--------------------------------------------------------
#   References: 
# 
#   [1] https://www.fileformat.info/info/charset/UTF-7/list.htm
#   [2] https://learn.microsoft.com/en-gb/windows/win32/intl/code-page-identifiers
#   [3] https://learn.microsoft.com/en-us/windows/console/console-virtual-terminal-sequences
#   [4] https://gist.github.com/fnky/458719343aabd01cfb17a3a4f7296797
#   [5] https://github.com/dankogai/p5-encode/blob/main/lib/Encode/Guess.pm
#
#--------------------------------------------------------
# https://stackoverflow.com/a/62511302/1147688

Function Find-Bytes([byte[]]$Bytes, [byte[]]$Search, [int]$Start, [Switch]$All) {
    For ($Index = $Start; $Index -le $Bytes.Length - $Search.Length ; $Index++) {
        For ($i = 0; $i -lt $Search.Length -and $Bytes[$Index + $i] -eq $Search[$i]; $i++) {}
        If ($i -ge $Search.Length) { 
            $Index
            If (!$All) { Return }
        } 
    }
}

function get_file_encoding {
    param([Parameter(ValueFromPipeline=$True)] $filename)    
    
    begin {
        # Use .NET to set current directory 
        [Environment]::CurrentDirectory = (pwd).path
    }
    
    process {
        
        function guess_encoding ($bytes) {
            # ---------------------------------------------------------------------------------------------------
            # Plan:     Do the easy checks first!
            # 1. scan whole file & check if there are no codes above [1-127] and excess of "?" (0x3f)  --> ASCII
            # 2. scan whole file & check if there are codes above [1-127]           --> ? ANSI/OEM/UTF-8
            # 3. scan whole file & check if there are many codes "2b41" &  char<127 --> UTF-7  --> "2b2f76" UTF-7 BOM
            # 4. scan whole file & check if there are many codes "c2 | c3"          --> UTF-8
            # ---------------------------------------------------------------------------------------------------
            switch -regex ('{0:x2}{1:x2}{2:x2}{3:x2}' -f $bytes[0],$bytes[1],$bytes[2],$bytes[3]) {
                # 1. Check UTF-8 BOM
                '^efbbbf'   { return 'utf-8 BOM' }          # UTF-8 BOM     (?)
                '^2b2f76'   { return 'utf-7 BOM' }          # UTF-7 BOM     (65000)
                
                # 2.  Check UTF-32 (BE|LE)
                '^fffe0000' { return 'utf-32 LE' }          # UTF-32 LE     (12000)
                '^0000feff' { return 'utf-32 BE' }          # UTF-32 BE     (12001)     'bigendianutf32'
                
                # 3.  Check UTF-16 (BE|LE)
                '^fffe'     { return 'utf-16 LE' }          # UTF-16 LE     (1200)      'unicode'
                '^feff'     { return 'utf-16 BE' }          # UTF-16 BE     (1201)      'bigendianunicode'
                
                default     { return 'unsure' }             # 
            }
        }
        
        function guess_again ($blob) {
            
            #-------------------------------
            # 1. Check if ASCII [0-127] (7-bit)
            #-------------------------------
            # (a) Check if using ASCII above 127
            $guess_ascii = 1
            foreach ($i in  $blob) { if ($i -gt 127) { $guess_ascii=0; break; } }
            
            # (b) Check if there are many consecutive "?"s. 
            #     That would indicate having erroneously saved a 
            #     ISO-8859-1 character containing file, as ASCII.
            
            #$b = [byte[]]("????".ToCharArray())
            #$n = (Find-Bytes -all $blob $b).Count
            #if ($n -gt 4) {}
            
            #-------------------------------
            # 2. Check for UTF-7 strings "2b41" (43 65)
            #-------------------------------
            
            $b = [byte[]]("+A".ToCharArray())
            $finds=(Find-Bytes -all $blob $b).Count
            $quart = [math]::Round(($blob.length)*0.05)
            #Write-Host -Fo DarkGray "  UTF-7 (quart,finds) : (${quart},${finds})"
            
            if ( ($finds -gt 10) -And ($guess_ascii -eq 1) ) { 
                return 'utf-7' 
            } elseif ($guess_ascii -eq 1) {
                return 'ascii'
            } 
            
            #-------------------------------
            # 3. Check for UTF-8 strings "c2|c3" (194,195)
            #-------------------------------
            # If > 25% are c2|c3, probably utf-8
            $b = [byte[]](0xc2)
            $c = [byte[]](0xc3)
            $f1=(Find-Bytes -all $blob $b).Count
            $f2=(Find-Bytes -all $blob $c).Count
            $quart = [math]::Round(($blob.length)*0.25)
            $finds = ($f1 + $f2)
            
            if ($finds -gt $quart) { return "utf-8" } 
            
            #-------------------------------
            # 4. Check for OEM Strings:
            #-------------------------------
            # Check for "4x" sequences of 'AAAA'(41), 'IIII'(49), 'OOOO'(4f)
            $n = 0
            #$oemlist = @(65,73,79)
            $oemlist = @('A','I','O')
            #$b = [byte[]](("$i"*4).ToCharArray())
            foreach ($i in $oemlist) {$b = [byte[]](("$i"*4).ToCharArray()); $n += (Find-Bytes -all $blob $b).Count  } 
            #$blob | Group-Object | Select Name, Count | Sort -Top 15 -Descending Count
            Write-Host -Fo DarkGray "  OEM (finds) : ($n)"
            
            if ($n -ge 3) { return "OEM 437 ?" }
            
            return "unknown"
        }
        
        $bytes = [byte[]](Get-Content $filename -AsByteStream -ReadCount 4 -TotalCount 4)
        
        if (!$bytes) { 
            $guess = 'failed' 
        } else {
            $guess = guess_encoding($bytes)
        }
        
        if ($guess -eq 'unsure') {
            # 28591  iso-8859-1  Western European (ISO) // Windows-1252
            $blob = [byte[]](Get-Content $filename -AsByteStream -ReadCount 0)
            $guess = guess_again($blob)
        }
        
        $name = $filename.Name
        Write-Host -Fo White ("  {0,-16}" -f $guess) -Non; Write-Host -Fo DarkYellow "$name"
    }
}

Set-Alias encguess get_file_encoding

Using Powershell

After many years of trying to get file encoding from native CMD/Powershell methods, and always having to resort to using (and installing) 3rd party software like Cygwin, git-bash and other external binaries, there is finally a native method.

Before, people go on complaining about all the ways this can fail, please understand that this tool is primarily to be used for identifying Text,Log, CSV and TAB type of files. Not binary files. In addition, the file encoding is mostly a guessing game, so the provided script is making some rudimentary guesses, that will certainly fail on large files. Feel free to test and give improved feedback in the gist.

To test this, I was dumping a bunch of weird garbage text into a string, and then exporting it using the available Windows Encodings.

ASCII, BigEndianUnicode, BigEndianUTF32, OEM, Unicode, UTF7, UTF8, UTF8BOM, UTF8NoBOM, UTF32

# The Garbage
$d=''; (33..126 && 161..252) | ForEach-Object { $c = $([char]$_); $d += ${c} }; $d = "1234 5678 ABCD EFGH`nCRLF: `r`nESC[ :`e[`nESC[m :`e[m`n`r`nASCII [22-126,161-252]:`n$d";

$elist=@('ASCII','BigEndianUnicode','BigEndianUTF32','OEM','Unicode','UTF7','UTF8','UTF8BOM','UTF8NoBOM','UTF32')
$elist | ForEach-Object { $ec=[string]($_); $fp = "zx_$ec.txt"; Write-Host -Fo DarkGray ("Encoding to file: {0}" -f $fp); $d | Out-File -Encoding $ec -FilePath $fp; }

# ls | encguess

 ascii           zx_ASCII.txt
 utf-16 BE       zx_BigEndianUnicode.txt
 utf-32 BE       zx_BigEndianUTF32.txt
 OEM (finds) : (3)
 OEM 437 ?       zx_OEM.txt
 utf-16 LE       zx_Unicode.txt
 utf-32 LE       zx_UTF32.txt
 utf-7           zx_UTF7.txt
 utf-8           zx_UTF8.txt
 utf-8 BOM       zx_UTF8BOM.txt
 utf-8           zx_UTF8NoBOM.txt

Here's is the code and it can also be found in the gist URL.

#!/usr/bin/env pwsh
# GuessFileEncoding.ps1 - Guess File Encoding for Windows-11 using Powershell
# -*- coding: utf-8 -*-
#------------------------------------------------------------------------------
#   Author      : not2qubit
#   Date        : 2023-11-27
#   Version:    : 1.0.0
#   License:    : CC-BY-SA-4.0
#   URL:        : https://gist.github.com/eabase/d4f16c8c6535f3868d5dfb1efbde0e5a
#--------------------------------------------------------
#   Usage       : ls | encguess
#               : encguess .\somefile.txt
#--------------------------------------------------------
#   References: 
# 
#   [1] https://www.fileformat.info/info/charset/UTF-7/list.htm
#   [2] https://learn.microsoft.com/en-gb/windows/win32/intl/code-page-identifiers
#   [3] https://learn.microsoft.com/en-us/windows/console/console-virtual-terminal-sequences
#   [4] https://gist.github.com/fnky/458719343aabd01cfb17a3a4f7296797
#   [5] https://github.com/dankogai/p5-encode/blob/main/lib/Encode/Guess.pm
#
#--------------------------------------------------------
# https://stackoverflow.com/a/62511302/1147688

Function Find-Bytes([byte[]]$Bytes, [byte[]]$Search, [int]$Start, [Switch]$All) {
    For ($Index = $Start; $Index -le $Bytes.Length - $Search.Length ; $Index++) {
        For ($i = 0; $i -lt $Search.Length -and $Bytes[$Index + $i] -eq $Search[$i]; $i++) {}
        If ($i -ge $Search.Length) { 
            $Index
            If (!$All) { Return }
        } 
    }
}

function get_file_encoding {
    param([Parameter(ValueFromPipeline=$True)] $filename)    
    
    begin {
        # Use .NET to set current directory 
        [Environment]::CurrentDirectory = (pwd).path
    }
    
    process {
        
        function guess_encoding ($bytes) {
            # ---------------------------------------------------------------------------------------------------
            # Plan:     Do the easy checks first!
            # 1. scan whole file & check if there are no codes above [1-127] and excess of "?" (0x3f)  --> ASCII
            # 2. scan whole file & check if there are codes above [1-127]           --> ? ANSI/OEM/UTF-8
            # 3. scan whole file & check if there are many codes "2b41" &  char<127 --> UTF-7  --> "2b2f76" UTF-7 BOM
            # 4. scan whole file & check if there are many codes "c2 | c3"          --> UTF-8
            # ---------------------------------------------------------------------------------------------------
            switch -regex ('{0:x2}{1:x2}{2:x2}{3:x2}' -f $bytes[0],$bytes[1],$bytes[2],$bytes[3]) {
                # 1. Check UTF-8 BOM
                '^efbbbf'   { return 'utf-8 BOM' }          # UTF-8 BOM     (?)
                '^2b2f76'   { return 'utf-7 BOM' }          # UTF-7 BOM     (65000)
                
                # 2.  Check UTF-32 (BE|LE)
                '^fffe0000' { return 'utf-32 LE' }          # UTF-32 LE     (12000)
                '^0000feff' { return 'utf-32 BE' }          # UTF-32 BE     (12001)     'bigendianutf32'
                
                # 3.  Check UTF-16 (BE|LE)
                '^fffe'     { return 'utf-16 LE' }          # UTF-16 LE     (1200)      'unicode'
                '^feff'     { return 'utf-16 BE' }          # UTF-16 BE     (1201)      'bigendianunicode'
                
                default     { return 'unsure' }             # 
            }
        }
        
        function guess_again ($blob) {
            
            #-------------------------------
            # 1. Check if ASCII [0-127] (7-bit)
            #-------------------------------
            # (a) Check if using ASCII above 127
            $guess_ascii = 1
            foreach ($i in  $blob) { if ($i -gt 127) { $guess_ascii=0; break; } }
            
            # (b) Check if there are many consecutive "?"s. 
            #     That would indicate having erroneously saved a 
            #     ISO-8859-1 character containing file, as ASCII.
            
            #$b = [byte[]]("????".ToCharArray())
            #$n = (Find-Bytes -all $blob $b).Count
            #if ($n -gt 4) {}
            
            #-------------------------------
            # 2. Check for UTF-7 strings "2b41" (43 65)
            #-------------------------------
            
            $b = [byte[]]("+A".ToCharArray())
            $finds=(Find-Bytes -all $blob $b).Count
            $quart = [math]::Round(($blob.length)*0.05)
            #Write-Host -Fo DarkGray "  UTF-7 (quart,finds) : (${quart},${finds})"
            
            if ( ($finds -gt 10) -And ($guess_ascii -eq 1) ) { 
                return 'utf-7' 
            } elseif ($guess_ascii -eq 1) {
                return 'ascii'
            } 
            
            #-------------------------------
            # 3. Check for UTF-8 strings "c2|c3" (194,195)
            #-------------------------------
            # If > 25% are c2|c3, probably utf-8
            $b = [byte[]](0xc2)
            $c = [byte[]](0xc3)
            $f1=(Find-Bytes -all $blob $b).Count
            $f2=(Find-Bytes -all $blob $c).Count
            $quart = [math]::Round(($blob.length)*0.25)
            $finds = ($f1 + $f2)
            
            if ($finds -gt $quart) { return "utf-8" } 
            
            #-------------------------------
            # 4. Check for OEM Strings:
            #-------------------------------
            # Check for "4x" sequences of 'AAAA'(41), 'IIII'(49), 'OOOO'(4f)
            $n = 0
            #$oemlist = @(65,73,79)
            $oemlist = @('A','I','O')
            #$b = [byte[]](("$i"*4).ToCharArray())
            foreach ($i in $oemlist) {$b = [byte[]](("$i"*4).ToCharArray()); $n += (Find-Bytes -all $blob $b).Count  } 
            #$blob | Group-Object | Select Name, Count | Sort -Top 15 -Descending Count
            Write-Host -Fo DarkGray "  OEM (finds) : ($n)"
            
            if ($n -ge 3) { return "OEM 437 ?" }
            
            return "unknown"
        }
        
        $bytes = [byte[]](Get-Content $filename -AsByteStream -ReadCount 4 -TotalCount 4)
        
        if (!$bytes) { 
            $guess = 'failed' 
        } else {
            $guess = guess_encoding($bytes)
        }
        
        if ($guess -eq 'unsure') {
            # 28591  iso-8859-1  Western European (ISO) // Windows-1252
            $blob = [byte[]](Get-Content $filename -AsByteStream -ReadCount 0)
            $guess = guess_again($blob)
        }
        
        $name = $filename.Name
        Write-Host -Fo White ("  {0,-16}" -f $guess) -Non; Write-Host -Fo DarkYellow "$name"
    }
}

Set-Alias encguess get_file_encoding
世界等同你 2024-09-26 03:37:43

使用 Windows 7 附带的常规旧版 记事本 打开文件。
当您单击“另存为...”时,它将显示文件的编码。
它看起来像这样:
在此处输入图像描述

无论默认选择的编码是什么,这就是文件当前的编码。

如果是 UTF-8,您可以将其更改为 ANSI,然后单击“保存”以更改编码(反之亦然)。

有许多不同类型的编码,但这就是我们导出文件时所需要的全部 。 UTF-8 和第 3 方需要 ANSI。这是一次性导出,因此记事本适合我。

仅供参考:根据我的理解,我认为“Unicode”(如记事本中列出的)是 UTF-16 的用词不当。
有关记事本的“Unicode”选项的更多信息:Windows 7 - UTF -8 和 Unicode

更新 (06/14/2023):

使用较新的 Notepad 和 Notepad++

Notepad (Windows 10 和 11) 的屏幕截图进行更新
 右下角:在此处输入图像描述

  “另存为...”对话框:在此处输入图像描述

Notepad++:
 右下角:在此处输入图像描述

   “编码”菜单项:在此处输入图像描述
  NotePad++ 中提供了更多编码选项;

其他 (Mac/Linux/Win) 选项:

我听说 Windows 11 改进了 100+MB 大型文件的性能,打开速度更快。
我在网上读到,Notepad++ 仍然是大文件编辑器领域的冠军。
但是,(对于那些使用 MacLinux 的用户)我发现了一些其他竞争者:
1)。 Sublime Text
2)。 Visual Studio 代码

Open up your file using regular old vanilla Notepad that comes with Windows 7.
It will show you the encoding of the file when you click "Save As...".
It'll look like this:
enter image description here

Whatever the default-selected encoding is, that is what your current encoding is for the file.

If it is UTF-8, you can change it to ANSI and click save to change the encoding (or visa-versa).

There are many different types of encodings, but this was all I needed when our export files were in UTF-8 and the 3rd party required ANSI. It was a onetime export, so Notepad fit the bill for me.

FYI: From my understanding I think "Unicode" (as listed in Notepad) is a misnomer for UTF-16.
More here on Notepad's "Unicode" option: Windows 7 - UTF-8 and Unicode

Update (06/14/2023):

Updated with screenshots of the newer Notepad and Notepad++

Notepad (Windows 10 & 11):
   Bottom-Right Corner: enter image description here

   "Save As..." Dialog Box: enter image description here

Notepad++:
   Bottom-Right Corner: enter image description here

   "Encoding" Menu Item: enter image description here
   Far more Encoding options are available in NotePad++; should you need them.

Other (Mac/Linux/Win) Options:

I hear Windows 11 improved the performance of large 100+MB files to open much faster.
On the web I've read that Notepad++ is still the all around large-file editor champion.
However, (for those on Mac or Linux) here are some other contenders I found:
1). Sublime Text
2). Visual Studio Code

冬天旳寂寞 2024-09-26 03:37:43

如果您的 Windows 计算机上有“git”或“Cygwin”,请转到文件所在的文件夹并执行命令:

file *

这将为您提供该文件夹中所有文件的编码详细信息。

If you have "git" or "Cygwin" on your Windows Machine, then go to the folder where your file is present and execute the command:

file *

This will give you the encoding details of all the files in that folder.

十级心震 2024-09-26 03:37:43

(Linux) 命令行工具“文件”可通过 GnuWin32 在 Windows 上使用:

http://gnuwin32 .sourceforge.net/packages/file.htm

如果您安装了 git,它位于 C:\Program Files\git\usr\bin 中。

例子:

    C:\Users\SH\Downloads\SquareRoot>file *
    _UpgradeReport_Files;         directory
    Debug;                        directory
    duration.h;                   ASCII C++ program text, with CRLF line terminators
    ipch;                         directory
    main.cpp;                     ASCII C program text, with CRLF line terminators
    Precision.txt;                ASCII text, with CRLF line terminators
    Release;                      directory
    Speed.txt;                    ASCII text, with CRLF line terminators
    SquareRoot.sdf;               data
    SquareRoot.sln;               UTF-8 Unicode (with BOM) text, with CRLF line terminators
    SquareRoot.sln.docstates.suo; PCX ver. 2.5 image data
    SquareRoot.suo;               CDF V2 Document, corrupt: Cannot read summary info
    SquareRoot.vcproj;            XML  document text
    SquareRoot.vcxproj;           XML document text
    SquareRoot.vcxproj.filters;   XML document text
    SquareRoot.vcxproj.user;      XML document text
    squarerootmethods.h;          ASCII C program text, with CRLF line terminators
    UpgradeLog.XML;               XML  document text

    C:\Users\SH\Downloads\SquareRoot>file --mime-encoding *
    _UpgradeReport_Files;         binary
    Debug;                        binary
    duration.h;                   us-ascii
    ipch;                         binary
    main.cpp;                     us-ascii
    Precision.txt;                us-ascii
    Release;                      binary
    Speed.txt;                    us-ascii
    SquareRoot.sdf;               binary
    SquareRoot.sln;               utf-8
    SquareRoot.sln.docstates.suo; binary
    SquareRoot.suo;               CDF V2 Document, corrupt: Cannot read summary infobinary
    SquareRoot.vcproj;            us-ascii
    SquareRoot.vcxproj;           utf-8
    SquareRoot.vcxproj.filters;   utf-8
    SquareRoot.vcxproj.user;      utf-8
    squarerootmethods.h;          us-ascii
    UpgradeLog.XML;               us-ascii

The (Linux) command-line tool 'file' is available on Windows via GnuWin32:

http://gnuwin32.sourceforge.net/packages/file.htm

If you have git installed, it's located in C:\Program Files\git\usr\bin.

Example:

    C:\Users\SH\Downloads\SquareRoot>file *
    _UpgradeReport_Files;         directory
    Debug;                        directory
    duration.h;                   ASCII C++ program text, with CRLF line terminators
    ipch;                         directory
    main.cpp;                     ASCII C program text, with CRLF line terminators
    Precision.txt;                ASCII text, with CRLF line terminators
    Release;                      directory
    Speed.txt;                    ASCII text, with CRLF line terminators
    SquareRoot.sdf;               data
    SquareRoot.sln;               UTF-8 Unicode (with BOM) text, with CRLF line terminators
    SquareRoot.sln.docstates.suo; PCX ver. 2.5 image data
    SquareRoot.suo;               CDF V2 Document, corrupt: Cannot read summary info
    SquareRoot.vcproj;            XML  document text
    SquareRoot.vcxproj;           XML document text
    SquareRoot.vcxproj.filters;   XML document text
    SquareRoot.vcxproj.user;      XML document text
    squarerootmethods.h;          ASCII C program text, with CRLF line terminators
    UpgradeLog.XML;               XML  document text

    C:\Users\SH\Downloads\SquareRoot>file --mime-encoding *
    _UpgradeReport_Files;         binary
    Debug;                        binary
    duration.h;                   us-ascii
    ipch;                         binary
    main.cpp;                     us-ascii
    Precision.txt;                us-ascii
    Release;                      binary
    Speed.txt;                    us-ascii
    SquareRoot.sdf;               binary
    SquareRoot.sln;               utf-8
    SquareRoot.sln.docstates.suo; binary
    SquareRoot.suo;               CDF V2 Document, corrupt: Cannot read summary infobinary
    SquareRoot.vcproj;            us-ascii
    SquareRoot.vcxproj;           utf-8
    SquareRoot.vcxproj.filters;   utf-8
    SquareRoot.vcxproj.user;      utf-8
    squarerootmethods.h;          us-ascii
    UpgradeLog.XML;               us-ascii
白云不回头 2024-09-26 03:37:43

安装 git (在 Windows 上你必须使用 git bash 控制台)。类型:

file --mime-encoding *   

对于当前目录中的所有文件,或

file --mime-encoding */*   

对于所有子目录中的文件

Install git ( on Windows you have to use git bash console). Type:

file --mime-encoding *   

for all files in the current directory , or

file --mime-encoding */*   

for the files in all subdirectories

攒眉千度 2024-09-26 03:37:43

我发现另一个有用的工具:[https://codeplexarchive.org/project/EncodingCheckerEXE] 可以在这里找到

Another tool that I found useful: [https://codeplexarchive.org/project/EncodingCheckerEXE] can be found here

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文