我可以让 git 将 UTF-16 文件识别为文本吗?

发布于 2024-07-17 08:13:20 字数 229 浏览 10 评论 0原文

我正在 git 中跟踪 Virtual PC 虚拟机文件 (*.vmc),进行更改后,git 将该文件识别为二进制文件,并且不会为我区分它。 我发现该文件是用 UTF-16 编码的。

可以教 git 识别该文件是文本并适当处理它吗?

我在 Cygwin 下使用 git,并将 core.autocrlf 设置为 false。 如果需要的话,我可以在 UNIX 下使用 mSysGit 或 git。

I'm tracking a Virtual PC virtual machine file (*.vmc) in git, and after making a change git identified the file as binary and wouldn't diff it for me. I discovered that the file was encoded in UTF-16.

Can git be taught to recognize that this file is text and handle it appropriately?

I'm using git under Cygwin, with core.autocrlf set to false. I could use mSysGit or git under UNIX, if necessary.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(10

々眼睛长脚气 2024-07-24 08:13:20

我已经在这个问题上挣扎了一段时间,并且刚刚发现(对我来说)一个完美的解决方案:

$ git config --global diff.tool vimdiff      # or merge.tool to get merging too!
$ git difftool commit1 commit2

git difftool 采用与 git diff 相同的参数,但运行您选择的 diff 程序而不是内置的 GNU diff。 因此,选择一个多字节感知的 diff(在我的例子中,在 diff 模式下使用 vim)并使用 git difftool 而不是 git diff

发现“difftool”太长而无法输入? 没问题:

$ git config --global alias.dt difftool
$ git dt commit1 commit2

Git 很棒。

I've been struggling with this problem for a while, and just discovered (for me) a perfect solution:

$ git config --global diff.tool vimdiff      # or merge.tool to get merging too!
$ git difftool commit1 commit2

git difftool takes the same arguments as git diff would, but runs a diff program of your choice instead of the built-in GNU diff. So pick a multibyte-aware diff (in my case, vim in diff mode) and just use git difftool instead of git diff.

Find "difftool" too long to type? No problem:

$ git config --global alias.dt difftool
$ git dt commit1 commit2

Git rocks.

五里雾 2024-07-24 08:13:20

有一个非常简单的解决方案,可以在 Unices 上开箱即用。

例如,对于 Apple 的 .strings 文件,只需:

  1. 在存储库的根目录中创建一个 .gitattributes 文件:

     *.strings diff=localizedstrings 
      
  2. 将以下内容添加到您的 ~/.gitconfig 文件中:

     [diff“localizedstrings”] 
       textconv =“iconv -f utf-16 -t utf-8” 
      

Git 中的 Diff .strings 文件 (和 2010 年的旧帖子)。

There is a very simple solution that works out of the box on Unices.

For example, with Apple's .strings files just:

  1. Create a .gitattributes file in the root of your repository with:

     *.strings diff=localizablestrings
    
  2. Add the following to your ~/.gitconfig file:

     [diff "localizablestrings"]
     textconv = "iconv -f utf-16 -t utf-8"
    

Source: Diff .strings files in Git (and older post from 2010).

小帐篷 2024-07-24 08:13:20

您是否尝试过设置 .gitattributes 将其视为文本文件?

例如:

*.vmc diff

更多详细信息请访问 http://www.git-scm.com/docs/gitattributes。 html

Have you tried setting your .gitattributes to treat it as a text file?

e.g.:

*.vmc diff

More details at http://www.git-scm.com/docs/gitattributes.html.

岁月流歌 2024-07-24 08:13:20

默认情况下,git 似乎不能很好地处理 UTF-16; 对于这样的文件,您必须确保没有对其进行 CRLF 处理,但您希望 diffmerge 正常工作文本文件(这忽略了您的终端/编辑器是否可以处理 UTF-16)。

但是查看 .gitattributes 手册页,这里是自定义属性是 binary

[attr]binary -diff -crlf

所以在我看来,您可以在顶级 .gitattributes 中为 utf16 定义一个自定义属性(请注意,我添加了合并此处以确保它被视为文本):

[attr]utf16 diff merge -crlf

从那里您可以在任何 .gitattributes 文件中指定如下内容:

*.vmc utf16

另请注意,您仍然应该能够 diff 文件,即使 git 认为它是二进制文件:

git diff --text

编辑

这个答案基本上表明 GNU diff 与 UTF-16 甚至 UTF-8 的效果不太好。 如果您想让 git 使用不同的工具来查看差异(通过 --ext-diff),该答案建议 Guiffy

但您可能需要的只是diff 一个仅包含 ASCII 字符的 UTF-16 文件。 使其工作的一种方法是使用 --ext-diff 和以下 shell 脚本:

#!/bin/bash
diff <(iconv -f utf-16 -t utf-8 "$1") <(iconv -f utf-16 -t utf-8 "$2")

请注意,转换为 UTF-8 也可能适用于合并,您只需确保它已完成在两个方向上。

至于查看 UTF-16 文件差异时终端的输出:

尝试像这样进行比较会导致
二进制垃圾喷到屏幕上。
如果 git 使用 GNU diff,它会
似乎 GNU diff 不是
支持 unicode。

GNU diff 并不真正关心 unicode,因此当您使用 diff --text 时,它只是比较并输出文本。 问题是您使用的终端无法处理发出的 UTF-16(与 ASCII 字符的差异标记相结合)。

By default, it looks like git won't work well with UTF-16; for such a file you have to make sure that no CRLF processing is done on it, but you want diff and merge to work as a normal text file (this is ignoring whether or not your terminal/editor can handle UTF-16).

But looking at the .gitattributes manpage, here is the custom attribute that is binary:

[attr]binary -diff -crlf

So it seems to me that you could define a custom attribute in your top level .gitattributes for utf16 (note that I add merge here to be sure it is treated as text):

[attr]utf16 diff merge -crlf

From there you would be able to specify in any .gitattributes file something like:

*.vmc utf16

Also note that you should still be able to diff a file, even if git thinks it's binary with:

git diff --text

Edit

This answer basically says that GNU diff wth UTF-16 or even UTF-8 doesn't work very well. If you want to have git use a different tool to see differences (via --ext-diff), that answer suggests Guiffy.

But what you likely need is just to diff a UTF-16 file that contains only ASCII characters. A way to get that to work is to use --ext-diff and the following shell script:

#!/bin/bash
diff <(iconv -f utf-16 -t utf-8 "$1") <(iconv -f utf-16 -t utf-8 "$2")

Note that converting to UTF-8 might work for merging as well, you just have to make sure it's done in both directions.

As for the output to the terminal when looking at a diff of a UTF-16 file:

Trying to diff like that results in
binary garbage spewed to the screen.
If git is using GNU diff, it would
seem that GNU diff is not
unicode-aware.

GNU diff doesn't really care about unicode, so when you use diff --text it just diffs and outputs the text. The problem is that the terminal you're using can't handle the UTF-16 that's emitted (combined with the diff marks that are ASCII characters).

那支青花 2024-07-24 08:13:20

git最近开始理解utf16等编码。
请参阅 gitattributes 文档,搜索 working-tree-encoding

[确保你的手册页匹配,因为这是相当新的!]

如果(比如说)Windows 机器上的文件是 UTF-16,没有 BOM,那么添加到你的 .gitattributes 文件

*.vmc text working-tree-encoding=UTF-16LE eol=CRLF

如果 UTF-16(带 bom)在 * nix 使其:(

*.vmc text working-tree-encoding=UTF-16-BOM eol=LF

*.vmc 替换为 *.whatever 来处理您需要处理的 whatever 类型文件)

请参阅:支持工作树编码“UTF-16LE-BOM”


稍后添加

在@Hackslash之后,人们可能会发现这还不够

 *.vmc text working-tree... 

为了获得漂亮的文本差异,您需要

 *.vmc diff working-tree...

两者都放在一起也可以,

 *.vmc text diff working-tree... 

但它可以说是

  • 多余的 - eol=...意味着 text
  • 详细 - 一个大型项目可以轻松拥有数十种不同的文本文件类型

问题

Git 有一个 macro-attribute binary 这意味着 -text-diff。 相反的 +text +diff 不可用内置,但 git 提供了合成它的工具(我认为!)。解决

方案

Git 允许定义新的宏属性。

我建议您在 .gitattributes 文件的顶部

 [attr]textfile text diff

然后对于需要为文本和 diff 的所有路径请

 path textfile working-tree-encoding= eol=...

注意,在大多数情况下我们需要默认编码(utf-8)和默认值eol(本机),因此可能会被删除。

大多数行应该看起来像

*.c textfile
*.py textfile
Etc

为什么不直接使用 diff?

实用:大多数情况下我们需要原生 eol。 这意味着没有 eol=... 。 因此 text 不会被隐含,需要显式放置。

概念:文本与二进制是根本区别。 eol、编码、diff 等只是它的一些方面。

免责声明

由于我们生活在一个奇怪的时代,我没有一台可以运行 git 的机器。 所以我现在无法检查最新的添加内容。 如果有人发现有问题,我会修改/删除。

git recently has begun to understand encodings such as utf16.
See gitattributes docs, search for working-tree-encoding

[Make sure your man page matches since this is quite new!]

If (say) the file is UTF-16 without BOM on Windows machine then add to your .gitattributes file

*.vmc text working-tree-encoding=UTF-16LE eol=CRLF

If UTF-16 (with bom) on *nix make it:

*.vmc text working-tree-encoding=UTF-16-BOM eol=LF

(Replace *.vmc with *.whatever for whatever type files you need to handle)

See: Support working-tree-encoding "UTF-16LE-BOM".


Added later

Following @Hackslash, one may find that this is insufficient

 *.vmc text working-tree... 

To get nice text-diffs you need

 *.vmc diff working-tree...

Putting both works as well

 *.vmc text diff working-tree... 

But it's arguably

  • Redundant — eol=... implies text
  • Verbose — a large project could easily have dozens of different text file types

The Problem

Git has a macro-attribute binary which means -text -diff. The opposite +text +diff is not available built-in but git gives the tools (I think!) for synthesizing it

The solution

Git allows one to define new macro attributes.

I'd propose that top of the .gitattributes file you have

 [attr]textfile text diff

Then for all paths that need to be text and diff do

 path textfile working-tree-encoding= eol=...

Note that in most cases we would want the default encoding (utf-8) and default eol (native) and so may be dropped.

Most lines should look like

*.c textfile
*.py textfile
Etc

Why not just use diff?

Practical: In most cases we want native eol. Which means no eol=... . So text won't get implied and needs to be put explicitly.

Conceptual: Text Vs binary is the fundamental distinction. eol, encoding, diff etc are just some aspects of it.

Disclaimer

Due to the bizarre times we are living in I don't have a machine with a current working git. So I'm unable at the moment to check the latest addition. If someone finds something wrong, I'll emend/remove.

空袭的梦i 2024-07-24 08:13:20

解决方案是通过 cmd.exe /c "type %1" 进行过滤。 cmd 的 type 内置函数将进行转换,因此您可以将其与 git diff 的 textconv 功能一起使用,以启用 UTF-16 文件的文本比较(也应该适用于 UTF-8,尽管未经测试) 。

引用 gitattributes 手册页:


执行二进制文件的文本差异

有时需要查看某些二进制文件的文本转换版本的差异。 例如,字处理器文档可以转换为 ASCII 文本表示形式,并显示文本的差异。 尽管此转换丢失了一些信息,但生成的差异对于人类查看很有用(但不能直接应用)。

textconv 配置选项用于定义执行此类转换的程序。 该程序应该采用单个参数,即要转换的文件的名称,并在标准输出上生成结果文本。

例如,要显示文件的 exif 信息而不是二进制信息的差异(假设您安装了 exif 工具),请将以下部分添加到您的 $GIT_DIR/config 文件(或 < code>$HOME/.gitconfig 文件):

[diff "jpg"]
        textconv = exif

mingw32 的解决方案,cygwin 粉丝可能需要改变方法。 问题在于传递文件名以转换为 cmd.exe - 它将使用正斜杠,而 cmd 假定反斜杠目录分隔符。

第 1 步:

创建将转换为 stdout 的单参数脚本。 c:\path\to\some\script.sh:

#!/bin/bash
SED='s/\//\\\\\\\\/g'
FILE=\`echo $1 | sed -e "$SED"\`
cmd.exe /c "type $FILE"

第 2 步:

设置 git 以便能够使用脚本文件。 在你的 git 配置(~/.gitconfig.git/config 或参见 man git-config)中,输入:

[diff "cmdtype"]
textconv = c:/path/to/some/script.sh

步骤 3:

点通过利用 .gitattributes 文件(请参阅 man gitattributes(5))来应用此解决方法:

*vmc diff=cmdtype

然后在文件上使用 git diff 。

Solution is to filter through cmd.exe /c "type %1". cmd's type builtin will do the conversion, and so you can use that with the textconv ability of git diff to enable text diffing of UTF-16 files (should work with UTF-8 as well, although untested).

Quoting from gitattributes man page:


Performing text diffs of binary files

Sometimes it is desirable to see the diff of a text-converted version of some binary files. For example, a word processor document can be converted to an ASCII text representation, and the diff of the text shown. Even though this conversion loses some information, the resulting diff is useful for human viewing (but cannot be applied directly).

The textconv config option is used to define a program for performing such a conversion. The program should take a single argument, the name of a file to convert, and produce the resulting text on stdout.

For example, to show the diff of the exif information of a file instead of the binary information (assuming you have the exif tool installed), add the following section to your $GIT_DIR/config file (or $HOME/.gitconfig file):

[diff "jpg"]
        textconv = exif

A solution for mingw32, cygwin fans may have to alter the approach. The issue is with passing the filename to convert to cmd.exe - it will be using forward slashes, and cmd assumes backslash directory separators.

Step 1:

Create the single argument script that will do the conversion to stdout. c:\path\to\some\script.sh:

#!/bin/bash
SED='s/\//\\\\\\\\/g'
FILE=\`echo $1 | sed -e "$SED"\`
cmd.exe /c "type $FILE"

Step 2:

Set up git to be able to use the script file. Inside your git config (~/.gitconfig or .git/config or see man git-config), put this:

[diff "cmdtype"]
textconv = c:/path/to/some/script.sh

Step 3:

Point out files to apply this workarond to by utilizing .gitattributes files (see man gitattributes(5)):

*vmc diff=cmdtype

then use git diff on your files.

旧人 2024-07-24 08:13:20

我编写了一个小型 git-diff 驱动程序 to-utf8,它应该可以轻松区分任何非 ASCII/UTF-8 编码的文件。 您可以按照此处的说明安装它: https://github.com/chaitanyagupta/gitutils#to- utf8to-utf8 脚本可在同一存储库中使用)。

请注意,此脚本要求系统上同时存在 fileiconv 命令。

I have written a small git-diff driver, to-utf8, which should make it easy to diff any non-ASCII/UTF-8 encoded files. You can install it using the instructions here: https://github.com/chaitanyagupta/gitutils#to-utf8 (the to-utf8 script is available in the same repo).

Note that this script requires both file and iconv commands to be available on the system.

地狱即天堂 2024-07-24 08:13:20

最近在 Windows 上遇到了这个问题,Windows 版 git 附带的 dos2unixunix2dos bin 解决了这个问题。 默认情况下,它们位于 C:\Program Files\Git\usr\bin\ 中。 请注意,仅当您的文件不需要需要为 UTF-16 时,此方法才有效。例如,有人意外地将 python 文件编码为 UTF-16,但实际上它不需要需要(就我而言)。

PS C:\Users\xxx> dos2unix my_file.py
dos2unix: converting UTF-16LE file my_file.py to ANSI_X3.4-1968 Unix format...

PS C:\Users\xxx> unix2dos my_file.py
unix2dos: converting UTF-16LE file my_file.py to ANSI_X3.4-1968 DOS format...

Had this problem on Windows recently, and the dos2unixand unix2dos bins that ship with git for windows did the trick. By default they're located in C:\Program Files\Git\usr\bin\. Observe this will only work if your file doesn't need to be UTF-16. For example, someone accidently encoded a python file as UTF-16 when it didn't need to be (in my case).

PS C:\Users\xxx> dos2unix my_file.py
dos2unix: converting UTF-16LE file my_file.py to ANSI_X3.4-1968 Unix format...

and

PS C:\Users\xxx> unix2dos my_file.py
unix2dos: converting UTF-16LE file my_file.py to ANSI_X3.4-1968 DOS format...
荒人说梦 2024-07-24 08:13:20

正如其他答案中所述,git diff 不会将 UTF-16 文件作为文本处理,这使得它们在 Atlassian SourceTree 中无法查看。 如果文件名/或后缀已知,下面的修复将使这些文件在 SourceTree 下可以正常查看和比较。

如果 UTF-16 文件的文件后缀已知(例如 *.uni),则具有该后缀的所有文件都可以通过以下两项更改与 UTF-16 到 UTF-8 转换器关联:

  1. 创建或修改存储库根目录中的 .gitattributes 文件包含以下行:

    <前><代码> *.uni diff=utf16

  2. 然后使用以下部分修改用户主目录 (C:\Users\yourusername\.gitconfig) 中的 .gitconfig 文件:< /p>
    <前><代码>[diff=utf16]
    textconv =“iconv -f utf-16 -t utf-8”

这两项更改应立即生效,而无需将存储库重新加载到 SourceTree 中。 它将文本转换应用于所有 *.uni 文件,使它们像其他文本文件一样可查看和比较。 如果其他文件需要此转换,您可以向 .gitattributes 文件添加其他行。 (如果指定的文件不是 UTF-16,您将得到该文件不可读的结果。)

请注意,此答案是 Tony Kuneck 答案的简化重写。

As described in other answers git diff doesn't handle UTF-16 files as text and this makes them unviewable in Atlassian SourceTree for example. If the file name/or suffix is known the fix below will make those files viewable and comparable normally under SourceTree.

If the file suffix of the UTF-16 files is known (*.uni for example) then all files with that suffix can be associated with UTF-16 to UTF-8 converter with the following two changes:

  1. Create or modify the .gitattributes file in the root directory of the repository with the following line:

     *.uni diff=utf16
    
  2. Then modify the .gitconfig file in the users home directory (C:\Users\yourusername\.gitconfig) with the following section:

    [diff=utf16]
        textconv = "iconv -f utf-16 -t utf-8"
    

These two changes should take effect immediately without reloading the repository into SourceTree. It applies the text conversion to all *.uni files which makes them viewable and comparable like other text files. If other files need this conversion you can add additional lines to the .gitattributes file. (If the designated file(s) are NOT UTF-16 you will get unreadable results for that file.)

Note that this answer is a simplified rewrite of Tony Kuneck's answer.

乖乖兔^ω^ 2024-07-24 08:13:20

gitattributes 上的 git 文档对编码主题给出了简短而精彩的解释 -

Git 识别以 ASCII 或其超集之一编码的文件(例如
UTF-8、ISO-8859-1、...)作为文本文件。 以某些其他方式编码的文件
编码(例如 UTF-16)被解释为二进制,因此
内置的 Git 文本处理工具(例如 git diff)以及大多数 Git
Web 前端不会通过以下方式可视化这些文件的内容
默认。

但是,working-tree-encoding 属性允许您告诉 Git 哪些文件在存储到存储库之前应该重新编码(为 UTF-8)。 当它们“复制”到工作目录时,它们随后会“返回”到其原始编码。

免责声明 - (也许)其他答案中已经说过了这里的所有内容,有些甚至提供了有关如何解决问题的更多详细信息。 然而,我引用的内容让我意识到“Git 可以处理 UTF-8 以外的编码吗?”的答案是多么简单。 浏览了几个小时后...

The git documentation on gitattributes gives a brief and nice explanation on the encoding topic -

Git recognizes files encoded in ASCII or one of its supersets (e.g.
UTF-8, ISO-8859-1, …​) as text files. Files encoded in certain other
encodings (e.g. UTF-16) are interpreted as binary and consequently
built-in Git text processing tools (e.g. git diff) as well as most Git
web front ends do not visualize the contents of these files by
default.

However, the working-tree-encoding attribute allows you to tell Git which files should be re-encoded (to UTF-8) before being stored in the repository. They are later "returned" to their original encoding when "copied" to the working directory.

Disclaimer - (Perhaps) Evertyhing here have been said in the other answers, and some even gave a lot more details on how to fix your issue. However, the quote I included made me realize how simple the answer of "Can Git handle encoding other than UTF-8?" is after browsing for it for hours...

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文