我可以让 git 将 UTF-16 文件识别为文本吗?
我正在 git 中跟踪 Virtual PC 虚拟机文件 (*.vmc),进行更改后,git 将该文件识别为二进制文件,并且不会为我区分它。 我发现该文件是用 UTF-16 编码的。
可以教 git 识别该文件是文本并适当处理它吗?
我在 Cygwin 下使用 git,并将 core.autocrlf 设置为 false。 如果需要的话,我可以在 UNIX 下使用 mSysGit 或 git。
I'm tracking a Virtual PC virtual machine file (*.vmc) in git, and after making a change git identified the file as binary and wouldn't diff it for me. I discovered that the file was encoded in UTF-16.
Can git be taught to recognize that this file is text and handle it appropriately?
I'm using git under Cygwin, with core.autocrlf set to false. I could use mSysGit or git under UNIX, if necessary.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
我已经在这个问题上挣扎了一段时间,并且刚刚发现(对我来说)一个完美的解决方案:
git difftool 采用与 git diff 相同的参数,但运行您选择的 diff 程序而不是内置的 GNU
diff
。 因此,选择一个多字节感知的 diff(在我的例子中,在 diff 模式下使用vim
)并使用git difftool
而不是git diff
。发现“difftool”太长而无法输入? 没问题:
Git 很棒。
I've been struggling with this problem for a while, and just discovered (for me) a perfect solution:
git difftool
takes the same arguments asgit diff
would, but runs a diff program of your choice instead of the built-in GNUdiff
. So pick a multibyte-aware diff (in my case,vim
in diff mode) and just usegit difftool
instead ofgit diff
.Find "difftool" too long to type? No problem:
Git rocks.
有一个非常简单的解决方案,可以在 Unices 上开箱即用。
例如,对于 Apple 的
.strings
文件,只需:在存储库的根目录中创建一个
.gitattributes
文件:将以下内容添加到您的
~/.gitconfig
文件中:源:Git 中的 Diff .strings 文件 (和 2010 年的旧帖子)。
There is a very simple solution that works out of the box on Unices.
For example, with Apple's
.strings
files just:Create a
.gitattributes
file in the root of your repository with:Add the following to your
~/.gitconfig
file:Source: Diff .strings files in Git (and older post from 2010).
您是否尝试过设置
.gitattributes
将其视为文本文件?例如:
更多详细信息请访问 http://www.git-scm.com/docs/gitattributes。 html。
Have you tried setting your
.gitattributes
to treat it as a text file?e.g.:
More details at http://www.git-scm.com/docs/gitattributes.html.
默认情况下,
git
似乎不能很好地处理 UTF-16; 对于这样的文件,您必须确保没有对其进行CRLF
处理,但您希望diff
和merge
正常工作文本文件(这忽略了您的终端/编辑器是否可以处理 UTF-16)。但是查看
.gitattributes
手册页,这里是自定义属性是binary
:所以在我看来,您可以在顶级
.gitattributes
中为utf16
定义一个自定义属性(请注意,我添加了合并此处以确保它被视为文本):从那里您可以在任何
.gitattributes
文件中指定如下内容:另请注意,您仍然应该能够
diff
文件,即使git
认为它是二进制文件:编辑
这个答案基本上表明 GNU diff 与 UTF-16 甚至 UTF-8 的效果不太好。 如果您想让
git
使用不同的工具来查看差异(通过--ext-diff
),该答案建议 Guiffy。但您可能需要的只是
diff
一个仅包含 ASCII 字符的 UTF-16 文件。 使其工作的一种方法是使用--ext-diff
和以下 shell 脚本:请注意,转换为 UTF-8 也可能适用于合并,您只需确保它已完成在两个方向上。
至于查看 UTF-16 文件差异时终端的输出:
GNU diff 并不真正关心 unicode,因此当您使用 diff --text 时,它只是比较并输出文本。 问题是您使用的终端无法处理发出的 UTF-16(与 ASCII 字符的差异标记相结合)。
By default, it looks like
git
won't work well with UTF-16; for such a file you have to make sure that noCRLF
processing is done on it, but you wantdiff
andmerge
to work as a normal text file (this is ignoring whether or not your terminal/editor can handle UTF-16).But looking at the
.gitattributes
manpage, here is the custom attribute that isbinary
:So it seems to me that you could define a custom attribute in your top level
.gitattributes
forutf16
(note that I add merge here to be sure it is treated as text):From there you would be able to specify in any
.gitattributes
file something like:Also note that you should still be able to
diff
a file, even ifgit
thinks it's binary with:Edit
This answer basically says that GNU diff wth UTF-16 or even UTF-8 doesn't work very well. If you want to have
git
use a different tool to see differences (via--ext-diff
), that answer suggests Guiffy.But what you likely need is just to
diff
a UTF-16 file that contains only ASCII characters. A way to get that to work is to use--ext-diff
and the following shell script:Note that converting to UTF-8 might work for merging as well, you just have to make sure it's done in both directions.
As for the output to the terminal when looking at a diff of a UTF-16 file:
GNU diff doesn't really care about unicode, so when you use diff --text it just diffs and outputs the text. The problem is that the terminal you're using can't handle the UTF-16 that's emitted (combined with the diff marks that are ASCII characters).
git最近开始理解utf16等编码。
请参阅 gitattributes 文档,搜索
working-tree-encoding
[确保你的手册页匹配,因为这是相当新的!]
如果(比如说)Windows 机器上的文件是 UTF-16,没有 BOM,那么添加到你的
.gitattributes
文件如果 UTF-16(带 bom)在 * nix 使其:(
将
*.vmc
替换为*.whatever
来处理您需要处理的whatever
类型文件)请参阅:支持工作树编码“UTF-16LE-BOM”。
稍后添加
在@Hackslash之后,人们可能会发现这还不够
为了获得漂亮的文本差异,您需要
将两者都放在一起也可以,
但它可以说是
eol=...
意味着text
问题
Git 有一个 macro-attribute
binary
这意味着-text-diff
。 相反的+text +diff
不可用内置,但 git 提供了合成它的工具(我认为!)。解决方案
Git 允许定义新的宏属性。
我建议您在 .gitattributes 文件的顶部
然后对于需要为文本和 diff 的所有路径请
注意,在大多数情况下我们需要默认编码(utf-8)和默认值eol(本机),因此可能会被删除。
大多数行应该看起来像
为什么不直接使用 diff?
实用:大多数情况下我们需要原生 eol。 这意味着没有
eol=...
。 因此text
不会被隐含,需要显式放置。概念:文本与二进制是根本区别。 eol、编码、diff 等只是它的一些方面。
免责声明
由于我们生活在一个奇怪的时代,我没有一台可以运行 git 的机器。 所以我现在无法检查最新的添加内容。 如果有人发现有问题,我会修改/删除。
git recently has begun to understand encodings such as utf16.
See gitattributes docs, search for
working-tree-encoding
[Make sure your man page matches since this is quite new!]
If (say) the file is UTF-16 without BOM on Windows machine then add to your
.gitattributes
fileIf UTF-16 (with bom) on *nix make it:
(Replace
*.vmc
with*.whatever
forwhatever
type files you need to handle)See: Support working-tree-encoding "UTF-16LE-BOM".
Added later
Following @Hackslash, one may find that this is insufficient
To get nice text-diffs you need
Putting both works as well
But it's arguably
eol=...
impliestext
The Problem
Git has a macro-attribute
binary
which means-text -diff
. The opposite+text +diff
is not available built-in but git gives the tools (I think!) for synthesizing itThe solution
Git allows one to define new macro attributes.
I'd propose that top of the
.gitattributes
file you haveThen for all paths that need to be text and diff do
Note that in most cases we would want the default encoding (utf-8) and default eol (native) and so may be dropped.
Most lines should look like
Why not just use diff?
Practical: In most cases we want native eol. Which means no
eol=...
. Sotext
won't get implied and needs to be put explicitly.Conceptual: Text Vs binary is the fundamental distinction. eol, encoding, diff etc are just some aspects of it.
Disclaimer
Due to the bizarre times we are living in I don't have a machine with a current working git. So I'm unable at the moment to check the latest addition. If someone finds something wrong, I'll emend/remove.
解决方案是通过
cmd.exe /c "type %1"
进行过滤。 cmd 的type
内置函数将进行转换,因此您可以将其与 git diff 的 textconv 功能一起使用,以启用 UTF-16 文件的文本比较(也应该适用于 UTF-8,尽管未经测试) 。引用 gitattributes 手册页:
执行二进制文件的文本差异
有时需要查看某些二进制文件的文本转换版本的差异。 例如,字处理器文档可以转换为 ASCII 文本表示形式,并显示文本的差异。 尽管此转换丢失了一些信息,但生成的差异对于人类查看很有用(但不能直接应用)。
textconv 配置选项用于定义执行此类转换的程序。 该程序应该采用单个参数,即要转换的文件的名称,并在标准输出上生成结果文本。
例如,要显示文件的 exif 信息而不是二进制信息的差异(假设您安装了 exif 工具),请将以下部分添加到您的
$GIT_DIR/config
文件(或 < code>$HOME/.gitconfig 文件):mingw32 的解决方案,cygwin 粉丝可能需要改变方法。 问题在于传递文件名以转换为 cmd.exe - 它将使用正斜杠,而 cmd 假定反斜杠目录分隔符。
第 1 步:
创建将转换为 stdout 的单参数脚本。 c:\path\to\some\script.sh:
第 2 步:
设置 git 以便能够使用脚本文件。 在你的 git 配置(
~/.gitconfig
或.git/config
或参见man git-config
)中,输入:步骤 3:
点通过利用 .gitattributes 文件(请参阅 man gitattributes(5))来应用此解决方法:
然后在文件上使用 git diff 。
Solution is to filter through
cmd.exe /c "type %1"
. cmd'stype
builtin will do the conversion, and so you can use that with the textconv ability of git diff to enable text diffing of UTF-16 files (should work with UTF-8 as well, although untested).Quoting from gitattributes man page:
Performing text diffs of binary files
Sometimes it is desirable to see the diff of a text-converted version of some binary files. For example, a word processor document can be converted to an ASCII text representation, and the diff of the text shown. Even though this conversion loses some information, the resulting diff is useful for human viewing (but cannot be applied directly).
The textconv config option is used to define a program for performing such a conversion. The program should take a single argument, the name of a file to convert, and produce the resulting text on stdout.
For example, to show the diff of the exif information of a file instead of the binary information (assuming you have the exif tool installed), add the following section to your
$GIT_DIR/config
file (or$HOME/.gitconfig
file):A solution for mingw32, cygwin fans may have to alter the approach. The issue is with passing the filename to convert to cmd.exe - it will be using forward slashes, and cmd assumes backslash directory separators.
Step 1:
Create the single argument script that will do the conversion to stdout. c:\path\to\some\script.sh:
Step 2:
Set up git to be able to use the script file. Inside your git config (
~/.gitconfig
or.git/config
or seeman git-config
), put this:Step 3:
Point out files to apply this workarond to by utilizing .gitattributes files (see man gitattributes(5)):
then use
git diff
on your files.我编写了一个小型 git-diff 驱动程序
to-utf8
,它应该可以轻松区分任何非 ASCII/UTF-8 编码的文件。 您可以按照此处的说明安装它: https://github.com/chaitanyagupta/gitutils#to- utf8(to-utf8
脚本可在同一存储库中使用)。请注意,此脚本要求系统上同时存在
file
和iconv
命令。I have written a small git-diff driver,
to-utf8
, which should make it easy to diff any non-ASCII/UTF-8 encoded files. You can install it using the instructions here: https://github.com/chaitanyagupta/gitutils#to-utf8 (theto-utf8
script is available in the same repo).Note that this script requires both
file
andiconv
commands to be available on the system.最近在 Windows 上遇到了这个问题,Windows 版 git 附带的
dos2unix
和unix2dos
bin 解决了这个问题。 默认情况下,它们位于 C:\Program Files\Git\usr\bin\ 中。 请注意,仅当您的文件不需要需要为 UTF-16 时,此方法才有效。例如,有人意外地将 python 文件编码为 UTF-16,但实际上它不需要需要(就我而言)。和
Had this problem on Windows recently, and the
dos2unix
andunix2dos
bins that ship with git for windows did the trick. By default they're located inC:\Program Files\Git\usr\bin\
. Observe this will only work if your file doesn't need to be UTF-16. For example, someone accidently encoded a python file as UTF-16 when it didn't need to be (in my case).and
正如其他答案中所述,git diff 不会将 UTF-16 文件作为文本处理,这使得它们在 Atlassian SourceTree 中无法查看。 如果文件名/或后缀已知,下面的修复将使这些文件在 SourceTree 下可以正常查看和比较。
如果 UTF-16 文件的文件后缀已知(例如 *.uni),则具有该后缀的所有文件都可以通过以下两项更改与 UTF-16 到 UTF-8 转换器关联:
创建或修改存储库根目录中的 .gitattributes 文件包含以下行:
<前><代码> *.uni diff=utf16
然后使用以下部分修改用户主目录 (C:\Users\yourusername\.gitconfig) 中的 .gitconfig 文件:< /p>
<前><代码>[diff=utf16]
textconv =“iconv -f utf-16 -t utf-8”
这两项更改应立即生效,而无需将存储库重新加载到 SourceTree 中。 它将文本转换应用于所有 *.uni 文件,使它们像其他文本文件一样可查看和比较。 如果其他文件需要此转换,您可以向 .gitattributes 文件添加其他行。 (如果指定的文件不是 UTF-16,您将得到该文件不可读的结果。)
请注意,此答案是 Tony Kuneck 答案的简化重写。
As described in other answers git diff doesn't handle UTF-16 files as text and this makes them unviewable in Atlassian SourceTree for example. If the file name/or suffix is known the fix below will make those files viewable and comparable normally under SourceTree.
If the file suffix of the UTF-16 files is known (*.uni for example) then all files with that suffix can be associated with UTF-16 to UTF-8 converter with the following two changes:
Create or modify the .gitattributes file in the root directory of the repository with the following line:
Then modify the .gitconfig file in the users home directory (C:\Users\yourusername\.gitconfig) with the following section:
These two changes should take effect immediately without reloading the repository into SourceTree. It applies the text conversion to all *.uni files which makes them viewable and comparable like other text files. If other files need this conversion you can add additional lines to the .gitattributes file. (If the designated file(s) are NOT UTF-16 you will get unreadable results for that file.)
Note that this answer is a simplified rewrite of Tony Kuneck's answer.
gitattributes 上的 git 文档对编码主题给出了简短而精彩的解释 -
但是,
working-tree-encoding
属性允许您告诉 Git 哪些文件在存储到存储库之前应该重新编码(为 UTF-8)。 当它们“复制”到工作目录时,它们随后会“返回”到其原始编码。免责声明 - (也许)其他答案中已经说过了这里的所有内容,有些甚至提供了有关如何解决问题的更多详细信息。 然而,我引用的内容让我意识到“Git 可以处理 UTF-8 以外的编码吗?”的答案是多么简单。 浏览了几个小时后...
The git documentation on gitattributes gives a brief and nice explanation on the encoding topic -
However, the
working-tree-encoding
attribute allows you to tell Git which files should be re-encoded (to UTF-8) before being stored in the repository. They are later "returned" to their original encoding when "copied" to the working directory.Disclaimer - (Perhaps) Evertyhing here have been said in the other answers, and some even gave a lot more details on how to fix your issue. However, the quote I included made me realize how simple the answer of "Can Git handle encoding other than UTF-8?" is after browsing for it for hours...