如何使用 GNU diff 比较 utf-16 文件?

发布于 2024-07-18 20:41:00 字数 76 浏览 5 评论 0原文

GNU diff 似乎不够智能,无法检测和处理 UTF-16 文件,这让我感到惊讶。 我是否缺少一个明显的命令行选项? 有好的选择吗?

GNU diff doesn't seem to be smart enough to detect and handle UTF-16 files, which surprises me. Am I missing an obvious command-line option? Is there a good alternative?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

梦途 2024-07-25 20:41:00

vimdiff 非常适合此目的。

我在阅读这个 StackOverflow 答案。

vimdiff works quite nicely for this purpose.

I found it while reading this StackOverflow answer.

清君侧 2024-07-25 20:41:00

来自 GNU diff 文档:

处理多字节和可变宽度
人物

diff、diff3 和 sdiff 处理每一行
作为单字节字符串的输入
人物。 这可能会处理不当
在某些情况下是多字节字符。
例如,当被要求忽略时
空格,diff 没有正确忽略
多字节空格字符。

此外,diff 目前假设每个
字节是一列宽,并且这个
某些假设是不正确的
区域设置,例如使用 UTF-8 的区域设置
编码。 这会导致以下问题
-y 或 --side-byside 选项
差异。

这些问题需要解决
不过度影响
公用事业的绩效
单字节环境。

IBM GNU/Linux 技术中心
国际化团队提出
一些补丁来支持
国际化差异
http:// /oss.software.ibm.com/developer/opensource/linux/patches/i18n/diffutils-2.7.2-i18n-0.1.patch.gz
不幸的是,这些补丁是
不完整并且是旧版本
diff,所以需要做更多的工作
在这个领域。

我自己从来没有意识到这一点。

如果非自由、非命令行工具可以完成这项工作,看起来 Guiffy 可以胜任这项工作,但仍在寻找免费软件命令行工具:

http://www.guiffy.com/Diff-Tool.html

From the GNU diff documentation:

Handling Multibyte and Varying-Width
Characters

diff, diff3 and sdiff treat each line
of input as a string of unibyte
characters. This can mishandle
multibyte characters in some cases.
For example, when asked to ignore
spaces, diff does not properly ignore
a multibyte space character.

Also, diff currently assumes that each
byte is one column wide, and this
assumption is incorrect in some
locales, e.g., locales that use UTF-8
encoding. This causes problems with
the -y or --side-by-side option of
diff.

These problems need to be fixed
without unduly affecting the
performance of the utilities in
unibyte environments.

The IBM GNU/Linux Technology Center
Internationalization Team has proposed
some patches to support
internationalized diff
http://oss.software.ibm.com/developer/opensource/linux/patches/i18n/diffutils-2.7.2-i18n-0.1.patch.gz.
Unfortunately, these patches are
incomplete and are to an older version
of diff, so more work needs to be done
in this area.

I never realized that myself.

It looks like Guiffy could to the job if a nonfree, non-command line tool will do the job, still looking for a freeware command line tool:

http://www.guiffy.com/Diff-Tool.html

伴梦长久 2024-07-25 20:41:00

安装支持 UTF-16 的 ripgrep 实用程序,然后运行:

diff <(rg -N . file1.txt) <(rg -N . file2.txt)

ripgrep 支持搜索 UTF-8 以外的文本编码的文件,例如 UTF-16、latin-1、GBK、EUC-JP、Shift_JIS 等。 (提供了对自动检测 UTF-16 的一些支持。其他文本编码必须使用 -E/--encoding 标志专门指定。

Install ripgrep utility which supports UTF-16, then run:

diff <(rg -N . file1.txt) <(rg -N . file2.txt)

ripgrep supports searching files in text encodings other than UTF-8, such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some support for automatically detecting UTF-16 is provided. Other text encodings must be specifically specified with the -E/--encoding flag.)

雪若未夕 2024-07-25 20:41:00

使用重音符号或特殊字符时的格式错误补丁:

 diff --version
 diff (GNU diffutils) 3.6
 diff -Naur old_foo new_foo > foo.patch

正确处理重音符号或特殊字符,无论比较的文件/目录是否位于 git 文件夹中。

 git --version
 git version 2.17.1
 git diff --no-index old_foo new_foo > foo.patch

Malforms patches when accent marks or special characters are used:

 diff --version
 diff (GNU diffutils) 3.6
 diff -Naur old_foo new_foo > foo.patch

Correctly handles accent marks or special characters regardless of whether compared files/dirs are in a git folder.

 git --version
 git version 2.17.1
 git diff --no-index old_foo new_foo > foo.patch
紅太極 2024-07-25 20:41:00

您可以使用优秀的 chardet 在 python 中构建一些东西,然后将您的文件转换为 UTF-8 并将其发送到 GNU diff ?

http://chardet.feedparser.org/

You could maybe build something in python with the excellent chardet, then convert your files to UTF-8 and send this to GNU diff ?

http://chardet.feedparser.org/

彼岸花似海 2024-07-25 20:41:00

在 Python 中,您可以使用 difflib.HtmlDiff 创建一个 HTML 表,该表显示两个行序列之间的差异,并且它似乎可以很好地处理 Unicode 字符串(当然,前提是您使用适当的编解码器读取和写入它们)。

>>> hd = difflib.HtmlDiff()
>>> htmldiff = hd.make_file(codecs.open('file1', 'r', 'utf-16').readlines(), codecs.open('file2', 'r', 'utf-16').readlines())
>>> print >> codecs.open('diff.html', 'w', 'utf-16'), htmldiff

In Python, you can use difflib.HtmlDiff to create an HTML table that shows the differences between two sequences of lines, and it seems to work fine with Unicode strings (provided, of course, you read and write them with the appropriate codecs).

>>> hd = difflib.HtmlDiff()
>>> htmldiff = hd.make_file(codecs.open('file1', 'r', 'utf-16').readlines(), codecs.open('file2', 'r', 'utf-16').readlines())
>>> print >> codecs.open('diff.html', 'w', 'utf-16'), htmldiff
嗼ふ静 2024-07-25 20:41:00

Meld 是一个支持 UTF-16 的开源 diff 工具。 它是免费且开源的,并得到 GNOME 项目的支持。

Meld is an open source diff tool that supports UTF-16. It's free and open source, and supported by the GNOME project.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文