在 Mercurial 中获取 Unicode 文件上的可读差异显示 (MS Windows)

发布于 2024-09-04 17:42:17 字数 283 浏览 6 评论 0原文

我正在尝试将一些 Windows PowerShell 脚本存储在 Mercurial 存储库中。 PowerShell 编辑器似乎喜欢将文件保存为 UTF-16 Unicode。这意味着有很多 \0 字节,这就是 Mercurial 用来区分“文本”和“二进制”文件的方式。我知道这对 Mercurial 存储数据的方式没有影响,但这确实意味着它显示二进制差异,这有点难以阅读。有没有办法告诉 Mercurial 这些确实是文本文件?大概我需要说服 Mercurial 对特定文件类型使用外部 Unicode 识别差异程序。

I'm trying to store some Windows PowerShell scripts in a Mercurial repository. It seems the PowerShell editor likes to save files as UTF-16 Unicode. This means that there are lots of \0 bytes, which is what Mercurial uses to distinguish between "text" and "binary" files. I understand that this makes no difference to how Mercurial stores the data, but it does mean that it displays binary diffs, which are kind of hard to read. Is there a way to tell Mercurial that these really are text files? Presumably I would need to convince Mercurial to use an external Unicode-aware diff program for particular file types.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

冷了相思 2024-09-11 17:42:17

这可能与您无关;如果听起来不像,请阅读最后一段。

我不确定这是否是您所需要的,但我需要与 UTF-16LE 内容进行差异,而不仅仅是“二进制文件”是不同的” - 当我几个月前搜索它时,我发现了一个讨论它的线程和错误; 这是其中的一部分。我现在找不到这个迷你扩展的原始来源(尽管它所做的正是该补丁所做的事情),但我得到的是一个扩展,BOM.py

#!/usr/bin/env python

from mercurial import hg, util

import codecs

boms = [
    codecs.BOM_UTF8,
    codecs.BOM_UTF16_BE, codecs.BOM_UTF16_LE,
    codecs.BOM_UTF32_BE, codecs.BOM_UTF32_LE
    ]

def binary(s):
    if s:
        for bom in boms:
            if s.startswith(bom):
                return False
        return '\0' in s
    return False


def reposetup(ui, repo):
    util.binary = binary

它被加载到 . hgrc(或您的用户\用户名\mercurial.ini)如下所示:

[extensions]
bom = ~/.hgexts/BOM.py

请注意,Windows 和 Linux 之间的路径会有所不同;在我的 Windows 副本上,我将路径设置为 \...\whatever (它位于 USB 磁盘上,驱动器盘符可以更改)。不幸的是,相对路径是相对于当前工作目录而不是存储库根目录或任何类似的东西,但如果您将其保存在 C: 驱动器上,您可以只输入完整路径。

在Linux(我的主要开发环境)中,这效果很好;在命令提示符(我仍然经常使用)中,它通常运行良好。我从未在 PowerShell 中尝试过它,但我希望它在对命令行中任意空字节的支持方面比命令提示符更好。

我不确定这是否是您想要的;顺便说一句,你说过“二进制差异”,我怀疑你可能已经有了这个,或者正在做 hg diff -a ,它实现了同样的目标。在这种情况下,我能想到的就是编写另一个扩展,它采用 UTF-16LE 并尝试将其解码为 UTF-8。我不确定此类扩展的语法,但我可能会尝试一下。

编辑:现在通过commands.py、cmdutil.py、patch.py​​和mdiff.py搜索了mercurial源代码,我发现二进制差异是使用base85编码(patch.b85diff)而不是完成的正常差异。我没有意识到这一点,我认为这只是迫使它进行区分。在这种情况下,也许这个文本相关的。我等待回复看看是否是这样!

This may not be relevant to you; read the last paragraph if it doesn't sound like it is.

I'm not sure whether this is what you're needing, but I've needed diffs with UTF-16LE content more than just the "binary files are different" - when I searched around some months ago for it I found a thread and bug discussing it; here's part of it. I can't find the original source of this mini-extension now (though it's doing just what that patch does), but what I got was an extension, BOM.py:

#!/usr/bin/env python

from mercurial import hg, util

import codecs

boms = [
    codecs.BOM_UTF8,
    codecs.BOM_UTF16_BE, codecs.BOM_UTF16_LE,
    codecs.BOM_UTF32_BE, codecs.BOM_UTF32_LE
    ]

def binary(s):
    if s:
        for bom in boms:
            if s.startswith(bom):
                return False
        return '\0' in s
    return False


def reposetup(ui, repo):
    util.binary = binary

This gets loaded in the .hgrc (or your users\username\mercurial.ini) like this:

[extensions]
bom = ~/.hgexts/BOM.py

Note the path will vary between Windows and Linux; on my Windows copy I put the path as \...\whatever (it's on a USB disk where the drive letter can change). Unfortunately relative paths are taken relative to the current working directory rather than the repository root or any such thing, but if you are saving it on your C: drive, you can just put the full path.

In Linux (my main development environment), this works well; in Command Prompt (which I still use regularly), it generally works well. I've never tried it in PowerShell, but I would expect it to be better than Command Prompt in its support for arbitrary null bytes in the command line.

I'm not sure if this is what you want at all; by the way you've said "binary diffs" I suspect you may already either have this or be doing hg diff -a which is achieving the same thing. In that case, all I can think of is writing another extension which takes the UTF-16LE and attempts to decode it to UTF-8. I'm not sure of the syntax for such an extension, but I might try that out.

Edit: having now trawled the mercurial source through commands.py, cmdutil.py, patch.py and mdiff.py, I see that binary diffs are done with a base85 encoding (patch.b85diff) rather than the normal diff. I wasn't aware of that, I thought it just forced it to diff it. In that case, perhaps this text is relevant after all. I await a response to see if it is!

千寻… 2024-09-11 17:42:17

我通过使用 NotePad++ 创建一个新文件并将其另存为 PowerShell 文件(.ps1 扩展名)来解决此问题。 NotePad++ 会将文件创建为纯文本 ANSI 文件。创建后,我可以在 PowerShell 编辑器中打开该文件并根据需要进行任何更改,而无需编辑器修改文件编码。

免责声明:我刚刚遇到了这个问题,所以我不确定是否有任何影响,但到目前为止我的脚本似乎正常工作,并且我的差异显示得很好。

I have worked around this by creating a new file with NotePad++ and saving it as a PowerShell file (.ps1 extension). NotePad++ will create the file as a plain text ANSI file. Once created I can open the file in the PowerShell editor and make any changes as necessary without the editor modifying the file encoding.

Disclaimer: I encountered this just moments ago and so I am not sure if there are any repercussions but so far my scripts appear to work as normal and my diffs are showing up nicely.

鲜血染红嫁衣 2024-09-11 17:42:17

如果我的其他答案不能满足您的要求,我认为这个答案可以;虽然我还没有在 Windows 上测试过它,但它在 Linux 上运行良好。它做了一件潜在的令人讨厌的事情,用一个将 utf-16le 转换为 utf-8 的新函数包装 mercurial.mdiff.unidiff 。这不会影响 hg st,但会影响 hg diff。一个潜在的陷阱是 BOM 也将从 UTF-16LE BOM 更改为 UTF-8 BOM。

无论如何,我认为它可能对你有用,所以就在这里。

扩展文件 utf16decodediff.py

import codecs
from mercurial import mdiff

unidiff = mdiff.unidiff

def new_unidiff(a, ad, b, bd, fn1, fn2, r=None, opts=mdiff.defaultopts):
    """
    A simple wrapper around mercurial.mdiff.unidiff which first decodes
    UTF-16LE text.
    """

    if a.startswith(codecs.BOM_UTF16_LE):
        try:
            # Gets reencoded as utf-8 to be a str rather than a unicode; some
            # extensions may expect a str and may break if it's wrong.
            a = a.decode('utf-16le').encode('utf-8')
        except UnicodeDecodeError:
            pass

    if b.startswith(codecs.BOM_UTF16_LE):
        try:
            b = b.decode('utf-16le').encode('utf-8')
        except UnicodeDecodeError:
            pass

    return unidiff(a, ad, b, bd, fn1, fn2, r, opts)

mdiff.unidiff = new_unidiff

.hgrc 中:(

[extensions]
utf16decodediff = ~/.hgexts/utf16decodediff.py

或等效路径。)

If my other answer does not do what you want, I think this one may; although I haven't tested it on Windows at all yet, it's working well in Linux. It does what is potentially a nasty thing, in wrapping mercurial.mdiff.unidiff with a new function which converts utf-16le to utf-8. This will not affect hg st, but will affect hg diff. One potential pitfall is that the BOM will also be changed from UTF-16LE BOM to the UTF-8 BOM.

Anyway, I think it may be useful to you, so here it is.

Extension file utf16decodediff.py:

import codecs
from mercurial import mdiff

unidiff = mdiff.unidiff

def new_unidiff(a, ad, b, bd, fn1, fn2, r=None, opts=mdiff.defaultopts):
    """
    A simple wrapper around mercurial.mdiff.unidiff which first decodes
    UTF-16LE text.
    """

    if a.startswith(codecs.BOM_UTF16_LE):
        try:
            # Gets reencoded as utf-8 to be a str rather than a unicode; some
            # extensions may expect a str and may break if it's wrong.
            a = a.decode('utf-16le').encode('utf-8')
        except UnicodeDecodeError:
            pass

    if b.startswith(codecs.BOM_UTF16_LE):
        try:
            b = b.decode('utf-16le').encode('utf-8')
        except UnicodeDecodeError:
            pass

    return unidiff(a, ad, b, bd, fn1, fn2, r, opts)

mdiff.unidiff = new_unidiff

In .hgrc:

[extensions]
utf16decodediff = ~/.hgexts/utf16decodediff.py

(Or equivalent paths.)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文