如何确保我的所有源文件都保持 UTF-8 且带有 Unix 行结尾？

发布于 2024-12-28 05:59:15 字数 3321 浏览 1 评论 0原文

我正在寻找一些适用于 Linux 的命令行工具，可以帮助我检测文件并将其从 iso-8859-1 和 windows-1252 等字符集转换为 >utf-8 以及从 Windows 行结尾到 Unix 行结尾。

我需要这个的原因是我正在通过 SFTP 在 Linux 服务器上使用 Windows 上的编辑器（如 Sublime Text）处理项目，这些编辑器总是把这些事情搞砸。现在我猜测我的文件大约有一半是utf-8，其余的是iso-8859-1和windows-1252，因为它似乎 Sublime Text 只是在保存文件时选择文件包含的符号的字符集。行结尾始终是 Windows 行结尾，即使我在选项中指定默认行结尾是 LF，所以大约一半的文件具有 LF，一半是 < em>CRLF。

因此，我至少需要一个工具来递归扫描我的项目文件夹，并提醒我偏离 utf-8 且以 LF 行结尾的文件，以便我可以手动修复该问题在我将更改提交到 GIT 之前。

任何有关该主题的评论和个人经历也将受到欢迎。

谢谢

编辑：我有一个临时解决方案，我使用tree和file来输出有关每个文件的信息在我的项目中，但这有点奇怪。如果我不包含 file 的 -i 选项，那么我的很多文件都会获得不同的输出，例如 ASCII C++ 程序文本 和 >HTML 文档文本 和 英文文本 等：

$ tree -f -i -a -I node_modules --noreport -n | xargs file | grep -v directory
./config.json:              ASCII C++ program text
./debugserver.sh:           ASCII text
./.gitignore:               ASCII text, with no line terminators
./lib/config.js:            ASCII text
./lib/database.js:          ASCII text
./lib/get_input.js:         ASCII text
./lib/models/stream.js:     ASCII English text
./lib/serverconfig.js:      ASCII text
./lib/server.js:            ASCII text
./package.json:             ASCII text
./public/index.html:        HTML document text
./src/config.coffee:        ASCII English text
./src/database.coffee:      ASCII English text
./src/get_input.coffee:     ASCII English text, with CRLF line terminators
./src/jtv.coffee:           ASCII English text
./src/models/stream.coffee: ASCII English text
./src/server.coffee:        ASCII text
./src/serverconfig.coffee:  ASCII text
./testserver.sh:            ASCII text
./vendor/minify.json.js:    ASCII C++ program text, with CRLF line terminators

但是如果我包含 -i 它不会显示行终止符：

$ tree -f -i -a -I node_modules --noreport -n | xargs file -i | grep -v directory
./config.json:              text/x-c++; charset=us-ascii
./debugserver.sh:           text/plain; charset=us-ascii
./.gitignore:               text/plain; charset=us-ascii
./lib/config.js:            text/plain; charset=us-ascii
./lib/database.js:          text/plain; charset=us-ascii
./lib/get_input.js:         text/plain; charset=us-ascii
./lib/models/stream.js:     text/plain; charset=us-ascii
./lib/serverconfig.js:      text/plain; charset=us-ascii
./lib/server.js:            text/plain; charset=us-ascii
./package.json:             text/plain; charset=us-ascii
./public/index.html:        text/html; charset=us-ascii
./src/config.coffee:        text/plain; charset=us-ascii
./src/database.coffee:      text/plain; charset=us-ascii
./src/get_input.coffee:     text/plain; charset=us-ascii
./src/jtv.coffee:           text/plain; charset=us-ascii
./src/models/stream.coffee: text/plain; charset=us-ascii
./src/server.coffee:        text/plain; charset=us-ascii
./src/serverconfig.coffee:  text/plain; charset=us-ascii
./testserver.sh:            text/plain; charset=us-ascii
./vendor/minify.json.js:    text/x-c++; charset=us-ascii

另外为什么它显示 charset=us-ascii 而不是utf-8？什么是text/x-c++？有没有办法可以为每个文件仅输出 charset=utf-8 和 line-terminators=LF ？

原文

I'm looking for some command-line tools for Linux that can help me detect and convert files from character sets like iso-8859-1 and windows-1252 to utf-8 and from Windows line endings to Unix line endings.

The reason I need this is that I'm working on projects on Linux servers via SFTP with editors on Windows (like Sublime Text) that just constantly screws these things up. Right now I'm guessing about half my files are utf-8, the rest are iso-8859-1 and windows-1252 as it seems Sublime Text is just picking character set by which symbols the file contains when I save it. The line endings are ALWAYS Windows line endings even though I've specified in the options that default line endings are LF, so about half of my files have LF and half are CRLF.

So I would need at least a tool that would recursively scan my project folder and alert me of files that deviate from utf-8 with LF line endings so I could manually fix that before I commit my changes to GIT.

Any comments and personal experiences on the topic would also be welcome.

Thanks

Edit: I have a temporary solution in place where I use tree and file to output information about every file in my project, but it's kinda wonky. If I don't include the -i option for file then a lot of my files gets different output like ASCII C++ program text and HTML document text and English text etc:

$ tree -f -i -a -I node_modules --noreport -n | xargs file | grep -v directory
./config.json:              ASCII C++ program text
./debugserver.sh:           ASCII text
./.gitignore:               ASCII text, with no line terminators
./lib/config.js:            ASCII text
./lib/database.js:          ASCII text
./lib/get_input.js:         ASCII text
./lib/models/stream.js:     ASCII English text
./lib/serverconfig.js:      ASCII text
./lib/server.js:            ASCII text
./package.json:             ASCII text
./public/index.html:        HTML document text
./src/config.coffee:        ASCII English text
./src/database.coffee:      ASCII English text
./src/get_input.coffee:     ASCII English text, with CRLF line terminators
./src/jtv.coffee:           ASCII English text
./src/models/stream.coffee: ASCII English text
./src/server.coffee:        ASCII text
./src/serverconfig.coffee:  ASCII text
./testserver.sh:            ASCII text
./vendor/minify.json.js:    ASCII C++ program text, with CRLF line terminators

But if I do include -i it doesn't show me line terminators:

$ tree -f -i -a -I node_modules --noreport -n | xargs file -i | grep -v directory
./config.json:              text/x-c++; charset=us-ascii
./debugserver.sh:           text/plain; charset=us-ascii
./.gitignore:               text/plain; charset=us-ascii
./lib/config.js:            text/plain; charset=us-ascii
./lib/database.js:          text/plain; charset=us-ascii
./lib/get_input.js:         text/plain; charset=us-ascii
./lib/models/stream.js:     text/plain; charset=us-ascii
./lib/serverconfig.js:      text/plain; charset=us-ascii
./lib/server.js:            text/plain; charset=us-ascii
./package.json:             text/plain; charset=us-ascii
./public/index.html:        text/html; charset=us-ascii
./src/config.coffee:        text/plain; charset=us-ascii
./src/database.coffee:      text/plain; charset=us-ascii
./src/get_input.coffee:     text/plain; charset=us-ascii
./src/jtv.coffee:           text/plain; charset=us-ascii
./src/models/stream.coffee: text/plain; charset=us-ascii
./src/server.coffee:        text/plain; charset=us-ascii
./src/serverconfig.coffee:  text/plain; charset=us-ascii
./testserver.sh:            text/plain; charset=us-ascii
./vendor/minify.json.js:    text/x-c++; charset=us-ascii

Also why does it display charset=us-ascii and not utf-8? And what's text/x-c++? Is there a way I could output only charset=utf-8 and line-terminators=LF for each file?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

反目相谮 2025-01-04 05:59:15

我最终得到的解决方案是两个 Sublime Text 2 插件“EncodingHelper”和“LineEndings”。我现在在状态栏中同时获取文件编码和行结尾：

Sublime Text 2 status bar

如果编码错误，我可以文件->使用编码保存。如果行结尾错误，后一个插件附带了用于更改行结尾的命令：

Sublime Text 2 Commands

回复收藏 0 原文

橘香 2025-01-04 05:59:15

如果文件没有 BOM，并且 file 查看的文本量中没有“有趣的字符”，则 file 会断定它是 ~~ASCII~~ ISO-646——UTF-8 的严格子集。您可能会发现，将 BOM 放在所有文件上会鼓励所有这些 Windows 工具发挥作用； UTF-8 文件上的 BOM 约定起源于 Windows。或者它可能会让事情变得更糟。至于 x/c++，好吧，那只是 file 试图提供帮助，但失败了。你的 javascript 中有一些看起来像 C++ 的东西。

Apache Tika 有一个编码检测器；您甚至可以使用它附带的命令行驱动程序作为 file 的替代品。它将坚持 MIME 类型，而不会转向 C++。

回复收藏 0 原文

著墨染雨君画夕 2025-01-04 05:59:15

尝试使用自定义程序来检查您想要的内容，而不是文件。这是一个快速破解，主要基于一些 Google 点击，由@ikegami 偶然编写。

#!/usr/bin/perl

use strict;
use warnings;

use Encode qw( decode );

use vars (qw(@ARGV));

@ARGV > 0 or die "Usage: $0 files ...\n";

for my $filename (@ARGV)
{
    my $terminator = 'CRLF';
    my $charset = 'UTF-8';
    local $/;
    undef $/;
    my $file;
    if (open (F, "<", $filename))
    {
        $file = <F>;
        close F;    
        # Don't print bogus data e.g. for directories
        unless (defined $file)
        {
            warn "$0: Skipping $filename: $!\n;
            next;
        }
    }
    else
    {
        warn "$0: Could not open $filename: $!\n";
        next;
    }

    my $have_crlf = ($file =~ /\r\n/);
    my $have_cr = ($file =~ /\r(?!\n)/);
    my $have_lf = ($file =~ /(?!\r\n).\n/);
    my $sum = $have_crlf + $have_cr + $have_lf;
    if ($sum == 0)
    {
        $terminator = "no";
    }
    elsif ($sum > 2)
    {
        $terminator = "mixed";
    }
    elsif ($have_cr)    
    {
        $terminator = "CR";
    }
    elsif ($have_lf)
    {
        $terminator = "LF";
    }

    $charset = 'ASCII' unless ($file =~ /[^\000-\177]/);

    $charset = 'unknown'
        unless eval { decode('UTF-8', $file, Encode::FB_CROAK); 1 };

    print "$filename: charset $charset, $terminator line endings\n";
}

请注意，这没有传统 8 位编码的概念 - 如果它既不是纯 7 位 ASCII 也不是正确的 UTF-8，它只会抛出 unknown。

Instead of file, try a custom program to check just the things you want. Here is a quick hack, mainly based on some Google hits, which were incidentally written by @ikegami.

#!/usr/bin/perl

use strict;
use warnings;

use Encode qw( decode );

use vars (qw(@ARGV));

@ARGV > 0 or die "Usage: $0 files ...\n";

for my $filename (@ARGV)
{
    my $terminator = 'CRLF';
    my $charset = 'UTF-8';
    local $/;
    undef $/;
    my $file;
    if (open (F, "<", $filename))
    {
        $file = <F>;
        close F;    
        # Don't print bogus data e.g. for directories
        unless (defined $file)
        {
            warn "$0: Skipping $filename: $!\n;
            next;
        }
    }
    else
    {
        warn "$0: Could not open $filename: $!\n";
        next;
    }

    my $have_crlf = ($file =~ /\r\n/);
    my $have_cr = ($file =~ /\r(?!\n)/);
    my $have_lf = ($file =~ /(?!\r\n).\n/);
    my $sum = $have_crlf + $have_cr + $have_lf;
    if ($sum == 0)
    {
        $terminator = "no";
    }
    elsif ($sum > 2)
    {
        $terminator = "mixed";
    }
    elsif ($have_cr)    
    {
        $terminator = "CR";
    }
    elsif ($have_lf)
    {
        $terminator = "LF";
    }

    $charset = 'ASCII' unless ($file =~ /[^\000-\177]/);

    $charset = 'unknown'
        unless eval { decode('UTF-8', $file, Encode::FB_CROAK); 1 };

    print "$filename: charset $charset, $terminator line endings\n";
}

Note that this has no concept of legacy 8-bit encodings - it will simply throw unknown if it's neither pure 7-bit ASCII nor proper UTF-8.

回复收藏 0 原文

~没有更多了~