使用 C# 检测文件名字符是否被视为国际字符

发布于 2024-08-25 11:40:44 字数 3840 浏览 4 评论 0原文

我编写了一个小型控制台应用程序(下面的源代码)来定位并选择性地重命名包含国际字符的文件,因为它们是大多数源代码控制系统持续痛苦的根源(下面有一些背景知识)。我正在使用的代码有一个简单的字典,其中包含要查找和替换的字符(并删除使用超过一个存储字节的所有其他字符),但感觉非常黑客。 (a) 查明某个角色是否是国际角色的正确方法是什么? (b) 最好的 ASCII 替换字符是什么?

让我提供一些背景信息来说明为什么需要这样做。碰巧丹麦语 Å 字符在 UTF-8 中有两种不同的编码,都代表相同的符号。这些被称为 NFC 和 NFD 编码。 Windows 和 Linux 默认情况下会创建 NFC 编码,但会尊重给定的任何编码。 Mac 会将所有名称(保存到 HFS+ 分区时)转换为 NFD,因此为在 Windows 上创建的文件的名称返回不同的字节流。这有效地破坏了 Subversion、Git 和许多其他不关心正确处理这种情况的实用程序。

我目前正在评估 Mercurial,事实证明它在处理国际字符方面更糟糕。由于对这些问题相当厌倦,必须放弃源代码控制或国际字符,所以我们就到了。

我当前的实现:

public class Checker
{
    private Dictionary<char, string> internationals = new Dictionary<char, string>();
    private List<char> keep = new List<char>();
    private List<char> seen = new List<char>();

    public Checker()
    {
        internationals.Add( 'æ', "ae" );
        internationals.Add( 'ø', "oe" );
        internationals.Add( 'å', "aa" );
        internationals.Add( 'Æ', "Ae" );
        internationals.Add( 'Ø', "Oe" );
        internationals.Add( 'Å', "Aa" );

        internationals.Add( 'ö', "o" );
        internationals.Add( 'ü', "u" );
        internationals.Add( 'ä', "a" );
        internationals.Add( 'é', "e" );
        internationals.Add( 'è', "e" );
        internationals.Add( 'ê', "e" );

        internationals.Add( '¦', "" );
        internationals.Add( 'Ã', "" );
        internationals.Add( '©', "" );
        internationals.Add( ' ', "" );
        internationals.Add( '§', "" );
        internationals.Add( '¡', "" );
        internationals.Add( '³', "" );
        internationals.Add( '­', "" );
        internationals.Add( 'º', "" );

        internationals.Add( '«', "-" );
        internationals.Add( '»', "-" );
        internationals.Add( '´', "'" );
        internationals.Add( '`', "'" );
        internationals.Add( '"', "'" );
        internationals.Add( Encoding.UTF8.GetString( new byte[] { 226, 128, 147 } )[ 0 ], "-" );
        internationals.Add( Encoding.UTF8.GetString( new byte[] { 226, 128, 148 } )[ 0 ], "-" );
        internationals.Add( Encoding.UTF8.GetString( new byte[] { 226, 128, 153 } )[ 0 ], "'" );
        internationals.Add( Encoding.UTF8.GetString( new byte[] { 226, 128, 166 } )[ 0 ], "." );

        keep.Add( '-' );
        keep.Add( '=' );
        keep.Add( '\'' );
        keep.Add( '.' );
    }

    public bool IsInternationalCharacter( char c )
    {
        var s = c.ToString();
        byte[] bytes = Encoding.UTF8.GetBytes( s );
        if( bytes.Length > 1 && ! internationals.ContainsKey( c ) && ! seen.Contains( c ) )
        {
            Console.WriteLine( "X '{0}' ({1})", c, string.Join( ",", bytes ) );
            seen.Add( c );
            if( ! keep.Contains( c ) )
            {
                internationals[ c ] = "";
            }
        }
        return internationals.ContainsKey( c );
    }

    public bool HasInternationalCharactersInName( string name, out string safeName )
    {
        StringBuilder sb = new StringBuilder();
        Array.ForEach( name.ToCharArray(), c => sb.Append( IsInternationalCharacter( c ) ? internationals[ c ] : c.ToString() ) );
        int length = sb.Length;
        sb.Replace( "  ", " " );
        while( sb.Length != length )
        {
            sb.Replace( "  ", " " );
        }
        safeName = sb.ToString().Trim();
        string namePart = Path.GetFileNameWithoutExtension( safeName );
        if( namePart.EndsWith( "." ) )
            safeName = namePart.Substring( 0, namePart.Length - 1 ) + Path.GetExtension( safeName );
        return name != safeName;
    }
}

这将像这样调用:

FileInfo file = new File( "Århus.txt" );
string safeName;    
if( checker.HasInternationalCharactersInName( file.Name, out safeName ) )
{
    // rename file 
}

I've written a small console application (source below) to locate and optionally rename files containing international characters, as they are a source of constant pain with most source control systems (some background on this below). The code I'm using has a simple dictionary with characters to look for and replace (and nukes every other character that uses more than one byte of storage), but it feels very hackish. What's the right way to (a) find out whether a character is international? and (b) what the best ASCII substitution character would be?

Let me provide some background information on why this is needed. It so happens that the danish Å character has two different encodings in UTF-8, both representing the same symbol. These are known as NFC and NFD encodings. Windows and Linux will create NFC encoding by default but respect whatever encoding it is given. Mac will convert all names (when saving to a HFS+ partition) to NFD and therefore returns a different byte stream for the name of a file created on Windows. This effectively breaks Subversion, Git and lots of other utilities that don't care to properly handle this scenario.

I'm currently evaluating Mercurial, which turns out to be even worse at handling international characters.. being fairly tired of these problems, either source control or the international character would have to go, and so here we are.

My current implementation:

public class Checker
{
    private Dictionary<char, string> internationals = new Dictionary<char, string>();
    private List<char> keep = new List<char>();
    private List<char> seen = new List<char>();

    public Checker()
    {
        internationals.Add( 'æ', "ae" );
        internationals.Add( 'ø', "oe" );
        internationals.Add( 'å', "aa" );
        internationals.Add( 'Æ', "Ae" );
        internationals.Add( 'Ø', "Oe" );
        internationals.Add( 'Å', "Aa" );

        internationals.Add( 'ö', "o" );
        internationals.Add( 'ü', "u" );
        internationals.Add( 'ä', "a" );
        internationals.Add( 'é', "e" );
        internationals.Add( 'è', "e" );
        internationals.Add( 'ê', "e" );

        internationals.Add( '¦', "" );
        internationals.Add( 'Ã', "" );
        internationals.Add( '©', "" );
        internationals.Add( ' ', "" );
        internationals.Add( '§', "" );
        internationals.Add( '¡', "" );
        internationals.Add( '³', "" );
        internationals.Add( '­', "" );
        internationals.Add( 'º', "" );

        internationals.Add( '«', "-" );
        internationals.Add( '»', "-" );
        internationals.Add( '´', "'" );
        internationals.Add( '`', "'" );
        internationals.Add( '"', "'" );
        internationals.Add( Encoding.UTF8.GetString( new byte[] { 226, 128, 147 } )[ 0 ], "-" );
        internationals.Add( Encoding.UTF8.GetString( new byte[] { 226, 128, 148 } )[ 0 ], "-" );
        internationals.Add( Encoding.UTF8.GetString( new byte[] { 226, 128, 153 } )[ 0 ], "'" );
        internationals.Add( Encoding.UTF8.GetString( new byte[] { 226, 128, 166 } )[ 0 ], "." );

        keep.Add( '-' );
        keep.Add( '=' );
        keep.Add( '\'' );
        keep.Add( '.' );
    }

    public bool IsInternationalCharacter( char c )
    {
        var s = c.ToString();
        byte[] bytes = Encoding.UTF8.GetBytes( s );
        if( bytes.Length > 1 && ! internationals.ContainsKey( c ) && ! seen.Contains( c ) )
        {
            Console.WriteLine( "X '{0}' ({1})", c, string.Join( ",", bytes ) );
            seen.Add( c );
            if( ! keep.Contains( c ) )
            {
                internationals[ c ] = "";
            }
        }
        return internationals.ContainsKey( c );
    }

    public bool HasInternationalCharactersInName( string name, out string safeName )
    {
        StringBuilder sb = new StringBuilder();
        Array.ForEach( name.ToCharArray(), c => sb.Append( IsInternationalCharacter( c ) ? internationals[ c ] : c.ToString() ) );
        int length = sb.Length;
        sb.Replace( "  ", " " );
        while( sb.Length != length )
        {
            sb.Replace( "  ", " " );
        }
        safeName = sb.ToString().Trim();
        string namePart = Path.GetFileNameWithoutExtension( safeName );
        if( namePart.EndsWith( "." ) )
            safeName = namePart.Substring( 0, namePart.Length - 1 ) + Path.GetExtension( safeName );
        return name != safeName;
    }
}

And this would be invoked like this:

FileInfo file = new File( "Århus.txt" );
string safeName;    
if( checker.HasInternationalCharactersInName( file.Name, out safeName ) )
{
    // rename file 
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

故事与诗 2024-09-01 11:40:44

(一)简单。检查是否有任何大于 127 的代码点。

(b) 尝试 NKFD 标准化和/或 uni2ascii

(a) Simple. Check for any code points that are greater than 127.

(b) Try NKFD normalization and/or uni2ascii.

扭转时空 2024-09-01 11:40:44

在当今这个时代,存在着令人悲伤的问题。显然,MAC 使用的 NFD 形式引起了您的头痛。您可以考虑的一件事是从字形中删除变音符号,这会导致 NFD 与 NFC 不同。

我不能 100% 确定这是完全准确的(特别是对于亚洲脚本),但它应该接近:

public static string RemoveDiacriticals(string txt) {
  string nfd = txt.Normalize(NormalizationForm.FormD);
  StringBuilder retval = new StringBuilder(nfd.Length);
  foreach (char ch in nfd) {
    if (ch >= '\u0300' && ch <= '\u036f') continue;
    if (ch >= '\u1dc0' && ch <= '\u1de6') continue;
    if (ch >= '\ufe20' && ch <= '\ufe26') continue;
    if (ch >= '\u20d0' && ch <= '\u20f0') continue;
    retval.Append(ch);
  }
  return retval.ToString();
}

Sad problem to have in this day and age. Clearly the NFD form that the MAC uses is causing you this headache. One thing you could consider is removing the diacritics from the glyphs that causes NFD to be different from NFC.

I'm not 100% sure this is completely accurate (especially for Asian scripts), but it ought to be close:

public static string RemoveDiacriticals(string txt) {
  string nfd = txt.Normalize(NormalizationForm.FormD);
  StringBuilder retval = new StringBuilder(nfd.Length);
  foreach (char ch in nfd) {
    if (ch >= '\u0300' && ch <= '\u036f') continue;
    if (ch >= '\u1dc0' && ch <= '\u1de6') continue;
    if (ch >= '\ufe20' && ch <= '\ufe26') continue;
    if (ch >= '\u20d0' && ch <= '\u20f0') continue;
    retval.Append(ch);
  }
  return retval.ToString();
}
一瞬间的火花 2024-09-01 11:40:44

如果你不介意暴力破解,你可以尝试这样的操作:

string name = "Århus.txt";
string kd = name.Normalize(NormalizationForm.FormKD);
byte[] kd_bytes = Encoding.Unicode.GetBytes(kd);
byte[] ascii_bytes = Encoding.Convert(Encoding.Unicode, Encoding.ASCII, kd_bytes);
string flattened = Encoding.ASCII.GetString(ascii_bytes);

This will conversion Århus.txt to A?rhus.txt,因为 KD 形式将 Å 分开,并且转换为 7 位 ASCII 会丢失变音标记。如何处理剩下的小“?”取决于您。

你的里程可能会因其他角色而异,但我猜 KD 标准化应该可以解决问题。我已经很多年没有从事代码页转换工作了,但我发现这个问题很有趣。

编辑:

我刚刚尝试了 æÆØ ,它们都转换为 ? ,所以这对你来说可能太有损失了。尽管如此,它仍可能为您提供一些线索,从而找到答案。

If you don't mind brute force, you can try something like this:

string name = "Århus.txt";
string kd = name.Normalize(NormalizationForm.FormKD);
byte[] kd_bytes = Encoding.Unicode.GetBytes(kd);
byte[] ascii_bytes = Encoding.Convert(Encoding.Unicode, Encoding.ASCII, kd_bytes);
string flattened = Encoding.ASCII.GetString(ascii_bytes);

This will convert Århus.txt to A?rhus.txt, because the KD form breaks the Å apart, and the conversion to 7-bit ASCII loses the diacritical mark. What to do with the little ?'s left over is up to you.

Your mileage may vary on the other characters, but I would guess the KD normalization should do the trick. I have not worked on code page conversions for years now, but I found the question intriguing.

EDIT:

I just tried æÆØ and they all converted to ?, so this may be too lossy for you. Still, it may give you some clues that lead to an answer.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文