将两个 ascii 字符转换为其“对应”字符一个字符扩展 ASCII 表示

发布于 2024-08-24 23:24:21 字数 403 浏览 5 评论 0原文

问题: 我有两个来自外部系统的固定宽度字符串。第一个包含基本字符(如 az),第二个(可以)包含附加到第一个字符串以创建实际字符的变音符号。

string asciibase = "Dutch has funny chars: a,e,u";
string diacrits  = "                       ' \" \"";

//no clue what to do

string result = "Dutch has funny chars: á,ë,ü";

我可以编写大量搜索并替换所有字符+不同的变音符号,但希望有一些更优雅的东西。

有人知道如何解决这个问题吗?尝试使用 string.Normalize (c#) 计算小数值,但没有结果。而且谷歌并没有真正拿出什么东西。

The problem:
I have two fixed width strings from an external system. The first contains the base characters (like a-z), the second (MAY) contain diacritics to be appended to the first string to create the actual characters.

string asciibase = "Dutch has funny chars: a,e,u";
string diacrits  = "                       ' \" \"";

//no clue what to do

string result = "Dutch has funny chars: á,ë,ü";

I could write a massive search and replace for all characters + different diacritics but was hoping for something a bit more elegant.

Somebody have a clue how to fix this one? Tried it with calculating the decimal values, using string.Normalize (c#) but no results. Also Google didn't really turn up with something.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

酒废 2024-08-31 23:24:21

将变音符号从 Unicode 组合变音符号范围转换为合适的 unicode 值:

http://www .unicode.org/charts/PDF/U0300.pdf

然后将 char 及其变音符号放在一起,例如对于 e-acute,U+0065 = "e" 且 U+0301 = Acute。

  String s = "\u0065\u0301";

然后:

  string normalisedString = s.Normalize();

将两者组合成一个新字符串。

Convert the diacritics to suitable unicode values from the Unicode combining diacritical marks range:

http://www.unicode.org/charts/PDF/U0300.pdf

Then slap the char and its diacritic together e.g. for e-acute, U+0065 = "e" and U+0301 = acute.

  String s = "\u0065\u0301";

Then:

  string normalisedString = s.Normalize();

Will combine the two into a new string.

随遇而安 2024-08-31 23:24:21

除了使用查找表之外,我找不到简单的解决方案:

public void TestMethod1()
{
    string asciibase = "Dutch has funny chars: a,e,u";
    string diacrits = "                       ' \" \"";
    var merged = DiacritMerger.Merge(asciibase, diacrits);
}

[编辑:@JonB 和 @Oliver 的答案中的建议后的简化代码]

public class DiacritMerger
{
    static readonly Dictionary<char, char> _lookup = new Dictionary<char, char>
                         {
                             {'\'', '\u0301'},
                             {'"', '\u0308'}
                         };

    public static string Merge(string asciiBase, string diacrits)
    {
        var combined = asciiBase.Zip(diacrits, (ascii, diacrit) => DiacritVersion(diacrit, ascii));
        return new string(combined.ToArray());
    }

    private static char DiacritVersion(char diacrit, char character)
    {
        char combine;
        return _lookup.TryGetValue(diacrit, out combine) ? new string(new [] {character, combine}).Normalize()[0] : character;
    }
}

I cannot find an easy solution except using lookup tables:

public void TestMethod1()
{
    string asciibase = "Dutch has funny chars: a,e,u";
    string diacrits = "                       ' \" \"";
    var merged = DiacritMerger.Merge(asciibase, diacrits);
}

[EDIT: Simplified code after suggestions in the answers from @JonB and @Oliver]

public class DiacritMerger
{
    static readonly Dictionary<char, char> _lookup = new Dictionary<char, char>
                         {
                             {'\'', '\u0301'},
                             {'"', '\u0308'}
                         };

    public static string Merge(string asciiBase, string diacrits)
    {
        var combined = asciiBase.Zip(diacrits, (ascii, diacrit) => DiacritVersion(diacrit, ascii));
        return new string(combined.ToArray());
    }

    private static char DiacritVersion(char diacrit, char character)
    {
        char combine;
        return _lookup.TryGetValue(diacrit, out combine) ? new string(new [] {character, combine}).Normalize()[0] : character;
    }
}
烟酉 2024-08-31 23:24:21

问题是,必须显式解析指定的变音符号,因为双点不单独存在,因此在这种情况下使用双引号。因此,为了解决您的问题,您没有任何其他机会来实现每个所需的案例。

这是获取线索的起点...

    public SomeFunction()
    {
        string asciiChars = "Dutch has funny chars: a,e,u";
        string diacrits = "                       ' \" \"";

        var combinedChars = asciiChars.Zip(diacrits, (ascii, diacrit) =>
        {
            return CombineChars(ascii, diacrit);
        });

        var Result = new String(combinedChars.ToArray());
    }

    private char CombineChars(char ascii, char diacrit)
    {
        switch (diacrit)
        {
            case '"':
                return AddDoublePoints(ascii);
            case '\'':
                return AddAccent(ascii);
            default:
                return ascii;
        }
    }

    private char AddDoublePoints(char ascii)
    {
        switch (ascii)
        {
            case 'a':
                return 'ä';
            case 'o':
                return 'ö';
            case 'u':
                return 'ü';
            default:
                return ascii;
        }
    }

    private char AddAccent(char ascii)
    {
        switch (ascii)
        {
            case 'a':
                return 'á';
            case 'o':
                return 'ó';
            default:
                return ascii;
        }
    }
}

IEnumerable.Zip 已经 在 .Net 4 中实现,但要在 3.5 中获得它,您需要此代码 (取自埃里克·利珀特):

public static class IEnumerableExtension
{
    public static IEnumerable<TResult> Zip<TFirst, TSecond, TResult>
        (this IEnumerable<TFirst> first,
        IEnumerable<TSecond> second,
        Func<TFirst, TSecond, TResult> resultSelector)
    {
        if (first == null) throw new ArgumentNullException("first");
        if (second == null) throw new ArgumentNullException("second");
        if (resultSelector == null) throw new ArgumentNullException("resultSelector");
        return ZipIterator(first, second, resultSelector);
    }

    private static IEnumerable<TResult> ZipIterator<TFirst, TSecond, TResult>
        (IEnumerable<TFirst> first,
        IEnumerable<TSecond> second,
        Func<TFirst, TSecond, TResult> resultSelector)
    {
        using (IEnumerator<TFirst> e1 = first.GetEnumerator())
        using (IEnumerator<TSecond> e2 = second.GetEnumerator())
            while (e1.MoveNext() && e2.MoveNext())
                yield return resultSelector(e1.Current, e2.Current);
    }
}

The problem is, that the specified diacrits have to be explicitly parsed, cause the double points don't exists sole and so the double quotes are used for this case. So to accomplish your problem you don't have any other chance then to implement each needed case.

Here is a starting point to get a clue...

    public SomeFunction()
    {
        string asciiChars = "Dutch has funny chars: a,e,u";
        string diacrits = "                       ' \" \"";

        var combinedChars = asciiChars.Zip(diacrits, (ascii, diacrit) =>
        {
            return CombineChars(ascii, diacrit);
        });

        var Result = new String(combinedChars.ToArray());
    }

    private char CombineChars(char ascii, char diacrit)
    {
        switch (diacrit)
        {
            case '"':
                return AddDoublePoints(ascii);
            case '\'':
                return AddAccent(ascii);
            default:
                return ascii;
        }
    }

    private char AddDoublePoints(char ascii)
    {
        switch (ascii)
        {
            case 'a':
                return 'ä';
            case 'o':
                return 'ö';
            case 'u':
                return 'ü';
            default:
                return ascii;
        }
    }

    private char AddAccent(char ascii)
    {
        switch (ascii)
        {
            case 'a':
                return 'á';
            case 'o':
                return 'ó';
            default:
                return ascii;
        }
    }
}

The IEnumerable.Zip is already implemented in .Net 4, but to get it in 3.5 you'll need this code (taken from Eric Lippert):

public static class IEnumerableExtension
{
    public static IEnumerable<TResult> Zip<TFirst, TSecond, TResult>
        (this IEnumerable<TFirst> first,
        IEnumerable<TSecond> second,
        Func<TFirst, TSecond, TResult> resultSelector)
    {
        if (first == null) throw new ArgumentNullException("first");
        if (second == null) throw new ArgumentNullException("second");
        if (resultSelector == null) throw new ArgumentNullException("resultSelector");
        return ZipIterator(first, second, resultSelector);
    }

    private static IEnumerable<TResult> ZipIterator<TFirst, TSecond, TResult>
        (IEnumerable<TFirst> first,
        IEnumerable<TSecond> second,
        Func<TFirst, TSecond, TResult> resultSelector)
    {
        using (IEnumerator<TFirst> e1 = first.GetEnumerator())
        using (IEnumerator<TSecond> e2 = second.GetEnumerator())
            while (e1.MoveNext() && e2.MoveNext())
                yield return resultSelector(e1.Current, e2.Current);
    }
}
生寂 2024-08-31 23:24:21

我不知道 C# 或其标准库,但一种替代方法可能是利用现有的 HTML/SGML/XML 字符实体解析器/渲染器之类的东西,或者如果您实际上要将其呈现给浏览器,什么都没有

伪代码:

for(i=0; i < strlen(either_string); i++) {
  if isspace(diacrits[i]) {
     output(asciibase[i]);
  }else{
     output("&");
     output(asciibase[i]);
     switch (diacrits[i]) {
       case '"' : output "uml"; break;
       case '^' : output "circ"; break;
       case '~' : output "tilde"; break;
       case 'o' : output "ring"; break;
       ... and so on for each "code" in the diacrits modifier
       ... (for acute, grave, cedil, lig, ...)
     }
     output(";");
  }
}

因此,A + o -> Åu + " -> ü 等等。

如果你可以解析 html 实体,你应该然后就可以回家了,甚至可以在字符集之间移植!

I don't know C#, or its standard libraries, but one alternative approach might be to utilize something like an existing HTML/SGML/XML character entity parser/renderer, or if you actually are going to present it to a browser, nothing!

Pseudo code:

for(i=0; i < strlen(either_string); i++) {
  if isspace(diacrits[i]) {
     output(asciibase[i]);
  }else{
     output("&");
     output(asciibase[i]);
     switch (diacrits[i]) {
       case '"' : output "uml"; break;
       case '^' : output "circ"; break;
       case '~' : output "tilde"; break;
       case 'o' : output "ring"; break;
       ... and so on for each "code" in the diacrits modifier
       ... (for acute, grave, cedil, lig, ...)
     }
     output(";");
  }
}

Thus, A + o -> Å, u + " -> ü and so on.

If you can then parse html entities, you should then be home free, and even portable between charsets!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文