C# 中的 Slugify 和字符转写

发布于 2024-08-19 10:54:10 字数 1141 浏览 8 评论 0原文

我正在尝试将以下 slugify 方法从 PHP 转换为 C#: http://snipplr.com/view/22741/slugify-a- string-in-php/

编辑: 为了方便起见,这里是上面的代码:

/**
 * Modifies a string to remove al non ASCII characters and spaces.
 */
static public function slugify($text)
{
    // replace non letter or digits by -
    $text = preg_replace('~[^\\pL\d]+~u', '-', $text);

    // trim
    $text = trim($text, '-');

    // transliterate
    if (function_exists('iconv'))
    {
        $text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
    }

    // lowercase
    $text = strtolower($text);

    // remove unwanted characters
    $text = preg_replace('~[^-\w]+~', '', $text);

    if (empty($text))
    {
        return 'n-a';
    }

    return $text;
}

我对其余部分进行编码没有问题,除了找不到C# 相当于以下 PHP 代码行:

$text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);

编辑: 这样做的目的是将非 ASCII 字符(例如 Reformáció Genfi Emlékműve Előtt)转换为 reformacio-genfi-emlekmuve-elott

I'm trying to translate the following slugify method from PHP to C#:
http://snipplr.com/view/22741/slugify-a-string-in-php/

Edit: For the sake of convenience, here the code from above:

/**
 * Modifies a string to remove al non ASCII characters and spaces.
 */
static public function slugify($text)
{
    // replace non letter or digits by -
    $text = preg_replace('~[^\\pL\d]+~u', '-', $text);

    // trim
    $text = trim($text, '-');

    // transliterate
    if (function_exists('iconv'))
    {
        $text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
    }

    // lowercase
    $text = strtolower($text);

    // remove unwanted characters
    $text = preg_replace('~[^-\w]+~', '', $text);

    if (empty($text))
    {
        return 'n-a';
    }

    return $text;
}

I got no probleming coding the rest except I can not find the C# equivalent of the following line of PHP code:

$text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);

Edit:
Purpose of this is to translate non-ASCII characters such as Reformáció Genfi Emlékműve Előtt into reformacio-genfi-emlekmuve-elott

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

我不在是我 2024-08-26 10:54:10

我还想补充一点, //TRANSLIT 删除了撇号,而 @jxac 解决方案没有解决这个问题。我不知道为什么,但首先将其编码为西里尔字母,然后编码为 ASCII,您会得到与 //TRANSLIT 类似的行为。

var str = "éåäöíØ";
var noApostrophes = Encoding.ASCII.GetString(Encoding.GetEncoding("Cyrillic").GetBytes(str)); 

=> "eaaoiO"

I would also like to add that the //TRANSLIT removes the apostrophes and that @jxac solution doesn't address that. I'm not sure why but by first encoding it to Cyrillic and then to ASCII you get a similar behavior as //TRANSLIT.

var str = "éåäöíØ";
var noApostrophes = Encoding.ASCII.GetString(Encoding.GetEncoding("Cyrillic").GetBytes(str)); 

=> "eaaoiO"
烟花肆意 2024-08-26 10:54:10

Codeplex 上有一个用于音译的 .NET 库 - unidecode。它通常使用从 python 移植的 Unicode 表来实现这一点。

There is a .NET library for transliteration on codeplex - unidecode. It generally does the trick using Unidecode tables ported from python.

隐诗 2024-08-26 10:54:10

转换为字符串:

byte[] unicodeBytes = Encoding.Unicode.GetBytes(str);
byte[] asciiBytes = Encoding.Convert(Encoding.Unicode, Encoding.ASCII, unicodeBytes);
string asciiString = Encoding.ASCII.GetString(asciiBytes);

转换为字节:

byte[] ascii = Encoding.ASCII.GetBytes(str);

@Thomas Levesque 是对,将由输出流编码...

要删除变音符号(重音符号),您可以使用 String.Normalize 函数,详细信息如下:

http://www.siao2.com/2007/05/14/2629747.aspx

应该可以处理大多数情况(其中该字形实际上是一个字符加上一个重音符号)。进行更积极的字符匹配(以处理斯堪的纳维亚slashed o [Ø]、二合字母和其他外来字形),有表格方法:

http: //www.codeproject.com/KB/cs/UnicodeNormalization.aspx

除了规范化之外,还包括大约 1,000 个符号映射。

(注意,示例中的正则表达式替换删除了所有标点符号)

conversion to string:

byte[] unicodeBytes = Encoding.Unicode.GetBytes(str);
byte[] asciiBytes = Encoding.Convert(Encoding.Unicode, Encoding.ASCII, unicodeBytes);
string asciiString = Encoding.ASCII.GetString(asciiBytes);

conversion to bytes:

byte[] ascii = Encoding.ASCII.GetBytes(str);

@Thomas Levesque is right, will get encoded by the output stream...

to remove the diacritics (accent marks), you can use the String.Normalize function, as detailed here:

http://www.siao2.com/2007/05/14/2629747.aspx

that should take care of most of the cases (where the glyph is really a character plus an accent mark). for an even more aggressive char matching (to take care of cases like the Scandinavian slashed o [Ø], digraphs, and other exotic glyphs), there's the table approach:

http://www.codeproject.com/KB/cs/UnicodeNormalization.aspx

this includes around 1,000 symbol mappings in addition to the normalization.

(note, all punctuation is removed by the regex replace in your example)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文