C# 中的 Slugify 和字符转写
我正在尝试将以下 slugify 方法从 PHP 转换为 C#: http://snipplr.com/view/22741/slugify-a- string-in-php/
编辑: 为了方便起见,这里是上面的代码:
/**
* Modifies a string to remove al non ASCII characters and spaces.
*/
static public function slugify($text)
{
// replace non letter or digits by -
$text = preg_replace('~[^\\pL\d]+~u', '-', $text);
// trim
$text = trim($text, '-');
// transliterate
if (function_exists('iconv'))
{
$text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
}
// lowercase
$text = strtolower($text);
// remove unwanted characters
$text = preg_replace('~[^-\w]+~', '', $text);
if (empty($text))
{
return 'n-a';
}
return $text;
}
我对其余部分进行编码没有问题,除了找不到C# 相当于以下 PHP 代码行:
$text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
编辑: 这样做的目的是将非 ASCII 字符(例如 Reformáció Genfi Emlékműve Előtt
)转换为 reformacio-genfi-emlekmuve-elott
I'm trying to translate the following slugify method from PHP to C#:
http://snipplr.com/view/22741/slugify-a-string-in-php/
Edit: For the sake of convenience, here the code from above:
/**
* Modifies a string to remove al non ASCII characters and spaces.
*/
static public function slugify($text)
{
// replace non letter or digits by -
$text = preg_replace('~[^\\pL\d]+~u', '-', $text);
// trim
$text = trim($text, '-');
// transliterate
if (function_exists('iconv'))
{
$text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
}
// lowercase
$text = strtolower($text);
// remove unwanted characters
$text = preg_replace('~[^-\w]+~', '', $text);
if (empty($text))
{
return 'n-a';
}
return $text;
}
I got no probleming coding the rest except I can not find the C# equivalent of the following line of PHP code:
$text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
Edit:
Purpose of this is to translate non-ASCII characters such as Reformáció Genfi Emlékműve Előtt
into reformacio-genfi-emlekmuve-elott
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我还想补充一点,
//TRANSLIT
删除了撇号,而 @jxac 解决方案没有解决这个问题。我不知道为什么,但首先将其编码为西里尔字母,然后编码为 ASCII,您会得到与//TRANSLIT
类似的行为。I would also like to add that the
//TRANSLIT
removes the apostrophes and that @jxac solution doesn't address that. I'm not sure why but by first encoding it to Cyrillic and then to ASCII you get a similar behavior as//TRANSLIT
.Codeplex 上有一个用于音译的 .NET 库 - unidecode。它通常使用从 python 移植的 Unicode 表来实现这一点。
There is a .NET library for transliteration on codeplex - unidecode. It generally does the trick using Unidecode tables ported from python.
转换为字符串:
转换为字节:
@Thomas Levesque 是对,将由输出流编码...
要删除变音符号(重音符号),您可以使用 String.Normalize 函数,详细信息如下:
http://www.siao2.com/2007/05/14/2629747.aspx
应该可以处理大多数情况(其中该字形实际上是一个字符加上一个重音符号)。进行更积极的字符匹配(以处理斯堪的纳维亚slashed o [Ø]、二合字母和其他外来字形),有表格方法:
http: //www.codeproject.com/KB/cs/UnicodeNormalization.aspx
除了规范化之外,还包括大约 1,000 个符号映射。
(注意,示例中的正则表达式替换删除了所有标点符号)
conversion to string:
conversion to bytes:
@Thomas Levesque is right, will get encoded by the output stream...
to remove the diacritics (accent marks), you can use the String.Normalize function, as detailed here:
http://www.siao2.com/2007/05/14/2629747.aspx
that should take care of most of the cases (where the glyph is really a character plus an accent mark). for an even more aggressive char matching (to take care of cases like the Scandinavian slashed o [Ø], digraphs, and other exotic glyphs), there's the table approach:
http://www.codeproject.com/KB/cs/UnicodeNormalization.aspx
this includes around 1,000 symbol mappings in addition to the normalization.
(note, all punctuation is removed by the regex replace in your example)