Rails 和 Django 等 Web 框架内置了对“slugs”的支持,用于生成可读且 SEO 友好的 URL:
slug 字符串通常包含仅包含字符 az
、0-9
和 -
,因此可以在不进行 URL 转义的情况下编写(例如“foo%20bar”)。
我正在寻找一个 Java slug 函数,给定任何有效的 Unicode 字符串将返回 slug 表示形式(az
、0-9
和 -
) 。
一个简单的 slug 函数类似于:
return input.toLowerCase().replaceAll("[^a-z0-9-]", "");
但是,此实现不会处理国际化和重音 (ë
> e
)。解决这个问题的一种方法是枚举所有特殊情况,但这不是很优雅。我正在寻找一些经过深思熟虑和通用的东西。
我的问题:
- 在 Java 中生成 Django/Rails 类型段的最通用/实用的方法是什么?
Web frameworks such as Rails and Django has built-in support for "slugs" which are used to generate readable and SEO-friendly URLs:
A slug string typically contains only of the characters a-z
, 0-9
and -
and can hence be written without URL-escaping (think "foo%20bar").
I'm looking for a Java slug function that given any valid Unicode string will return a slug representation (a-z
, 0-9
and -
).
A trivial slug function would be something along the lines of:
return input.toLowerCase().replaceAll("[^a-z0-9-]", "");
However, this implementation would not handle internationalization and accents (ë
> e
). One way around this would be to enumerate all special cases, but that would not be very elegant. I'm looking for something more well thought out and general.
My question:
- What is the most general/practical way to generate Django/Rails type slugs in Java?
发布评论
评论(4)
使用规范分解标准化字符串:
但这仍然是一个相当幼稚的过程。它不会对 s 升号(ß - 用于德语)或任何非拉丁字母(希腊语、西里尔语、CJK 等)执行任何操作。
更改字符串的大小写时要小心。大写和小写形式取决于字母。在土耳其语中,U+0069 (i) 的大写是 U+0130 (і),而不是 U+0049 (I),因此您如果您在土耳其语言环境下使用
String.toLowerCase()
,则存在将非 latin1 字符重新引入字符串的风险。Normalize your string using canonical decomposition:
This is still a fairly naive process, though. It isn't going to do anything for s-sharp (ß - used in German), or any non-Latin-based alphabet (Greek, Cyrillic, CJK, etc).
Be careful when changing the case of a string. Upper and lower case forms are dependent on alphabets. In Turkish, the capitalization of U+0069 (i) is U+0130 (İ), not U+0049 (I) so you risk introducing a non-latin1 character back into your string if you use
String.toLowerCase()
under a Turkish locale.http://search.maven.org/#search|ga|1|slugify
这是GitHub 存储库查看代码及其用法:
https://github.com/slugify/slugify
http://search.maven.org/#search|ga|1|slugify
And here's the GitHub repository to take a look at the code and its usage:
https://github.com/slugify/slugify
McDowel 的提议几乎可行,但在像
Hello World !!
这样的情况下,它返回hello-world--
(注意--
处字符串的末尾)而不是hello-world
。固定版本可能是:
The proposition of McDowel almost works, but in cases like this
Hello World !!
it returnshello-world--
(note the--
at the end of the string) instead ofhello-world
.A fixed version could be:
我扩展了@McDowell 的答案,将转义标点符号包含为连字符,并删除重复的和前导/尾随的连字符。
I've extended the answer by @McDowell to include escaping punctuation as hyphens and to remove duplicate and leading/trailing hyphens.