用于生成 slugs 的 Java 代码/库（用于漂亮的 URL）

发布于 2024-08-10 02:20:08 字数 812 浏览 11 评论 0原文

Rails 和 Django 等 Web 框架内置了对“slugs”的支持，用于生成可读且 SEO 友好的 URL：

slug 字符串通常包含仅包含字符 az、0-9 和 -，因此可以在不进行 URL 转义的情况下编写（例如“foo%20bar”）。

我正在寻找一个 Java slug 函数，给定任何有效的 Unicode 字符串将返回 slug 表示形式（az、0-9 和 -）。

一个简单的 slug 函数类似于：

return input.toLowerCase().replaceAll("[^a-z0-9-]", "");

但是，此实现不会处理国际化和重音 (ë > e)。解决这个问题的一种方法是枚举所有特殊情况，但这不是很优雅。我正在寻找一些经过深思熟虑和通用的东西。

我的问题：

在 Java 中生成 Django/Rails 类型段的最通用/实用的方法是什么？

原文

Web frameworks such as Rails and Django has built-in support for "slugs" which are used to generate readable and SEO-friendly URLs:

A slug string typically contains only of the characters a-z, 0-9 and - and can hence be written without URL-escaping (think "foo%20bar").

I'm looking for a Java slug function that given any valid Unicode string will return a slug representation (a-z, 0-9 and -).

A trivial slug function would be something along the lines of:

return input.toLowerCase().replaceAll("[^a-z0-9-]", "");

However, this implementation would not handle internationalization and accents (ë > e). One way around this would be to enumerate all special cases, but that would not be very elegant. I'm looking for something more well thought out and general.

My question:

What is the most general/practical way to generate Django/Rails type slugs in Java?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

请别遗忘我 2024-08-17 02:20:08

使用规范分解标准化字符串：

  private static final Pattern NONLATIN = Pattern.compile("[^\\w-]");
  private static final Pattern WHITESPACE = Pattern.compile("[\\s]");

  public static String toSlug(String input) {
    String nowhitespace = WHITESPACE.matcher(input).replaceAll("-");
    String normalized = Normalizer.normalize(nowhitespace, Form.NFD);
    String slug = NONLATIN.matcher(normalized).replaceAll("");
    return slug.toLowerCase(Locale.ENGLISH);
  }

但这仍然是一个相当幼稚的过程。它不会对 s 升号（ß - 用于德语）或任何非拉丁字母（希腊语、西里尔语、CJK 等）执行任何操作。

更改字符串的大小写时要小心。大写和小写形式取决于字母。在土耳其语中，U+0069 (i) 的大写是 U+0130 (і)，而不是 U+0049 (I)，因此您如果您在土耳其语言环境下使用 String.toLowerCase()，则存在将非 latin1 字符重新引入字符串的风险。

Normalize your string using canonical decomposition:

  private static final Pattern NONLATIN = Pattern.compile("[^\\w-]");
  private static final Pattern WHITESPACE = Pattern.compile("[\\s]");

  public static String toSlug(String input) {
    String nowhitespace = WHITESPACE.matcher(input).replaceAll("-");
    String normalized = Normalizer.normalize(nowhitespace, Form.NFD);
    String slug = NONLATIN.matcher(normalized).replaceAll("");
    return slug.toLowerCase(Locale.ENGLISH);
  }

This is still a fairly naive process, though. It isn't going to do anything for s-sharp (ß - used in German), or any non-Latin-based alphabet (Greek, Cyrillic, CJK, etc).

Be careful when changing the case of a string. Upper and lower case forms are dependent on alphabets. In Turkish, the capitalization of U+0069 (i) is U+0130 (İ), not U+0049 (I) so you risk introducing a non-latin1 character back into your string if you use String.toLowerCase() under a Turkish locale.

回复收藏 0 原文

幽蝶幻影 2024-08-17 02:20:08

http://search.maven.org/#search|ga|1|slugify

这是GitHub 存储库查看代码及其用法：

https://github.com/slugify/slugify

回复收藏 0 原文

叫嚣ゝ 2024-08-17 02:20:08

McDowel 的提议几乎可行，但在像 Hello World !! 这样的情况下，它返回 hello-world-- （注意 -- 处字符串的末尾）而不是 hello-world。

固定版本可能是：

private static final Pattern NONLATIN = Pattern.compile("[^\\w-]");
private static final Pattern WHITESPACE = Pattern.compile("[\\s]");
private static final Pattern EDGESDHASHES = Pattern.compile("(^-|-$)");

public static String toSlug(String input) {
    String nowhitespace = WHITESPACE.matcher(input).replaceAll("-");
    String normalized = Normalizer.normalize(nowhitespace, Normalizer.Form.NFD);
    String slug = NONLATIN.matcher(normalized).replaceAll("");
    slug = EDGESDHASHES.matcher(slug).replaceAll("");
    return slug.toLowerCase(Locale.ENGLISH);
}

The proposition of McDowel almost works, but in cases like this Hello World !! it returns hello-world-- (note the -- at the end of the string) instead of hello-world.

A fixed version could be:

private static final Pattern NONLATIN = Pattern.compile("[^\\w-]");
private static final Pattern WHITESPACE = Pattern.compile("[\\s]");
private static final Pattern EDGESDHASHES = Pattern.compile("(^-|-$)");

public static String toSlug(String input) {
    String nowhitespace = WHITESPACE.matcher(input).replaceAll("-");
    String normalized = Normalizer.normalize(nowhitespace, Normalizer.Form.NFD);
    String slug = NONLATIN.matcher(normalized).replaceAll("");
    slug = EDGESDHASHES.matcher(slug).replaceAll("");
    return slug.toLowerCase(Locale.ENGLISH);
}

回复收藏 0 原文

情场扛把子 2024-08-17 02:20:08

我扩展了@McDowell 的答案，将转义标点符号包含为连字符，并删除重复的和前导/尾随的连字符。

  private static final Pattern NONLATIN = Pattern.compile("[^\\w_-]");  
  private static final Pattern SEPARATORS = Pattern.compile("[\\s\\p{Punct}&&[^-]]");  

  public static String makeSlug(String input) {  
    String noseparators = SEPARATORS.matcher(input).replaceAll("-");
    String normalized = Normalizer.normalize(noseparators, Form.NFD);
    String slug = NONLATIN.matcher(normalized).replaceAll("");
    return slug.toLowerCase(Locale.ENGLISH).replaceAll("-{2,}","-").replaceAll("^-|-$","");
  }

I've extended the answer by @McDowell to include escaping punctuation as hyphens and to remove duplicate and leading/trailing hyphens.

  private static final Pattern NONLATIN = Pattern.compile("[^\\w_-]");  
  private static final Pattern SEPARATORS = Pattern.compile("[\\s\\p{Punct}&&[^-]]");  

  public static String makeSlug(String input) {  
    String noseparators = SEPARATORS.matcher(input).replaceAll("-");
    String normalized = Normalizer.normalize(noseparators, Form.NFD);
    String slug = NONLATIN.matcher(normalized).replaceAll("");
    return slug.toLowerCase(Locale.ENGLISH).replaceAll("-{2,}","-").replaceAll("^-|-$","");
  }

回复收藏 0 原文

~没有更多了~