Java 和 SEO 友好的 URL:©reate ╨由特殊字符组成的字符串中的有效 http URL

发布于 2024-09-06 22:19:06 字数 569 浏览 2 评论 0原文

我正在尝试从可能包含特殊字符、带重音符号的字母、类似中文的字符等的字符串中提取 SEO 友好的 URL。
SO 正在这样做,并且正在将这篇文章的标题翻译为“

java-and-seo-friendly-urls-reate--a-valid-http-url-from-a-string-composed-by-s

我正在尝试用 Java 执行此操作”。
我正在使用 这篇文章 解决方案和 URLEncoder .encode 将中文和其他符号翻译为有效的 URL 字符。

你曾经实现过类似的东西吗?有更好的办法吗?

I'm trying to extract SEO friendly URLs from strings that can contain special characters, letter with accents, Chinese like characters, etc.
SO is doing this and it's translating this post title in

java-and-seo-friendly-urls-reate--a-valid-http-url-from-a-string-composed-by-s

I'm trying to do this in Java.
I'm using this post solution with URLEncoder.encode to translate Chinese and other symbols into valid URL characters.

Have you ever implemented something like this? Is there a better way?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

清音悠歌 2024-09-13 22:19:06

这可能是解决问题的过于简单的方法,但您可以仅使用正则表达式来删除所有非标准字符。因此,将字符串转换为小写后,您可以将所有非小写字母字符替换为空字符,然后将所有空格替换为“-”字符。

private static String encodeForUrl(String input) {
  return input.toLowerCase().replaceAll("[^a-z\\s]", "").replaceAll("\\s", "-");
}

This might be an oversimplistic approach to the problem, but you could just use regular expressions to remove all non standard characters. So after converting your string to lowercase, you can replace all non lowercase alphabetic characters with an empty character and then replace all spaces with the '-' character.

private static String encodeForUrl(String input) {
  return input.toLowerCase().replaceAll("[^a-z\\s]", "").replaceAll("\\s", "-");
}
万劫不复 2024-09-13 22:19:06

我不知道有什么标准方法,我一直在使用与您所指的类似的解决方案。不确定哪个更好,所以这里有:

public class TextUtils {

private static final Pattern DIACRITICS_AND_FRIENDS =
        Pattern.compile("[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+");

private static final Transliterator TO_LATIN_TRANSLITERATOR = Transliterator.getInstance("Any-Latin");

private static final Pattern EEQUIVALENTS = Pattern.compile("[ǝƏ]+");
private static final Pattern IEQUIVALENTS = Pattern.compile("[ı]+");
private static final Pattern DEQUIVALENTS = Pattern.compile("[Ððđ]+");
private static final Pattern OEQUIVALENTS = Pattern.compile("[Øø]+");
private static final Pattern LEQUIVALENTS = Pattern.compile("[Ł]+");

//all spaces, non-ascii and punctuation characters except _ and -
private static final Pattern CRAP = Pattern.compile("[\\p{IsSpace}\\P{IsASCII}\\p{IsP}\\+&&[^_]]");
private static final Pattern SEPARATORS = Pattern.compile("[\\p{IsSpace}/`-]");

private static final Pattern URLFRIENDLY = Pattern.compile("([a-zA-Z0-9_])*");
private static final CharsetEncoder ASCII_ENCODER = Charset.forName("ISO-8859-1").newEncoder();

/**
 * Returns true when the input test contains only characters from the ASCII set, false otherwise.
 */
public static boolean isPureAscii(String text) {
    return ASCII_ENCODER.canEncode(text);
}

/**
 * Replaces all characters that normalize into two characters with their base symbol (e.g. ü -> u)
 */
public static String replaceCombiningDiacriticalMarks(String text) {
    return DIACRITICS_AND_FRIENDS.matcher(Normalizer.normalize(text, Normalizer.Form.NFKD)).replaceAll("");
}

/**
 * Turns the input string into a url friendly variant (containing only alphanumeric characters and '-' and '_'). 
 * If the input string cannot be converted an IllegalArgumentException is thrown.
 */
public static String urlFriendlyStrict(String unfriendlyString) throws IllegalArgumentException {
    String friendlyString =
            urlFriendly(unfriendlyString);

    //Assert can be removed to improve performance
    Assert.isTrue(URLFRIENDLY.matcher(friendlyString).matches(),
            format("Friendly string [%s] based on [%s] is not friendly enough", friendlyString, unfriendlyString));
    return friendlyString;
}

/**
 * Turns the input string into a url friendly variant (containing only alphanumeric characters and '-' and '_').
 * Use {@link #urlFriendlyStrict(String)} to avoid potential bugs in this code.
 */
private static String urlFriendly(String unfriendlyString) {
    return removeCrappyCharacters(
            replaceEquivalentsOfSymbols(
                    replaceCombiningDiacriticalMarks(
                            transLiterateSymbols(
                                    replaceSeparatorsWithUnderscores(
                                            unfriendlyString.trim()))))).toLowerCase();
}

private static String transLiterateSymbols(String incomprehensibleString) {
    String latin = TO_LATIN_TRANSLITERATOR.transform(incomprehensibleString);
    return latin;
}

private static String replaceEquivalentsOfSymbols(String unfriendlyString) {
    return
            LEQUIVALENTS.matcher(
                    OEQUIVALENTS.matcher(
                            DEQUIVALENTS.matcher(
                                    IEQUIVALENTS.matcher(
                                            EEQUIVALENTS.matcher(unfriendlyString).replaceAll("e"))
                                            .replaceAll("i"))
                                    .replaceAll("d"))
                            .replaceAll("o"))
                    .replaceAll("l");
}

private static String removeCrappyCharacters(String unfriendlyString) {
    return CRAP.matcher(unfriendlyString).replaceAll("");
}

private static String replaceSeparatorsWithUnderscores(String unfriendlyString) {
    return SEPARATORS.matcher(unfriendlyString).replaceAll("_");
}

}

I don't know of any standard way for this, I've been using a similair solution as what you are refering to. Not sure which one's better, so here you have it:

public class TextUtils {

private static final Pattern DIACRITICS_AND_FRIENDS =
        Pattern.compile("[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+");

private static final Transliterator TO_LATIN_TRANSLITERATOR = Transliterator.getInstance("Any-Latin");

private static final Pattern EEQUIVALENTS = Pattern.compile("[ǝƏ]+");
private static final Pattern IEQUIVALENTS = Pattern.compile("[ı]+");
private static final Pattern DEQUIVALENTS = Pattern.compile("[Ððđ]+");
private static final Pattern OEQUIVALENTS = Pattern.compile("[Øø]+");
private static final Pattern LEQUIVALENTS = Pattern.compile("[Ł]+");

//all spaces, non-ascii and punctuation characters except _ and -
private static final Pattern CRAP = Pattern.compile("[\\p{IsSpace}\\P{IsASCII}\\p{IsP}\\+&&[^_]]");
private static final Pattern SEPARATORS = Pattern.compile("[\\p{IsSpace}/`-]");

private static final Pattern URLFRIENDLY = Pattern.compile("([a-zA-Z0-9_])*");
private static final CharsetEncoder ASCII_ENCODER = Charset.forName("ISO-8859-1").newEncoder();

/**
 * Returns true when the input test contains only characters from the ASCII set, false otherwise.
 */
public static boolean isPureAscii(String text) {
    return ASCII_ENCODER.canEncode(text);
}

/**
 * Replaces all characters that normalize into two characters with their base symbol (e.g. ü -> u)
 */
public static String replaceCombiningDiacriticalMarks(String text) {
    return DIACRITICS_AND_FRIENDS.matcher(Normalizer.normalize(text, Normalizer.Form.NFKD)).replaceAll("");
}

/**
 * Turns the input string into a url friendly variant (containing only alphanumeric characters and '-' and '_'). 
 * If the input string cannot be converted an IllegalArgumentException is thrown.
 */
public static String urlFriendlyStrict(String unfriendlyString) throws IllegalArgumentException {
    String friendlyString =
            urlFriendly(unfriendlyString);

    //Assert can be removed to improve performance
    Assert.isTrue(URLFRIENDLY.matcher(friendlyString).matches(),
            format("Friendly string [%s] based on [%s] is not friendly enough", friendlyString, unfriendlyString));
    return friendlyString;
}

/**
 * Turns the input string into a url friendly variant (containing only alphanumeric characters and '-' and '_').
 * Use {@link #urlFriendlyStrict(String)} to avoid potential bugs in this code.
 */
private static String urlFriendly(String unfriendlyString) {
    return removeCrappyCharacters(
            replaceEquivalentsOfSymbols(
                    replaceCombiningDiacriticalMarks(
                            transLiterateSymbols(
                                    replaceSeparatorsWithUnderscores(
                                            unfriendlyString.trim()))))).toLowerCase();
}

private static String transLiterateSymbols(String incomprehensibleString) {
    String latin = TO_LATIN_TRANSLITERATOR.transform(incomprehensibleString);
    return latin;
}

private static String replaceEquivalentsOfSymbols(String unfriendlyString) {
    return
            LEQUIVALENTS.matcher(
                    OEQUIVALENTS.matcher(
                            DEQUIVALENTS.matcher(
                                    IEQUIVALENTS.matcher(
                                            EEQUIVALENTS.matcher(unfriendlyString).replaceAll("e"))
                                            .replaceAll("i"))
                                    .replaceAll("d"))
                            .replaceAll("o"))
                    .replaceAll("l");
}

private static String removeCrappyCharacters(String unfriendlyString) {
    return CRAP.matcher(unfriendlyString).replaceAll("");
}

private static String replaceSeparatorsWithUnderscores(String unfriendlyString) {
    return SEPARATORS.matcher(unfriendlyString).replaceAll("_");
}

}

请爱~陌生人 2024-09-13 22:19:06

我想说 URLEncoder.encode 是可行的方法。所有非 URL 字符都会被映射,您肯定不想重新发明轮子(一次又一次)。

I would say URLEncoder.encode is the way to go. All non-URL chars are mapped, and you surely don't want to reinvent the wheel (again and again and again).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文