Java 的 toLowerCase() 是否保留原始字符串长度?

发布于 2024-08-23 12:56:32 字数 345 浏览 4 评论 0原文

假设有两个 Java String 对象:

String str = "<my string>";
String strLower = str.toLowerCase();

对于 的每个值,表达式的

str.length() == strLower.length()

计算结果是否为 true

那么,String.toLowerCase() 是否会为任何 String 值保留原始字符串长度?

Assume two Java String objects:

String str = "<my string>";
String strLower = str.toLowerCase();

Is it then true that for every value of <my string> the expression

str.length() == strLower.length()

evaluates to true?

So, does String.toLowerCase() preserve original string length for any value of String?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

夜光 2024-08-30 12:56:32

令人惊讶的是它没有

来自 的 Java 文档改为小写

使用给定区域设置的规则将此字符串中的所有字符转换为小写。大小写映射基于Character 类指定的Unicode 标准版本。 由于大小写映射并不总是 1:1 字符映射,因此生成的字符串可能与原始字符串的长度不同。

示例:

package com.stackoverflow.q2357315;

import java.util.Locale;

public class Test {
    public static void main(String[] args) throws Exception {
        Locale.setDefault(new Locale("lt"));
        String s = "\u00cc";
        System.out.println(s + " (" + s.length() + ")"); // Ì (1)
        s = s.toLowerCase();
        System.out.println(s + " (" + s.length() + ")"); // i̇̀ (3)
    }
}

Surprisingly it does not!!

From Java docs of toLowerCase

Converts all of the characters in this String to lower case using the rules of the given Locale. Case mapping is based on the Unicode Standard version specified by the Character class. Since case mappings are not always 1:1 char mappings, the resulting String may be a different length than the original String.

Example:

package com.stackoverflow.q2357315;

import java.util.Locale;

public class Test {
    public static void main(String[] args) throws Exception {
        Locale.setDefault(new Locale("lt"));
        String s = "\u00cc";
        System.out.println(s + " (" + s.length() + ")"); // Ì (1)
        s = s.toLowerCase();
        System.out.println(s + " (" + s.length() + ")"); // i̇̀ (3)
    }
}
时光礼记 2024-08-30 12:56:32

首先,我想指出,我绝对同意@codaddict 的(目前评分最高的)答案。

但我想做一个实验,所以这里是:

这不是一个正式的证明,但是这段代码为我运行而没有到达 if 的内部(使用 JDK 1.6.0 Update Ubuntu 上的版本为 16):

编辑: 这里还有一些处理区域设置的更新代码:

import java.util.Locale;

public class ToLowerTester {
    public final Locale locale;

    public ToLowerTester(final Locale locale) {
        this.locale = locale;
    }

    public String findFirstStrangeTwoLetterCombination() {
        char[] b = new char[2];
        for (char c1 = 0; c1 < Character.MAX_VALUE; c1++) {
            b[0] = c1;
            for (char c2 = 0; c2 < Character.MAX_VALUE; c2++) {
                b[1] = c2;
                final String string = new String(b);
                String lower = string.toLowerCase(locale);
                if (string.length() != lower.length()) {
                    return string;
                }
            }
        }
        return null;
    }
    public static void main(final String[] args) {
        Locale[] locales;
        if (args.length != 0) {
            locales = new Locale[args.length];
            for (int i=0; i<args.length; i++) {
                locales[i] = new Locale(args[i]);
            }
        } else {
            locales = Locale.getAvailableLocales();
        }
        for (Locale locale : locales) {
            System.out.println("Testing " + locale + "...");
            String result = new ToLowerTester(locale).findFirstStrangeTwoLetterCombination();
            if (result != null) {
                String lower = result.toLowerCase(locale);
                System.out.println("Found strange two letter combination for locale "
                    + locale + ": <" + result + "> (" + result.length() + ") -> <"
                    + lower + "> (" + lower.length() + ")");
            }
        }
    }
}

使用接受的答案中提到的区域设置名称运行该代码将打印一些示例。在不带参数的情况下运行它将尝试所有可用的区域设置(并且需要相当长的时间!)。

它并不广泛,因为理论上可能存在行为不同的多字符字符串,但这是一个很好的初步近似。

另请注意,以这种方式生成的许多两个字符组合可能是无效的 UTF-16 ,因此这段代码中没有任何爆炸的事实只能归咎于 Java 中非常强大的 String API。

最后但并非最不重要的一点是:即使当前的 Java 实现的假设是正确的,一旦 Java 的未来版本实现了 Unicode 标准的未来版本,这种情况就很容易改变,其中新字符的规则可能会出现不再适用的情况。确实如此。

所以依赖于此仍然是一个非常糟糕的主意。

First of all, I'd like to point out that I absolutely agree with the (currently highest-rated) answer of @codaddict.

But I wanted to do an experiment, so here it is:

It's not a formal proof, but this code ran for me without ever reaching the inside of the if (using JDK 1.6.0 Update 16 on Ubuntu):

Edit: Here's some updated code that handles Locales as well:

import java.util.Locale;

public class ToLowerTester {
    public final Locale locale;

    public ToLowerTester(final Locale locale) {
        this.locale = locale;
    }

    public String findFirstStrangeTwoLetterCombination() {
        char[] b = new char[2];
        for (char c1 = 0; c1 < Character.MAX_VALUE; c1++) {
            b[0] = c1;
            for (char c2 = 0; c2 < Character.MAX_VALUE; c2++) {
                b[1] = c2;
                final String string = new String(b);
                String lower = string.toLowerCase(locale);
                if (string.length() != lower.length()) {
                    return string;
                }
            }
        }
        return null;
    }
    public static void main(final String[] args) {
        Locale[] locales;
        if (args.length != 0) {
            locales = new Locale[args.length];
            for (int i=0; i<args.length; i++) {
                locales[i] = new Locale(args[i]);
            }
        } else {
            locales = Locale.getAvailableLocales();
        }
        for (Locale locale : locales) {
            System.out.println("Testing " + locale + "...");
            String result = new ToLowerTester(locale).findFirstStrangeTwoLetterCombination();
            if (result != null) {
                String lower = result.toLowerCase(locale);
                System.out.println("Found strange two letter combination for locale "
                    + locale + ": <" + result + "> (" + result.length() + ") -> <"
                    + lower + "> (" + lower.length() + ")");
            }
        }
    }
}

Running that code with the locale names mentioned in the accepted answer will print some examples. Running it without an argument will try all available locales (and take quite a while!).

It's not extensive, because theoretically there could be multi-character Strings that behave differently, but it's a good first approximation.

Also note that many of the two-character combinations produced this way are probably invalid UTF-16, so the fact that nothing explodes in this code can only be blamed on a very robust String API in Java.

And last but not least: even if the assumption is true for the current implementation of Java, that can easily change once future versions of Java implement future versions of the Unicode standard, in which the rules for new characters may introduce situations where this no longer holds true.

So depending on this is still a pretty bad idea.

盗梦空间 2024-08-30 12:56:32

还要记住 toUpperCase() 也不保留长度。示例:对于德语语言环境,“straße”变为“STRASSE”。因此,如果您使用区分大小写的字符串并且需要存储某些内容的索引,那么您或多或少会遇到麻烦。

编辑:

这是@julaine的。这是丑陋的旧式代码,但我不再积极使用 Java 进行编码。

import java.util.Locale;

class Uppercase {
  public static void main(String[] args) {
    Locale.setDefault(new Locale.Builder().setLanguage("de").setRegion("DE").build());
    String str = "straße";
    System.out.println(str.toUpperCase());
  }
}

如果你运行它,你会发现我是对的:

$ javac Uppercase.java
$ java -cp . Uppercase
STRASSE

原因很简单,直到 2017 年(左右),ß 的唯一官方大写是 SS(或更罕见的 SZ)。语言环境中不需要字典来处理这个问题:-)

Also remember that toUpperCase() does not preserve the length either. Example: “straße” becomes “STRASSE” for the German locale. So you're more or less screwed if you're working with case-sensitive strings and you need to store the index for something.

Edit:

Here is for @julaine. It is ugly old-style code, but I am no longer actively coding in Java.

import java.util.Locale;

class Uppercase {
  public static void main(String[] args) {
    Locale.setDefault(new Locale.Builder().setLanguage("de").setRegion("DE").build());
    String str = "straße";
    System.out.println(str.toUpperCase());
  }
}

If you run it, you will see that I am right:

$ javac Uppercase.java
$ java -cp . Uppercase
STRASSE

The reason is simply that until 2017 (or so), the only official capitalization of ß was SS (or the rarer SZ). No dictionary is needed in the locale to handle that :-)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文