Android,日文字符文件名比较问题
我正在尝试将搜索字符串与文件名与 Android 上的递归目录搜索相匹配。问题是字符是日语,在某些情况下不匹配。例如,我尝试匹配文件名开头的搜索字符串是“呼ぶ”。当我从 file.getName() 打印文件名时,这得到了准确的反映,例如打印到控制台的文件名以“呼ぶ”开头。但是当我对搜索字符串进行匹配时,例如 fileName.startwith(“呼ぶ”),它不匹配。
事实证明,当我打印正在搜索的文件名的子字符串时,第二个字符是不同的——单词是“呼ふ”而不是“呼ぶ”。如果我提取字节并打印十六进制字符,则最后一个字节会减少 1 – 大概是“ぶ”和“ふ”之间的差异。
下面是用于显示差异的代码:
String name = soundFile.getName();
String string1 = question.kanji;
Log.d(TAG, "searching for : s1:" + question.kanji + " + " + question.hiragana + " + " + question.english);
Log.d(TAG, "name is: " + name);
Log.d(TAG, "question.kanaji.length(): " + question.kanji.length());
Log.d(TAG, "question.hiragana.length(): " + question.hiragana.length());
String compareStart = name.substring(0, string1.length() );
Log.d(TAG, "string1.length(): " + string1.length());
Log.d(TAG, "compareStart.length(): " + compareStart.length());
byte[] nameUTF8 = null;
byte[] s1UTF8 = null;
byte[] csUTF8 = null;
nameUTF8 = name.getBytes();
s1UTF8 = string1.getBytes();
csUTF8 = compareStart.getBytes();
Log.d(TAG, "nameUTF8.length: " + s1UTF8.length);
Log.d(TAG, "s1UTF8.length: " + s1UTF8.length);
Log.d(TAG, "csUTF8.length: " + csUTF8.length);
for (int i = 0; i < s1UTF8.length; i++) {
Log.d(TAG, "s1UTF8[i]: " + Integer.toString(s1UTF8[i] & 0xff, 16).toUpperCase());
}
for (int i = 0; i < csUTF8.length; i++) {
Log.d(TAG, "csUTF8[i]: " + Integer.toString(csUTF8[i] & 0xff, 16).toUpperCase());
}
for (int i = 0; i < nameUTF8.length; i++) {
Log.d(TAG, "nameUTF8[i]: " + Integer.toString(nameUTF8[i] & 0xff, 16).toUpperCase());
}
部分输出如下:
D/AnswerView(12078): searching for : s1:呼ぶ + よぶ + to call out,to invite
D/AnswerView(12078): name is: 呼ぶ よぶ to call out,to invite.mp3
D/AnswerView(12078): question.kanaji.length(): 2
D/AnswerView(12078): question.hiragana.length(): 2
D/AnswerView(12078): string1: 呼ぶ
D/AnswerView(12078): compareStart: 呼ふ
D/AnswerView(12078): string1.length(): 2
D/AnswerView(12078): compareStart.length(): 2
D/AnswerView(12078): string1.length(): 2
D/AnswerView(12078): compareStart.length(): 2
D/AnswerView(12078): nameUTF8.length: 6
D/AnswerView(12078): s1UTF8.length: 6
D/AnswerView(12078): csUTF8.length: 6
D/AnswerView(12078): s1UTF8[i]: E5
D/AnswerView(12078): s1UTF8[i]: 91
D/AnswerView(12078): s1UTF8[i]: BC
D/AnswerView(12078): s1UTF8[i]: E3
D/AnswerView(12078): s1UTF8[i]: 81
D/AnswerView(12078): s1UTF8[i]: B6
D/AnswerView(12078): csUTF8[i]: E5
D/AnswerView(12078): csUTF8[i]: 91
D/AnswerView(12078): csUTF8[i]: BC
D/AnswerView(12078): csUTF8[i]: E3
D/AnswerView(12078): csUTF8[i]: 81
D/AnswerView(12078): csUTF8[i]: B5
D/AnswerView(12078): nameUTF8[i]: E5
D/AnswerView(12078): nameUTF8[i]: 91
D/AnswerView(12078): nameUTF8[i]: BC
D/AnswerView(12078): nameUTF8[i]: E3
D/AnswerView(12078): nameUTF8[i]: 81
D/AnswerView(12078): nameUTF8[i]: B5
D/AnswerView(12078): nameUTF8[i]: E3
D/AnswerView(12078): nameUTF8[i]: 82
D/AnswerView(12078): nameUTF8[i]: 99
D/AnswerView(12078): nameUTF8[i]: 20
D/AnswerView(12078): nameUTF8[i]: 20
D/AnswerView(12078): nameUTF8[i]: 20
D/AnswerView(12078): nameUTF8[i]: 20
显示提取的文件名子字符串的第六个字节以及文件名本身是“B5”而不是“B6”位于搜索字符串中。但是,打印的文件名可以正确显示。我很困惑。当底层字符不同时,为什么文件名能够正确显示到控制台?为什么文件名开头有额外的 3 个非空白字节 - 搜索字符串中不需要这些字节来表示“ぶ”字符?
I'm trying to match a search string with a file name with a recursive directory search on Android. The problem is that the characters are Japanese, and it's not matching in some cases. For example, the search string I'm trying to match the start of the file name with is “呼ぶ”. When I print the file names, from file.getName(), this is accurately reflected, e.g. the file name printed to the console starts with “呼ぶ”. But when I do a match on the search string, e.g. fileName.startwith(“呼ぶ”), it doesn't match.
It turns out that when I print the substring of the file name being searched, the second character is different – the word is “呼ふ” instead of “呼ぶ”. If I extract the bytes and print the hex characters, the last byte is off by 1 – presumably the difference between “ぶ” and “ふ”.
Here is the code used to show the difference:
String name = soundFile.getName();
String string1 = question.kanji;
Log.d(TAG, "searching for : s1:" + question.kanji + " + " + question.hiragana + " + " + question.english);
Log.d(TAG, "name is: " + name);
Log.d(TAG, "question.kanaji.length(): " + question.kanji.length());
Log.d(TAG, "question.hiragana.length(): " + question.hiragana.length());
String compareStart = name.substring(0, string1.length() );
Log.d(TAG, "string1.length(): " + string1.length());
Log.d(TAG, "compareStart.length(): " + compareStart.length());
byte[] nameUTF8 = null;
byte[] s1UTF8 = null;
byte[] csUTF8 = null;
nameUTF8 = name.getBytes();
s1UTF8 = string1.getBytes();
csUTF8 = compareStart.getBytes();
Log.d(TAG, "nameUTF8.length: " + s1UTF8.length);
Log.d(TAG, "s1UTF8.length: " + s1UTF8.length);
Log.d(TAG, "csUTF8.length: " + csUTF8.length);
for (int i = 0; i < s1UTF8.length; i++) {
Log.d(TAG, "s1UTF8[i]: " + Integer.toString(s1UTF8[i] & 0xff, 16).toUpperCase());
}
for (int i = 0; i < csUTF8.length; i++) {
Log.d(TAG, "csUTF8[i]: " + Integer.toString(csUTF8[i] & 0xff, 16).toUpperCase());
}
for (int i = 0; i < nameUTF8.length; i++) {
Log.d(TAG, "nameUTF8[i]: " + Integer.toString(nameUTF8[i] & 0xff, 16).toUpperCase());
}
The partial output is as follows:
D/AnswerView(12078): searching for : s1:呼ぶ + よぶ + to call out,to invite
D/AnswerView(12078): name is: 呼ぶ よぶ to call out,to invite.mp3
D/AnswerView(12078): question.kanaji.length(): 2
D/AnswerView(12078): question.hiragana.length(): 2
D/AnswerView(12078): string1: 呼ぶ
D/AnswerView(12078): compareStart: 呼ふ
D/AnswerView(12078): string1.length(): 2
D/AnswerView(12078): compareStart.length(): 2
D/AnswerView(12078): string1.length(): 2
D/AnswerView(12078): compareStart.length(): 2
D/AnswerView(12078): nameUTF8.length: 6
D/AnswerView(12078): s1UTF8.length: 6
D/AnswerView(12078): csUTF8.length: 6
D/AnswerView(12078): s1UTF8[i]: E5
D/AnswerView(12078): s1UTF8[i]: 91
D/AnswerView(12078): s1UTF8[i]: BC
D/AnswerView(12078): s1UTF8[i]: E3
D/AnswerView(12078): s1UTF8[i]: 81
D/AnswerView(12078): s1UTF8[i]: B6
D/AnswerView(12078): csUTF8[i]: E5
D/AnswerView(12078): csUTF8[i]: 91
D/AnswerView(12078): csUTF8[i]: BC
D/AnswerView(12078): csUTF8[i]: E3
D/AnswerView(12078): csUTF8[i]: 81
D/AnswerView(12078): csUTF8[i]: B5
D/AnswerView(12078): nameUTF8[i]: E5
D/AnswerView(12078): nameUTF8[i]: 91
D/AnswerView(12078): nameUTF8[i]: BC
D/AnswerView(12078): nameUTF8[i]: E3
D/AnswerView(12078): nameUTF8[i]: 81
D/AnswerView(12078): nameUTF8[i]: B5
D/AnswerView(12078): nameUTF8[i]: E3
D/AnswerView(12078): nameUTF8[i]: 82
D/AnswerView(12078): nameUTF8[i]: 99
D/AnswerView(12078): nameUTF8[i]: 20
D/AnswerView(12078): nameUTF8[i]: 20
D/AnswerView(12078): nameUTF8[i]: 20
D/AnswerView(12078): nameUTF8[i]: 20
Showing that the sixth byte of the extracted substring of the file name, as well as the file name itself, is "B5" instead of "B6" as it is in the search string. However, the printed file name is correctly displayed. I'm stumped. Why is the file name being correctly displayed to the console when the underlying characters are different? Why are there an additional 3 non-blank bytes at the beginning of the file name - which somehow aren't needed in the search string to represent the "ぶ" character?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
该问题看起来是标准化形式之一。例如,我知道在 Mac 上,文件系统始终位于 NFD 中。但你发布的字符串是在 NFC 中。观察:
所以我认为你必须考虑转向 NFD。
顺便说一句,U+547C CJK 代码点恰好是来自 Unihan 数据库的:
The problem looks to be one of normalization forms. I know that on a Mac, for example, the filesystem is always in NFD. But the string you posted is in NFC. Watch:
So I think you are going to have to think about converting to NFD.
BTW, that U+547C CJK code point happens to be this from the Unihan database:
在这里,您使用从
string1
获取的长度来切片name
。正如汤姆所指出的,字符串采用不同的标准化形式,因此它们的长度不需要一致。Here you are using a length taken from
string1
to slicename
. As Tom has pointed out, the strings are on different normalization forms, so their lengths don't need to coincide.