PHP从多字节字符串中去除重复字符的方法?
呃。有谁知道如何创建一个与 PHP count_chars($string, 3) 命令等效的多字节字符函数?
这样它将返回每个唯一字符仅一个实例的列表。如果这是英语,并且我们有
“aaabggxxyxzxxgggghq xcccxxxzxxyx”,
它将返回“abgh qxyz”(注意空格已计算在内)。
(在这种情况下,顺序并不重要,可以是任何内容)。
如果日语汉字(不确定浏览器是否都支持此):
汉汉汉字私私字私字汉字私汉汉字私
并且它将仅返回使用的 3 个汉字:
汉字私
它需要适用于任何 UTF-8 编码的字符串。
Arrrgh. Does anyone know how to create a function that's the multibyte character equivalent of the PHP count_chars($string, 3) command?
Such that it will return a list of ONLY ONE INSTANCE of each unique character. If that was English and we had
"aaabggxxyxzxxgggghq xcccxxxzxxyx"
It would return "abgh qxyz" (Note the space IS counted).
(The order isn't important in this case, can be anything).
If Japanese kanji (not sure browsers will all support this):
漢漢漢字漢字私私字私字漢字私漢字漢字私
And it will return just the 3 kanji used:
漢字私
It needs to work on any UTF-8 encoded string.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
嘿戴夫,你永远不会看到这个人的到来。
什么,你以为我会再次使用
mb_substr
吗?用正则表达式来说,它会查找任何一个字符,然后查找该同一字符的一个或多个实例。然后,匹配的区域将被替换为匹配的一个字符。
u
修饰符在PCRE中打开UTF-8模式,其中它处理UTF-8序列而不是8位字符。只要正在处理的字符串已经是 UTF-8 并且 PCRE 是使用 Unicode 支持编译的,这应该适合您。嘿,猜猜看!
这使用了与随机播放代码相同的通用技巧。我们获取字符串的长度,然后使用 mb_substr 一次提取一个字符。然后我们使用该字符作为数组中的键。我们正在利用 PHP 的位置数组:键按照它们定义的顺序排序。一旦我们遍历了字符串并识别了所有字符,我们就抓住键并按照它们在字符串中出现的顺序将它们重新连接在一起。您还可以通过此技术获得每个字符的字符数。
如果有像
mb_str_split
这样的东西与str_split
.(这里没有汉字示例,我遇到了复制/粘贴错误。)
在这里,尝试一下大小:
您将需要调用此两次,第二次使用左边的字符串右边,右边的字符串在左边。输出会有所不同 -
array_diff
只是为您提供左侧缺少右侧的内容,因此您必须执行两次才能获得整个故事。Hey Dave, you're never going to see this one coming.
What, you thought I was going to use
mb_substr
again?In regex-speak, it's looking for any one character, then one or more instances of that same character. The matched region is then replaced with the one character that matched.
The
u
modifier turns on UTF-8 mode in PCRE, in which it deals with UTF-8 sequences instead of 8-bit characters. As long as the string being processed is UTF-8 already and PCRE was compiled with Unicode support, this should work fine for you.Hey, guess what!
This uses the same general trick as the shuffle code. We grab the length of the string, then use
mb_substr
to extract it one character at a time. We then use that character as a key in an array. We're taking advantage of PHP's positional arrays: keys are sorted in the order that they are defined. Once we've gone through the string and identified all of the characters, we grab the keys and join'em back together in the same order that they appeared in the string. You also get a per-character character count from this technique.This would have been much easier if there was such a thing as
mb_str_split
to go along withstr_split
.(No Kanji example here, I'm experiencing a copy/paste bug.)
Here, try this on for size:
You'll want to call this twice, the second time with the left string on the right, and the right string on the left. The output will be different --
array_diff
just gives you the stuff in the left side that's missing from the right, so you have to do it twice to get the whole story.请尝试检查 iconv_strlen PHP 标准库函数。不能说东方编码,但它适用于欧洲和东欧语言。无论如何,它给予了一些自由!
Please try to check the iconv_strlen PHP standard library function. Can't say about orient encodings, but it works fine for european and east europe languages. In any case it gives some freedom!
容易多了。使用 str_split 将短语转换为一个数组,每个字符作为一个元素。然后使用 array_unique 删除重复项。很简单。没什么复杂的。我喜欢这样。
Much easier. User str_split to turn the phrase into an array with each character as an element. Then use array_unique to remove duplicates. Pretty simple. Nothing complicated. I like it that way.