当前位置：文江博客话题详情

PHP从多字节字符串中去除重复字符的方法？

发布于 2024-10-25 20:30:08 字数 342 浏览 1 评论 0原文

呃。有谁知道如何创建一个与 PHP count_chars($string, 3) 命令等效的多字节字符函数？

这样它将返回每个唯一字符仅一个实例的列表。如果这是英语，并且我们有

“aaabggxxyxzxxgggghq xcccxxxzxxyx”，

它将返回“abgh qxyz”（注意空格已计算在内）。

（在这种情况下，顺序并不重要，可以是任何内容）。

如果日语汉字（不确定浏览器是否都支持此）：

汉汉汉字私私字私字汉字私汉汉字私

并且它将仅返回使用的 3 个汉字：

汉字私

它需要适用于任何 UTF-8 编码的字符串。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

趴在窗边数星星i 2024-11-01 20:30:08

嘿戴夫，你永远不会看到这个人的到来。

php > $kanji = '漢漢漢字漢字私私字私字漢字私漢字漢字私';
php > $not_kanji = 'aaabcccbbc';
php > $pattern = '/(.)\1+/u';
php > echo preg_replace($pattern, '$1', $kanji);
漢字漢字私字私字漢字私漢字漢字私
php > echo preg_replace($pattern, '$1', $not_kanji);
abcbc

什么，你以为我会再次使用 mb_substr 吗？

用正则表达式来说，它会查找任何一个字符，然后查找该同一字符的一个或多个实例。然后，匹配的区域将被替换为匹配的一个字符。

u 修饰符在PCRE中打开UTF-8模式，其中它处理UTF-8序列而不是8位字符。只要正在处理的字符串已经是 UTF-8 并且 PCRE 是使用 Unicode 支持编译的，这应该适合您。

嘿，猜猜看！

$not_kanji = 'aaabbbbcdddbbbbccgggcdddeeedddaaaffff';
$l = mb_strlen($not_kanji);
$unique = array();
for($i = 0; $i < $l; $i++) {
    $char = mb_substr($not_kanji, $i, 1);
    if(!array_key_exists($char, $unique))
        $unique[$char] = 0;
    $unique[$char]++;
}
echo join('', array_keys($unique));

这使用了与随机播放代码相同的通用技巧。我们获取字符串的长度，然后使用 mb_substr 一次提取一个字符。然后我们使用该字符作为数组中的键。我们正在利用 PHP 的位置数组：键按照它们定义的顺序排序。一旦我们遍历了字符串并识别了所有字符，我们就抓住键并按照它们在字符串中出现的顺序将它们重新连接在一起。您还可以通过此技术获得每个字符的字符数。

如果有像 mb_str_split 这样的东西与 str_split.

（这里没有汉字示例，我遇到了复制/粘贴错误。）

在这里，尝试一下大小：

function mb_count_chars_kinda($input) {
    $l = mb_strlen($input);
    $unique = array();
    for($i = 0; $i < $l; $i++) {
        $char = mb_substr($input, $i, 1);
        if(!array_key_exists($char, $unique))
            $unique[$char] = 0;
        $unique[$char]++;
    }
    return $unique;
}

function mb_string_chars_diff($one, $two) {
    $left = array_keys(mb_count_chars_kinda($one));
    $right = array_keys(mb_count_chars_kinda($two));
    return array_diff($left, $right);
}

print_r(mb_string_chars_diff('aabbccddeeffgg', 'abcde'));
/* => 
Array
(
    [5] => f
    [6] => g
)
*/

您将需要调用此两次，第二次使用左边的字符串右边，右边的字符串在左边。输出会有所不同 - array_diff 只是为您提供左侧缺少右侧的内容，因此您必须执行两次才能获得整个故事。

Hey Dave, you're never going to see this one coming.

php > $kanji = '漢漢漢字漢字私私字私字漢字私漢字漢字私';
php > $not_kanji = 'aaabcccbbc';
php > $pattern = '/(.)\1+/u';
php > echo preg_replace($pattern, '$1', $kanji);
漢字漢字私字私字漢字私漢字漢字私
php > echo preg_replace($pattern, '$1', $not_kanji);
abcbc

What, you thought I was going to use mb_substr again?

In regex-speak, it's looking for any one character, then one or more instances of that same character. The matched region is then replaced with the one character that matched.

The u modifier turns on UTF-8 mode in PCRE, in which it deals with UTF-8 sequences instead of 8-bit characters. As long as the string being processed is UTF-8 already and PCRE was compiled with Unicode support, this should work fine for you.

Hey, guess what!

$not_kanji = 'aaabbbbcdddbbbbccgggcdddeeedddaaaffff';
$l = mb_strlen($not_kanji);
$unique = array();
for($i = 0; $i < $l; $i++) {
    $char = mb_substr($not_kanji, $i, 1);
    if(!array_key_exists($char, $unique))
        $unique[$char] = 0;
    $unique[$char]++;
}
echo join('', array_keys($unique));

This uses the same general trick as the shuffle code. We grab the length of the string, then use mb_substr to extract it one character at a time. We then use that character as a key in an array. We're taking advantage of PHP's positional arrays: keys are sorted in the order that they are defined. Once we've gone through the string and identified all of the characters, we grab the keys and join'em back together in the same order that they appeared in the string. You also get a per-character character count from this technique.

This would have been much easier if there was such a thing as mb_str_split to go along with str_split.

(No Kanji example here, I'm experiencing a copy/paste bug.)

Here, try this on for size:

function mb_count_chars_kinda($input) {
    $l = mb_strlen($input);
    $unique = array();
    for($i = 0; $i < $l; $i++) {
        $char = mb_substr($input, $i, 1);
        if(!array_key_exists($char, $unique))
            $unique[$char] = 0;
        $unique[$char]++;
    }
    return $unique;
}

function mb_string_chars_diff($one, $two) {
    $left = array_keys(mb_count_chars_kinda($one));
    $right = array_keys(mb_count_chars_kinda($two));
    return array_diff($left, $right);
}

print_r(mb_string_chars_diff('aabbccddeeffgg', 'abcde'));
/* => 
Array
(
    [5] => f
    [6] => g
)
*/

You'll want to call this twice, the second time with the left string on the right, and the right string on the left. The output will be different -- array_diff just gives you the stuff in the left side that's missing from the right, so you have to do it twice to get the whole story.

回复收藏 0 原文

孤千羽 2024-11-01 20:30:08

请尝试检查 iconv_strlen PHP 标准库函数。不能说东方编码，但它适用于欧洲和东欧语言。无论如何，它给予了一些自由！

回复收藏 0 原文

妄想挽回 2024-11-01 20:30:08

$name = "My string";
$name_array = str_split($name);
$name_array_uniqued = array_unique($name_array);
print_r($name_array_uniqued);

容易多了。使用 str_split 将短语转换为一个数组，每个字符作为一个元素。然后使用 array_unique 删除重复项。很简单。没什么复杂的。我喜欢这样。

$name = "My string";
$name_array = str_split($name);
$name_array_uniqued = array_unique($name_array);
print_r($name_array_uniqued);

Much easier. User str_split to turn the phrase into an array with each character as an element. Then use array_unique to remove duplicates. Pretty simple. Nothing complicated. I like it that way.

回复收藏 0 原文

~没有更多了~