查找字符串中唯一字符的多字节安全方法

发布于 2024-10-26 15:36:37 字数 1277 浏览 6 评论 0原文

我有一个我认为很简单的问题,但事实证明它非常复杂。

我有一个很长的 UTF-8 字符串,它混合了罗马、西欧、日语和韩语字符和标点符号。许多是多字节字符,但有些(我认为)不是。

我需要做两件事:

  1. 确保没有重复的字符(并输出新的字符串,去掉重复的内容)。
  2. 随机打乱该新字符串。

(抱歉,我似乎无法正确引用代码格式......)

function uniquechars($string) {
    $l = mb_strlen($string);
    $unique = array();
    for($i = 0; $i < $l; $i++) {
        $char = mb_substr($string, $i, 1);
        if(!array_key_exists($char, $unique))
            $unique[$char] = 0;
        $unique[$char]++;
    }
    $uniquekeys = join('', array_keys($unique));
    return $uniquekeys;
}  

并且:

function unicode_shuffle($string)
{
    $len = mb_strlen($string);
    $sploded = array(); 
    while($len-- > 0) { 
        $sploded[] = mb_substr($string, $len, 1);
    }
    shuffle($sploded);
    $shuffled = join('', $sploded);
    return $shuffled;
}

使用这两个函数,有人非常有帮助地提供了,我以为我已经准备好了......除了奇怪的是,它似乎是唯一字符串(无重复)和随机字符串不包含相同数量的字符。 (我从浏览器中突出显示这些字符,然后剪切并粘贴到另一个应用程序中...一个字符串的长度始终与上面的字符串不同,但通常会有所不同...甚至获得的字符数量也不同每次都被截断!)。

抱歉,我对 PHP 和编码了解不够,无法亲自调查这个问题,但这里到底出了什么问题?看起来只要洗好一根长长的绳子就应该很容易,但显然比我想象的要困难得多。是否还有另一种更简单的方法可以做到这一点?我应该首先将字符串转换为相应的十六进制数字并对其进行打乱,然后再转换回 UTF-8 吗?我应该输出到文件而不是屏幕吗?

有人有建议吗?抱歉,我对此很陌生,所以可能我只是做了一些非常愚蠢的事情。

I have a problem that I thought would be simple but it's turning out to be quite complex.

I have a long UTF-8 string that is a mix of Roman, Western-European, Japanese, and Korean characters and punctuation. Many are multibyte chars, but some (I think) are not.

I need to do 2 things:

  1. Make sure there are no duplicate chars (and output that new string, stripped of dupes).
  2. Randomly shuffle that new string.

(Sorry, I can't seem to get the code quoting to format right...)

function uniquechars($string) {
    $l = mb_strlen($string);
    $unique = array();
    for($i = 0; $i < $l; $i++) {
        $char = mb_substr($string, $i, 1);
        if(!array_key_exists($char, $unique))
            $unique[$char] = 0;
        $unique[$char]++;
    }
    $uniquekeys = join('', array_keys($unique));
    return $uniquekeys;
}  

and:

function unicode_shuffle($string)
{
    $len = mb_strlen($string);
    $sploded = array(); 
    while($len-- > 0) { 
        $sploded[] = mb_substr($string, $len, 1);
    }
    shuffle($sploded);
    $shuffled = join('', $sploded);
    return $shuffled;
}

Using those two functions, which someone very helpfully provided, I THOUGHT I was all set...except that curiously, it seems like the Unique string (no duplicates) and the Shuffled string do not contain the same number of characters. (I am highlighting these chars from my browser and then cutting-and-pasting into another application...one string is always a different length than the one above, but often it varies...it's not even the same number of chars getting truncated each time!).

I'm sorry I don't know enough about PHP nor about coding to sleuth this myself but what on earth is going wrong here? It seems like it should be easy to just shuffle a big long string, but apparently it's much harder than I thought. Is there maybe another, easier way to do this? Should I convert the string first into respective hex numbers and shuffle those, then convert back to UTF-8? Should I output to a file rather than the screen?

Anyone out there have suggestions? I'm sorry, I'm very new to this, so possibly I'm just doing something really dumb.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

水溶 2024-11-02 15:36:37

你可能可以把事情做得更简单。

这是一个仅获取字符串中唯一字符的函数:

// returns an array of unique characters from a given string
function getUnique( $string ) {

    $chars = preg_split( '//', $string, -1, PREG_SPLIT_NO_EMPTY );
    $unique = array_unique( $chars );

    return $unique;

}

然后,如果您想重新排列顺序,只需将唯一字符数组传递给shuffle

$shuffled = shuffle( $unique );

编辑:对于多字节字符,这个函数应该可以解决问题(感谢 http://php .net/manual/en/function.mb-split.php 用于帮助使用正则表达式):

function getUnique( $string ) {

    $chars = preg_split( '/(?<!^)(?!$)/u', $string ); 
    $unique = array_unique( $chars );

    return $unique;

}

You can probably do things a lot simpler.

Here's a function to get only the unique characters in a string:

// returns an array of unique characters from a given string
function getUnique( $string ) {

    $chars = preg_split( '//', $string, -1, PREG_SPLIT_NO_EMPTY );
    $unique = array_unique( $chars );

    return $unique;

}

Then, if you want to reshuffle the order, just pass the array of unique chars to shuffle:

$shuffled = shuffle( $unique );

Edit: For multi-byte characters, this function should do the trick (thanks to http://php.net/manual/en/function.mb-split.php for helping with the regex):

function getUnique( $string ) {

    $chars = preg_split( '/(?<!^)(?!$)/u', $string ); 
    $unique = array_unique( $chars );

    return $unique;

}
差↓一点笑了 2024-11-02 15:36:37

如果您不需要打乱字符,则可以使用稍微费力的模式并预先查找重复字符,在一次传递中删除所有重复的字符。

要打乱字符,您可以在每个字符之间拆分字符串,然后对该数组调用 array_unique() 。混洗部分可能对其他开发人员没有用处,但请注意,shuffle() 的返回值是一个布尔值(不是混洗后的有效负载),因此不必费心将返回值分配给变量。

从字符串中删除重复字符:(演示)

$str = 'ăāæåߧśšşçæåߧś';

var_export(
    preg_replace('/(.)(?=.*\1)/u',
    '',
    $str
);

拆分、删除重复字符、随机播放:(Demo

$str = 'ăāæåߧśšşçæåߧś';

$unique = array_unique(
    preg_split(
        '//u',
        $str,
        0,
        PREG_SPLIT_NO_EMPTY
    )
);

shuffle($unique); 

var_export($unique);

我认为 mb_str_split() 也可以安全地分割整个字符,但我不知道如果编码有任何附带问题。

If you didn't need to shuffle the characters, you could remove all duplicated characters in a single pass using a slightly more laborious pattern with a lookahead for a duplicate.

To shuffle the characters, you split the string between each character, then call array_unique() on that array. The shuffling part may not be useful to other developers, but note that the returned value from shuffle() is a boolean value (not the shuffled payload) so don't bother assigning the return value to a variable.

Removing dupe chars from a string: (Demo)

$str = 'ăāæåߧśšşçæåߧś';

var_export(
    preg_replace('/(.)(?=.*\1)/u',
    '',
    $str
);

Split, remove dupes, shuffle: (Demo)

$str = 'ăāæåߧśšşçæåߧś';

$unique = array_unique(
    preg_split(
        '//u',
        $str,
        0,
        PREG_SPLIT_NO_EMPTY
    )
);

shuffle($unique); 

var_export($unique);

I assume that mb_str_split() would also be safe to split whole characters, but I don't know if there are any fringe concerns with encodings.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文