如何反转 Unicode 字符串

发布于 2024-07-11 19:34:19 字数 1768 浏览 5 评论 0原文

对该问题的答案的评论中暗示了这一点PHP 无法反转 Unicode 字符串。

至于 Unicode,它可以在 PHP 中使用 因为大多数应用程序将其处理为 二进制。 是的,PHP 是 8 位干净的。 尝试 PHP 中的等价物:perl -Mutf8 -e '打印标量反向("ほげほげ")' 你会得到垃圾, 不是“げほげほ”。 –jrockway

不幸的是,PHP 的 unicode 支持 atm 充其量是“缺乏”,这是正确的。 这将 希望彻底改变PHP6

PHP MultiByte 函数 确实提供了处理 unicode 所需的基本功能,但是不一致,确实缺少很多功能。 其中之一是反转字符串的函数。

我当然想反转这段文字,没有其他原因,然后弄清楚它是否可能。 我创建了一个函数来完成反转 Unicode 文本这一巨大复杂的任务,因此您可以在 PHP6 之前放松一点。

测试代码:

$enc = 'UTF-8';
$text = "ほげほげ";
$defaultEnc = mb_internal_encoding();

echo "Showing results with encoding $defaultEnc.\n\n";

$revNormal = strrev($text);
$revInt = mb_strrev($text);
$revEnc = mb_strrev($text, $enc);

echo "Original text is: $text .\n";
echo "Normal strrev output: " . $revNormal . ".\n";
echo "mb_strrev without encoding output: $revInt.\n";
echo "mb_strrev with encoding $enc output: $revEnc.\n";

if (mb_internal_encoding($enc)) {
    echo "\nSetting internal encoding to $enc from $defaultEnc.\n\n";
    
    $revNormal = strrev($text);
    $revInt = mb_strrev($text);
    $revEnc = mb_strrev($text, $enc);
    
    echo "Original text is: $text .\n";
    echo "Normal strrev output: " . $revNormal . ".\n";
    echo "mb_strrev without encoding output: $revInt.\n";
    echo "mb_strrev with encoding $enc output: $revEnc.\n";
    
} else {
    echo "\nCould not set internal encoding to $enc!\n";
}

It was hinted in a comment to an answer to this question that PHP can not reverse Unicode strings.

As for Unicode, it works in PHP
because most apps process it as
binary. Yes, PHP is 8-bit clean. Try
the equivalent of this in PHP: perl
-Mutf8 -e 'print scalar reverse("ほげほげ")' You will get garbage,
not "げほげほ". – jrockway

And unfortunately it is correct that PHPs unicode support atm is at best "lacking". This will hopefully change drastically with PHP6.

PHPs MultiByte functions does provide the basic functionality you need to deal with unicode, but it is inconsistent and does lack a lot of functions. One of these is a function to reverse a string.

I of course wanted to reverse this text for no other reason then to figure out if it was possible. And I made a function to accomplish this enormous complex task of reversing this Unicode text, so you can relax a bit longer until PHP6.

Test code:

$enc = 'UTF-8';
$text = "ほげほげ";
$defaultEnc = mb_internal_encoding();

echo "Showing results with encoding $defaultEnc.\n\n";

$revNormal = strrev($text);
$revInt = mb_strrev($text);
$revEnc = mb_strrev($text, $enc);

echo "Original text is: $text .\n";
echo "Normal strrev output: " . $revNormal . ".\n";
echo "mb_strrev without encoding output: $revInt.\n";
echo "mb_strrev with encoding $enc output: $revEnc.\n";

if (mb_internal_encoding($enc)) {
    echo "\nSetting internal encoding to $enc from $defaultEnc.\n\n";
    
    $revNormal = strrev($text);
    $revInt = mb_strrev($text);
    $revEnc = mb_strrev($text, $enc);
    
    echo "Original text is: $text .\n";
    echo "Normal strrev output: " . $revNormal . ".\n";
    echo "mb_strrev without encoding output: $revInt.\n";
    echo "mb_strrev with encoding $enc output: $revEnc.\n";
    
} else {
    echo "\nCould not set internal encoding to $enc!\n";
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

爱情眠于流年 2024-07-18 19:34:19

这是使用正则表达式的另一种方法:

function utf8_strrev($str){
 preg_match_all('/./us', $str, $ar);
 return implode(array_reverse($ar[0]));
}

here's another approach using regex:

function utf8_strrev($str){
 preg_match_all('/./us', $str, $ar);
 return implode(array_reverse($ar[0]));
}
鱼窥荷 2024-07-18 19:34:19

Grapheme 函数比 mbstring 和 PCRE 函数更正确地处理 UTF-8 字符串/Mbstring 和 PCRE 可能会破坏字符。 您可以通过执行以下代码来查看它们之间的差异。

function str_to_array($string)
{
    $length = grapheme_strlen($string);
    $ret = [];

    for ($i = 0; $i < $length; $i += 1) {

        $ret[] = grapheme_substr($string, $i, 1);
    }

    return $ret;
}

function str_to_array2($string)
{
    $length = mb_strlen($string, "UTF-8");
    $ret = [];

    for ($i = 0; $i < $length; $i += 1) {

    $ret[] = mb_substr($string, $i, 1, "UTF-8");
}

    return $ret;
}

function str_to_array3($string)
{
    return preg_split('//u', $string, -1, PREG_SPLIT_NO_EMPTY);
}

function utf8_strrev($string)
{
    return implode(array_reverse(str_to_array($string)));
}

function utf8_strrev2($string)
{
    return implode(array_reverse(str_to_array2($string)));
}

function utf8_strrev3($string)
{
    return implode(array_reverse(str_to_array3($string)));
}

// http://www.php.net/manual/en/function.grapheme-strlen.php
$string = "a\xCC\x8A"  // 'LATIN SMALL LETTER A WITH RING ABOVE' (U+00E5)
         ."o\xCC\x88"; // 'LATIN SMALL LETTER O WITH DIAERESIS'  (U+00F6)

var_dump(array_map(function($elem) { return strtoupper(bin2hex($elem)); },
[
  'should be' => "o\xCC\x88"."a\xCC\x8A",
  'grapheme' => utf8_strrev($string),
  'mbstring' => utf8_strrev2($string),
  'pcre' => utf8_strrev3($string)
]));

结果就在这里。

array(4) {
  ["should be"]=>
  string(12) "6FCC8861CC8A"
  ["grapheme"]=>
  string(12) "6FCC8861CC8A"
  ["mbstring"]=>
  string(12) "CC886FCC8A61"
  ["pcre"]=>
  string(12) "CC886FCC8A61"
}

从 PHP 5.5 (intl 3.0) 开始可以使用 IntlBreakIterator;

function utf8_strrev($str)
{
    $it = IntlBreakIterator::createCodePointInstance();
    $it->setText($str);

    $ret = '';
    $pos = 0;
    $prev = 0;

    foreach ($it as $pos) {
        $ret = substr($str, $prev, $pos - $prev) . $ret;
        $prev = $pos;
    }

    return $ret;  
}

Grapheme functions handle UTF-8 string more correctly than mbstring and PCRE functions/ Mbstring and PCRE may break characters. You can see the defference between them by executing the following code.

function str_to_array($string)
{
    $length = grapheme_strlen($string);
    $ret = [];

    for ($i = 0; $i < $length; $i += 1) {

        $ret[] = grapheme_substr($string, $i, 1);
    }

    return $ret;
}

function str_to_array2($string)
{
    $length = mb_strlen($string, "UTF-8");
    $ret = [];

    for ($i = 0; $i < $length; $i += 1) {

    $ret[] = mb_substr($string, $i, 1, "UTF-8");
}

    return $ret;
}

function str_to_array3($string)
{
    return preg_split('//u', $string, -1, PREG_SPLIT_NO_EMPTY);
}

function utf8_strrev($string)
{
    return implode(array_reverse(str_to_array($string)));
}

function utf8_strrev2($string)
{
    return implode(array_reverse(str_to_array2($string)));
}

function utf8_strrev3($string)
{
    return implode(array_reverse(str_to_array3($string)));
}

// http://www.php.net/manual/en/function.grapheme-strlen.php
$string = "a\xCC\x8A"  // 'LATIN SMALL LETTER A WITH RING ABOVE' (U+00E5)
         ."o\xCC\x88"; // 'LATIN SMALL LETTER O WITH DIAERESIS'  (U+00F6)

var_dump(array_map(function($elem) { return strtoupper(bin2hex($elem)); },
[
  'should be' => "o\xCC\x88"."a\xCC\x8A",
  'grapheme' => utf8_strrev($string),
  'mbstring' => utf8_strrev2($string),
  'pcre' => utf8_strrev3($string)
]));

The result is here.

array(4) {
  ["should be"]=>
  string(12) "6FCC8861CC8A"
  ["grapheme"]=>
  string(12) "6FCC8861CC8A"
  ["mbstring"]=>
  string(12) "CC886FCC8A61"
  ["pcre"]=>
  string(12) "CC886FCC8A61"
}

IntlBreakIterator can be used since PHP 5.5 (intl 3.0);

function utf8_strrev($str)
{
    $it = IntlBreakIterator::createCodePointInstance();
    $it->setText($str);

    $ret = '';
    $pos = 0;
    $prev = 0;

    foreach ($it as $pos) {
        $ret = substr($str, $prev, $pos - $prev) . $ret;
        $prev = $pos;
    }

    return $ret;  
}
拥抱影子 2024-07-18 19:34:19

这是另一种方法。 这似乎无需指定输出编码即可工作(使用几个不同的 mb_internal_encoding 进行测试):

function mb_strrev($text)
{
    return join('', array_reverse(
        preg_split('~~u', $text, -1, PREG_SPLIT_NO_EMPTY)
    ));
}

Here's another way. This seems to work without having to specify an output encoding (tested with a couple of different mb_internal_encodings):

function mb_strrev($text)
{
    return join('', array_reverse(
        preg_split('~~u', $text, -1, PREG_SPLIT_NO_EMPTY)
    ));
}
心的位置 2024-07-18 19:34:19

答案

function mb_strrev($text, $encoding = null)
{
    $funcParams = array($text);
    if ($encoding !== null)
        $funcParams[] = $encoding;
    $length = call_user_func_array('mb_strlen', $funcParams);

    $output = '';
    $funcParams = array($text, $length, 1);
    if ($encoding !== null)
        $funcParams[] = $encoding;
    while ($funcParams[1]--) {
         $output .= call_user_func_array('mb_substr', $funcParams);
    }
    return $output;
}

The answer

function mb_strrev($text, $encoding = null)
{
    $funcParams = array($text);
    if ($encoding !== null)
        $funcParams[] = $encoding;
    $length = call_user_func_array('mb_strlen', $funcParams);

    $output = '';
    $funcParams = array($text, $length, 1);
    if ($encoding !== null)
        $funcParams[] = $encoding;
    while ($funcParams[1]--) {
         $output .= call_user_func_array('mb_substr', $funcParams);
    }
    return $output;
}
诺曦 2024-07-18 19:34:19

另一种方法:

function mb_strrev($str, $enc = null) {
    if(is_null($enc)) $enc = mb_internal_encoding();
    $str = mb_convert_encoding($str, 'UTF-16BE', $enc);
    return mb_convert_encoding(strrev($str), $enc, 'UTF-16LE');
}

Another method:

function mb_strrev($str, $enc = null) {
    if(is_null($enc)) $enc = mb_internal_encoding();
    $str = mb_convert_encoding($str, 'UTF-16BE', $enc);
    return mb_convert_encoding(strrev($str), $enc, 'UTF-16LE');
}
等待我真够勒 2024-07-18 19:34:19

这很简单 utf8_strrev( $str )。 请参阅下面我复制的库的相关代码:

function utf8_strrev( $str )
{
    return implode( array_reverse( utf8_split( $str ) ) );
}

function utf8_split( $str , $split_length = 1 )
{
    $str    = ( string ) $str;

    $ret    = array( );

    if( pcre_utf8_support( ) )
    {
        $str    = utf8_clean( $str );

        $ret    = preg_split('/(?<!^)(?!$)/u', $str );

        // \X is buggy in many recent versions of PHP
        //preg_match_all( '/\X/u' , $str , $ret );
        //$ret  = $ret[0];
    }
    else
    {
        //Fallback

        $len    = strlen( $str );

        for( $i = 0 ; $i < $len ; $i++ )
        {
            if( ( $str[$i] & "\x80" ) === "\x00" )
            {
                $ret[]  = $str[$i];
            }
            else if( ( ( $str[$i] & "\xE0" ) === "\xC0" ) && ( isset( $str[$i+1] ) ) )
            {
                if( ( $str[$i+1] & "\xC0" ) === "\x80" )
                {
                    $ret[]  = $str[$i] . $str[$i+1];

                    $i++;
                }
            }
            else if( ( ( $str[$i] & "\xF0" ) === "\xE0" ) && ( isset( $str[$i+2] ) ) )
            {
                if( ( ( $str[$i+1] & "\xC0" ) === "\x80" ) && ( ( $str[$i+2] & "\xC0" ) === "\x80" ) )
                {
                    $ret[]  = $str[$i] . $str[$i+1] . $str[$i+2];

                    $i  = $i + 2;
                }
            }
            else if( ( ( $str[$i] & "\xF8" ) === "\xF0" ) && ( isset( $str[$i+3] ) ) )
            {
                if( ( ( $str[$i+1] & "\xC0" ) === "\x80" ) && ( ( $str[$i+2] & "\xC0" ) === "\x80" ) && ( ( $str[$i+3] & "\xC0" ) === "\x80" ) )
                {
                    $ret[]  = $str[$i] . $str[$i+1] . $str[$i+2] . $str[$i+3];

                    $i  = $i + 3;
                }
            }
        }
    }


    if( $split_length > 1 )
    {
        $ret = array_chunk( $ret , $split_length );

        $ret    = array_map( 'implode' , $ret );
    }

    if( $ret[0] === '' )
    {
        return array( );
    }

    return $ret;
}


function utf8_clean( $str , $remove_bom = false )
{
    $regx = '/([\x00-\x7F]|[\xC0-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF]{2}|[\xF0-\xF7][\x80-\xBF]{3})|./s';

    $str    = preg_replace( $regx , '$1' , $str );

    if( $remove_bom )
    {
        $str    = utf8_str_replace( utf8_bom( ) , '' , $str );
    }

    return $str;
}


function utf8_str_replace( $search , $replace , $subject , &$count = 0 )
{
    return str_replace( $search , $replace , $subject , $count );
}


function utf8_bom( )
{
    return "\xef\xbb\xbf";

}


function pcre_utf8_support( )
{
    static $support;

    if( !isset( $support ) )
    {
        $support = @preg_match( '//u', '' );
        //Cached the response
    }

    return $support;
}

It is easy utf8_strrev( $str ). See the relevant source code of my Library that I copied below:

function utf8_strrev( $str )
{
    return implode( array_reverse( utf8_split( $str ) ) );
}

function utf8_split( $str , $split_length = 1 )
{
    $str    = ( string ) $str;

    $ret    = array( );

    if( pcre_utf8_support( ) )
    {
        $str    = utf8_clean( $str );

        $ret    = preg_split('/(?<!^)(?!$)/u', $str );

        // \X is buggy in many recent versions of PHP
        //preg_match_all( '/\X/u' , $str , $ret );
        //$ret  = $ret[0];
    }
    else
    {
        //Fallback

        $len    = strlen( $str );

        for( $i = 0 ; $i < $len ; $i++ )
        {
            if( ( $str[$i] & "\x80" ) === "\x00" )
            {
                $ret[]  = $str[$i];
            }
            else if( ( ( $str[$i] & "\xE0" ) === "\xC0" ) && ( isset( $str[$i+1] ) ) )
            {
                if( ( $str[$i+1] & "\xC0" ) === "\x80" )
                {
                    $ret[]  = $str[$i] . $str[$i+1];

                    $i++;
                }
            }
            else if( ( ( $str[$i] & "\xF0" ) === "\xE0" ) && ( isset( $str[$i+2] ) ) )
            {
                if( ( ( $str[$i+1] & "\xC0" ) === "\x80" ) && ( ( $str[$i+2] & "\xC0" ) === "\x80" ) )
                {
                    $ret[]  = $str[$i] . $str[$i+1] . $str[$i+2];

                    $i  = $i + 2;
                }
            }
            else if( ( ( $str[$i] & "\xF8" ) === "\xF0" ) && ( isset( $str[$i+3] ) ) )
            {
                if( ( ( $str[$i+1] & "\xC0" ) === "\x80" ) && ( ( $str[$i+2] & "\xC0" ) === "\x80" ) && ( ( $str[$i+3] & "\xC0" ) === "\x80" ) )
                {
                    $ret[]  = $str[$i] . $str[$i+1] . $str[$i+2] . $str[$i+3];

                    $i  = $i + 3;
                }
            }
        }
    }


    if( $split_length > 1 )
    {
        $ret = array_chunk( $ret , $split_length );

        $ret    = array_map( 'implode' , $ret );
    }

    if( $ret[0] === '' )
    {
        return array( );
    }

    return $ret;
}


function utf8_clean( $str , $remove_bom = false )
{
    $regx = '/([\x00-\x7F]|[\xC0-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF]{2}|[\xF0-\xF7][\x80-\xBF]{3})|./s';

    $str    = preg_replace( $regx , '$1' , $str );

    if( $remove_bom )
    {
        $str    = utf8_str_replace( utf8_bom( ) , '' , $str );
    }

    return $str;
}


function utf8_str_replace( $search , $replace , $subject , &$count = 0 )
{
    return str_replace( $search , $replace , $subject , $count );
}


function utf8_bom( )
{
    return "\xef\xbb\xbf";

}


function pcre_utf8_support( )
{
    static $support;

    if( !isset( $support ) )
    {
        $support = @preg_match( '//u', '' );
        //Cached the response
    }

    return $support;
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文