在使用 preg_match() 匹配之前获取多字节字符计数（PREG_OFFSET_CAPTURE 参数对字节计数没有帮助）

发布于 2024-08-10 20:43:02 字数 703 浏览 8 评论 0原文

我正在尝试使用 preg_match。

preg_match('/H/u', "\xC2\xA1Hola!", $a_matches, PREG_OFFSET_CAPTURE);
echo $a_matches[0][1];

这应该打印 1，因为“H”位于字符串“¡Hola!”中的索引 1。但它打印 2。所以看起来它没有将主题视为 UTF8 编码的字符串，即使我传递了“u” 正则表达式中的修饰符。

我的 php.ini 中有以下设置，并且其他 UTF8 函数正在运行：

mbstring.func_overload = 7
mbstring.language = Neutral
mbstring.internal_encoding = UTF-8
mbstring.http_input = pass
mbstring.http_output = pass
mbstring.encoding_translation = Off

有什么想法吗？

原文

I'm trying to search a UTF8-encoded string using preg_match.

preg_match('/H/u', "\xC2\xA1Hola!", $a_matches, PREG_OFFSET_CAPTURE);
echo $a_matches[0][1];

This should print 1, since "H" is at index 1 in the string "¡Hola!". But it prints 2. So it seems like it's not treating the subject as a UTF8-encoded string, even though I'm passing the "u" modifier in the regular expression.

I have the following settings in my php.ini, and other UTF8 functions are working:

mbstring.func_overload = 7
mbstring.language = Neutral
mbstring.internal_encoding = UTF-8
mbstring.http_input = pass
mbstring.http_output = pass
mbstring.encoding_translation = Off

Any ideas?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

無處可尋 2024-08-17 20:43:02

尽管 u 修饰符使模式和主题都被解释为 UTF-8，但捕获的偏移量仍然以字节为单位计算。

您可以使用 mb_strlen 来获取 UTF-8 字符而不是字节的长度：

$str = "\xC2\xA1Hola!";
preg_match('/H/u', $str, $a_matches, PREG_OFFSET_CAPTURE);
echo mb_strlen(substr($str, 0, $a_matches[0][1]));

Although the u modifier makes both the pattern and subject be interpreted as UTF-8, the captured offsets are still counted in bytes.

You can use mb_strlen to get the length in UTF-8 characters rather than bytes:

$str = "\xC2\xA1Hola!";
preg_match('/H/u', $str, $a_matches, PREG_OFFSET_CAPTURE);
echo mb_strlen(substr($str, 0, $a_matches[0][1]));

回复收藏 0 原文

高速公鹿 2024-08-17 20:43:02

尝试在正则表达式之前添加此(*UTF8)：

preg_match('(*UTF8)/H/u', "\xC2\xA1Hola!", $a_matches, PREG_OFFSET_CAPTURE);

Magic，感谢评论
https://www.php.net/manual/function.preg- match.php#95828

Try adding this (*UTF8) before the regex:

preg_match('(*UTF8)/H/u', "\xC2\xA1Hola!", $a_matches, PREG_OFFSET_CAPTURE);

Magic, thanks to a comment in
https://www.php.net/manual/function.preg-match.php#95828

回复收藏 0 原文

摇划花蜜的午后 2024-08-17 20:43:02

看起来这是一个“功能”，请参阅
http://bugs.php.net/bug.php?id=37391

'u' 开关仅对 PCRE 有意义，PHP 本身不知道它。

从 PHP 的角度来看，字符串是字节序列，返回字节偏移量似乎是合乎逻辑的（我不是说“正确”）。

回复收藏 0 原文

北陌 2024-08-17 20:43:02

请原谅我的死后发布，但可能有人会发现它很有用：下面的代码可以作为 preg_match 和 preg_match_all 函数的替代，并返回正确的匹配与 UTF8 编码字符串的正确偏移量。

     mb_internal_encoding('UTF-8');

     /**
     * Returns array of matches in same format as preg_match or preg_match_all
     * @param bool   $matchAll If true, execute preg_match_all, otherwise preg_match
     * @param string $pattern  The pattern to search for, as a string.
     * @param string $subject  The input string.
     * @param int    $offset   The place from which to start the search (in bytes).
     * @return array
     */
    function pregMatchCapture($matchAll, $pattern, $subject, $offset = 0)
    {
        $matchInfo = array();
        $method    = 'preg_match';
        $flag      = PREG_OFFSET_CAPTURE;
        if ($matchAll) {
            $method .= '_all';
        }
        $n = $method($pattern, $subject, $matchInfo, $flag, $offset);
        $result = array();
        if ($n !== 0 && !empty($matchInfo)) {
            if (!$matchAll) {
                $matchInfo = array($matchInfo);
            }
            foreach ($matchInfo as $matches) {
                $positions = array();
                foreach ($matches as $match) {
                    $matchedText   = $match[0];
                    $matchedLength = $match[1];
                    $positions[]   = array(
                        $matchedText,
                        mb_strlen(mb_strcut($subject, 0, $matchedLength))
                    );
                }
                $result[] = $positions;
            }
            if (!$matchAll) {
                $result = $result[0];
            }
        }
        return $result;
    }

    $s1 = 'Попробуем русскую строку для теста';
    $s2 = 'Try english string for test';

    var_dump(pregMatchCapture(true, '/обу/', $s1));
    var_dump(pregMatchCapture(false, '/обу/', $s1));

    var_dump(pregMatchCapture(true, '/lish/', $s2));
    var_dump(pregMatchCapture(false, '/lish/', $s2));

我的示例的输出：

    array(1) {
      [0]=>
      array(1) {
        [0]=>
        array(2) {
          [0]=>
          string(6) "обу"
          [1]=>
          int(4)
        }
      }
    }
    array(1) {
      [0]=>
      array(2) {
        [0]=>
        string(6) "обу"
        [1]=>
        int(4)
      }
    }
    array(1) {
      [0]=>
      array(1) {
        [0]=>
        array(2) {
          [0]=>
          string(4) "lish"
          [1]=>
          int(7)
        }
      }
    }
    array(1) {
      [0]=>
      array(2) {
        [0]=>
        string(4) "lish"
        [1]=>
        int(7)
      }
    }

Excuse me for necroposting, but may be somebody will find it useful: code below can work both as replacement for preg_match and preg_match_all functions and returns correct matches with correct offset for UTF8-encoded strings.

     mb_internal_encoding('UTF-8');

     /**
     * Returns array of matches in same format as preg_match or preg_match_all
     * @param bool   $matchAll If true, execute preg_match_all, otherwise preg_match
     * @param string $pattern  The pattern to search for, as a string.
     * @param string $subject  The input string.
     * @param int    $offset   The place from which to start the search (in bytes).
     * @return array
     */
    function pregMatchCapture($matchAll, $pattern, $subject, $offset = 0)
    {
        $matchInfo = array();
        $method    = 'preg_match';
        $flag      = PREG_OFFSET_CAPTURE;
        if ($matchAll) {
            $method .= '_all';
        }
        $n = $method($pattern, $subject, $matchInfo, $flag, $offset);
        $result = array();
        if ($n !== 0 && !empty($matchInfo)) {
            if (!$matchAll) {
                $matchInfo = array($matchInfo);
            }
            foreach ($matchInfo as $matches) {
                $positions = array();
                foreach ($matches as $match) {
                    $matchedText   = $match[0];
                    $matchedLength = $match[1];
                    $positions[]   = array(
                        $matchedText,
                        mb_strlen(mb_strcut($subject, 0, $matchedLength))
                    );
                }
                $result[] = $positions;
            }
            if (!$matchAll) {
                $result = $result[0];
            }
        }
        return $result;
    }

    $s1 = 'Попробуем русскую строку для теста';
    $s2 = 'Try english string for test';

    var_dump(pregMatchCapture(true, '/обу/', $s1));
    var_dump(pregMatchCapture(false, '/обу/', $s1));

    var_dump(pregMatchCapture(true, '/lish/', $s2));
    var_dump(pregMatchCapture(false, '/lish/', $s2));

Output of my example:

    array(1) {
      [0]=>
      array(1) {
        [0]=>
        array(2) {
          [0]=>
          string(6) "обу"
          [1]=>
          int(4)
        }
      }
    }
    array(1) {
      [0]=>
      array(2) {
        [0]=>
        string(6) "обу"
        [1]=>
        int(4)
      }
    }
    array(1) {
      [0]=>
      array(1) {
        [0]=>
        array(2) {
          [0]=>
          string(4) "lish"
          [1]=>
          int(7)
        }
      }
    }
    array(1) {
      [0]=>
      array(2) {
        [0]=>
        string(4) "lish"
        [1]=>
        int(7)
      }
    }

回复收藏 0 原文

冬天旳寂寞 2024-08-17 20:43:02

您可以通过使用字节计数 substr 将字符串切割到 preg_mach 返回的偏移量，然后使用正确计数测量此前缀来计算真正的 UTF-8 偏移量mb_strlen。

$utf8Offset = mb_strlen(substr($text, 0, $offsetFromPregMatch), 'UTF-8');

You can calculate the real UTF-8 offset by cutting the string to the offset returned by the preg_mach with the byte-counting substr and then measuring this prefix with the correct-counting mb_strlen.

$utf8Offset = mb_strlen(substr($text, 0, $offsetFromPregMatch), 'UTF-8');

回复收藏 0 原文

口干舌燥 2024-08-17 20:43:02

如果您只想找到 H 的多字节安全位置，请尝试 mb_strpos()

mb_internal_encoding('UTF-8');
$str = "\xC2\xA1Hola!";
$pos = mb_strpos($str, 'H');
echo $str."\n";
echo $pos."\n";
echo mb_substr($str,$pos,1)."\n";

输出：

¡Hola!
1
H

If all you want to do is find the multi-byte safe position of H try mb_strpos()

mb_internal_encoding('UTF-8');
$str = "\xC2\xA1Hola!";
$pos = mb_strpos($str, 'H');
echo $str."\n";
echo $pos."\n";
echo mb_substr($str,$pos,1)."\n";

Output:

¡Hola!
1
H

回复收藏 0 原文

勿忘心安 2024-08-17 20:43:02

我编写了一个小类，将 preg_match 返回的偏移量转换为正确的 utf 偏移量：

final class NonUtfToUtfOffset
{
    /** @var int[] */
    private $utfMap = [];

    public function __construct(string $content)
    {
        $contentLength = mb_strlen($content);

        for ($offset = 0; $offset < $contentLength; $offset ++) {
            $char = mb_substr($content, $offset, 1);
            $nonUtfLength = strlen($char);

            for ($charOffset = 0; $charOffset < $nonUtfLength; $charOffset ++) {
                $this->utfMap[] = $offset;
            }
        }
    }

    public function convertOffset(int $nonUtfOffset): int
    {
        return $this->utfMap[$nonUtfOffset];
    }
}

您可以这样使用它：

$content = 'aą bać d';
$offsetConverter = new NonUtfToUtfOffset($content);

preg_match_all('#(bać)#ui', $content, $m, PREG_OFFSET_CAPTURE);

foreach ($m[1] as [$word, $offset]) {
    echo "bad: " . mb_substr($content, $offset, mb_strlen($word))."\n";
    echo "good: " . mb_substr($content, $offsetConverter->convertOffset($offset), mb_strlen($word))."\n";
}

https://3v4l。 org/8Y32J

I wrote small class to convert offsets returned by preg_match to proper utf offsets:

final class NonUtfToUtfOffset
{
    /** @var int[] */
    private $utfMap = [];

    public function __construct(string $content)
    {
        $contentLength = mb_strlen($content);

        for ($offset = 0; $offset < $contentLength; $offset ++) {
            $char = mb_substr($content, $offset, 1);
            $nonUtfLength = strlen($char);

            for ($charOffset = 0; $charOffset < $nonUtfLength; $charOffset ++) {
                $this->utfMap[] = $offset;
            }
        }
    }

    public function convertOffset(int $nonUtfOffset): int
    {
        return $this->utfMap[$nonUtfOffset];
    }
}

You can use it like that:

$content = 'aą bać d';
$offsetConverter = new NonUtfToUtfOffset($content);

preg_match_all('#(bać)#ui', $content, $m, PREG_OFFSET_CAPTURE);

foreach ($m[1] as [$word, $offset]) {
    echo "bad: " . mb_substr($content, $offset, mb_strlen($word))."\n";
    echo "good: " . mb_substr($content, $offsetConverter->convertOffset($offset), mb_strlen($word))."\n";
}

https://3v4l.org/8Y32J

回复收藏 0 原文

寄风 2024-08-17 20:43:02

您可能想查看 T-Regx 库。

pattern('/Hola/u')->match('\xC2\xA1Hola!')->first(function (Match $match) 
{
    echo $match->offset();     // characters
    echo $match->byteOffset(); // bytes
});

此 $match->offset() 是 UTF-8 安全偏移量。

You might want to look at T-Regx library.

pattern('/Hola/u')->match('\xC2\xA1Hola!')->first(function (Match $match) 
{
    echo $match->offset();     // characters
    echo $match->byteOffset(); // bytes
});

This $match->offset() is UTF-8 safe offset.

回复收藏 0 原文

心清如水 2024-08-17 20:43:02

我只需使用随意的 substr 而不是预期的 mb_substr （PHP 7.4）就解决了这个问题。

当文本包含欧元符号（€ ）。

另外 iconv 和 utf8_encode 也没有帮助，我无法使用 htmlentities 。

只需恢复到简单的 substr 就会有所帮助，并且它可以正确地处理 € 和其他字符。

回复收藏 0 原文

笑饮青盏花 2024-08-17 20:43:02

我认为在这种情况下使用 PREG_OFFSET_CAPTURE 只会带来更多工作。

以下脚本的演示。

如果模式仅包含文字字符，则 preg_ 就太过分了，只需使用 mb_strpos() 并记住返回值将为 false如果大海捞针没有找到。

var_export(mb_strpos($str, 'H')); // 1

如果您知道大海捞针将存在，则可以将 preg_match_all() 与奇妙的 \G（继续）元字符和 \X 一起使用>（多字节任意字符）元字符。

echo preg_match_all('/\G(?![A-Z])\X/u', $str); // 1
// if needle not found, will return the mb length of haystack

如果你不知道大海捞针是否存在，只需检查返回的计数是否等于输入字符串的多字节长度。

$mbLength = preg_match_all('/\G(?![A-Z])\X/u', $str, $m);
var_export(mb_strlen($str) !== $mbLength ? $mbLength : 'not found');

但是，如果您无论如何都要调用额外的 mb_ 函数，那么只需进行一次匹配，检查是否匹配，如果匹配则测量其多字节长度。

var_export(
    preg_match('/\X*?(?=[A-Z])/u', $str, $m) ? mb_strlen($m[0]) : 'not found' 
);

综上所述，我从未见过需要计算某些内容的多字节位置，除非更大的任务是隔离或替换子字符串。如果是这种情况，请完全避免此步骤，只需使用 preg_match() 或 preg_replace() 来更直接地满足您的需求。

I think working with PREG_OFFSET_CAPTURE in this case only creates more work.

Demo of below scripts.

If the pattern only contains literal characters, then preg_ is overkill, just use mb_strpos() and bear in mind that the returned value will be false if the needle is not found in the haystack.

var_export(mb_strpos($str, 'H')); // 1

If you know that the needle will exist in the haystack, you can use preg_match_all() with the marvellous \G (continue) metacharacter and \X (multibyte any character) metacharacter.

echo preg_match_all('/\G(?![A-Z])\X/u', $str); // 1
// if needle not found, will return the mb length of haystack

If you don't know if the needle will exist in the haystack, just check if the returned count is equal to the multibyte length of the input string.

$mbLength = preg_match_all('/\G(?![A-Z])\X/u', $str, $m);
var_export(mb_strlen($str) !== $mbLength ? $mbLength : 'not found');

But if you are going to call an extra mb_ function anyhow, then make just one match, check if a match was made, and measure its multibyte length if so.

var_export(
    preg_match('/\X*?(?=[A-Z])/u', $str, $m) ? mb_strlen($m[0]) : 'not found' 
);

All this said, I've never seen the need to count the multibyte position of something unless the greater task was to isolate or replace a substring. If this is the case, avoid this step entirely and just use preg_match() or preg_replace() to more directly serve your needs.

回复收藏 0 原文

~没有更多了~