将“abc123def”替换为与“abc 123 def”在多字节字符串中
通常我会这样做。
$str = preg_replace('#(\d+)#', ' $1 ', $str);
如果我知道它将是 utf-8,我会在模式中添加一个小写的“u”修饰符,我想我会做得很好。但由于有报告称 utf-8 占用的存储空间是使用本机字符集时的 2 倍,在某些情况下是 3 倍,因此我尝试不将应用程序限制为 utf-8。
因此,我试图远离我最喜欢的 preg_ 函数。
到目前为止,大多数事情都相当简单,但我在替换方面有点困难,我通常在 preg_ 中使用字符类,例如“\d”。
Normally I would just do this.
$str = preg_replace('#(\d+)#', ' $1 ', $str);
If I knew it was going to be utf-8 I would add a lowercase "u" modifier to the pattern and I think I would be good. But because of reports of utf-8 taking 2x and in some cases 3x the storage space than it would take if the native character set were used, I'm trying not to restrict the application to utf-8.
Thus, I'm trying to stay away from my favorite preg_ functions.
Most things have been fairly simple so far, but I'm a little stuck on replacements where I'd normally use character classes in preg_ such as "\d".
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
使用
mb_convert_encoding
实现存储包装器,因此您只需在内部进行操作UTF-8。(我仍然认为你应该要求UTF-8并为每个人节省很多麻烦。)
Implement a storage wrapper with
mb_convert_encoding
so internally you only have to manipulate UTF-8.(I still think you should require UTF-8 and save everyone a lot of trouble.)
我认为 UTF-8 编码是这样的:编码输出中字节值等于或小于 127 的任何内容始终是与该字节值匹配的 ASCII 字符,而不是多字节序列的一部分。因此,在这种情况下,您可以继续假装编码是 ASCII,而不会引起问题(因为空格和数字都是 ASCII)。
请参阅 http://en.wikipedia.org/wiki/UTF-8 其中它表明多字节序列中的所有字节都具有最高有效位集(例如全部>127)。
I think that UTF-8 encoding is such that anything in the encoded output with a byte value of 127 or less is always the ASCII character matching that byte value and never part of a multi byte sequence. So you can continue to pretend the encoding is ASCII in this situation and not cause problems (as spaces and digits are all ASCII).
See the description in http://en.wikipedia.org/wiki/UTF-8 where it shows that all the bytes in a multibyte sequence have the most significant bit set (e.g. are all > 127).