Python:如何用半角字符替换全角字符?

发布于 2024-08-24 12:21:24 字数 327 浏览 5 评论 0原文

如果这是 PHP,我可能会这样做:

function no_more_half_widths($string){
  $foo = array('1','2','3','4','5','6','7','8','9','10')
  $bar = array('1','2','3','4','5','6','7','8','9','10')
  return str_replace($foo, $bar, $string)
}

我尝试过 python 中的 .translate 函数,它表明数组的大小不同。我认为这是因为单个字符是用 utf-8 编码的。有什么建议吗?

If this was PHP, I would probably do something like this:

function no_more_half_widths($string){
  $foo = array('1','2','3','4','5','6','7','8','9','10')
  $bar = array('1','2','3','4','5','6','7','8','9','10')
  return str_replace($foo, $bar, $string)
}

I have tried the .translate function in python and it indicates that the arrays are not of the same size. I assume this is due to the fact that the individual characters are encoded in utf-8. Any suggestions?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

离笑几人歌 2024-08-31 12:21:24

内置的 unicodedata 模块可以做到这一点:

>>> import unicodedata
>>> foo = u'1234567890'
>>> unicodedata.normalize('NFKC', foo)
u'1234567890'

“NFKC”代表“标准化形式 KC [兼容性分解,然后是规范组合]”,并将全角字符替换为半角字符,即 Unicode 等效项

请注意,它还同时规范化各种其他事物,例如单独的重音符号和罗马数字符号。

The built-in unicodedata module can do it:

>>> import unicodedata
>>> foo = u'1234567890'
>>> unicodedata.normalize('NFKC', foo)
u'1234567890'

The “NFKC” stands for “Normalization Form KC [Compatibility Decomposition, followed by Canonical Composition]”, and replaces full-width characters by half-width ones, which are Unicode equivalent.

Note that it also normalizes all sorts of other things at the same time, like separate accent marks and Roman numeral symbols.

始于初秋 2024-08-31 12:21:24

在Python3中,您可以使用以下代码片段。它在所有 ASCII 字符和相应的全角字符之间建立映射。最重要的是,这不需要您对很容易出错的 ascii 序列进行硬编码。

 FULL2HALF = dict((i + 0xFEE0, i) for i in range(0x21, 0x7F))
 FULL2HALF[0x3000] = 0x20
      
 def halfen(s):
     '''
     Convert full-width characters to ASCII counterpart
     '''
     return str(s).translate(FULL2HALF)

另外,使用相同的逻辑,您可以使用以下代码将半角字符转换为全角字符:

 HALF2FULL = dict((i, i + 0xFEE0) for i in range(0x21, 0x7F))
 HALF2FULL[0x20] = 0x3000
      
 def fullen(s):
     '''
     Convert all ASCII characters to the full-width counterpart.
     '''
     return str(s).translate(HALF2FULL)

注意:这两个片段仅考虑 ASCII 字符,并且不转换任何日文/韩文全角字符。

为了完整起见,来自维基百科

范围 U+FF01–FF5E 将 ASCII 21 到 7E 的字符再现为
全角形式,即 CJK 中使用的固定宽度形式
计算。这对于在 CJK 中排版拉丁字符很有用
环境。 U+FF00 不对应于全角 ASCII 20
(空格字符),因为该角色已由 U+3000 履行
“表意空间。”

范围 U+FF65–FFDC 编码片假名和韩文的半角形式
字符。

范围U+FFE0–FFEE包括全角和半角符号。

python2 解决方案可以在 gist/jcayzac 找到。

In Python3, you can use the following snippet. It makes a map between all ASCII characters and corresponding fullwidth characters. Best of all, this doesn't need you to hard code the ascii sequence, which is error prone.

 FULL2HALF = dict((i + 0xFEE0, i) for i in range(0x21, 0x7F))
 FULL2HALF[0x3000] = 0x20
      
 def halfen(s):
     '''
     Convert full-width characters to ASCII counterpart
     '''
     return str(s).translate(FULL2HALF)

Also, with same logic, you can convert halfwidth characters to fullwidth, with the following code:

 HALF2FULL = dict((i, i + 0xFEE0) for i in range(0x21, 0x7F))
 HALF2FULL[0x20] = 0x3000
      
 def fullen(s):
     '''
     Convert all ASCII characters to the full-width counterpart.
     '''
     return str(s).translate(HALF2FULL)

Note: These two snippets only consider ASCII characters, and does not convert any japanese/korean fullwidth characters.

For completeness, from wikipedia:

Range U+FF01–FF5E reproduces the characters of ASCII 21 to 7E as
fullwidth forms, that is, a fixed width form used in CJK
computing. This is useful for typesetting Latin characters in a CJK
environment. U+FF00 does not correspond to a fullwidth ASCII 20
(space character), since that role is already fulfilled by U+3000
"ideographic space."

Range U+FF65–FFDC encodes halfwidth forms of Katakana and Hangul
characters.

Range U+FFE0–FFEE includes fullwidth and halfwidth symbols.

A python2 solution can be found at gist/jcayzac.

神魇的王 2024-08-31 12:21:24

我认为没有内置函数可以一次性进行多次替换,因此您必须自己完成。

一种方法是:

>>> src = (u'1',u'2',u'3',u'4',u'5',u'6',u'7',u'8',u'9',u'10')
>>> dst = ('1','2','3','4','5','6','7','8','9','0')
>>> string = u'a123'
>>> for i, j in zip(src, dst):
...     string = string.replace(i, j)
... 
>>> string
u'a123'

或者使用字典:

>>> trans = {u'1': '1', u'2': '2', u'3': '3', u'4': '4', u'5': '5', u'6': '6', u'7': '7', u'8': '8', u'9': '9', u'0': '0'}
>>> string = u'a123'
>>> for i, j in trans.iteritems():
...     string = string.replace(i, j)
...     
>>> string
u'a123'

或者最后,使用正则表达式(这实际上可能是最快的):

>>> import re
>>> trans = {u'1': '1', u'2': '2', u'3': '3', u'4': '4', u'5': '5', u'6': '6', u'7': '7', u'8': '8', u'9': '9', u'0': '0'}
>>> lookup = re.compile(u'|'.join(trans.keys()), re.UNICODE)
>>> string = u'a123'
>>> lookup.sub(lambda x: trans[x.group()], string)
u'a123'

I don't think there's a built-in function to do multiple replacements in one pass, so you'll have to do it yourself.

One way to do it:

>>> src = (u'1',u'2',u'3',u'4',u'5',u'6',u'7',u'8',u'9',u'10')
>>> dst = ('1','2','3','4','5','6','7','8','9','0')
>>> string = u'a123'
>>> for i, j in zip(src, dst):
...     string = string.replace(i, j)
... 
>>> string
u'a123'

Or using a dictionary:

>>> trans = {u'1': '1', u'2': '2', u'3': '3', u'4': '4', u'5': '5', u'6': '6', u'7': '7', u'8': '8', u'9': '9', u'0': '0'}
>>> string = u'a123'
>>> for i, j in trans.iteritems():
...     string = string.replace(i, j)
...     
>>> string
u'a123'

Or finally, using regex (and this might actually be the fastest):

>>> import re
>>> trans = {u'1': '1', u'2': '2', u'3': '3', u'4': '4', u'5': '5', u'6': '6', u'7': '7', u'8': '8', u'9': '9', u'0': '0'}
>>> lookup = re.compile(u'|'.join(trans.keys()), re.UNICODE)
>>> string = u'a123'
>>> lookup.sub(lambda x: trans[x.group()], string)
u'a123'
我要还你自由 2024-08-31 12:21:24

使用 unicode.translate 方法:

>>> table = dict(zip(map(ord,u'0123456789'),map(ord,u'0123456789')))
>>> print u'123'.translate(table)
123

它需要将代码点映射为数字,而不是字符。此外,使用 u'unicodeliters' 会使值保持未编码状态。

Using the unicode.translate method:

>>> table = dict(zip(map(ord,u'0123456789'),map(ord,u'0123456789')))
>>> print u'123'.translate(table)
123

It requires a mapping of code points as numbers, not characters. Also, using u'unicode literals' leaves the values unencoded.

意中人 2024-08-31 12:21:24

在Python 3中,最干净的方法是使用 str.translate 和 < a href="https://docs.python.org/3/library/stdtypes.html#str.maketrans" rel="nofollow">str.maketrans:

FULLWIDTH_TO_HALFWIDTH = str.maketrans('1234567890',
                                       '1234567890')
def fullwidth_to_halfwidth(s):
    return s.translate(FULLWIDTH_TO_HALFWIDTH)

在 Python 2 中,str.maketrans 是 < a href="https://docs.python.org/2/library/string.html#string.maketrans" rel="nofollow">string.maketrans 并且不适用于 Unicode 字符,因此您需要制作一本字典,正如乔什·李上面指出的那样。

In Python 3, cleanest is to use str.translate and str.maketrans:

FULLWIDTH_TO_HALFWIDTH = str.maketrans('1234567890',
                                       '1234567890')
def fullwidth_to_halfwidth(s):
    return s.translate(FULLWIDTH_TO_HALFWIDTH)

In Python 2, str.maketrans is instead string.maketrans and doesn’t work with Unicode characters, so you need to make a dictionary, as Josh Lee notes above.

等风来 2024-08-31 12:21:24

正则表达式方法

>>> import re
>>> re.sub(u"[\uff10-\uff19]",lambda x:chr(ord(x.group(0))-0xfee0),u"456")
u'456'

Regex approach

>>> import re
>>> re.sub(u"[\uff10-\uff19]",lambda x:chr(ord(x.group(0))-0xfee0),u"456")
u'456'
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文