Python：如何用半角字符替换全角字符？

发布于 2024-08-24 12:21:24 字数 327 浏览 5 评论 0原文

如果这是 PHP，我可能会这样做：

function no_more_half_widths($string){
  $foo = array('１','２','３','４','５','６','７','８','９','１０')
  $bar = array('1','2','3','4','5','6','7','8','9','10')
  return str_replace($foo, $bar, $string)
}

我尝试过 python 中的 .translate 函数，它表明数组的大小不同。我认为这是因为单个字符是用 utf-8 编码的。有什么建议吗？

原文

If this was PHP, I would probably do something like this:

function no_more_half_widths($string){
  $foo = array('１','２','３','４','５','６','７','８','９','１０')
  $bar = array('1','2','3','4','5','6','7','8','9','10')
  return str_replace($foo, $bar, $string)
}

I have tried the .translate function in python and it indicates that the arrays are not of the same size. I assume this is due to the fact that the individual characters are encoded in utf-8. Any suggestions?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

离笑几人歌 2024-08-31 12:21:24

内置的 unicodedata 模块可以做到这一点：

>>> import unicodedata
>>> foo = u'１２３４５６７８９０'
>>> unicodedata.normalize('NFKC', foo)
u'1234567890'

“NFKC”代表“标准化形式 KC [兼容性分解，然后是规范组合]”，并将全角字符替换为半角字符，即 Unicode 等效项。

请注意，它还同时规范化各种其他事物，例如单独的重音符号和罗马数字符号。

The built-in unicodedata module can do it:

>>> import unicodedata
>>> foo = u'１２３４５６７８９０'
>>> unicodedata.normalize('NFKC', foo)
u'1234567890'

The “NFKC” stands for “Normalization Form KC [Compatibility Decomposition, followed by Canonical Composition]”, and replaces full-width characters by half-width ones, which are Unicode equivalent.

Note that it also normalizes all sorts of other things at the same time, like separate accent marks and Roman numeral symbols.

回复收藏 0 原文

始于初秋 2024-08-31 12:21:24

在Python3中，您可以使用以下代码片段。它在所有 ASCII 字符和相应的全角字符之间建立映射。最重要的是，这不需要您对很容易出错的 ascii 序列进行硬编码。

 FULL2HALF = dict((i + 0xFEE0, i) for i in range(0x21, 0x7F))
 FULL2HALF[0x3000] = 0x20
      
 def halfen(s):
     '''
     Convert full-width characters to ASCII counterpart
     '''
     return str(s).translate(FULL2HALF)

另外，使用相同的逻辑，您可以使用以下代码将半角字符转换为全角字符：

 HALF2FULL = dict((i, i + 0xFEE0) for i in range(0x21, 0x7F))
 HALF2FULL[0x20] = 0x3000
      
 def fullen(s):
     '''
     Convert all ASCII characters to the full-width counterpart.
     '''
     return str(s).translate(HALF2FULL)

注意：这两个片段仅考虑 ASCII 字符，并且不转换任何日文/韩文全角字符。

为了完整起见，来自维基百科：

范围 U+FF01–FF5E 将 ASCII 21 到 7E 的字符再现为
全角形式，即 CJK 中使用的固定宽度形式
计算。这对于在 CJK 中排版拉丁字符很有用
环境。 U+FF00 不对应于全角 ASCII 20
（空格字符），因为该角色已由 U+3000 履行
“表意空间。”
范围 U+FF65–FFDC 编码片假名和韩文的半角形式
字符。
范围U+FFE0–FFEE包括全角和半角符号。

python2 解决方案可以在 gist/jcayzac 找到。

In Python3, you can use the following snippet. It makes a map between all ASCII characters and corresponding fullwidth characters. Best of all, this doesn't need you to hard code the ascii sequence, which is error prone.

 FULL2HALF = dict((i + 0xFEE0, i) for i in range(0x21, 0x7F))
 FULL2HALF[0x3000] = 0x20
      
 def halfen(s):
     '''
     Convert full-width characters to ASCII counterpart
     '''
     return str(s).translate(FULL2HALF)

Also, with same logic, you can convert halfwidth characters to fullwidth, with the following code:

 HALF2FULL = dict((i, i + 0xFEE0) for i in range(0x21, 0x7F))
 HALF2FULL[0x20] = 0x3000
      
 def fullen(s):
     '''
     Convert all ASCII characters to the full-width counterpart.
     '''
     return str(s).translate(HALF2FULL)

Note: These two snippets only consider ASCII characters, and does not convert any japanese/korean fullwidth characters.

For completeness, from wikipedia:

Range U+FF01–FF5E reproduces the characters of ASCII 21 to 7E as
fullwidth forms, that is, a fixed width form used in CJK
computing. This is useful for typesetting Latin characters in a CJK
environment. U+FF00 does not correspond to a fullwidth ASCII 20
(space character), since that role is already fulfilled by U+3000
"ideographic space."
Range U+FF65–FFDC encodes halfwidth forms of Katakana and Hangul
characters.
Range U+FFE0–FFEE includes fullwidth and halfwidth symbols.

A python2 solution can be found at gist/jcayzac.

回复收藏 0 原文

神魇的王 2024-08-31 12:21:24

我认为没有内置函数可以一次性进行多次替换，因此您必须自己完成。

一种方法是：

>>> src = (u'１',u'２',u'３',u'４',u'５',u'６',u'７',u'８',u'９',u'１０')
>>> dst = ('1','2','3','4','5','6','7','8','9','0')
>>> string = u'a１２３'
>>> for i, j in zip(src, dst):
...     string = string.replace(i, j)
... 
>>> string
u'a123'

或者使用字典：

>>> trans = {u'１': '1', u'２': '2', u'３': '3', u'４': '4', u'５': '5', u'６': '6', u'７': '7', u'８': '8', u'９': '9', u'０': '0'}
>>> string = u'a１２３'
>>> for i, j in trans.iteritems():
...     string = string.replace(i, j)
...     
>>> string
u'a123'

或者最后，使用正则表达式（这实际上可能是最快的）：

>>> import re
>>> trans = {u'１': '1', u'２': '2', u'３': '3', u'４': '4', u'５': '5', u'６': '6', u'７': '7', u'８': '8', u'９': '9', u'０': '0'}
>>> lookup = re.compile(u'|'.join(trans.keys()), re.UNICODE)
>>> string = u'a１２３'
>>> lookup.sub(lambda x: trans[x.group()], string)
u'a123'

I don't think there's a built-in function to do multiple replacements in one pass, so you'll have to do it yourself.

One way to do it:

>>> src = (u'１',u'２',u'３',u'４',u'５',u'６',u'７',u'８',u'９',u'１０')
>>> dst = ('1','2','3','4','5','6','7','8','9','0')
>>> string = u'a１２３'
>>> for i, j in zip(src, dst):
...     string = string.replace(i, j)
... 
>>> string
u'a123'

Or using a dictionary:

>>> trans = {u'１': '1', u'２': '2', u'３': '3', u'４': '4', u'５': '5', u'６': '6', u'７': '7', u'８': '8', u'９': '9', u'０': '0'}
>>> string = u'a１２３'
>>> for i, j in trans.iteritems():
...     string = string.replace(i, j)
...     
>>> string
u'a123'

Or finally, using regex (and this might actually be the fastest):

>>> import re
>>> trans = {u'１': '1', u'２': '2', u'３': '3', u'４': '4', u'５': '5', u'６': '6', u'７': '7', u'８': '8', u'９': '9', u'０': '0'}
>>> lookup = re.compile(u'|'.join(trans.keys()), re.UNICODE)
>>> string = u'a１２３'
>>> lookup.sub(lambda x: trans[x.group()], string)
u'a123'

回复收藏 0 原文

我要还你自由 2024-08-31 12:21:24

使用 unicode.translate 方法：

>>> table = dict(zip(map(ord,u'０１２３４５６７８９'),map(ord,u'0123456789')))
>>> print u'１２３'.translate(table)
123

它需要将代码点映射为数字，而不是字符。此外，使用 u'unicodeliters' 会使值保持未编码状态。

Using the unicode.translate method:

>>> table = dict(zip(map(ord,u'０１２３４５６７８９'),map(ord,u'0123456789')))
>>> print u'１２３'.translate(table)
123

It requires a mapping of code points as numbers, not characters. Also, using u'unicode literals' leaves the values unencoded.

回复收藏 0 原文

意中人 2024-08-31 12:21:24

在Python 3中，最干净的方法是使用 str.translate 和 < a href="https://docs.python.org/3/library/stdtypes.html#str.maketrans" rel="nofollow">str.maketrans：

FULLWIDTH_TO_HALFWIDTH = str.maketrans('１２３４５６７８９０',
                                       '1234567890')
def fullwidth_to_halfwidth(s):
    return s.translate(FULLWIDTH_TO_HALFWIDTH)

在 Python 2 中，str.maketrans 是 < a href="https://docs.python.org/2/library/string.html#string.maketrans" rel="nofollow">string.maketrans 并且不适用于 Unicode 字符，因此您需要制作一本字典，正如乔什·李上面指出的那样。

In Python 3, cleanest is to use str.translate and str.maketrans:

FULLWIDTH_TO_HALFWIDTH = str.maketrans('１２３４５６７８９０',
                                       '1234567890')
def fullwidth_to_halfwidth(s):
    return s.translate(FULLWIDTH_TO_HALFWIDTH)

In Python 2, str.maketrans is instead string.maketrans and doesn’t work with Unicode characters, so you need to make a dictionary, as Josh Lee notes above.

回复收藏 0 原文

等风来 2024-08-31 12:21:24

正则表达式方法

>>> import re
>>> re.sub(u"[\uff10-\uff19]",lambda x:chr(ord(x.group(0))-0xfee0),u"４５６")
u'456'

Regex approach

>>> import re
>>> re.sub(u"[\uff10-\uff19]",lambda x:chr(ord(x.group(0))-0xfee0),u"４５６")
u'456'

回复收藏 0 原文

~没有更多了~

关于作者

后知后觉

暂无简介

0 文章

0 评论

848 人气

关注发私信

友情链接

文江博客

Python：如何用半角字符替换全角字符？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

关于作者

相关话题

热门标签

推荐作者

離殇

小姐丶请自重

Aik

国产ˉ祖宗

猥琐帝

半仙

友情链接

Python：如何用半角字符替换全角字符？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

关于作者

相关话题

热门标签

推荐作者

離殇

小姐丶请自重

Aik

国产ˉ祖宗

猥琐帝

半仙

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。