如何使用Unicodedata模块处理Python 3中的多字符Unicode表情符号？

发布于 2025-01-18 09:47:00 字数 253 浏览 4 评论 0原文

当我与表情符号合作并尝试使用unicodedata模块获取其编码点和名称时，我一直在多字符表情符号表情符号。该模块拒绝让我使用字符串，而是想要字符。我尝试了标准化，我尝试在utf-8和unicode-escape中编码，然后一次又一次地研究了它，但是我没有成功地找到正在发生的事情！

emojis = ["

原文

While I was working with emojis and attempting to acquire their codepoint and names with the unicodedata module, I kept having issues with multi-character emojis. The module refuses to let me use strings and instead wanted characters. I tried normalizing, I tried encoding in utf-8 and unicode-escape, and I researched it again and again, but I was not successful in finding what was going on!

emojis = ["????", "????", "????", "????", "❣️", "✨"]
for emoji in emojis:
    codepoint: str = hex(ord(emoji))
    filename = 'emoji_u{0}.png'.format(codepoint[2:])
    print('{emoji} ({codepoint}) => {filename}'.format(emoji=emoji,
                                                       codepoint=codepoint,
                                                       filename=filename))

While yes, the above code does not use the unicodedata module, it shows you what I was having a problem with regardless...

???? (0x1f496) => emoji_u1f496.png
???? (0x1f498) => emoji_u1f498.png
???? (0x1f49d) => emoji_u1f49d.png
???? (0x1f49e) => emoji_u1f49e.png
Traceback (most recent call last):
  File "F:/Programming/Languages/Vue.js/lovely/collect.py", line 8, in <module>
    codepoint: str = hex(ord(emoji))
TypeError: ord() expected a character, but string of length 2 found

After a break, somehow, I managed to convert the emoji unintentionally, from this: ❣️ to this: ❣. Python was able to process this new emoji character perfectly fine. The unicodedata module likes it too!

So what's the difference? Why does one have color and not the other in both my browser and IDE? And most importantly, how do I convert multi-character emojis to single-character emojis in Python?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

从来不烧饼 2025-01-25 09:47:00

一些人感知的单个字符表情符号（称为 graphemes ）由多个代码点组成。这是处理它们的方法。我添加了一个复杂的示例：

import unicodedata as ud
emojis = ["

Some human-perceived single-character emoji (called graphemes) are made up of multiple code points. Here's a way to handle them. I added a complicated example:

import unicodedata as ud

emojis = ["????", "????", "????", "????", "❣️", "✨", "????‍????‍????‍????"]
for emoji in emojis:
    print('Emoji:',emoji)
    for cp in emoji:
        print(f'    {cp} U+{ord(cp):04X} {ud.name(cp)}')

Output:

Emoji: ????
    ???? U+1F496 SPARKLING HEART
Emoji: ????
    ???? U+1F498 HEART WITH ARROW
Emoji: ????
    ???? U+1F49D HEART WITH RIBBON
Emoji: ????
    ???? U+1F49E REVOLVING HEARTS
Emoji: ❣️
    ❣ U+2763 HEAVY HEART EXCLAMATION MARK ORNAMENT
    ️ U+FE0F VARIATION SELECTOR-16
Emoji: ✨
    ✨ U+2728 SPARKLES
Emoji: ????‍????‍????‍????
    ???? U+1F468 MAN
    ‍ U+200D ZERO WIDTH JOINER
    ???? U+1F469 WOMAN
    ‍ U+200D ZERO WIDTH JOINER
    ???? U+1F467 GIRL
    ‍ U+200D ZERO WIDTH JOINER
    ???? U+1F466 BOY

If the emoji are in a single string the rules for processing a single grapheme are complicated, but implemented by the 3rd party regex module. \X matches graphemes:

import unicodedata as ud
import regex

for m in regex.finditer(r'\X', '????????????????❣️✨????‍????‍????‍????'):
    emoji = m.group(0)
    print(f'{emoji}   {ascii(emoji)}')

Output:

????   '\U0001f496'
????   '\U0001f498'
????   '\U0001f49d'
????   '\U0001f49e'
❣️   '\u2763\ufe0f'
✨   '\u2728'
????‍????‍????‍????   '\U0001f468\u200d\U0001f469\u200d\U0001f467\u200d\U0001f466'

回复收藏 0 原文

~没有更多了~