Python:如何检查 unicode 字符串是否包含大小写字符?
我正在做一个过滤器,其中检查 unicode(utf-8 编码)字符串是否不包含大写字符(在所有语言中)。如果字符串根本不包含任何大小写字符,那对我来说没问题。
例如:“你好!”不会通过过滤器,而是“!”应该通过过滤器,因为“!”不是大小写字符。
我计划使用 islower() 方法,但在上面的示例中,“!”.islower() 将返回 False。
根据 Python 文档,“如果 unicode 字符串的大小写字符全部为小写且字符串至少包含一个大小写字符,则 python unicode 方法 islower() 返回 True,否则返回 False。”
由于当字符串不包含任何大小写字符时,该方法也会返回 False,即。 “!”,我想检查字符串是否包含任何大小写字符。
像这样的东西......
string = unicode("!@#$%^", 'utf-8')
#check first if it contains cased characters
if not contains_cased(string):
return True
return string.islower():
对 contains_cased() 函数有什么建议吗?
或者可能有不同的实施方法?
谢谢!
I'm doing a filter wherein I check if a unicode (utf-8 encoding) string contains no uppercase characters (in all languages). It's fine with me if the string doesn't contain any cased character at all.
For example: 'Hello!' will not pass the filter, but "!" should pass the filter, since "!" is not a cased character.
I planned to use the islower() method, but in the example above, "!".islower() will return False.
According to the Python Docs, "The python unicode method islower() returns True if the unicode string's cased characters are all lowercase and the string contained at least one cased character, otherwise, it returns False."
Since the method also returns False when the string doesn't contain any cased character, ie. "!", I want to do check if the string contains any cased character at all.
Something like this....
string = unicode("!@#$%^", 'utf-8')
#check first if it contains cased characters
if not contains_cased(string):
return True
return string.islower():
Any suggestions for a contains_cased() function?
Or probably a different implementation approach?
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这里是有关 Unicode 字符类别的完整报道。
字母类别包括:
请注意,
Ll <-> islower()
;类似地,Lu
;(Lu 或 Lt)<-> istitle()
您可能希望阅读有关大小写的复杂讨论,其中包括一些对
Lm
字母的讨论。盲目地将所有“字母”都区分大小写显然是错误的。
Lo
类别包括 BMP 中的 45301 个代码点(使用 Python 2.6 进行计数)。其中很大一部分是朝鲜文音节、中日韩表意文字和其他东亚字符——很难理解它们如何被视为“大小写”。您可能想考虑基于您期望的“大小写字符”的(未指定)行为的替代定义。这是一个简单的第一次尝试:
有趣的是,有 1216 x Ll 和 937 x Lu,总共 2153 ... 进一步调查 Ll 和 Lu 真正含义的范围。
Here is the full scoop on Unicode character categories.
Letter categories include:
Note that
Ll <-> islower()
; similarly forLu
;(Lu or Lt) <-> istitle()
You may wish to read the complicated discussion on casing, which includes some discussion of
Lm
letters.Blindly treating all "letters" as cased is demonstrably wrong. The
Lo
category includes 45301 codepoints in the BMP (counted using Python 2.6). A large chunk of these would be Hangul Syllables, CJK Ideographs, and other East Asian characters -- very hard to understand how they might be considered "cased".You might like to consider an alternative definition, based on the (unspecified) behaviour of "cased characters" that you expect. Here's a simple first attempt:
Interestingly there are 1216 x Ll and 937 x Lu, a total of 2153 ... scope for further investigation of what Ll and Lu really mean.
使用模块
unicodedata
,小写字母返回“
Ll
”,大写字母返回“Lu
”。此处您可以找到 unicode 字符类别列表
use module
unicodedata
,returns "
Ll
" for lowercase letters and "Lu
" for uppercase ones.here you can find list of unicode character categories