如何对不使用空格作为单词分隔符的语言(如中文)执行Python split()?
我想将一个句子分成一个单词列表。
对于英语和欧洲语言,这很简单,只需使用 split()
>>> "This is a sentence.".split()
['This', 'is', 'a', 'sentence.']
但我还需要处理不使用空格作为单词分隔符的语言(例如中文)的句子。
>>> u"这是一个句子".split()
[u'\u8fd9\u662f\u4e00\u4e2a\u53e5\u5b50']
显然这是行不通的。
如何将这样的句子拆分为单词列表?
更新:
到目前为止,答案似乎表明这需要自然语言处理技术,并且中文的单词边界是不明确的。我不确定我明白为什么。中文的词界对我来说似乎非常明确。每个中文单词/字符都有对应的unicode,并作为单独的单词/字符显示在屏幕上。
那么歧义从何而来。正如您在我的 Python 控制台输出中看到的那样,Python 可以毫无问题地告诉我的例句由 5 个字符组成:
这 - u8fd9
是 - u662f
一 - u4e00
个 - u4e2a
句 - u53e5
子 - u5b50
所以显然 Python 可以毫无问题地告诉单词/字符边界。我只需要列表中的这些单词/字符。
I want to split a sentence into a list of words.
For English and European languages this is easy, just use split()
>>> "This is a sentence.".split()
['This', 'is', 'a', 'sentence.']
But I also need to deal with sentences in languages such as Chinese that don't use whitespace as word separator.
>>> u"这是一个句子".split()
[u'\u8fd9\u662f\u4e00\u4e2a\u53e5\u5b50']
Obviously that doesn't work.
How do I split such a sentence into a list of words?
UPDATE:
So far the answers seem to suggest that this requires natural language processing techniques and that the word boundaries in Chinese are ambiguous. I'm not sure I understand why. The word boundaries in Chinese seem very definite to me. Each Chinese word/character has a corresponding unicode and is displayed on screen as an separate word/character.
So where does the ambiguity come from. As you can see in my Python console output Python has no problem telling that my example sentence is made up of 5 characters:
这 - u8fd9
是 - u662f
一 - u4e00
个 - u4e2a
句 - u53e5
子 - u5b50
So obviously Python has no problem telling the word/character boundaries. I just need those words/characters in a list.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
您可以执行此操作,但不能使用标准库函数。正则表达式也不会帮助你。
您描述的任务属于自然语言处理 (NLP) 领域的一部分。在单词边界分割中文单词方面已经做了很多工作。我建议您使用这些现有的解决方案之一,而不是尝试推出自己的解决方案。
你列出来的都是汉字。这些大致类似于英语中的字母或音节(但与 NullUserException 在评论中指出的不太一样)。字符边界在哪里没有任何歧义 - 这是非常明确的。但您要求的不是字符边界,而是单词边界。中文单词可以由多个字符组成。
如果您只想找到字符,那么这非常简单,并且不需要 NLP 库。只需将消息解码为 unicode 字符串(如果尚未完成),然后使用对内置函数
list
的调用将 unicode 字符串转换为列表。这将为您提供字符串中的字符列表。对于您的具体示例:You can do this but not with standard library functions. And regular expressions won't help you either.
The task you are describing is part of the field called Natural Language Processing (NLP). There has been quite a lot of work done already on splitting Chinese words at word boundaries. I'd suggest that you use one of these existing solutions rather than trying to roll your own.
What you have listed there is Chinese characters. These are roughly analagous to letters or syllables in English (but not quite the same as NullUserException points out in a comment). There is no ambiguity about where the character boundaries are - this is very well defined. But you asked not for character boundaries but for word boundaries. Chinese words can consist of more than one character.
If all you want is to find the characters then this is very simple and does not require an NLP library. Simply decode the message into a unicode string (if it is not already done) then convert the unicode string to a list using a call to the builtin function
list
. This will give you a list of the characters in the string. For your specific example:需要注意的是:使用
list( '...' )
(在 Py3 中;对于 Py2 来说是u'...'
)不会,一般意义上,给你一个 unicode 字符串的字符;相反,它很可能会产生一系列 16 位代码点。对于所有“窄”CPython 构建都是如此,它占了当今 Python 安装的绝大多数。当 unicode 在 20 世纪 90 年代首次提出时,有人认为 16 位足以满足通用文本编码的所有需求,因为它实现了从 128 个代码点(7 位)到 256 个代码点(8 位)的转变高达 65,536 个代码点。然而,很快人们就发现,这只是一厢情愿的想法。如今,unicode 5.2 版中定义了大约 100,000 个代码点,还有数千个代码点有待纳入。为了实现这一点,unicode 必须从 16 位转移到(概念上)32 位(尽管它没有充分利用 32 位地址空间)。
为了保持与基于 unicode 仍为 16 位的假设构建的软件的兼容性,设计了所谓的代理对,其中来自专门指定块的两个 16 位代码点用于表示超过 65'536 的代码点,即超出unicode 称之为“基本多语言平面”(BMP),并且被戏称为该编码的“星体”平面,因为它们相对难以捉摸,并且给文本处理和编码领域的人们带来了持续的头痛。
现在,虽然窄 CPython 在某些情况下相当透明地处理代理对,但在其他情况下它仍然无法做正确的事情,字符串分割是那些更麻烦的情况之一。在狭窄的 python 构建中,
list( 'abc大
just a word of caution: using
list( '...' )
(in Py3; that'su'...'
for Py2) will not, in the general sense, give you the characters of a unicode string; rather, it will most likely result in a series of 16bit codepoints. this is true for all 'narrow' CPython builds, which accounts for the vast majority of python installations today.when unicode was first proposed in the 1990s, it was suggested that 16 bits would be more than enough to cover all the needs of a universal text encoding, as it enabled a move from 128 codepoints (7 bits) and 256 codepoints (8 bits) to a whopping 65'536 codepoints. it soon became apparent, however, that that had been wishful thinking; today, around 100'000 codepoints are defined in unicode version 5.2, and thousands more are pending for inclusion. in order for that to become possible, unicode had to move from 16 to (conceptually) 32 bits (although it doesn't make full use of the 32bit address space).
in order to maintain compatibility with software built on the assumption that unicode was still 16 bits, so-called surrogate pairs were devised, where two 16 bit codepoints from specifically designated blocks are used to express codepoints beyond 65'536, that is, beyond what unicode calls the 'basic multilingual plane', or BMP, and which are jokingly referred to as the 'astral' planes of that encoding, for their relative elusiveness and constant headache they offer to people working in the field of text processing and encoding.
now while narrow CPython deals with surrogate pairs quite transparently in some cases, it will still fail to do the right thing in other cases, string splitting being one of those more troublesome cases. in a narrow python build,
list( 'abc大????def' )
(orlist( 'abc\u5927\U00027C3Cdef' )
when written with escapes) will result in['a', 'b', 'c', '大', '\ud85f', '\udc3c', 'd', 'e', 'f']
, with'\ud85f', '\udc3c'
being a surrogate pair. incidentally,'\ud85f\udc3c'
is what the JSON standard expects you to write in order to representU-27C3C
. either of these codepoints is useless on its own; a well-formed unicode string can only ever have pairs of surrogates.so what you want to split a string into characters is really:
which correctly returns
['a', 'b', 'c', '大', '????', 'd', 'e', 'f']
(note: you can probably rewrite the regular expression so that filtering out empty strings becomes unnecessary).if all you want to do is splitting a text into chinese characters, you'd be pretty much done at this point. not sure what the OP's concept of a 'word' is, but to me, 这是一个句子 may be equally split into 这 | 是 | 一 | 个 | 句子 as well as 这是 | 一个 | 句子, depending on your point of view. however, anything that goes beyond the concept of (possibly composed) characters and character classes (symbols vs whitespace vs letters and such) goes well beyond what is built into unicode and python; you'll need some natural language processing to do that. let me remark that while your example
'yes the United Nations can!'.split()
does successfully demonstrate that the split method does something useful to a lot of data, it does not parse the english text into words correctly: it fails to recognizeUnited Nations
as one word, while it falsely assumescan!
is a word, which it is clearly not. this method gives both false positives and false negatives. depending on your data and what you intend to accomplish, this may or may not be what you want.好吧,我想通了。
我需要的可以通过简单地使用 list() 来完成:
感谢您的所有输入。
Ok I figured it out.
What I need can be accomplished by simply using list():
Thanks for all your inputs.
最好的中文分词器工具是 pynlpir。
请注意,pynlpir 有一个臭名昭著但易于修复的许可问题,您可以在互联网上找到大量解决方案。
您只需替换 NLPIR 文件夹中的 NLPIR.user 文件即可从此 存储库 并重新启动您的环境。
Best tokenizer tool for Chinese is pynlpir.
Be aware of the fact that pynlpir has a notorious but easy fixable problem with licensing, on which you can find plenty of solutions on the internet.
You simply need to replace the NLPIR.user file in your NLPIR folder downloading a valide licence from this repository and restart your environment.
像中文这样的语言对单词的定义非常流畅。例如,
ma
的一个含义是“马”。shang
的一个含义是“之上”或“之上”。一个复合词是“mashang”,字面意思是“在马背上”,但被比喻用来表示“立即”。您需要一本非常好的包含复合词的字典,并且查找字典需要最长匹配的方法。复合词在德语(著名的例子是“多瑙河蒸汽导航公司董事的妻子”被表达为一个单词)、突厥语、芬兰语和马扎尔语中非常普遍——这些语言都有很长的单词,其中许多单词在英语中找不到。一本字典,需要分解才能理解它们。你的问题是语言学问题之一,与Python无关。
Languages like Chinese have a very fluid definition of a word. E.g. One meaning of
ma
is "horse". One meaning ofshang
is "above" or "on top of". A compound is "mashang" which means literally "on horseback" but is used figuratively to mean "immediately". You need a very good dictionary with compounds in it and looking up the dictionary needs a longest-match approach. Compounding is rife in German (famous example is something like "Danube steam navigation company director's wife" being expressed as one word), Turkic languages, Finnish, and Magyar -- these languages have very long words many of which won't be found in a dictionary and need breaking down to understand them.Your problem is one of linguistics, nothing to do with Python.
对于日语来说,这在一定程度上是可能的,通常在单词的开头和结尾有不同的字符类别,但对于中文来说,有关于该主题的完整科学论文。如果您有兴趣,我有一个用于拆分日语单词的正则表达式: http://hg.hatta-wiki.org/hatta-dev/file/cd21122e2c63/hatta/search.py#l19
It's partially possible with Japanese, where you usually have different character classes at the beginning and end of the word, but there are whole scientific papers on the subject for Chinese. I have a regular expression for splitting words in Japanese if you are interested: http://hg.hatta-wiki.org/hatta-dev/file/cd21122e2c63/hatta/search.py#l19
试试这个:http://code.google.com/p/pymmseg-cpp/
Try this: http://code.google.com/p/pymmseg-cpp/
list() 是纯中文句子的答案。对于大多数情况下的英语/中文混合体。它在 hybrid-split 回答,只需复制 冬季如下。
The list() is the answer for Chinese only sentence. For those hybrid English/Chines in most of case. It answered at hybrid-split, just copy answer from Winter as below.