中文中的分词和模式匹配是如何工作的?
这道题涉及到计算,也涉及到中文知识。 我有中文查询,并且有一个单独的中文短语列表,我需要能够找到其中哪些查询包含这些短语。
用英语来说,这是一个非常简单的任务。我根本不懂中文,它的语义、语法规则等,如果这个论坛上也懂中文的人可以帮助我一些基本的理解以及如何为中文进行模式匹配。
我有一个基本的看法,在中文中,一个单位(中间没有任何空格)实际上可以表示多个单词(这是正确的吗?)。那么,对于多个单词如何相互组合以作为一个整体脱颖而出,是否存在任何规则呢?令人困惑的是,中文书写中存在空格,但即使一个没有空格的单元也包含多个单词。
任何从计算角度解释中文、模式匹配等的链接都会非常有用。
This question involves computing as well as knowledge of Chinese.
I have chinese queries and I have a separate list of phrases in Chinese I need to be able to find which of these queries have any of these phrases.
In english, it is a very simple task. I don't understand Chinese at all, its semantics, grammar rules etc. and if somebody in this forum who also understands Chinese can help me with some basic understanding and how pattern matching is done for Chinese.
I have a basic perception that in Chinese one unit (without any space in between) can actually mean more than one word(Is this correct?). So are there any rules on how more than one word combine among themselves to stand out as a unit. It is confusing because there are spaces in Chinese writing yet even a unit without space has more than one word in it.
Any links which explain Chinese from computational point of view, pattern matching etc would be very useful..
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
在中文中空格很少使用,例如:
你会注意到看起来是空格的实际上只是中文标点符号,只是比平常有更多的填充。
可以这样想:一个汉字与一个英文单词非常非常大致相似。通常,需要将两个或多个字符组合起来形成一个单词,并且根据上下文,每个单独的字符可能表示完全不同的含义。
为了有意义地标记中文文本,您必须考虑到这一点来对单词进行分段。
请参阅斯坦福 NLP 小组的中文自然语言处理和语音处理。
In Chinese spaces are rarely used, eg:
You'll notice what appear to be spaces actually are just Chinese punctuation characters, which just have more padding than usual.
Think of it this way: one Chinese character is very, very roughly similar to one English word. Often times two or more characters need to be combined to form one word, and each separate character may mean something completely different depending on context.
To meaningfully tokenize Chinese text you'd have to segment words taking that in consideration.
See Chinese Natural Language Processing and Speech Processing, from the Stanford NLP group.
Ken Lunde 的书 CJKV 信息处理 可能值得一看。
基本词序是主语 - 动词 - 宾语,但另请参阅 http://en 中的“主题突出” .wikipedia.org/wiki/Chinese_grammar
Ken Lunde's book CJKV Information Processing is probably worth a look.
The basic word order is subject - verb - object, but see also "Topic prominence" in http://en.wikipedia.org/wiki/Chinese_grammar