将中文句子拆分成单独的单词
我需要将一个中文句子分成单独的单词。中文的问题是没有空格。例如,该句子可能如下所示:主楼怎么走
(带空格则为:主楼怎么走
)。
目前我能想到一个解决方案。我有一本包含中文单词的字典(在数据库中)。该脚本将:
尝试在数据库中查找句子的前两个字符(
主楼
),如果
主楼
实际上是一个单词并且它在数据库中,脚本将尝试查找前三个字符(主楼怎
) 。主楼怎
不是一个词,所以不在数据库中 =>我的应用程序现在知道主楼
是一个单独的单词。尝试与其余的角色一起做。
我不太喜欢这种方法,因为即使要分析一小段文本,也会多次查询数据库。
对此还有其他解决方案吗?
I need to split a Chinese sentence into separate words. The problem with Chinese is that there are no spaces. For example, the sentence may look like: 主楼怎么走
(with spaces it would be: 主楼 怎么 走
).
At the moment I can think of one solution. I have a dictionary with Chinese words (in a database). The script will:
try to find the first two characters of the sentence in the database (
主楼
),if
主楼
is actually a word and it's in the database the script will try to find first three characters (主楼怎
).主楼怎
is not a word, so it's not in the database => my application now knows that主楼
is a separate word.try to do it with the rest of the characters.
I don't really like this approach, because to analyze even a small text it would query the database too many times.
Are there any other solutions to this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
您可能需要考虑使用 trie 数据结构。您首先从字典构建特里树,然后搜索有效单词会快得多。优点是可以非常快速地确定您是否处于单词末尾或需要继续查找更长的单词。
You might want to consider using a trie data structure. You first construct the trie from the dictionary then searching for valid words will be much faster. The advantage is determining if you are at the end of a word or need to continue looking for longer words is very fast.
你有输入的文本、句子、段落等。所以是的,您对其的处理将需要针对每次检查查询您的数据库。
不过,通过在单词列上建立适当的索引,您应该不会遇到太多问题。
话说回来,这本词典有多大呢?毕竟,您只需要单词,而不需要它们的定义来检查它是否是有效单词。因此,如果可能的话(取决于大小),拥有一个仅包含键(实际单词)的巨大内存映射/哈希表/字典可能是一种选择,并且会像闪电一样快。
对于 1500 万 单词,假设平均 7 个字符 @ 2 字节,每个字符大约为 200 兆字节。不太疯狂。
编辑:“仅”100 万字,您看到的空间大约刚刚超过 13 MB,比如 15 MB(加上一些开销)。我想说这是理所当然的。
You have the input text, sentence, paragraph whatever. So yes, your processing of it will need to query against your DB for each check.
With decent indexing on the word column though, you shouldn't have too many problems.
Having said that, how big is this dictionary? After all, you would only need the words, not their definitions to check whether it's a valid word. So if at all possible (depending on the size), having a huge memory map/hashtable/dictionary with just keys (the actual words) may be an option and would be quick as lightning.
At 15 million words, say average 7 characters @ 2 bytes each works out around the 200 Megabytes mark. Not too crazy.
Edit: At 'only' 1 million words, you're looking at around just over 13 Megabytes, say 15 with some overhead. That's a no-brainer I would say.
我确实意识到中文分词问题是一个非常复杂的问题,但在某些情况下,这个简单的算法可能就足够了:搜索从第 i 个字符开始的最长单词 w,然后再次从第 i+length(w) 个字符开始。
这是一个 Python 实现:
最后一部分使用 CCEDICT 字典的副本< /a> 将(简体)中文文本分段为两种风格(分别是带和不带非单词字符):
I do realize that the chinese word segmentation problem is a very complex one, but in some cases this trivial algorithm may be sufficient: search the longest word w starting with the ith character, then start again for the i+length(w)th character.
Here's a Python implementation:
The last part uses a copy of the CCEDICT dictionary to segment a (simplified) chinese text in two flavours (resp., with and without non-word characters):
好吧,如果您有一个包含所有单词的数据库,并且没有其他方法来获取这些单词,我认为您被迫重新查询数据库。
Well, if you have a database with all words and there is no other way to get those word involved I think you are forced to re-query the database.
(为简单起见,使用 ABCDE 来表示汉字)
假设您有“句子”ABCDE 输入,并且您的词典包含这些以 A 开头的单词:AB、ABC、AC、AE 和 ABB 。并假设单词CDE存在,但DE或E不存在。
当从左到右解析输入句子时,脚本会拉出第一个字符A。不要查询数据库来查看 A 是否是一个单词,而是查询数据库来提取以 A 开头的所有单词。
循环遍历这些结果,从输入字符串中抓取接下来的几个字符以进行正确的比较:
此时,程序会分叉它找到的两个“真实”分支。首先,它假定 AB 是第一个单词,并尝试查找 C 开头的单词。找到CDE,因此可以进行分支。在另一个分支中,ABC 是第一个单词,但 DE 是不可能的,因此该分支无效,这意味着第一个单词必须是真实的解释。
我认为这种方法最大限度地减少了对数据库的调用次数(尽管它可能会从数据库返回更大的集合,因为您正在获取全部以相同字符开头的单词集)。如果您的数据库针对此类搜索建立了索引,我认为这比逐个字母地搜索效果更好。现在看看整个过程以及其他答案,我认为这实际上是一个特里结构(假设搜索的字符是树的根),正如另一位海报所建议的那样。好吧,这是这个想法的实现!
(using ABCDE to represent Chinese characters for simplicity)
Let's say you've got the 'sentence' ABCDE input, and your dictionary contains these words that start with A: AB, ABC, AC, AE, and ABB. And presume that the word CDE exists, but DE, nor E do not.
When parsing the input sentence, going left to right, the script pulls the first character A. Instead of querying the database to see if A is a word, query the database to pull all words that start with A.
Loop through those results, grabbing the next few characters from the input string to get a proper comparison:
At this point the program forks down the two 'true' branches it found. On the first, it presumes AB is the first word, and tries to find C-starting words. CDE is found, so that branch is possible. Down the other branch, ABC is the first word, but DE is not possible, so that branch is invalid, meaning the first must be the true interpretation.
I think this method minimized the number of calls to the database (though it might return larger sets from the database, as you're fetching sets of words all starting with the same character). If your database were indexed for this sort of searching, I think this would work better than going letter-by letter. Looking at this whole process now, and the other answers, I think this is actually a trie structure (assuming the character searched for is the root of a tree), as another poster had suggested. Well, here's an implementation of that idea!
一种好的、快速的中文文本分词方法是基于最大匹配分词,它基本上会测试不同长度的单词,看看哪种分词组合最有可能。为此,它需要列出所有可能的单词。
在这里阅读更多相关信息:http://technology.chtsai.org/mmseg/
这就是方法我在我的读者(DuZhe)文本分析器( http://duzhe.aaginskiy.com )中使用。我不使用数据库,实际上我将单词列表预先加载到一个数组中,该数组确实占用了大约 2MB 的 RAM,但执行速度非常快。
如果您正在考虑使用词汇分割而不是统计(尽管根据一些研究,统计方法的准确度可达 97%),ADSOtrans 是一个非常好的分割工具,可以在这里找到:http://www.adsotrans.com
它使用数据库,但有很多冗余表以加快分段速度。您还可以提供语法定义来协助分段。
A good and fast way to segment Chinese text is based on Maximum Matching Segmentation, which is basically will test different length of words to see which combination of segmentation is most likely. It takes in a list of all possible words to do so.
Read more about it here: http://technology.chtsai.org/mmseg/
That's the method I use in my 读者 (DuZhe) Text Analyzer ( http://duzhe.aaginskiy.com ). I don't use a database, actually I pre-load a list of words into an array which does take up about ~2MB of RAM, but executes very quickly.
If you are looking into using lexical segmentation over statistical (though statistical method can be as accurate as ~97% according to some research), a very good segmentation tool is ADSOtrans that can be found here: http://www.adsotrans.com
It uses a database but has a lot of redundant tables to speed up the segmentation. You can also provide grammatical definitions to assist the segmentation.
这是计算语言学中相当标准的任务。它的名称是“标记化”或“分词”。尝试搜索“中文分词”或“中文分词”,您会发现一些用于完成此任务的工具,以及有关执行此任务的研究系统的论文。
为了做好这一点,您通常需要使用通过在相当大的训练语料库上运行机器学习系统而构建的统计模型。您可以在网上找到的一些系统都带有预先训练的模型。
This is a fairly standard task in computational linguistics. It goes by the name "tokenization" or "word segmentation." Try searching for "chinese word segmentation" or "chinese tokenization" and you'll find several tools that have been made to do this task, as well as papers about research systems to do it.
To do this well, you typically will need to use a statistical model built by running a machine learning system on a fairly large training corpus. Several of the systems you can find on the web come with pre-trained models.
您可以构建非常非常长的正则表达式。
编辑:
我的意思是使用数据库中的脚本自动构建它。不是为了写它
手。
You can build very very long Regular Expression.
Edit:
I meant to build it automatically with script from the DB. Not to write it by
hand.