高棉语分词的可行解决方案？

发布于 2024-10-15 06:14:55 字数 3168 浏览 15 评论 0原文

我正在研究一种解决方案，将高棉语（柬埔寨语言）的长行拆分为单个单词（UTF-8 格式）。高棉语单词之间不使用空格。有一些解决方案，但它们还远远不够（此处< /a> 和这里），而这些项目都半途而废了。

这是需要拆分的高棉语示例行（它们可以比这更长）：

ចូរសរសើរដល់ទ្រង់ដែលទ្រង់បានប្រទានការទាំងអស់នោះមកដល់រូបអ្នកដោយព្រោះអង្គព្រះយេស៊ូវ

创建一个分割高棉语单词的可行解决方案的目标是双重的：它将鼓励那些使用高棉语传统（非 Unicode）字体的人转换为 Unicode（这有很多好处），并且它将允许导入传统高棉语字体转换为 Unicode 以便快速与拼写检查器一起使用（而不是手动检查和拆分单词，对于大型文档，可能需要很长时间）。

我不需要 100% 的准确性，但速度很重要（特别是因为需要分成高棉语单词的行可能会很长）。我愿意接受建议，但目前我有一个大的高棉语单词语料库，它们被正确分割（带有不间断的空格），并且我创建了一个单词概率字典文件（Frequency.csv）作为字典分词器。

我在此处找到了使用维特比算法据说它运行得很快。

import re
from itertools import groupby

def viterbi_segment(text):
    probs, lasts = [1.0], [0]
    for i in range(1, len(text) + 1):
        prob_k, k = max((probs[j] * word_prob(text[j:i]), j)
                        for j in range(max(0, i - max_word_length), i))
        probs.append(prob_k)
        lasts.append(k)
    words = []
    i = len(text)
    while 0 < i:
        words.append(text[lasts[i]:i])
        i = lasts[i]
    words.reverse()
    return words, probs[-1]

def word_prob(word): return dictionary.get(word, 0) / total
def words(text): return re.findall('[a-z]+', text.lower()) 
dictionary = dict((w, len(list(ws)))
                  for w, ws in groupby(sorted(words(open('big.txt').read()))))
max_word_length = max(map(len, dictionary))
total = float(sum(dictionary.values()))

我还尝试使用本页作者的源java代码：文本分段：dictionary-基于单词分割，但它运行得太慢而没有任何用处（因为我的单词概率词典有超过 100k 个术语...）。

这是 python 中的另一个选项从不带空格/组合词的文本中检测最有可能的单词：

WORD_FREQUENCIES = {
    'file': 0.00123,
    'files': 0.00124,
    'save': 0.002,
    'ave': 0.00001,
    'as': 0.00555
}

def split_text(text, word_frequencies, cache):
    if text in cache:
        return cache[text]
    if not text:
        return 1, []
    best_freq, best_split = 0, []
    for i in xrange(1, len(text) + 1):
        word, remainder = text[:i], text[i:]
        freq = word_frequencies.get(word, None)
        if freq:
            remainder_freq, remainder = split_text(
                    remainder, word_frequencies, cache)
            freq *= remainder_freq
            if freq > best_freq:
                best_freq = freq
                best_split = [word] + remainder
    cache[text] = (best_freq, best_split)
    return cache[text]

print split_text('filesaveas', WORD_FREQUENCIES, {})

--> (1.3653e-08, ['file', 'save', 'as'])

我是对于 Python 来说，我是个新手，而且我对所有真正的编程（网站之外）都很陌生，所以请耐心等待。有人有任何他们认为效果很好的选择吗？

原文

I am working on a solution to split long lines of Khmer (the Cambodian language) into individual words (in UTF-8). Khmer does not use spaces between words. There are a few solutions out there, but they are far from adequate (here and here), and those projects have fallen by the wayside.

Here is a sample line of Khmer that needs to be split (they can be longer than this):

ចូរសរសើរដល់ទ្រង់ដែលទ្រង់បានប្រទានការទាំងអស់នោះមកដល់រូបអ្នកដោយព្រោះអង្គព្រះយេស៊ូវ ហើយដែលអ្នកមិនអាចរកការទាំងអស់នោះដោយសារការប្រព្រឹត្តរបស់អ្នកឡើយ។

The goal of creating a viable solution that splits Khmer words is twofold: it will encourage those who used Khmer legacy (non-Unicode) fonts to convert over to Unicode (which has many benefits), and it will enable legacy Khmer fonts to be imported into Unicode to be used with a spelling checker quickly (rather than manually going through and splitting words which, with a large document, can take a very long time).

I don't need 100% accuracy, but speed is important (especially since the line that needs to be split into Khmer words can be quite long).
I am open to suggestions, but currently I have a large corpus of Khmer words that are correctly split (with a non-breaking space), and I have created a word probability dictionary file (frequency.csv) to use as a dictionary for the word splitter.

I found this python code here that uses the Viterbi algorithm and it supposedly runs fast.

import re
from itertools import groupby

def viterbi_segment(text):
    probs, lasts = [1.0], [0]
    for i in range(1, len(text) + 1):
        prob_k, k = max((probs[j] * word_prob(text[j:i]), j)
                        for j in range(max(0, i - max_word_length), i))
        probs.append(prob_k)
        lasts.append(k)
    words = []
    i = len(text)
    while 0 < i:
        words.append(text[lasts[i]:i])
        i = lasts[i]
    words.reverse()
    return words, probs[-1]

def word_prob(word): return dictionary.get(word, 0) / total
def words(text): return re.findall('[a-z]+', text.lower()) 
dictionary = dict((w, len(list(ws)))
                  for w, ws in groupby(sorted(words(open('big.txt').read()))))
max_word_length = max(map(len, dictionary))
total = float(sum(dictionary.values()))

I also tried using the source java code from the author of this page: Text segmentation: dictionary-based word splitting but it ran too slow to be of any use (because my word probability dictionary has over 100k terms...).

And here is another option in python from Detect most likely words from text without spaces / combined words:

WORD_FREQUENCIES = {
    'file': 0.00123,
    'files': 0.00124,
    'save': 0.002,
    'ave': 0.00001,
    'as': 0.00555
}

def split_text(text, word_frequencies, cache):
    if text in cache:
        return cache[text]
    if not text:
        return 1, []
    best_freq, best_split = 0, []
    for i in xrange(1, len(text) + 1):
        word, remainder = text[:i], text[i:]
        freq = word_frequencies.get(word, None)
        if freq:
            remainder_freq, remainder = split_text(
                    remainder, word_frequencies, cache)
            freq *= remainder_freq
            if freq > best_freq:
                best_freq = freq
                best_split = [word] + remainder
    cache[text] = (best_freq, best_split)
    return cache[text]

print split_text('filesaveas', WORD_FREQUENCIES, {})

--> (1.3653e-08, ['file', 'save', 'as'])

I am a newbee when it comes to python and I am really new to all real programming (outside of websites), so please bear with me. Does anyone have any options that they feel would work well?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

余生再见 2024-10-22 06:14:55

ICU 库（具有 Python 和 Java 绑定）具有 DictionaryBasedBreakIterator 类可用于此目的。

回复收藏 0 原文

内心激荡 2024-10-22 06:14:55

带有示例 filesaveas 的 Python 似乎递归遍历整个输入字符串（for i in xrange(1, len(text) + 1)），将最佳结果填充到沿途缓存；对于每个潜在的单词，它然后开始查看下一个单词（这将依次查看之后的单词，依此类推），如果第二个单词不存在看起来不太好，它不会保存那个特定的。它感觉就像 O(N!) 运行时间，其中 N 是输入字符串的长度。

超级聪明，但除了简单的任务之外，其他任何事情都可能很糟糕。你知道的最长的高棉语单词是什么？我希望< 20 个字符。

如果您一次向该示例输入 20 个字符，也许您可以将运行时间降低到接近合理的水平。输入前 20 个字符，去掉第一个单词，然后输入剩余的输入。如果您重复使用缓存，它可能会做一些愚蠢的事情，例如沿途存储部分单词。

从完全不同的角度来看，有多少个高棉语单词是由两个或多个合法高棉语单词连接而成的？（类似于“penknife”或“basketball”）如果不是太多，创建一组字典可能是有意义的，按单词长度分隔，从单词映射到使用概率。

比如说，最长的高棉语单词有 14 个字符；将输入的 14 个字符输入到 len14 字典中，存储概率。将 13 个字符输入到 len13 中，存储概率。输入 12 个字符...一直到 1 到 len1 中。然后选择概率最高的解释，保存该单词，去掉那么多字符，然后重试。

因此，对于“I”与“Image”这样的输入，它不会严重失败，也许更长的输入应该自动增加概率？

感谢您提出这个有趣的问题；）我不知道有任何这样的语言，非常酷。

回复收藏 0 原文

牵你的手，一向走下去 2024-10-22 06:14:55

我认为这是一个好主意，事实上也是如此。

我建议你，当你有一些经验时，你可以添加一些规则，这些规则可以非常具体，例如，取决于之前的单词，取决于之后的单词，取决于周围的单词，取决于当前单词之前的单词序列单词，只是列举最常见的单词。您可以在 gposttl.sf.net 项目中找到一组规则，该项目是一个 pos 标记项目，位于文件 data/contextualrulefile 中。

规则应该在统计评估完成之后使用，它们进行一些微调，可以显着提高准确性。

回复收藏 0 原文

~没有更多了~