高棉语分词的可行解决方案?

发布于 2024-10-15 06:14:55 字数 3168 浏览 10 评论 0原文

我正在研究一种解决方案,将高棉语(柬埔寨语言)的长行拆分为单个单词(UTF-8 格式)。高棉语单词之间不使用空格。有一些解决方案,但它们还远远不够(此处< /a> 和 这里),而这些项目都半途而废了。

这是需要拆分的高棉语示例行(它们可以比这更长):

ចូរសរសើរដល់ទ្រង់ដែលទ្រង់បានប្រទានការទាំងអស់នោះមកដល់រូបអ្នកដោយព្រោះអង្គព្រះយេស៊ូវ

创建一个分割高棉语单词的可行解决方案的目标是双重的:它将鼓励那些使用高棉语传统(非 Unicode)字体的人转换为 Unicode(这有很多好处),并且它将允许导入传统高棉语字体转换为 Unicode 以便快速与拼写检查器一起使用(而不是手动检查和拆分单词,对于大型文档,可能需要很长时间)。

我不需要 100% 的准确性,但速度很重要(特别是因为需要分成高棉语单词的行可能会很长)。 我愿意接受建议,但目前我有一个大的高棉语单词语料库,它们被正确分割(带有不间断的空格),并且我创建了一个单词概率字典文件(Frequency.csv)作为字典分词器。

我在此处找到了使用 维特比算法 据说它运行得很快。

import re
from itertools import groupby

def viterbi_segment(text):
    probs, lasts = [1.0], [0]
    for i in range(1, len(text) + 1):
        prob_k, k = max((probs[j] * word_prob(text[j:i]), j)
                        for j in range(max(0, i - max_word_length), i))
        probs.append(prob_k)
        lasts.append(k)
    words = []
    i = len(text)
    while 0 < i:
        words.append(text[lasts[i]:i])
        i = lasts[i]
    words.reverse()
    return words, probs[-1]

def word_prob(word): return dictionary.get(word, 0) / total
def words(text): return re.findall('[a-z]+', text.lower()) 
dictionary = dict((w, len(list(ws)))
                  for w, ws in groupby(sorted(words(open('big.txt').read()))))
max_word_length = max(map(len, dictionary))
total = float(sum(dictionary.values()))

我还尝试使用本页作者的源java代码: 文本分段:dictionary-基于单词分割,但它运行得太慢而没有任何用处(因为我的单词概率词典有超过 100k 个术语...)。

这是 python 中的另一个选项 从不带空格/组合词的文本中检测最有可能的单词

WORD_FREQUENCIES = {
    'file': 0.00123,
    'files': 0.00124,
    'save': 0.002,
    'ave': 0.00001,
    'as': 0.00555
}

def split_text(text, word_frequencies, cache):
    if text in cache:
        return cache[text]
    if not text:
        return 1, []
    best_freq, best_split = 0, []
    for i in xrange(1, len(text) + 1):
        word, remainder = text[:i], text[i:]
        freq = word_frequencies.get(word, None)
        if freq:
            remainder_freq, remainder = split_text(
                    remainder, word_frequencies, cache)
            freq *= remainder_freq
            if freq > best_freq:
                best_freq = freq
                best_split = [word] + remainder
    cache[text] = (best_freq, best_split)
    return cache[text]

print split_text('filesaveas', WORD_FREQUENCIES, {})

--> (1.3653e-08, ['file', 'save', 'as'])

我是对于 Python 来说,我是个新手,而且我对所有真正的编程(网站之外)都很陌生,所以请耐心等待。有人有任何他们认为效果很好的选择吗?

I am working on a solution to split long lines of Khmer (the Cambodian language) into individual words (in UTF-8). Khmer does not use spaces between words. There are a few solutions out there, but they are far from adequate (here and here), and those projects have fallen by the wayside.

Here is a sample line of Khmer that needs to be split (they can be longer than this):

ចូរសរសើរដល់ទ្រង់ដែលទ្រង់បានប្រទានការទាំងអស់នោះមកដល់រូបអ្នកដោយព្រោះអង្គព្រះយេស៊ូវ ហើយដែលអ្នកមិនអាចរកការទាំងអស់នោះដោយសារការប្រព្រឹត្តរបស់អ្នកឡើយ។

The goal of creating a viable solution that splits Khmer words is twofold: it will encourage those who used Khmer legacy (non-Unicode) fonts to convert over to Unicode (which has many benefits), and it will enable legacy Khmer fonts to be imported into Unicode to be used with a spelling checker quickly (rather than manually going through and splitting words which, with a large document, can take a very long time).

I don't need 100% accuracy, but speed is important (especially since the line that needs to be split into Khmer words can be quite long).
I am open to suggestions, but currently I have a large corpus of Khmer words that are correctly split (with a non-breaking space), and I have created a word probability dictionary file (frequency.csv) to use as a dictionary for the word splitter.

I found this python code here that uses the Viterbi algorithm and it supposedly runs fast.

import re
from itertools import groupby

def viterbi_segment(text):
    probs, lasts = [1.0], [0]
    for i in range(1, len(text) + 1):
        prob_k, k = max((probs[j] * word_prob(text[j:i]), j)
                        for j in range(max(0, i - max_word_length), i))
        probs.append(prob_k)
        lasts.append(k)
    words = []
    i = len(text)
    while 0 < i:
        words.append(text[lasts[i]:i])
        i = lasts[i]
    words.reverse()
    return words, probs[-1]

def word_prob(word): return dictionary.get(word, 0) / total
def words(text): return re.findall('[a-z]+', text.lower()) 
dictionary = dict((w, len(list(ws)))
                  for w, ws in groupby(sorted(words(open('big.txt').read()))))
max_word_length = max(map(len, dictionary))
total = float(sum(dictionary.values()))

I also tried using the source java code from the author of this page: Text segmentation: dictionary-based word splitting but it ran too slow to be of any use (because my word probability dictionary has over 100k terms...).

And here is another option in python from Detect most likely words from text without spaces / combined words:

WORD_FREQUENCIES = {
    'file': 0.00123,
    'files': 0.00124,
    'save': 0.002,
    'ave': 0.00001,
    'as': 0.00555
}

def split_text(text, word_frequencies, cache):
    if text in cache:
        return cache[text]
    if not text:
        return 1, []
    best_freq, best_split = 0, []
    for i in xrange(1, len(text) + 1):
        word, remainder = text[:i], text[i:]
        freq = word_frequencies.get(word, None)
        if freq:
            remainder_freq, remainder = split_text(
                    remainder, word_frequencies, cache)
            freq *= remainder_freq
            if freq > best_freq:
                best_freq = freq
                best_split = [word] + remainder
    cache[text] = (best_freq, best_split)
    return cache[text]

print split_text('filesaveas', WORD_FREQUENCIES, {})

--> (1.3653e-08, ['file', 'save', 'as'])

I am a newbee when it comes to python and I am really new to all real programming (outside of websites), so please bear with me. Does anyone have any options that they feel would work well?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

余生再见 2024-10-22 06:14:55

ICU 库(具有 Python 和 Java 绑定)具有 DictionaryBasedBreakIterator 类可用于此目的。

The ICU library (that has Python and Java bindings) has a DictionaryBasedBreakIterator class that can be used for this.

内心激荡 2024-10-22 06:14:55

带有示例 filesaveas 的 Python 似乎递归遍历整个输入字符串(for i in xrange(1, len(text) + 1)),将最佳结果填充到沿途缓存;对于每个潜在的单词,它然后开始查看下一个单词(这将依次查看之后的单词,依此类推),如果第二个单词不存在看起来不太好,它不会保存那个特定的。它感觉就像 O(N!) 运行时间,其中 N 是输入字符串的长度。

超级聪明,但除了简单的任务之外,其他任何事情都可能很糟糕。你知道的最长的高棉语单词是什么?我希望< 20 个字符。

如果您一次向该示例输入 20 个字符,也许您可​​以将运行时间降低到接近合理的水平。输入前 20 个字符,去掉第一个单词,然后输入剩余的输入。如果您重复使用缓存,它可能会做一些愚蠢的事情,例如沿途存储部分单词。

从完全不同的角度来看,有多少个高棉语单词是由两个或多个合法高棉语单词连接而成的? (类似于“penknife”或“basketball”)如果不是太多,创建一组字典可能是有意义的,按单词长度分隔,从单词映射到使用概率。

比如说,最长的高棉语单词有 14 个字符;将输入的 14 个字符输入到 len14 字典中,存储概率。将 13 个字符输入到 len13 中,存储概率。输入 12 个字符...一直到 1 到 len1 中。然后选择概率最高的解释,保存该单词,去掉那么多字符,然后重试。

因此,对于“I”与“Image”这样的输入,它不会严重失败,也许更长的输入应该自动增加概率?

感谢您提出这个有趣的问题;)我不知道有任何这样的语言,非常酷。

The python with example filesaveas appears to recurse through the entire input string (for i in xrange(1, len(text) + 1)), stuffing the best results into the cache along the way; at each potential word, it then starts looking at the next word (which will in turn look at the word after that, and so on), and if that second word doesn't look very good, it won't save that particular one. It feels like O(N!) runtime, where N is the length of the input string.

Super clever, but probably horrible for anything but simple tasks. What's the longest Khmer word you've got? I'm hoping < 20 characters.

Maybe if you feed input into that example 20 characters at a time you can keep the runtime down to something approaching reasonable. Feed in the first 20 chars, suck off the first word, and then feed in the remaining input. If you re-use the cache it might do something silly like store partial words along the way.

On a completely different tack, how many Khmer words are formed by concatenating two or more legal Khmer words? (similar to 'penknife' or 'basketball') If not too many, it might make sense to create a set of dictionaries, segregated by length of word, mapping from word to probability of use.

Say, the longest Khmer word is 14 chars long; feed in 14 characters of input into the len14 dictionary, store the probability. Feed in 13 characters into len13, store the probability. Feed in 12 characters ... all the way down to 1 into len1. Then pick the interpretation with the highest probability, save the word, strip off that many characters, and try again.

So it won't fail badly for inputs like "I" vs "Image", maybe longer inputs should have automatically inflated probabilities?

Thanks for the fun question ;) I didn't know of any languages like this, pretty cool.

牵你的手,一向走下去 2024-10-22 06:14:55

我认为这是一个好主意,事实上也是如此。

我建议你,当你有一些经验时,你可以添加一些规则,这些规则可以非常具体,例如,取决于之前的单词,取决于之后的单词,取决于周围的单词,取决于当前单词之前的单词序列单词,只是列举最常见的单词。您可以在 gposttl.sf.net 项目中找到一组规则,该项目是一个 pos 标记项目,位于文件 data/contextualrulefile 中。

规则应该在统计评估完成之后使用,它们进行一些微调,可以显着提高准确性。

I think this is a good idea, as it is.

I suggest you, when you have some experience with it, you add some rules, that can be very specific, for example, depending on word before, depending on word after, depending on surrounding words, depending on a sequence of words before the current word, just to enumerate the most frequent ones. You can find a set of rules in gposttl.sf.net project, which is a pos tagging project, in the file data/contextualrulefile.

Rules should be used AFTER the statistics evaluation is finished, they make some fine tuning, and can improve accuracy remarkably.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文