查找字典单词

发布于 2024-08-02 07:53:29 字数 217 浏览 6 评论 0原文

我有很多由两个或三个英语单词组合而成的复合字符串。

    e.g. "Spicejet" is a combination of the words "spice" and "jet"

我需要将这些单独的英语单词与此类复合字符串分开。我的字典将包含大约 100000 个单词。

我可以最有效地将单个英语单词从此类复合字符串中分离出来。

I have a lot of compound strings that are a combination of two or three English words.

    e.g. "Spicejet" is a combination of the words "spice" and "jet"

I need to separate these individual English words from such compound strings. My dictionary is going to consist of around 100000 words.

What would be the most efficient by which I can separate individual English words from such compound strings.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(10

予囚 2024-08-09 07:53:29

我不确定您需要多少时间或频率来执行此操作(这是一次性操作吗?每天?每周?),但您显然需要快速、加权的字典查找。

您还需要一个冲突解决机制,也许是一个侧队列来手动解决具有多种可能含义的元组上的冲突。

我会研究尝试。使用它,您可以有效地找到(并加权)您的前缀,这正是您要寻找的。

您必须根据良好的字典源自行构建尝试,并在完整单词上对节点进行加权,以便为自己提供良好的质量机制以供参考。

只是在这里集思广益,但如果您知道您的数据集主要由二元组或三元组组成,您可能可以通过多次 Trie 查找来逃脱,例如查找“Spic”,然后查找“ejet”,然后发现两个结果的分数都较低,放弃到“Spice”和“Jet”,这两种尝试都会在两者之间产生良好的综合结果。

另外,我会考虑对最常见的前缀进行频率分析,直至达到任意或动态限制,例如过滤“the”或“un”或“in”并相应地对它们进行加权。

听起来是一个有趣的问题,祝你好运!

I'm not sure how much time or frequency you have to do this (is it a one-time operation? daily? weekly?) but you're obviously going to want a quick, weighted dictionary lookup.

You'll also want to have a conflict resolution mechanism, perhaps a side-queue to manually resolve conflicts on tuples that have multiple possible meanings.

I would look into Tries. Using one you can efficiently find (and weight) your prefixes, which are precisely what you will be looking for.

You'll have to build the Tries yourself from a good dictionary source, and weight the nodes on full words to provide yourself a good quality mechanism for reference.

Just brainstorming here, but if you know your dataset consists primarily of duplets or triplets, you could probably get away with multiple Trie lookups, for example looking up 'Spic' and then 'ejet' and then finding that both results have a low score, abandon into 'Spice' and 'Jet', where both Tries would yield a good combined result between the two.

Also I would consider utilizing frequency analysis on the most common prefixes up to an arbitrary or dynamic limit, e.g. filtering 'the' or 'un' or 'in' and weighting those accordingly.

Sounds like a fun problem, good luck!

贵在坚持 2024-08-09 07:53:29

如果您的目标是找到“输入的最大可能分解”,那么如果您使用一些图论,那么该算法可能会相当简单。您获取复合词并制作一个图表,每个字母之前和之后都有一个顶点。字符串中的每个索引都会有一个顶点,并且在末尾有一个顶点。接下来,您将在词典中查找作为复合词子串的所有合法单词。然后,对于每个合法子字符串,向图中添加一条权重为 1 的边,将子字符串中第一个字母之前的顶点与子字符串中最后一个字母之后的顶点连接起来。最后,使用最短路径算法找到第一个顶点和最后一个顶点之间边数最少的路径。

伪代码是这样的:

parseWords(compoundWord)
    # Make the graph
    graph = makeGraph()
    N = compoundWord.length
    for index = 0 to N
        graph.addVertex(i)

    # Add the edges for each word
    for index = 0 to N - 1
        for length = 1 to min(N - index, MAX_WORD_LENGTH)
            potentialWord = compoundWord.substr(index, length)
            if dictionary.isElement(potentialWord)
                graph.addEdge(index, index + length, 1)

    # Now find a list of edges which define the shortest path
    edges = graph.shortestPath(0, N)

    # Change these edges back into words.
    result = makeList()
    for e in edges
        result.add(compoundWord.substr(e.start, e.stop - e.start + 1))
    return result

显然,我还没有测试过这个伪代码,并且可能存在一些逐一索引错误,并且没有任何错误检查,但基本思想是存在的。我在学校做过类似的事情,效果很好。边创建循环的复杂度为 O(M * N),其中 N 是复合词的长度,M 是字典中的最大单词长度或 N(以较小者为准)。最短路径算法的运行时间将取决于您选择的算法。 Dijkstra 算法最容易让人想到。我认为它的运行时间是 O(N^2 * log(N)),因为可能的最大边是 N^2。

您可以使用任何最短路径算法。有几种最短路径算法,它们各有各的优点和缺点,但我猜对于你的情况,差异不会太大。如果您不想找到尽可能少的单词来分解复合词,而是想找到最多可能的单词,那么您可以给边赋予负权重,并尝试使用 允许负权重的算法

If the aim is to find the "the largest possible break up for the input" as you replied, then the algorithm could be fairly straightforward if you use some graph theory. You take the compound word and make a graph with a vertex before and after every letter. You'll have a vertex for each index in the string and one past the end. Next you find all legal words in your dictionary that are substrings of the compound word. Then, for each legal substring, add an edge with weight 1 to the graph connecting the vertex before the first letter in the substring with the vertex after the last letter in the substring. Finally, use a shortest path algorithm to find the path with fewest edges between the first and the last vertex.

The pseudo code is something like this:

parseWords(compoundWord)
    # Make the graph
    graph = makeGraph()
    N = compoundWord.length
    for index = 0 to N
        graph.addVertex(i)

    # Add the edges for each word
    for index = 0 to N - 1
        for length = 1 to min(N - index, MAX_WORD_LENGTH)
            potentialWord = compoundWord.substr(index, length)
            if dictionary.isElement(potentialWord)
                graph.addEdge(index, index + length, 1)

    # Now find a list of edges which define the shortest path
    edges = graph.shortestPath(0, N)

    # Change these edges back into words.
    result = makeList()
    for e in edges
        result.add(compoundWord.substr(e.start, e.stop - e.start + 1))
    return result

I, obviously, haven't tested this pseudo-code, and there may be some off-by-one indexing errors, and there isn't any bug-checking, but the basic idea is there. I did something similar to this in school and it worked pretty well. The edge creation loops are O(M * N), where N is the length of the compound word, and M is the maximum word length in your dictionary or N (whichever is smaller). The shortest path algorithm's runtime will depend on which algorithm you pick. Dijkstra's comes most readily to mind. I think its runtime is O(N^2 * log(N)), since the max edges possible is N^2.

You can use any shortest path algorithm. There are several shortest path algorithms which have their various strengths and weaknesses, but I'm guessing that for your case the difference will not be too significant. If, instead of trying to find the fewest possible words to break up the compound, you wanted to find the most possible, then you give the edges negative weights and try to find the shortest path with an algorithm that allows negative weights.

那伤。 2024-08-09 07:53:29

您将如何决定如何划分事物?环顾网络,您会发现一些 URL 示例被证明具有其他含义。

假设你没有大写的继续,你会用这些做什么(目前想到的,我知道还有更多。):

PenIsland
KidsExchange
TherapistFinder

最后一个特别有问题,因为麻烦的部分是两个单词连在一起but 不是一个复合词,当你打破它时,意思就完全改变了。

And how will you decide how to divide things? Look around the web and you'll find examples of URLs that turned out to have other meanings.

Assuming you didn't have the capitals to go on, what would you do with these (Ones that come to mind at present, I know there are more.):

PenIsland
KidsExchange
TherapistFinder

The last one is particularly problematic because the troublesome part is two words run together but is not a compound word, the meaning completely changes when you break it.

下壹個目標 2024-08-09 07:53:29

那么,给定一个单词,它是一个由另外两个英语单词组成的复合词吗?您可以为所有此类复合词建立某种查找表,但如果您只是检查候选词并尝试与英语单词进行匹配,您将得到误报。

编辑:看起来我必须去提供一些例子。我想到的词包括:

accustomednesses != accustomed + nesses
adulthoods != adult + hoods
agreeabilities != agree + abilities
willingest != will + ingest
windlasses != wind + lasses
withstanding != with + standing
yourselves != yours + elves
zoomorphic != zoom + orphic
ambassadorships != ambassador + ships
allotropes != allot + ropes

这里有一些 python 代码可以尝试阐明这一点。在磁盘上给自己准备一本字典并尝试一下:

from __future__ import with_statement

def opendict(dictionary=r"g:\words\words(3).txt"):
    with open(dictionary, "r") as f:
        return set(line.strip() for line in f)

if __name__ == '__main__':
    s = opendict()
    for word in sorted(s):
        if len(word) >= 10:
            for i in range(4, len(word)-4):
                left, right = word[:i], word[i:]
                if (left in s) and (right in s):
                    if right not in ('nesses', ):
                        print word, left, right

So, given a word, is it a compound word, composed of two other English words? You could have some sort of lookup table for all such compound words, but if you just examine the candidates and try to match against English words, you will get false positives.

Edit: looks as if I am going to have to go to provide some examples. Words I was thinking of include:

accustomednesses != accustomed + nesses
adulthoods != adult + hoods
agreeabilities != agree + abilities
willingest != will + ingest
windlasses != wind + lasses
withstanding != with + standing
yourselves != yours + elves
zoomorphic != zoom + orphic
ambassadorships != ambassador + ships
allotropes != allot + ropes

Here is some python code to try out to make the point. Get yourself a dictionary on disk and have a go:

from __future__ import with_statement

def opendict(dictionary=r"g:\words\words(3).txt"):
    with open(dictionary, "r") as f:
        return set(line.strip() for line in f)

if __name__ == '__main__':
    s = opendict()
    for word in sorted(s):
        if len(word) >= 10:
            for i in range(4, len(word)-4):
                left, right = word[:i], word[i:]
                if (left in s) and (right in s):
                    if right not in ('nesses', ):
                        print word, left, right
谁的新欢旧爱 2024-08-09 07:53:29

在我看来,您想将字典存储在 Trie 或 DAWG 数据结构中。

Trie 已经将单词存储为复合词。因此“spicejet”将存储为“spicejet”,其中*表示单词的结尾。您所要做的就是在字典中查找复合词并记录您击中了多少个词尾终止符。从那里你必须尝试每个子字符串(在这个例子中,我们还不知道“jet”是否是一个单词,所以我们必须查找它)。

It sounds to me like you want to store you dictionary in a Trie or a DAWG data structure.

A Trie already stores words as compound words. So "spicejet" would be stored as "spicejet" where the * denotes the end of a word. All you'd have to do is look up the compound word in the dictionary and keep track of how many end-of-word terminators you hit. From there you would then have to try each substring (in this example, we don't yet know if "jet" is a word, so we'd have to look that up).

北斗星光 2024-08-09 07:53:29

我发现任何合理的复合词中的子串数量相对较少(最小长度为 2)。例如,对于“spicejet”,我得到:

'sp', 'pi', 'ic', 'ce', 'ej', 'je', 'et',
'spi', 'pic', 'ice', 'cej', 'eje', 'jet',
'spic', 'pice', 'icej', 'ceje', 'ejet',
'spice', 'picej', 'iceje', 'cejet',
'spicej', 'piceje', 'icejet',
'spiceje' 'picejet'

... 26 个子字符串。

因此,找到一个函数来生成所有这些(使用 2, 3, 4 ... (len(yourstring) - 1) 的步幅在字符串上滑动,然后简单地检查一组中的每个函数或哈希表。

It occurs to me that there are a relatively small number of substrings (minimum length 2) from any reasonable compound word. For example for "spicejet" I get:

'sp', 'pi', 'ic', 'ce', 'ej', 'je', 'et',
'spi', 'pic', 'ice', 'cej', 'eje', 'jet',
'spic', 'pice', 'icej', 'ceje', 'ejet',
'spice', 'picej', 'iceje', 'cejet',
'spicej', 'piceje', 'icejet',
'spiceje' 'picejet'

... 26 substrings.

So, find a function to generate all those (slide across your string using strides of 2, 3, 4 ... (len(yourstring) - 1) and then simply check each of those in a set or hash table.

巷雨优美回忆 2024-08-09 07:53:29

最近有人提出了类似的问题:分词算法。如果您想限制拆分数量,您可以跟踪每个元组中的拆分数量(因此不是一对,而是三元组)。

A similar question was asked recently: Word-separating algorithm. If you wanted to limit the number of splits, you would keep track of the number of splits in each of the tuples (so instead of a pair, a triple).

抠脚大汉 2024-08-09 07:53:29

单词的存在可以通过 trie 来完成,或者更简单地通过 set(即哈希表)来完成。给定一个合适的函数,您可以这样做:

# python-ish pseudocode
def splitword(word):
    # word is a character array indexed from 0..n-1

    for i from 1 to n-1:
        head = word[:i]  # first i characters
        tail = word[i:]  # everything else

        if is_word(head):
            if i == n-1:
                return [head]   # this was the only valid word; return it as a 1-element list
            else:
                rest = splitword(tail)
                if rest != []:   # check whether we successfully split the tail into words
                    return [head] + rest

    return []  # No successful split found, and 'word' is not a word.

基本上,只需尝试不同的断点,看看我们是否可以创建单词。递归意味着它将回溯,直到找到成功的分割。

当然,这可能找不到你想要的分割。您可以修改它以返回所有可能的分割(而不仅仅是第一个找到的分割),然后进行某种加权总和,也许可以更喜欢常见的单词而不是不常见的单词。

Word existence could be done with a trie, or more simply with a set (i.e. a hash table). Given a suitable function, you could do:

# python-ish pseudocode
def splitword(word):
    # word is a character array indexed from 0..n-1

    for i from 1 to n-1:
        head = word[:i]  # first i characters
        tail = word[i:]  # everything else

        if is_word(head):
            if i == n-1:
                return [head]   # this was the only valid word; return it as a 1-element list
            else:
                rest = splitword(tail)
                if rest != []:   # check whether we successfully split the tail into words
                    return [head] + rest

    return []  # No successful split found, and 'word' is not a word.

Basically, just try the different break points to see if we can make words. The recursion means it will backtrack until a successful split is found.

Of course, this may not find the splits you want. You could modify this to return all possible splits (instead of merely the first found), then do some kind of weighted sum, perhaps, to prefer common words over uncommon words.

↙温凉少女 2024-08-09 07:53:29

这可能是一个非常困难的问题,并且没有简单的通用解决方案(可能有适用于小子集的启发式方法)。

我们在化学中正好面临这个问题,其中名称是由语素串联而成的。一个例子是:

ethylmethylketone

语素在哪里:

ethyl methyl and ketone

我们通过自动机和最大熵来解决这个问题,代码可以在 Sourceforge 上找到,

http://www.sf.net/projects/oscar3-chem

但请注意,这需要一些工作。

我们有时会遇到歧义,但仍在寻找报告它的好方法。

要区分 penIsland 和 penisLand 需要特定领域的启发法。可能的解释将取决于所使用的语料库——没有语言问题独立于正在分析的一个或多个领域。

作为另一个例子,字符串

weeknight

可以被解析为

wee knight

week night

两者都是“正确的”,因为它们遵循“形容词-名词”或“名词-名词”的形式。两者都是“有意义的”,选择哪一个将取决于使用领域。在奇幻游戏中,前者更有可能发生,而在商业中,后者更有可能发生。如果您遇到此类问题,那么拥有一个由专家注释的商定用法语料库将会很有用(技术上是自然语言处理中的“黄金标准”)。

This can be a very difficult problem and there is no simple general solution (there may be heuristics that work for small subsets).

We face exactly this problem in chemistry where names are composed by concatenation of morphemes. An example is:

ethylmethylketone

where the morphemes are:

ethyl methyl and ketone

We tackle this through automata and maximum entropy and the code is available on Sourceforge

http://www.sf.net/projects/oscar3-chem

but be warned that it will take some work.

We sometimes encounter ambiguity and are still finding a good way of reporting it.

To distinguish between penIsland and penisLand would require domain-specific heuristics. The likely interpretation will depend on the corpus being used - no linguistic problem is independent from the domain or domains being analysed.

As another example the string

weeknight

can be parsed as

wee knight

or

week night

Both are "right" in that they obey the form "adj-noun" or "noun-noun". Both make "sense" and which is chosen will depend on the domain of usage. In a fantasy game the first is more probable and in commerce the latter. If you have problems of this sort then it will be useful to have a corpus of agreed usage which has been annotated by experts (technically a "Gold Standard" in Natural Language Processing).

孤独患者 2024-08-09 07:53:29

我会使用以下算法。

  1. 从排序后的单词列表开始
    拆分,以及排序列表
    拒绝的话(字典)。

  2. 创建对象的结果列表
    应该存储:剩余单词
    和匹配单词的列表。

  3. 用单词填充结果列表
    拆分为剩余单词。

  4. 遍历结果数组并
    同时查字典——
    总是增加最少的
    二、以类似的方式
    合并算法。这样你就可以
    比较所有可能的匹配
    一次配对。

  5. 任何时候你找到匹配项,即
    拆分单词 以 a 开头的单词
    字典单词,替换
    匹配字典单词和
    结果列表中的剩余部分。
    你必须考虑到
    可能的倍数。

  6. 只要剩余部分为空,
    您找到了最终结果。

  7. 任何时候您找不到匹配的内容
    换句话说,“左侧”
    每次增加结果时
    指针由于不匹配,删除
    相应的结果项。这
    单词没有匹配项,也不可能是
    拆分。

  8. 一旦你到达底部
    列表,您将得到一个列表
    部分结果。重复循环
    直到它为空 - 转到第 4 点。

I would use the following algorithm.

  1. Start with the sorted list of words
    to split, and a sorted list of
    declined words (dictionary).

  2. Create a result list of objects
    which should store: remaining word
    and list of matched words.

  3. Fill the result list with the words
    to split as remaining words.

  4. Walk through the result array and
    the dictionary concurrently --
    always increasing the least of the
    two, in a manner similar to the
    merge algorithm. In this way you can
    compare all the possible matching
    pairs in one pass.

  5. Any time you find a match, i.e. a
    split words word that starts with a
    dictionary word, replace the
    matching dictionary word and the
    remaining part in the result list.
    You have to take into account
    possible multiples.

  6. Any time the remaining part is empty,
    you found a final result.

  7. Any time you don't find a match on
    the "left side", in other words,
    every time you increment the result
    pointer because of no match, delete
    the corresponding result item. This
    word has no matches and can't be
    split.

  8. Once you get to the bottom of the
    lists, you will have a list of
    partial results. Repeat the loop
    until this is empty -- go to point 4.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文