查找一个字符串中的所有单词和短语

发布于 2024-11-09 07:30:52 字数 1414 浏览 0 评论 0原文

由于主题领域（在墙上写字），添加了有趣的条件 - 字母不能改变它们的顺序，所以这不是一个关于字谜的问题。

我看到一个很长的字，是用油漆写在墙上的，现在突然我想要通过画出任意字母组合来从这个单词中获得所有可能的单词和短语。可以使用空格随机分隔的单词。
为了扩大可能的结果，让我们假设，不需要用空格来分隔单词。
编辑：显然应该保持字母顺序（感谢 idz 指出这一点）。此外，短语可能毫无意义。以下是一些示例：

Source word: disestablishment 
paint out:   ^ ^^^    ^^^^ ^^
left:         i   tabl    e    -> i table

or paint out:^^^^^^^^^   ^ ^^
left:                 ish e    -> i she  (spacelessness is ok)

视觉示例困难模式/奖励任务：考虑对字母可能进行的轻微更改（D <-> B、C <-> O 等）困难模式（D 改为 B）

请提出解决此问题的变体。

这是我一般的简单方法

很明显，我们需要一本英语词典来查找单词。
我们的目标是获取要在字典中搜索的单词。
我们需要找到所有可能的字母变体，以将它们与字典进行匹配：每个字母可以是其本身（1）或被涂掉（0）。
考虑到“不需要空格来分隔单词”条件，为了区分单词，我们必须假设任意两个字母之间可能有空格（1 - 有空格，0 - 没有空格）。

d i s e s t a b l i s h m e n t
 ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^  - possible whitespace

N = 源单词中的字母数量
N-1 =“可能的空格”数量
任何 N + N - 1 元素都可以处于两种状态，因此我们将它们视为布尔值。可能的变化数量为2^(N + N - 1)。是的，它计算了无用的变体，例如在空格之间粘贴空格，但我没有想出更优雅的公式。
现在我们需要一个算法来获取 N+N-1 布尔值序列的所有可能变体（我还没有想好，但是递归这个词在我脑海中闪过）。然后将所有 1 替换为相应的字母（如果布尔索引为奇数）或空格（偶数）
0 带有空格（奇数）或什么都没有（偶数）。然后修剪前导和尾随空格，分隔单词并在字典中搜索它们。

我不喜欢这种可怕的方法，希望你能帮助我找到好的替代方法。

原文

Due to subject area (writing on a wall) interesting condition is added - letters cannot change their order, so this is not a question about anagrams.

I saw a long word, written by paint on a wall, and now suddenly
I want all possible words and phrases I can get from this word by painting out any combination of letters. Wo r ds, randomly separated by whitespace are OK.
To broaden possible results let's make an assumption, that space is not necessary to separate words.
Edit: Obviously letter order should be maintained (thanks idz for pointing that out). Also, phrases may be meaningless. Here are some examples:

Source word: disestablishment 
paint out:   ^ ^^^    ^^^^ ^^
left:         i   tabl    e    -> i table

or paint out:^^^^^^^^^   ^ ^^
left:                 ish e    -> i she  (spacelessness is ok)

Visual example
Hard mode/bonus task: consider possible slight alterations to letters (D <-> B, C <-> O and so on) hard mode (D changed to B)

Please suggest your variants of solving this problem.

Here's my general straightforward approach

It's clear that we'll need an English dictionary to find words.
Our goal is to get words to search for in dictionary.
We need to find all possible letters variations to match them against dictionary: each letter can be itself (1) or painted out (0).
Taking the 'space is not needed to separate words' condition in consideration, to distinguish words we must assume that there might be a space between any two letters (1 - there's a space, 0 - there isn't).

d i s e s t a b l i s h m e n t
 ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^  - possible whitespace

N = number of letters in source word
N-1 = number of 'might-be spaces'
Any of the N + N - 1 elements can be in two states, so let's treat them as booleans. The number of possible variations is 2^(N + N - 1). Yes, it counts useless variants like pasting a space between to spaces, but I didn't come up with more elegant formula.
Now we need an algorithm to get all possible variations of N+N-1 sequence of booleans (I haven't thought it out yet, but word recursion flows through my mind). Then substitute all 1s with corresponding letters (if index of boolean is odd) or whitespace (even)
and 0s with whitespace (odd) or nothing (even). Then trim leading and trailing whitespace, separate words and search them in dictionary.

I don't like this monstrous approach and hope you will help me find good alternatives.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

傲影 2024-11-16 07:30:52

1) 将字典放入 trie 或前缀树

2) 对于字符串中的每个位置查找合法的通过 trie 查找单词；存储这些

3) 打印所有不重叠单词的组合

这假设像问题中的示例一样您想要保持字母顺序（即您对字谜不感兴趣）。

回复收藏 0 原文

时光磨忆 2024-11-16 07:30:52

#!/usr/bin/python3

from itertools import *
from pprint import pprint as pp

读字典，删除我们在英语中从未使用过的所有 1 和 2 个字母的单词：

with open('/usr/share/dict/words') as f:
    english = f.read().splitlines()

english = map(str.lower, english)
english = [w for w in english if (len(w)>2 or w in ['i','a','as','at','in','on','im','it','if','is','am','an'])]

def isWord(word):
    return word in english

您的问题：

def splitwords(word):
    """
        splitwords('starts') -> (('st', 'ar', 'ts'), ('st', 'arts'), ('star', 'ts'), ('starts'))
    """
    if word=='':
        yield ()
    for i in range(1,len(word)+1):
        try:
            left,right = word[:i],word[i:]
            if left in english:
                for reading in list(splitwords(right)):
                    yield (left,) + tuple(reading)
            else:
                raise IndexError()
        except IndexError:
            pass

def splitwordsWithDeletions(word):
    masks = product(*[(0,1) for char in word])
    for mask in masks:
        candidate = ''.join(compress(word,mask))
        for reading in splitwords(candidate):
            yield reading

for reading in splitwordsWithDeletions('interesting'):
    print(reading)

结果（需要大约 30 秒）：

()                                                                                                                                                                                                                    
('i',)
('in',)
('tin',)
('ting',)
('sin',)
('sing',)
('sting',)
('eng',)
('rig',)
('ring',)
('rein',)
('resin',)
('rest',)
('rest', 'i')
('rest', 'in')
...
('inters', 'tin')
('inter', 'sting')
('inters', 'ting')
('inter', 'eng')
('interest',)
('interest', 'i')
('interest', 'in')
('interesting',)

可能通过预先计算每个字母上可以读取哪些单词到一个容器中来加快速度每个字母，并迭代那些预先计算的以加快速度。我认为其他人概述了一个解决方案。

#!/usr/bin/python3

from itertools import *
from pprint import pprint as pp

Read in dictionary, remove all 1- and 2-letter words which we never use in the English language:

with open('/usr/share/dict/words') as f:
    english = f.read().splitlines()

english = map(str.lower, english)
english = [w for w in english if (len(w)>2 or w in ['i','a','as','at','in','on','im','it','if','is','am','an'])]

def isWord(word):
    return word in english

Your problem:

def splitwords(word):
    """
        splitwords('starts') -> (('st', 'ar', 'ts'), ('st', 'arts'), ('star', 'ts'), ('starts'))
    """
    if word=='':
        yield ()
    for i in range(1,len(word)+1):
        try:
            left,right = word[:i],word[i:]
            if left in english:
                for reading in list(splitwords(right)):
                    yield (left,) + tuple(reading)
            else:
                raise IndexError()
        except IndexError:
            pass

def splitwordsWithDeletions(word):
    masks = product(*[(0,1) for char in word])
    for mask in masks:
        candidate = ''.join(compress(word,mask))
        for reading in splitwords(candidate):
            yield reading

for reading in splitwordsWithDeletions('interesting'):
    print(reading)

Result (takes about 30 seconds):

()                                                                                                                                                                                                                    
('i',)
('in',)
('tin',)
('ting',)
('sin',)
('sing',)
('sting',)
('eng',)
('rig',)
('ring',)
('rein',)
('resin',)
('rest',)
('rest', 'i')
('rest', 'in')
...
('inters', 'tin')
('inter', 'sting')
('inters', 'ting')
('inter', 'eng')
('interest',)
('interest', 'i')
('interest', 'in')
('interesting',)

Speedup possible perhaps by precalculating which words can be read on each letter, into one bin per letter, and iterating with those pre-calculated to speed things up. I think someone else outlines a solution to that effect.

回复收藏 0 原文

无语# 2024-11-16 07:30:52

您还可以在其他地方找到字谜算法。

subwords(word):
  if word is empty return
  if word is real word:
    print word
  anagrams(word)
  for each letter in word:
    subwords(word minus letter)

编辑：射击，您需要在 for 循环中传递一个起点。否则，您将多余地创建大量调用。 Frank 减去 r 减去 n 与 Frank 减去 n 减去 r 相同。设置一个起点可以确保您获得每个子集一次...除了由于双字母引起的重复。也许只是在打印之前将结果存储到哈希表中？啊啊……

There are other places you can find anagram algorithms.

subwords(word):
  if word is empty return
  if word is real word:
    print word
  anagrams(word)
  for each letter in word:
    subwords(word minus letter)

Edit: shoot, you'll want to pass a starting point in for the for loop. Otherwise, you'll be redundantly creating a LOT of calls. Frank minus r minus n is the same as Frank minus n minus r. Putting a starting point can ensure that you get each subset once... Except for repeats due to double letters. Maybe just memoize the results to a hash table before printing? Argh...

回复收藏 0 原文

~没有更多了~