无法将 Ruby 字母中的英语单词组合起来

发布于 2024-07-19 02:17:48 字数 308 浏览 8 评论 0原文

我需要找到所有可以由字符串中的字母组成的英语单词

 sentence="Ziegler's Giant Bar"

我可以通过

 sentence.split(//)  

如何从 Ruby 中的句子中组成超过 4500 个英语单词?

[编辑]

最好将问题分成几个部分:

  1. 仅制作一个包含 10 个或更少字母的单词数组,
  2. 较长的单词可以单独查找

I need to find all English words which can be formed from the letters in a string

 sentence="Ziegler's Giant Bar"

I can make an array of letters by

 sentence.split(//)  

How can I make more than 4500 English words from the sentence in Ruby?

[edit]

It may be best to split the problem into parts:

  1. to make only an array of words with 10 letters or less
  2. the longer words can be looked up separately

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

扬花落满肩 2024-07-26 02:17:48

[假设您可以重复使用一个单词中的源字母]:对于字典列表中的每个单词,构造两个字母数组 - 一个用于候选单词,一个用于输入字符串。 从单词 array-of-letters 中减去输入的 array-of-letters,如果没有留下任何字母,则表示匹配。 执行此操作的代码如下所示:

def findWordsWithReplacement(sentence)
    out=[]
    splitArray=sentence.downcase.split(//)
    `cat /usr/share/dict/words`.each{|word|
        if (word.strip!.downcase.split(//) - splitArray).empty?
            out.push word
        end
     }
     return out
end

您可以像这样从 irb 调试器调用该函数:

output=findWordsWithReplacement("some input string"); puts output.join(" ")

...或者这里有一个包装器,您可以使用它从脚本交互地调用该函数:

puts "enter the text."
ARGF.each {|line|
    puts "working..."
    out=findWordsWithReplacement(line)
    puts out.join(" ")
    puts "there were #{out.size} words."
}

在 Mac 上运行此函数时,输出如下所示这:

$ ./findwords.rb
输入文字。
齐格勒的巨型酒吧
工作...
啊啊啊
aal aalii Aani Ab aba abaiser
abalienate Abantes Abaris abas abase
abaser Abasgi abasia 阿巴辛 abatable
abate abater abatis abaze abb 阿巴
阿巴斯·阿巴西·阿巴西·阿巴蒂尔·阿贝斯
艾比·阿贝 abear 阿贝尔·阿贝勒·阿贝利亚
阿贝尔 Abelite abelite abeltree
Aberia 异常 异常 abet abettal
冷杉 冷杉 abietate abietene abietin
松柏亚科 Abiezer Abigail
abigeat abilla abintestate
[....]
Z Z
za Zabaean zabeta 扎比安 zabra zabti
扎布蒂·扎格·扎因·赞·扎内拉·赞特·桑特
桑萨利亚赞泽 桑给巴尔扎拉铁石
zareba zat zati zattare 玉米热心
热心 热心 斑马 斑马
Zebrina zebrine zee zein zeist zel
Zelanian Zeltinger Zen Zenaga zenana
泽尔热情泽塔齐亚拉齐亚拉特齐贝林
zibet ziega zieger 之字形
zigzagger Zilla zing zingel Zingiber
姜烯 Zinnia zinsang Zinzar zira
zirai 齐尔巴尼特 齐瑞安 齐瑞安尼安
茭白 Zizia zizz
共有 6725 个单词。

这远远超过 4500 个单词,但这是因为 Mac 单词词典非常大。 如果您想准确地重现 Knuth 的结果,请从此处下载并解压 Knuth 的字典: http ://www.packetstormsecurity.org/Crackers/wordlists/dictionaries/knuth_words.gz 并将“/usr/share/dict/words”替换为解压替代目录的路径。 如果你做对了,你会得到 4514 个单词,最后是这个集合:

zanier zanies zaniness 桑给巴尔 zazen
热情斑马 斑马 蔡司 时代精神 禅宗
禅宗热情 zestier zeta Ziegler Zig
之字形 之字形 之字形 之字形
zingier zings 百日草

我相信这回答了原来的问题。

或者,提问者/读者可能希望列出可以从字符串构造的所有单词,而无需重复使用任何输入字母。 我建议的完成此操作的代码如下:复制候选单词,然后对于输入字符串中的每个字母,从副本中破坏性地删除该字母的第一个实例(使用“切片!”)。 如果此过程吸收了所有字母,请接受该单词。

def findWordsNoReplacement(sentence)
    out=[]
    splitInput=sentence.downcase.split(//)
    `cat /usr/share/dict/words`.each{|word|
        copy=word.strip!.downcase
        splitInput.each {|o| copy.slice!(o) }
        out.push word if copy==""
     }
     return out
end

[Assuming you can reuse the source letters within one word]: For each word in your dictionary list, construct two arrays of letters - one for the candidate word and one for the input string. Subtract the input array-of-letters from the word array-of-letters and if there weren't any letters left over, you've got a match. Code to do that looks like this:

def findWordsWithReplacement(sentence)
    out=[]
    splitArray=sentence.downcase.split(//)
    `cat /usr/share/dict/words`.each{|word|
        if (word.strip!.downcase.split(//) - splitArray).empty?
            out.push word
        end
     }
     return out
end

You can call that function from the irb debugger like so:

output=findWordsWithReplacement("some input string"); puts output.join(" ")

...or here's a wrapper you could use to call the function interactively from a script:

puts "enter the text."
ARGF.each {|line|
    puts "working..."
    out=findWordsWithReplacement(line)
    puts out.join(" ")
    puts "there were #{out.size} words."
}

When running this on a Mac, the output looks like this:

$ ./findwords.rb
enter the text.
Ziegler's Giant Bar
working...
A a aa
aal aalii Aani Ab aba abaiser
abalienate Abantes Abaris abas abase
abaser Abasgi abasia Abassin abatable
abate abater abatis abaze abb Abba
abbas abbasi abbassi abbatial abbess
Abbie Abe abear Abel abele Abelia
Abelian Abelite abelite abeltree
Aberia aberrant aberrate abet abettal
Abie Abies abietate abietene abietin
Abietineae Abiezer Abigail abigail
abigeat abilla abintestate
[....]
Z z
za Zabaean zabeta Zabian zabra zabti
zabtie zag zain Zan zanella zant zante
Zanzalian zanze Zanzibari zar zaratite
zareba zat zati zattare Zea zeal
zealless zeallessness zebra zebrass
Zebrina zebrine zee zein zeist zel
Zelanian Zeltinger Zen Zenaga zenana
zer zest zeta ziara ziarat zibeline
zibet ziega zieger zig zigzag
zigzagger Zilla zing zingel Zingiber
zingiberene Zinnia zinsang Zinzar zira
zirai Zirbanit Zirian Zirianian
Zizania Zizia zizz
there were 6725 words.

That is well over 4500 words, but that's because the Mac word dictionary is pretty large. If you want to reproduce Knuth's results exactly, download and unzip Knuth's dictionary from here: http://www.packetstormsecurity.org/Crackers/wordlists/dictionaries/knuth_words.gz and replace "/usr/share/dict/words" with the path to wherever you've unpacked the substitute directory. If you did it right you'll get 4514 words, ending in this collection:

zanier zanies zaniness Zanzibar zazen
zeal zebra zebras Zeiss zeitgeist Zen
Zennist zest zestier zeta Ziegler zig
zigging zigzag zigzagging zigzags zing
zingier zings zinnia

I believe that answers the original question.

Alternatively, the questioner/reader might have wanted to list all the words one can construct from a string without reusing any of the input letters. My suggested code to accomplish that works as follows: Copy the candidate word, then for each letter in the input string, destructively remove the first instance of that letter from the copy (using "slice!"). If this process absorbs all the letters, accept that word.

def findWordsNoReplacement(sentence)
    out=[]
    splitInput=sentence.downcase.split(//)
    `cat /usr/share/dict/words`.each{|word|
        copy=word.strip!.downcase
        splitInput.each {|o| copy.slice!(o) }
        out.push word if copy==""
     }
     return out
end
雪化雨蝶 2024-07-26 02:17:48

如果你想查找字母和频率受给定短语限制的单词,
您可以构造一个正则表达式来为您执行此操作:

sentence = "Ziegler's Giant Bar"

# count how many times each letter occurs in the 
# sentence (ignoring case, and removing non-letters)
counts = Hash.new(0)
sentence.downcase.gsub(/[^a-z]/,'').split(//).each do |letter|
  counts[letter] += 1
end
letters = counts.keys.join
length = counts.values.inject { |a,b| a + b }

# construct a regex that matches upto that many occurences
# of only those letters, ignoring non-letters
# (in a positive look ahead)
length_regex = /(?=^(?:[^a-z]*[#{letters}]){1,#{length}}[^a-z]*$)/i
# construct regexes that matches each letter up to its
# proper frequency (in a positive look ahead)
count_regexes = counts.map do |letter, count|
  /(?=^(?:[^#{letter}]*#{letter}){0,#{count}}[^#{letter}]*$)/i
end

# combine the regexes, to form a regex that will only
# match words that are made of a subset of the letters in the string
regex = /#{length_regex}#{count_regexes.join('')}/

# open a big file of words, and find all the ones that match
words = File.open("/usr/share/dict/words") do |f|
  f.map { |word| word.chomp }.find_all { |word| regex =~ word }
end

words.length #=> 3182
words #=> ["A", "a", "aa", "aal", "aalii", "Aani", "Ab", "aba", "abaiser", "Abantes",
          "Abaris", "abas", "abase", "abaser", "Abasgi", "abate", "abater", "abatis",
          ...
          "ba", "baa", "Baal", "baal", "Baalist", "Baalite", "Baalize", "baar", "bae",
          "Baeria", "baetzner", "bag", "baga", "bagani", "bagatine", "bagel", "bagganet",
          ...
          "eager", "eagle", "eaglet", "eagre", "ean", "ear", "earing", "earl", "earlet",
          "earn", "earner", "earnest", "earring", "eartab", "ease", "easel", "easer",
          ...
          "gab", "Gabe", "gabi", "gable", "gablet", "Gabriel", "Gael", "gaen", "gaet",
          "gag", "gagate", "gage", "gageable", "gagee", "gageite", "gager", "Gaia",
          ...
          "Iberian", "Iberis", "iberite", "ibis", "Ibsenite", "ie", "Ierne", "Igara",
          "Igbira", "ignatia", "ignite", "igniter", "Ila", "ilesite", "ilia", "Ilian",
          ...
          "laang", "lab", "Laban", "labia", "labiate", "labis", "labra", "labret", "laet",
          "laeti", "lag", "lagan", "lagen", "lagena", "lager", "laggar", "laggen",
          ...
          "Nabal", "Nabalite", "nabla", "nable", "nabs", "nae", "naegate", "naegates",
          "nael", "nag", "Naga", "naga", "Nagari", "nagger", "naggle", "nagster", "Naias",
          ...
          "Rab", "rab", "rabat", "rabatine", "Rabi", "rabies", "rabinet", "rag", "raga",
          "rage", "rager", "raggee", "ragger", "raggil", "raggle", "raging", "raglan",
          ...
          "sa", "saa", "Saan", "sab", "Saba", "Sabal", "Saban", "sabe", "saber",
          "saberleg", "Sabia", "Sabian", "Sabina", "sabina", "Sabine", "sabine", "Sabir",
          ...
          "tabes", "Tabira", "tabla", "table", "tabler", "tables", "tabling", "Tabriz",
          "tae", "tael", "taen", "taenia", "taenial", "tag", "Tagabilis", "Tagal",
          ...
          "zest", "zeta", "ziara", "ziarat", "zibeline", "zibet", "ziega", "zieger",
          "zig", "zing", "zingel", "Zingiber", "zira", "zirai", "Zirbanit", "Zirian"]

正向先行让您可以创建一个正则表达式来匹配字符串中某些指定模式匹配的位置,而无需消耗字符串中匹配的部分。
我们在这里使用它们来将同一字符串与单个正则表达式中的多个模式进行匹配。
仅当所有模式都匹配时,位置才匹配。

如果我们允许无限重用原始短语中的字母(就像 Knuth 根据 glenra 的评论所做的那样),那么它甚至更容易构建正则表达式:

sentence = "Ziegler's Giant Bar"

# find all the letters in the sentence
letters = sentence.downcase.gsub(/[^a-z]/,'').split(//).uniq

# construct a regex that matches any line in which
# the only letters used are the ones in the sentence
regex = /^([^a-z]|[#{letters.join}])*$/i

# open a big file of words, and find all the ones that match
words = File.open("/usr/share/dict/words") do |f|
  f.map { |word| word.chomp }.find_all { |word| regex =~ word }
end

words.length #=> 6725
words #=> ["A", "a", "aa", "aal", "aalii", "Aani", "Ab", "aba", "abaiser", "abalienate",
           ...
           "azine", "B", "b", "ba", "baa", "Baal", "baal", "Baalist", "Baalite",
           "Baalize", "baar", "Bab", "baba", "babai", "Babbie", "Babbitt", "babbitt",
           ...
           "Britannian", "britten", "brittle", "brittleness", "brittling", "Briza",
           "brizz", "E", "e", "ea", "eager", "eagerness", "eagle", "eagless", "eaglet",
           "eagre", "ean", "ear", "earing", "earl", "earless", "earlet", "earliness",
           ...
           "eternalize", "eternalness", "eternize", "etesian", "etna", "Etnean", "Etta",
           "Ettarre", "ettle", "ezba", "Ezra", "G", "g", "Ga", "ga", "gab", "gabber",
           "gabble", "gabbler", "Gabe", "gabelle", "gabeller", "gabgab", "gabi", "gable",
           ...
           "grittiness", "grittle", "Grizel", "Grizzel", "grizzle", "grizzler", "grr",
           "I", "i", "iba", "Iban", "Ibanag", "Iberes", "Iberi", "Iberia", "Iberian",
           ...
           "itinerarian", "itinerate", "its", "Itza", "Izar", "izar", "izle", "iztle",
           "L", "l", "la", "laager", "laang", "lab", "Laban", "labara", "labba", "labber",
           ...
           "litter", "litterer", "little", "littleness", "littling", "littress", "litz",
           "Liz", "Lizzie", "Llanberisslate", "N", "n", "na", "naa", "Naassenes", "nab",
           "Nabal", "Nabalite", "Nabataean", "Nabatean", "nabber", "nabla", "nable",
           ...
           "niter", "nitraniline", "nitrate", "nitratine", "Nitrian", "nitrile",
           "nitrite", "nitter", "R", "r", "ra", "Rab", "rab", "rabanna", "rabat",
           "rabatine", "rabatte", "rabbanist", "rabbanite", "rabbet", "rabbeting",
           ...
           "riteless", "ritelessness", "ritling", "rittingerite", "rizzar", "rizzle", "S",
           "s", "sa", "saa", "Saan", "sab", "Saba", "Sabaean", "sabaigrass", "Sabaist",
           ...
           "strigine", "string", "stringene", "stringent", "stringentness", "stringer",
           "stringiness", "stringing", "stringless", "strit", "T", "t", "ta", "taa",
           "Taal", "taar", "Tab", "tab", "tabaret", "tabbarea", "tabber", "tabbinet",
           ...
           "tsessebe", "tsetse", "tsia", "tsine", "tst", "tzaritza", "Tzental", "Z", "z",
           "za", "Zabaean", "zabeta", "Zabian", "zabra", "zabti", "zabtie", "zag", "zain",
           ...
           "Zirian", "Zirianian", "Zizania", "Zizia", "zizz"]

If you want to find words whose letters and frequency thereof are restricted by the given phrase,
you can construct a regex to do this for you:

sentence = "Ziegler's Giant Bar"

# count how many times each letter occurs in the 
# sentence (ignoring case, and removing non-letters)
counts = Hash.new(0)
sentence.downcase.gsub(/[^a-z]/,'').split(//).each do |letter|
  counts[letter] += 1
end
letters = counts.keys.join
length = counts.values.inject { |a,b| a + b }

# construct a regex that matches upto that many occurences
# of only those letters, ignoring non-letters
# (in a positive look ahead)
length_regex = /(?=^(?:[^a-z]*[#{letters}]){1,#{length}}[^a-z]*$)/i
# construct regexes that matches each letter up to its
# proper frequency (in a positive look ahead)
count_regexes = counts.map do |letter, count|
  /(?=^(?:[^#{letter}]*#{letter}){0,#{count}}[^#{letter}]*$)/i
end

# combine the regexes, to form a regex that will only
# match words that are made of a subset of the letters in the string
regex = /#{length_regex}#{count_regexes.join('')}/

# open a big file of words, and find all the ones that match
words = File.open("/usr/share/dict/words") do |f|
  f.map { |word| word.chomp }.find_all { |word| regex =~ word }
end

words.length #=> 3182
words #=> ["A", "a", "aa", "aal", "aalii", "Aani", "Ab", "aba", "abaiser", "Abantes",
          "Abaris", "abas", "abase", "abaser", "Abasgi", "abate", "abater", "abatis",
          ...
          "ba", "baa", "Baal", "baal", "Baalist", "Baalite", "Baalize", "baar", "bae",
          "Baeria", "baetzner", "bag", "baga", "bagani", "bagatine", "bagel", "bagganet",
          ...
          "eager", "eagle", "eaglet", "eagre", "ean", "ear", "earing", "earl", "earlet",
          "earn", "earner", "earnest", "earring", "eartab", "ease", "easel", "easer",
          ...
          "gab", "Gabe", "gabi", "gable", "gablet", "Gabriel", "Gael", "gaen", "gaet",
          "gag", "gagate", "gage", "gageable", "gagee", "gageite", "gager", "Gaia",
          ...
          "Iberian", "Iberis", "iberite", "ibis", "Ibsenite", "ie", "Ierne", "Igara",
          "Igbira", "ignatia", "ignite", "igniter", "Ila", "ilesite", "ilia", "Ilian",
          ...
          "laang", "lab", "Laban", "labia", "labiate", "labis", "labra", "labret", "laet",
          "laeti", "lag", "lagan", "lagen", "lagena", "lager", "laggar", "laggen",
          ...
          "Nabal", "Nabalite", "nabla", "nable", "nabs", "nae", "naegate", "naegates",
          "nael", "nag", "Naga", "naga", "Nagari", "nagger", "naggle", "nagster", "Naias",
          ...
          "Rab", "rab", "rabat", "rabatine", "Rabi", "rabies", "rabinet", "rag", "raga",
          "rage", "rager", "raggee", "ragger", "raggil", "raggle", "raging", "raglan",
          ...
          "sa", "saa", "Saan", "sab", "Saba", "Sabal", "Saban", "sabe", "saber",
          "saberleg", "Sabia", "Sabian", "Sabina", "sabina", "Sabine", "sabine", "Sabir",
          ...
          "tabes", "Tabira", "tabla", "table", "tabler", "tables", "tabling", "Tabriz",
          "tae", "tael", "taen", "taenia", "taenial", "tag", "Tagabilis", "Tagal",
          ...
          "zest", "zeta", "ziara", "ziarat", "zibeline", "zibet", "ziega", "zieger",
          "zig", "zing", "zingel", "Zingiber", "zira", "zirai", "Zirbanit", "Zirian"]

Positive lookaheads let you make a regex that matches a position in the string where some specified pattern matches without consuming the part of the string that matches.
We use them here to match the same string against multiple patterns in a single regex.
The position only matches if all our patterns match.

If we allow infinite reuse of letters from the original phrase (like Knuth did according to glenra's comment), then it's even easier to construct a regex:

sentence = "Ziegler's Giant Bar"

# find all the letters in the sentence
letters = sentence.downcase.gsub(/[^a-z]/,'').split(//).uniq

# construct a regex that matches any line in which
# the only letters used are the ones in the sentence
regex = /^([^a-z]|[#{letters.join}])*$/i

# open a big file of words, and find all the ones that match
words = File.open("/usr/share/dict/words") do |f|
  f.map { |word| word.chomp }.find_all { |word| regex =~ word }
end

words.length #=> 6725
words #=> ["A", "a", "aa", "aal", "aalii", "Aani", "Ab", "aba", "abaiser", "abalienate",
           ...
           "azine", "B", "b", "ba", "baa", "Baal", "baal", "Baalist", "Baalite",
           "Baalize", "baar", "Bab", "baba", "babai", "Babbie", "Babbitt", "babbitt",
           ...
           "Britannian", "britten", "brittle", "brittleness", "brittling", "Briza",
           "brizz", "E", "e", "ea", "eager", "eagerness", "eagle", "eagless", "eaglet",
           "eagre", "ean", "ear", "earing", "earl", "earless", "earlet", "earliness",
           ...
           "eternalize", "eternalness", "eternize", "etesian", "etna", "Etnean", "Etta",
           "Ettarre", "ettle", "ezba", "Ezra", "G", "g", "Ga", "ga", "gab", "gabber",
           "gabble", "gabbler", "Gabe", "gabelle", "gabeller", "gabgab", "gabi", "gable",
           ...
           "grittiness", "grittle", "Grizel", "Grizzel", "grizzle", "grizzler", "grr",
           "I", "i", "iba", "Iban", "Ibanag", "Iberes", "Iberi", "Iberia", "Iberian",
           ...
           "itinerarian", "itinerate", "its", "Itza", "Izar", "izar", "izle", "iztle",
           "L", "l", "la", "laager", "laang", "lab", "Laban", "labara", "labba", "labber",
           ...
           "litter", "litterer", "little", "littleness", "littling", "littress", "litz",
           "Liz", "Lizzie", "Llanberisslate", "N", "n", "na", "naa", "Naassenes", "nab",
           "Nabal", "Nabalite", "Nabataean", "Nabatean", "nabber", "nabla", "nable",
           ...
           "niter", "nitraniline", "nitrate", "nitratine", "Nitrian", "nitrile",
           "nitrite", "nitter", "R", "r", "ra", "Rab", "rab", "rabanna", "rabat",
           "rabatine", "rabatte", "rabbanist", "rabbanite", "rabbet", "rabbeting",
           ...
           "riteless", "ritelessness", "ritling", "rittingerite", "rizzar", "rizzle", "S",
           "s", "sa", "saa", "Saan", "sab", "Saba", "Sabaean", "sabaigrass", "Sabaist",
           ...
           "strigine", "string", "stringene", "stringent", "stringentness", "stringer",
           "stringiness", "stringing", "stringless", "strit", "T", "t", "ta", "taa",
           "Taal", "taar", "Tab", "tab", "tabaret", "tabbarea", "tabber", "tabbinet",
           ...
           "tsessebe", "tsetse", "tsia", "tsine", "tst", "tzaritza", "Tzental", "Z", "z",
           "za", "Zabaean", "zabeta", "Zabian", "zabra", "zabti", "zabtie", "zag", "zain",
           ...
           "Zirian", "Zirianian", "Zizania", "Zizia", "zizz"]
孤者何惧 2024-07-26 02:17:48

我认为 Ruby 没有英语词典。 但是您可以尝试将原始字符串的所有排列存储在一个数组中,然后通过 Google 检查这些字符串? 说一个词实际上就是一个词,如果点击量超过10万次什么的?

I don't think that Ruby has an English dictionary. But you could try to store all permutations of the original string in an array, and check those strings against Google? Say that a word is actually a word, if has more than 100.000 hits or something?

三生路 2024-07-26 02:17:48

你可以得到一个字母数组,如下所示:

sentence = "Ziegler's Giant Bar"
letters = sentence.split(//)

You can get an array of letters like so:

sentence = "Ziegler's Giant Bar"
letters = sentence.split(//)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文