如何在 Python 中使这个随机文本生成器更加高效？

发布于 2024-09-16 07:32:09 字数 1853 浏览 5 评论 0原文

我正在研究一个随机文本生成器 - 不使用马尔可夫链 - 目前它的工作没有太多问题。首先，这是我的代码流程：

输入一个句子作为输入 - 这称为触发字符串，分配给一个变量 -
获取触发字符串中最长的单词
在所有 Project Gutenberg 数据库中搜索包含该单词的句子 - 无论大小写 -
返回包含我在步骤 3 中谈到的单词的最长句子
附加将步骤 1 和步骤 4 中的句子放在一起
将步骤 4 中的句子指定为新的“触发”句子并重复该过程。请注意，我必须获取第二句中最长的单词并继续这样，依此类推 -

-这是我的代码：

import nltk
from nltk.corpus import gutenberg
from random import choice

triggerSentence = raw_input("Please enter the trigger sentence: ")#get input str
longestLength = 0
longestString = ""
listOfSents = gutenberg.sents() #all sentences of gutenberg are assigned -list of  list format-
listOfWords = gutenberg.words()# all words in gutenberg books -list format-

while triggerSentence:
    #so this is run every time through the loop
    split_str = triggerSentence.split()#split the sentence into words

    #code to find the longest word in the trigger sentence input
    for piece in split_str:
        if len(piece) > longestLength:
            longestString = piece
            longestLength = len(piece)

    #code to get the sentences containing the longest word, then selecting
    #random one of these sentences that are longer than 40 characters
    sets = []
    for sentence in listOfSents:
        if sentence.count(longestString):
            sents= " ".join(sentence)
            if len(sents) > 40:
            sets.append(" ".join(sentence))

    triggerSentence = choice(sets)
    print triggerSentence

我担心的是，循环大部分达到了一遍又一遍地打印同一个句子的程度。因为它是最长的句子，也有最长的单词。为了防止一遍又一遍地得到同一个句子，我想到了以下方法：

*如果当前句子中最长的单词与上一个句子中的相同，只需从当前句子中删除这个最长的单词并查找下一个最长的单词。

我为此尝试了一些实现，但未能应用上面的解决方案，因为它涉及列表和列表列表（由于古腾堡模块中的单词和句子）。关于如何找到第二长的单词有什么建议吗？我似乎无法通过解析简单的字符串输入来做到这一点，因为 NLTK 的古腾堡模块的 .sents() 和 .words() 函数分别产生列表和列表的列表。提前致谢。

原文

I'm working on a random text generator -without using Markov chains- and currently it works without too many problems. Firstly, here is my code flow:

Enter a sentence as input -this is called trigger string, is assigned to a variable-
Get longest word in trigger string
Search all Project Gutenberg database for sentences that contain this word -regardless of uppercase lowercase-
Return the longest sentence that has the word I spoke about in step 3
Append the sentence in Step 1 and Step4 together
Assign the sentence in Step 4 as the new 'trigger' sentence and repeat the process. Note that I have to get the longest word in second sentence and continue like that and so on-

And here is my code:

import nltk
from nltk.corpus import gutenberg
from random import choice

triggerSentence = raw_input("Please enter the trigger sentence: ")#get input str
longestLength = 0
longestString = ""
listOfSents = gutenberg.sents() #all sentences of gutenberg are assigned -list of  list format-
listOfWords = gutenberg.words()# all words in gutenberg books -list format-

while triggerSentence:
    #so this is run every time through the loop
    split_str = triggerSentence.split()#split the sentence into words

    #code to find the longest word in the trigger sentence input
    for piece in split_str:
        if len(piece) > longestLength:
            longestString = piece
            longestLength = len(piece)

    #code to get the sentences containing the longest word, then selecting
    #random one of these sentences that are longer than 40 characters
    sets = []
    for sentence in listOfSents:
        if sentence.count(longestString):
            sents= " ".join(sentence)
            if len(sents) > 40:
            sets.append(" ".join(sentence))

    triggerSentence = choice(sets)
    print triggerSentence

My concern is, the loop mostly reaches to a point where the same sentence is printed over and over again. Since it is the longest sentence that has the longest word. To counter getting the same sentence over and over again, I thought of the following:

*If the longest word in the current sentence is the same as it was in the last sentence, simply delete this longest word from the current sentence and look for the next longest word.

I tried some implementations for this but failed to apply the solution above since it involves lists and list of lists -due to words and sentences from gutenberg module-. Any suggestions about how to find the second longest word ? I seem to be unable to do this with parsing a simple string input since .sents() and .words() functions of NLTK's Gutenberg module yield list of list and list respectively. Thanks in advance.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

回梦 2024-09-23 07:32:09

一些建议的改进：

while 循环将永远运行，您可能应该将其删除。
使用 max 和生成器表达式以节省内存的方式生成最长的单词。
您应该生成一个长度大于 40 个字符的句子列表，其中包含具有列表理解的 longestWord。这也应该从 while 循环中删除，因为它只会发生。
sents = [" ".join(sent) 用于在 listOfSents 中发送，如果发送中最长的单词且 len(sent) > > 40]
如果您想打印出以随机顺序找到的每个句子，那么您可以尝试打乱刚刚创建的列表：
for send in random.shuffle(sents): print sent

以下是经过这些更改后代码的外观：

import nltk
from nltk.corpus import gutenberg
from random import shuffle

listOfSents = gutenberg.sents()
triggerSentence = raw_input("Please enter the trigger sentence: ")

longestWord = max(triggerSentence.split(), key=len)
longSents = [" ".join(sent) for sent in listOfSents 
                 if longestWord in sent 
                 and len(sent) > 40]

for sent in shuffle(longSents):
    print sent

Some suggested improvements:

The while loop will run forever, you should probably remove it.
Use max and generator expressions to generate the longest word in a memory-efficient manner.
You should generate a list of sentences with a length greater than 40 characters that include longestWord with a list comprehension. This should also be removed from the while loop, as it only happens.
sents = [" ".join(sent) for sent in listOfSents if longestWord in sent and len(sent) > 40]
If you want to print out every sentence that is found in a random order, then you could try shuffling the list you just created:
for sent in random.shuffle(sents): print sent

This is how the code could look with these changes:

import nltk
from nltk.corpus import gutenberg
from random import shuffle

listOfSents = gutenberg.sents()
triggerSentence = raw_input("Please enter the trigger sentence: ")

longestWord = max(triggerSentence.split(), key=len)
longSents = [" ".join(sent) for sent in listOfSents 
                 if longestWord in sent 
                 and len(sent) > 40]

for sent in shuffle(longSents):
    print sent

回复收藏 0 原文