古腾堡计划 Python 问题?

发布于 2024-09-15 11:28:22 字数 1119 浏览 8 评论 0原文

我正在尝试通过 python 的正则表达式和 NLTK 处理各种文本 - 它位于 http://www.nltk。 org/book-.我正在尝试创建一个随机文本生成器,但我遇到了一个问题。首先,这是我的算法:

  1. 输入一个句子作为输入 - 这称为触发字符串 -

  2. 获取触发字符串中最长的单词

  3. 在所有古腾堡计划数据库中搜索包含此单词的句子 - 无论大写小写 -

  4. 返回包含我在步骤 3 中谈到的单词的最长句子

  5. 将步骤 1 和步骤 4 中的句子附加在一起

  6. 重复该过程。请注意,我必须获取第二句中最长的单词并继续这样,依此类推 -

,我已经能够对前两个句子执行此操作,但我无法执行不区分大小写的搜索。古腾堡计划的整个句子数据库可通过 gutenberg.sents() 函数获得,但正则表达式 - 不区分大小写的搜索实际上是不可能的,因为 gutenberg.sents() 输出书中的句子如下 - 以列表格式的列表 -:

键入来调用莎士比亚的麦克白的所有句子

import nltk

from nltk.corpus import gutenberg 

gutenberg.sents('shakespeare-macbeth.txt') 

示例:通过在 python shell 命令行中

[['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'], 
['Actus', 'Primus', '.'], .......] 

,输出为: with [The Tragedie of Macbeth by William Shakespare, 1603] 和 Actus Primus。是前两句。

无论是大写还是小写,如何找到我要查找的单词?我迫切需要帮助,因为过去两天我一直在修补这个问题,它开始让我感到紧张。多谢。

I am trying to process various texts by regex and NLTK of python -which is at http://www.nltk.org/book-. I am trying to create a random text generator and I am having a hard time with a problem. First, here is my algorithm:

  1. Enter a sentence as input -this is called trigger string-

  2. Get longest word in trigger string

  3. Search all Project Gutenberg database for sentences that contain this word -regardless of uppercase lowercase-

  4. Return the longest sentence that has the word I spoke about in step 3

  5. Append the sentence in Step 1 and Step4 together

  6. Repeat the process. Note that I have to get the longest word in second sentence and continue like that and so on-

So far I have been able to do this for first two sentences but I cannot perform a case insensitive search. Entire sentence database of Project Gutenberg is available via gutenberg.sents() function but regex - case insensitive search is practically impossible since the gutenberg.sents() outputs the sentences in books as following -in a list of list format-:

EXAMPLE: all the sentences of shakespeare's macbeth is called by typing

import nltk

from nltk.corpus import gutenberg 

gutenberg.sents('shakespeare-macbeth.txt') 

into the python shell command line and output is:

[['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'], 
['Actus', 'Primus', '.'], .......] 

with [The Tragedie of Macbeth by William Shakespare, 1603] and Actus Primus. being the first two sentences.

How can I find the word I'm looking for regardless of it being uppercase/lowercase ? I'm desperately in need of help since I have been tinkering with this for the past two days and it's starting to wear on my nerves. Thanks a lot.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

阳光下的泡沫是彩色的 2024-09-22 11:28:22

给定单词列表 L 和目标单词 t

any(t.lower()==w.lower() for w in L)

以不区分大小写的方式告诉您 L 是否包含单词 t。当然,这样做速度更快,

lt = t.lower()
any(lt==w.lower() for w in L)

因为 Python 不会将常量计算“提升”到循环之外,并且除非您自己将其提升,否则它将重复执行。

给定列表 lol 的列表,可以通过以下方式找到包含 t 的最长子列表

longest = max((L for L in lol if any(lt==w.lower() for w in L)), key=len)

如果多个子列表包含 t 并且属于相同的最大长度,这会给你第一个,因为它发生了。

Given a list L of words, and a target word t,

any(t.lower()==w.lower() for w in L)

tells you whether L has word t in a case-insensitive way. It's faster, of course, to do

lt = t.lower()
any(lt==w.lower() for w in L)

since Python does not "hoist" the constant computation out of the loop and, unless you hoist it yourself, it will be performed repeatedly.

Given a list of lists lol, the longest sub-list including t can be found by

longest = max((L for L in lol if any(lt==w.lower() for w in L)), key=len)

If multiple sub-lists include t and are of the same maximal length, this will give you the first one, as it happens.

缱倦旧时光 2024-09-22 11:28:22

如何使用内置函数: str.lower()
返回转换为小写的字符串的副本。

然后只需比较字符串即可。

How about using the built-in function: str.lower()
Return a copy of the string converted to lowercase.

Then just compare the strings.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文