古腾堡计划 Python 问题?
我正在尝试通过 python 的正则表达式和 NLTK 处理各种文本 - 它位于 http://www.nltk。 org/book-.我正在尝试创建一个随机文本生成器,但我遇到了一个问题。首先,这是我的算法:
输入一个句子作为输入 - 这称为触发字符串 -
获取触发字符串中最长的单词
在所有古腾堡计划数据库中搜索包含此单词的句子 - 无论大写小写 -
返回包含我在步骤 3 中谈到的单词的最长句子
将步骤 1 和步骤 4 中的句子附加在一起
重复该过程。请注意,我必须获取第二句中最长的单词并继续这样,依此类推 -
,我已经能够对前两个句子执行此操作,但我无法执行不区分大小写的搜索。古腾堡计划的整个句子数据库可通过 gutenberg.sents()
函数获得,但正则表达式 - 不区分大小写的搜索实际上是不可能的,因为 gutenberg.sents()
输出书中的句子如下 - 以列表格式的列表 -:
键入来调用莎士比亚的麦克白的所有句子
import nltk
from nltk.corpus import gutenberg
gutenberg.sents('shakespeare-macbeth.txt')
示例:通过在 python shell 命令行中
[['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'],
['Actus', 'Primus', '.'], .......]
,输出为: with [The Tragedie of Macbeth by William Shakespare, 1603] 和 Actus Primus。是前两句。
无论是大写还是小写,如何找到我要查找的单词?我迫切需要帮助,因为过去两天我一直在修补这个问题,它开始让我感到紧张。多谢。
I am trying to process various texts by regex and NLTK of python -which is at http://www.nltk.org/book-. I am trying to create a random text generator and I am having a hard time with a problem. First, here is my algorithm:
Enter a sentence as input -this is called trigger string-
Get longest word in trigger string
Search all Project Gutenberg database for sentences that contain this word -regardless of uppercase lowercase-
Return the longest sentence that has the word I spoke about in step 3
Append the sentence in Step 1 and Step4 together
Repeat the process. Note that I have to get the longest word in second sentence and continue like that and so on-
So far I have been able to do this for first two sentences but I cannot perform a case insensitive search. Entire sentence database of Project Gutenberg is available via gutenberg.sents()
function but regex - case insensitive search is practically impossible since the gutenberg.sents()
outputs the sentences in books as following -in a list of list format-:
EXAMPLE: all the sentences of shakespeare's macbeth is called by typing
import nltk
from nltk.corpus import gutenberg
gutenberg.sents('shakespeare-macbeth.txt')
into the python shell command line and output is:
[['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'],
['Actus', 'Primus', '.'], .......]
with [The Tragedie of Macbeth by William Shakespare, 1603] and Actus Primus. being the first two sentences.
How can I find the word I'm looking for regardless of it being uppercase/lowercase ? I'm desperately in need of help since I have been tinkering with this for the past two days and it's starting to wear on my nerves. Thanks a lot.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
给定单词列表
L
和目标单词t
,以不区分大小写的方式告诉您 L 是否包含单词 t。当然,这样做速度更快,
因为 Python 不会将常量计算“提升”到循环之外,并且除非您自己将其提升,否则它将重复执行。
给定列表
lol
的列表,可以通过以下方式找到包含t
的最长子列表如果多个子列表包含
t
并且属于相同的最大长度,这会给你第一个,因为它发生了。Given a list
L
of words, and a target wordt
,tells you whether L has word t in a case-insensitive way. It's faster, of course, to do
since Python does not "hoist" the constant computation out of the loop and, unless you hoist it yourself, it will be performed repeatedly.
Given a list of lists
lol
, the longest sub-list includingt
can be found byIf multiple sub-lists include
t
and are of the same maximal length, this will give you the first one, as it happens.如何使用内置函数: str.lower()¶
返回转换为小写的字符串的副本。
然后只需比较字符串即可。
How about using the built-in function: str.lower()¶
Return a copy of the string converted to lowercase.
Then just compare the strings.