我正在尝试在文本文件中设置范围,以便将搜索结果与特定章节相关联
我知道有更可行的方法来解决这个问题(数据库:mysql,oracle等...),并且我有一个mysql db文件(KJV Bible),我可以通过PHP代码搜索。但是,我想在 Python 中打开 Bible.txt 文件并搜索某些字符串并返回行和行号。此外,(对我来说是一个挑战)我还想归还找到该行的书(来自平面文件)。我一直在阅读并试图更加熟悉Python。不幸的是,我仍然缺乏有效和高效解决问题的知识和技能。这是我想到的:我认为如果我使用 range 方法来设置章节的开头和结尾(代表行号),我可以为每本书/章节硬编码一个名称(例如.. range( 38, 4805)此范围之间的所有线都是创世记)。这似乎有效;我只尝试了几本书。但代码非常冗长(elif 语句)。有谁知道更有效的方法?下面是我为尝试几本书而编写的代码示例,KJV.txt 文件可能是 从古腾堡计划获得。
import os
import sys
import re
word_search = raw_input(r'Enter a word to search: ')
book = open("KJV.txt", "r")
regex = re.compile(word_search)
bibook = ''
for i, line in enumerate(book.readlines()):
result = regex.search(line)
ln = i
if result:
if ln in range(36, 4809):
bibook = 'Genesis'
elif ln in range(4812, 8859):
bibook = 'Exodus'
elif ln in range(8867, 11741):
bibook = 'Leviticus'
elif ln in range(11749, 15713):
bibook = 'Numbers'
template = "\nLine: {0}\nString: {1}\nBook: {2}\n"
output = template.format(ln, result.group(), bibook)
print output
I know that there are more feasible approaches to solving this problem (db: mysql, oracle, etc...), and I have a mysql db file (KJV Bible), that I can search via PHP code. However, I want to open the Bible.txt file in Python and search for certain strings and return the line and line number. Additionally, (the challenge for me) I want to also return the book in which the line was found (from a flat file). I've been reading and trying to become more familiar with Python. Unfortunately, I'm still lacking the knowledge and skill set to effectively and efficiently problem solve. Here is what I came up with: I thought that if I use the range method to set the beginning and end of a chapter (representing the line numbers), I could hard code a name for each book/chapter (eg.. range(38, 4805) all line between this range is Genesis). This seems to work; I only tried a few books. But the code is very verbose (elif statements). Does anyone know a more efficient approach? Below is a sample of code I wrote to try a few books, and the KJV.txt file may be obtained from Project Gutenberg.
import os
import sys
import re
word_search = raw_input(r'Enter a word to search: ')
book = open("KJV.txt", "r")
regex = re.compile(word_search)
bibook = ''
for i, line in enumerate(book.readlines()):
result = regex.search(line)
ln = i
if result:
if ln in range(36, 4809):
bibook = 'Genesis'
elif ln in range(4812, 8859):
bibook = 'Exodus'
elif ln in range(8867, 11741):
bibook = 'Leviticus'
elif ln in range(11749, 15713):
bibook = 'Numbers'
template = "\nLine: {0}\nString: {1}\nBook: {2}\n"
output = template.format(ln, result.group(), bibook)
print output
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
这是一个非常坚实的开始。不过我有一些建议。
首先,您对 readlines 的使用效率有点低。 readlines 从文件中创建一个新的行列表——它将整个文件存储在内存中。但你不必这样做;如果您想做的只是迭代文件中的行,您可以只说
for line in file
,或者在您的情况下:或者,如果您确实想将文件存储在内存中,也许为了重复搜索,请将 readlines 的结果保存到变量中:
您还可以使用 read 将文本存储为单个字符串,尽管在这种情况下这没有多大帮助,因为你仍然需要分割它:
其次,我想说与使用
i
作为索引变量然后单独保存到ln
相比,只需在前面使用一个有意义的变量名称即可。ln
很好,line_number
更清晰但冗长,lineno
是一个很好的折衷方案。我们在这里继续使用 ln ,因为我们都知道它的含义。第三,正如 utdemir 在评论中指出的那样,您实际上并不需要正则表达式。如果您希望用户能够输入更复杂的搜索,那么这可能是有意义的,但 RE 足够复杂,以至于它们会产生一个有问题的默认 ui。我只会使用
in
进行简单的子字符串匹配,如下所示:其余的 if 语句都很好,在某些情况下,这是最好的做法。但是,通常在需要(例如)
case
语句的情况下,实际上最好使用字典。当然,这里有范围,所以我们必须更聪明一点。让我们从起始页字典开始。很明显,这应该在循环之前,这样我们就不会每次都重新定义字典。
现在我们必须将 ln 映射到这些字典值之一。但
ln
很可能不等于上述任何数字,因此我们不能将其直接插入字典中。我们可以使用for
循环来迭代字典键(for key in first_lines
),将前一个键存储在prev_key<中/code>,测试是否
ln > key
,如果是,则返回prev_key
。但实际上有一种更好的方法可以用 python 来实现。我们不编写普通循环,而是使用内置函数filter
或列表理解来过滤列表,以从列表中删除大于的值>ln
.然后我们找到最大值
。这里
first_lines
的作用就像一个无序列表的键;一般来说,您可以像遍历列表一样迭代字典中的键,但需要注意的是键可以采用任何顺序。lambda
是一种定义短函数的方法:该函数将x
作为参数,并返回x
x
的结果。 ln.
.我们必须这样做,因为filter
想要一个函数作为它的第一个参数。它返回一个列表,其中包含first_lines
中给出True
结果的所有值。由于这可能有点难以阅读,特别是当涉及 lambda 时,我们最好在这里使用列表理解。对于大多数人来说,列表推导式具有良好的可读性和直观性。
在这种情况下,我们甚至可以省略括号,因为我们将它直接传递给函数。 Python 将其解释为“生成器表达式”,它类似于列表理解,但会动态计算值,而不是将它们预先存储在列表中。
现在要获取书名,您所要做的就是使用
first_line
作为键:最终结果:
This is a very solid start. I've got some suggestions, though.
First, your use of
readlines
is a bit inefficient.readlines
creates a new list of lines from the file -- it stores the whole file in memory. But you don't have to do that; if all you want to do is iterate over the lines in a file, you can just sayfor line in file
, or in your case:Alternatively, if you really do want to store the file in memory, perhaps for repeated searching, save the result of
readlines
to a variable:You can also store the text as a single string with
read
, though that's not so helpful in this case, since you'd still have to split it:Second, I would say rather than using
i
as the index variable and then saving it separately toln
, just use a meaningful variable name up front.ln
is fine,line_number
is clearer but verbose,lineno
is a nice compromise. Let's stick withln
here since we all know what it means.Third, as utdemir pointed out in the comments, you don't really need regex for this. Possibly it makes sense if you want your user to be able to enter more sophisticated searches, but REs are complicated enough that they make a questionable default ui. I would just use
in
for simple substring matching, as in:The remaining if statements are fine, and in some cases, this is the best thing to do. However, often in situations that would call for (say)
case
statements, it's actually better to use a dictionary. Of course, here you've got ranges, so we have to be a bit more clever.Let's start with a dictionary of start pages. As is probably obvious, this should precede the loop so we aren't redefining the dictionary every time through.
Now we have to map
ln
to one of these dictionary values. But the odds are good thatln
isn't equal to any of the above numbers, and so we cant plug it directly into the dictionary. We could use afor
loop to iterate over the dictionary keys (for key in first_lines
), store the previous key inprev_key
, test whetherln > key
, and if so, returnprev_key
. But there's actually a much nicer way to do it in python. Instead of writing a normal loop, we filter the list, using either the built-in functionfilter
or a list comprehension to remove values from the list that are larger thanln
. Then we find themax
.Here
first_lines
acts like an unordered list of its keys; in general, you can iterate over the keys in a dictionary just as you would a list, with the caveat that the keys can take any order.lambda
is a way to define a short function: this function takesx
as an argument and returns the result ofx < ln
. We have to do it this way becausefilter
wants a function as its first argument. It returns a list containing all the values fromfirst_lines
that give aTrue
result.Since this can be a bit hard to read, especially when
lambda
is involved, we're probably better off using a list comprehension here. List comprehensions are quire readable and intuitive for most people.We can even leave out the brackets in this case, since we're passing it directly to a function. Python interprets this as something called a "generator expression", which is akin to a list comprehension but calculates the values on the fly, instead of storing them in a list up front.
Now to get the name of the book, all you have to do is use
first_line
as a key:The final outcome:
只是稍微改变了您的代码版本。
编辑:
现在我读了你的示例文件。我认为分离第一个“1:2”部分并用它来学习书籍和行号将是一个更好的选择。
Just slightly changed version of your code.
Edit:
Now I read your example file. I think seperating the first "1:2" part and using it for learning the book and line number would be a better option.
更好地写成:
is better writen as:
避免 elifs 的一个简单方法是使用循环。使用 start <= ln
来测试数字是否在范围内也更加有效。 stop
而不是使用 -range
返回一个列表,Python 必须比较每个元素。A easy way to avoid the
elifs
is a loop. It's also much more efficient to test if a number is in range withstart <= ln < stop
instead of using -range
return a list and Python has to compare each element.你可以尝试这样的事情。请注意,这些书是一本接一本地出现的,因此您只需记录您当前正在查看的书。此外,检查行号是否在范围内的方法非常昂贵,因为对于文本文件中的每一行,您构建每个范围,然后进行线性扫描以查看该行是否在号码出现在其中。
其中一条评论指出,您也许可以采用一种更加数据驱动的方法,而不是对书籍位置进行硬编码。每本书都以可识别格式的一行或多行开头吗?如果是这样,您可以尝试检查并记录您当前正在看的书。
You could try something like this. Note that the books appear one after the other, so you only need to record which is the book that you're currently looking at. Also, your approach with checking if the line number is in a
range
is very expensive, since for each line in the text file, you construct each range, and then do a linear scan to see if the line number appears in it.One of the comments pointed out that you may be able to do a more data driven approach, rather than hard-coding the book positions. Do each of the books start with a line or lines in a recognisable format? if so, you could try checking for this and recording which is the current book that you're looking at.