我正在尝试在文本文件中设置范围，以便将搜索结果与特定章节相关联

发布于 2024-11-19 08:18:28 字数 1197 浏览 2 评论 0原文

我知道有更可行的方法来解决这个问题（数据库：mysql，oracle等...），并且我有一个mysql db文件（KJV Bible），我可以通过PHP代码搜索。但是，我想在 Python 中打开 Bible.txt 文件并搜索某些字符串并返回行和行号。此外，（对我来说是一个挑战）我还想归还找到该行的书（来自平面文件）。我一直在阅读并试图更加熟悉Python。不幸的是，我仍然缺乏有效和高效解决问题的知识和技能。这是我想到的：我认为如果我使用 range 方法来设置章节的开头和结尾（代表行号），我可以为每本书/章节硬编码一个名称（例如.. range( 38, 4805）此范围之间的所有线都是创世记）。这似乎有效；我只尝试了几本书。但代码非常冗长（elif 语句）。有谁知道更有效的方法？下面是我为尝试几本书而编写的代码示例，KJV.txt 文件可能是从古腾堡计划获得。

 import os
 import sys
 import re

 word_search = raw_input(r'Enter a word to search: ')
 book = open("KJV.txt", "r")
 regex = re.compile(word_search)
 bibook = ''

 for i, line in enumerate(book.readlines()):
     result = regex.search(line)
     ln = i
     if result:
         if ln in range(36, 4809):
            bibook = 'Genesis'
         elif ln in range(4812, 8859):
            bibook = 'Exodus'
         elif ln in range(8867, 11741):
            bibook =  'Leviticus'
         elif ln in range(11749, 15713):
            bibook = 'Numbers'

         template = "\nLine: {0}\nString: {1}\nBook: {2}\n"
         output = template.format(ln, result.group(), bibook)
         print output

原文

I know that there are more feasible approaches to solving this problem (db: mysql, oracle, etc...), and I have a mysql db file (KJV Bible), that I can search via PHP code. However, I want to open the Bible.txt file in Python and search for certain strings and return the line and line number. Additionally, (the challenge for me) I want to also return the book in which the line was found (from a flat file). I've been reading and trying to become more familiar with Python. Unfortunately, I'm still lacking the knowledge and skill set to effectively and efficiently problem solve. Here is what I came up with: I thought that if I use the range method to set the beginning and end of a chapter (representing the line numbers), I could hard code a name for each book/chapter (eg.. range(38, 4805) all line between this range is Genesis). This seems to work; I only tried a few books. But the code is very verbose (elif statements). Does anyone know a more efficient approach? Below is a sample of code I wrote to try a few books, and the KJV.txt file may be obtained from Project Gutenberg.

 import os
 import sys
 import re

 word_search = raw_input(r'Enter a word to search: ')
 book = open("KJV.txt", "r")
 regex = re.compile(word_search)
 bibook = ''

 for i, line in enumerate(book.readlines()):
     result = regex.search(line)
     ln = i
     if result:
         if ln in range(36, 4809):
            bibook = 'Genesis'
         elif ln in range(4812, 8859):
            bibook = 'Exodus'
         elif ln in range(8867, 11741):
            bibook =  'Leviticus'
         elif ln in range(11749, 15713):
            bibook = 'Numbers'

         template = "\nLine: {0}\nString: {1}\nBook: {2}\n"
         output = template.format(ln, result.group(), bibook)
         print output

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

温暖的光 2024-11-26 08:18:28

这是一个非常坚实的开始。不过我有一些建议。

首先，您对 readlines 的使用效率有点低。 readlines 从文件中创建一个新的行列表——它将整个文件存储在内存中。但你不必这样做；如果您想做的只是迭代文件中的行，您可以只说 for line in file，或者在您的情况下：

for i, line in enumerate(book):

或者，如果您确实想将文件存储在内存中，也许为了重复搜索，请将 readlines 的结果保存到变量中：

booklines = book.readlines()
for i, line in enumerate(booklines):

您还可以使用 read 将文本存储为单个字符串，尽管在这种情况下这没有多大帮助，因为你仍然需要分割它：

booktxt = book.read()
booklines = book.splitlines() #
for i, line in enumerate(booklines)

其次，我想说与使用i作为索引变量然后单独保存到ln相比，只需在前面使用一个有意义的变量名称即可。 ln 很好，line_number 更清晰但冗长，lineno 是一个很好的折衷方案。我们在这里继续使用 ln ，因为我们都知道它的含义。

for ln, line in enumerate(book):

第三，正如 utdemir 在评论中指出的那样，您实际上并不需要正则表达式。如果您希望用户能够输入更复杂的搜索，那么这可能是有意义的，但 RE 足够复杂，以至于它们会产生一个有问题的默认 ui。我只会使用 in 进行简单的子字符串匹配，如下所示：

    if word_search in line:

其余的 if 语句都很好，在某些情况下，这是最好的做法。但是，通常在需要（例如）case 语句的情况下，实际上最好使用字典。当然，这里有范围，所以我们必须更聪明一点。

让我们从起始页字典开始。很明显，这应该在循环之前，这样我们就不会每次都重新定义字典。

first_lines = {36: 'Genesis', 4812: 'Exodus', 8867: 'Leviticus', 11749: 'Numbers'}

现在我们必须将 ln 映射到这些字典值之一。但 ln 很可能不等于上述任何数字，因此我们不能将其直接插入字典中。我们可以使用for循环来迭代字典键（for key in first_lines），将前一个键存储在prev_key<中/code>，测试是否ln > key，如果是，则返回prev_key。但实际上有一种更好的方法可以用 python 来实现。我们不编写普通循环，而是使用内置函数 filter 或列表理解来过滤列表，以从列表中删除大于的值>ln.然后我们找到最大值。

first_line = max(filter(lambda l: l < ln, first_lines))

这里 first_lines 的作用就像一个无序列表的键；一般来说，您可以像遍历列表一样迭代字典中的键，但需要注意的是键可以采用任何顺序。 lambda 是一种定义短函数的方法：该函数将 x 作为参数，并返回 x x 的结果。 ln..我们必须这样做，因为 filter 想要一个函数作为它的第一个参数。它返回一个列表，其中包含 first_lines 中给出 True 结果的所有值。

由于这可能有点难以阅读，特别是当涉及 lambda 时，我们最好在这里使用列表理解。对于大多数人来说，列表推导式具有良好的可读性和直观性。

first_line = max([l for l in first_lines if l < ln])

在这种情况下，我们甚至可以省略括号，因为我们将它直接传递给函数。 Python 将其解释为“生成器表达式”，它类似于列表理解，但会动态计算值，而不是将它们预先存储在列表中。

first_line = max(l for l in first_lines if l < ln)

现在要获取书名，您所要做的就是使用 first_line 作为键：

bibook = first_lines[first_line]

最终结果：

import os
import sys
import re

word_search = raw_input(r'Enter a word to search: ')
book = open("KJV.txt", "r")
first_lines = {36: 'Genesis', 4812: 'Exodus', 8867: 'Leviticus', 11749: 'Numbers'}

for ln, line in enumerate(book):
    if word_search in line:
        first_line = max(l for l in first_lines if l < ln)
        bibook = first_lines[first_line]

        template = "\nLine: {0}\nString: {1}\nBook: {2}\n"
        output = template.format(ln, word_search, bibook)
        print output

This is a very solid start. I've got some suggestions, though.

First, your use of readlines is a bit inefficient. readlines creates a new list of lines from the file -- it stores the whole file in memory. But you don't have to do that; if all you want to do is iterate over the lines in a file, you can just say for line in file, or in your case:

for i, line in enumerate(book):

Alternatively, if you really do want to store the file in memory, perhaps for repeated searching, save the result of readlines to a variable:

booklines = book.readlines()
for i, line in enumerate(booklines):

You can also store the text as a single string with read, though that's not so helpful in this case, since you'd still have to split it:

booktxt = book.read()
booklines = book.splitlines() #
for i, line in enumerate(booklines)

Second, I would say rather than using i as the index variable and then saving it separately to ln, just use a meaningful variable name up front. ln is fine, line_number is clearer but verbose, lineno is a nice compromise. Let's stick with ln here since we all know what it means.

for ln, line in enumerate(book):

Third, as utdemir pointed out in the comments, you don't really need regex for this. Possibly it makes sense if you want your user to be able to enter more sophisticated searches, but REs are complicated enough that they make a questionable default ui. I would just use in for simple substring matching, as in:

    if word_search in line:

The remaining if statements are fine, and in some cases, this is the best thing to do. However, often in situations that would call for (say) case statements, it's actually better to use a dictionary. Of course, here you've got ranges, so we have to be a bit more clever.

Let's start with a dictionary of start pages. As is probably obvious, this should precede the loop so we aren't redefining the dictionary every time through.

first_lines = {36: 'Genesis', 4812: 'Exodus', 8867: 'Leviticus', 11749: 'Numbers'}

Now we have to map ln to one of these dictionary values. But the odds are good that ln isn't equal to any of the above numbers, and so we cant plug it directly into the dictionary. We could use a for loop to iterate over the dictionary keys (for key in first_lines), store the previous key in prev_key, test whether ln > key, and if so, return prev_key. But there's actually a much nicer way to do it in python. Instead of writing a normal loop, we filter the list, using either the built-in function filter or a list comprehension to remove values from the list that are larger than ln. Then we find the max.

first_line = max(filter(lambda l: l < ln, first_lines))

Here first_lines acts like an unordered list of its keys; in general, you can iterate over the keys in a dictionary just as you would a list, with the caveat that the keys can take any order. lambda is a way to define a short function: this function takes x as an argument and returns the result of x < ln. We have to do it this way because filter wants a function as its first argument. It returns a list containing all the values from first_lines that give a True result.

Since this can be a bit hard to read, especially when lambda is involved, we're probably better off using a list comprehension here. List comprehensions are quire readable and intuitive for most people.

first_line = max([l for l in first_lines if l < ln])

We can even leave out the brackets in this case, since we're passing it directly to a function. Python interprets this as something called a "generator expression", which is akin to a list comprehension but calculates the values on the fly, instead of storing them in a list up front.

first_line = max(l for l in first_lines if l < ln)

Now to get the name of the book, all you have to do is use first_line as a key:

bibook = first_lines[first_line]

The final outcome:

import os
import sys
import re

word_search = raw_input(r'Enter a word to search: ')
book = open("KJV.txt", "r")
first_lines = {36: 'Genesis', 4812: 'Exodus', 8867: 'Leviticus', 11749: 'Numbers'}

for ln, line in enumerate(book):
    if word_search in line:
        first_line = max(l for l in first_lines if l < ln)
        bibook = first_lines[first_line]

        template = "\nLine: {0}\nString: {1}\nBook: {2}\n"
        output = template.format(ln, word_search, bibook)
        print output

回复收藏 0 原文

柠栀 2024-11-26 08:18:28

只是稍微改变了您的代码版本。

word_search = raw_input(r'Enter a word to search: ')

with open("KJV.txt", "r") as book:
    #using with is always better when messing with files.
    bibook = ''
    for pos, line in enumerate(book):
    #a file object is already an iterable, so i don't think we need readlines.
        if result in line:
        #if result is always in ranges in your question, no need to check other limits.
        #also comparision operators is a lot faster than in.
            if pos < 4809:
                bibook = 'Genesis'
            elif pos < 8859:
                bibook = 'Exodus'
            elif pos < 11741:
                bibook = 'Leviticus'
            else:
                bibook = 'Numbers'
            #you can use string templates, but i think no need for that
            out = "\nLine: {0}\nString: {1}\nBook: {2}".format(
                                            pos, line, book)

            print(out)

编辑：

现在我读了你的示例文件。我认为分离第一个“1:2”部分并用它来学习书籍和行号将是一个更好的选择。

Just slightly changed version of your code.

word_search = raw_input(r'Enter a word to search: ')

with open("KJV.txt", "r") as book:
    #using with is always better when messing with files.
    bibook = ''
    for pos, line in enumerate(book):
    #a file object is already an iterable, so i don't think we need readlines.
        if result in line:
        #if result is always in ranges in your question, no need to check other limits.
        #also comparision operators is a lot faster than in.
            if pos < 4809:
                bibook = 'Genesis'
            elif pos < 8859:
                bibook = 'Exodus'
            elif pos < 11741:
                bibook = 'Leviticus'
            else:
                bibook = 'Numbers'
            #you can use string templates, but i think no need for that
            out = "\nLine: {0}\nString: {1}\nBook: {2}".format(
                                            pos, line, book)

            print(out)

Edit:

Now I read your example file. I think seperating the first "1:2" part and using it for learning the book and line number would be a better option.

回复收藏 0 原文

寻找我们的幸福 2024-11-26 08:18:28

     if ln in range(36, 4809):
        bibook = 'Genesis'
     elif ln in range(4812, 8859):
        bibook = 'Exodus'
     elif ln in range(8867, 11741):
        bibook =  'Leviticus'
     elif ln in range(11749, 15713):
        bibook = 'Numbers'

更好地写成：

#      (start, end, book)
tab = [(36, 4809, 'Genesis'), 
       (4812, 8859, 'Exodus'),
       (8867, 11741, 'Leviticus'),
       (11749, 15713, 'Numbers')]
for start, end, book in tab:
    if start <= ln < end:
        bibook = book
        break

     if ln in range(36, 4809):
        bibook = 'Genesis'
     elif ln in range(4812, 8859):
        bibook = 'Exodus'
     elif ln in range(8867, 11741):
        bibook =  'Leviticus'
     elif ln in range(11749, 15713):
        bibook = 'Numbers'

is better writen as:

#      (start, end, book)
tab = [(36, 4809, 'Genesis'), 
       (4812, 8859, 'Exodus'),
       (8867, 11741, 'Leviticus'),
       (11749, 15713, 'Numbers')]
for start, end, book in tab:
    if start <= ln < end:
        bibook = book
        break

回复收藏 0 原文

傲鸠 2024-11-26 08:18:28

避免 elifs 的一个简单方法是使用循环。使用 start <= ln 来测试数字是否在范围内也更加有效。 stop 而不是使用 - range 返回一个列表，Python 必须比较每个元素。

import os
import sys
import re


word_search = raw_input(r'Enter a word to search: ')
book = open("KJV.txt", "r")
regex = re.compile(word_search)
bibook = ''

bookranges = [
    ((36, 4809),  'Genesis'),
    ((4812, 8859), 'Exodus'),
    ((8867, 11741), 'Leviticus'),
    ((11749, 15713), 'Numbers')
]


for ln, line in enumerate(book.readlines()):
    result = regex.search(line)
    if result:
        for (start, stop), bibook in bookranges:
            if start <= ln <= stop:
                # found the book, so end the loop and use it later
                break
        else:
            # didnt find any range that matches.
            bibook = 'Somewhere between books'

     template = "\nLine: {0}\nString: {1}\nBook: {2}\n"
     output = template.format(ln, result.group(), bibook)
     print output

A easy way to avoid the elifs is a loop. It's also much more efficient to test if a number is in range with start <= ln < stop instead of using - range return a list and Python has to compare each element.

import os
import sys
import re


word_search = raw_input(r'Enter a word to search: ')
book = open("KJV.txt", "r")
regex = re.compile(word_search)
bibook = ''

bookranges = [
    ((36, 4809),  'Genesis'),
    ((4812, 8859), 'Exodus'),
    ((8867, 11741), 'Leviticus'),
    ((11749, 15713), 'Numbers')
]


for ln, line in enumerate(book.readlines()):
    result = regex.search(line)
    if result:
        for (start, stop), bibook in bookranges:
            if start <= ln <= stop:
                # found the book, so end the loop and use it later
                break
        else:
            # didnt find any range that matches.
            bibook = 'Somewhere between books'

     template = "\nLine: {0}\nString: {1}\nBook: {2}\n"
     output = template.format(ln, result.group(), bibook)
     print output

回复收藏 0 原文

情话已封尘 2024-11-26 08:18:28

你可以尝试这样的事情。请注意，这些书是一本接一本地出现的，因此您只需记录您当前正在查看的书。此外，检查行号是否在范围内的方法非常昂贵，因为对于文本文件中的每一行，您构建每个范围，然后进行线性扫描以查看该行是否在号码出现在其中。

books = [("Introduction",36),("Genesis",4809),("Exodus",8859),
         ("Leviticus",11741),("Numbers",15713)]

import os
import sys
import re

word_search = raw_input(r'Enter a word to search: ')
book = open("KJV.txt", "r")
bookIndex = 0
bookEnd = books[bookIndex][1]

for lineNum, line in enumerate(book):
    if lineNum > bookEnd:
        bookIndex += 1
        bookEnd = books[bookIndex][1]
    if word_search in line:
        template = "\nLine: {0}\nString: {1}\nBook: {2}\n"
        output = template.format(lineNum, line, books[bookIndex][0])
        print output

其中一条评论指出，您也许可以采用一种更加数据驱动的方法，而不是对书籍位置进行硬编码。每本书都以可识别格式的一行或多行开头吗？如果是这样，您可以尝试检查并记录您当前正在看的书。

You could try something like this. Note that the books appear one after the other, so you only need to record which is the book that you're currently looking at. Also, your approach with checking if the line number is in a range is very expensive, since for each line in the text file, you construct each range, and then do a linear scan to see if the line number appears in it.

books = [("Introduction",36),("Genesis",4809),("Exodus",8859),
         ("Leviticus",11741),("Numbers",15713)]

import os
import sys
import re

word_search = raw_input(r'Enter a word to search: ')
book = open("KJV.txt", "r")
bookIndex = 0
bookEnd = books[bookIndex][1]

for lineNum, line in enumerate(book):
    if lineNum > bookEnd:
        bookIndex += 1
        bookEnd = books[bookIndex][1]
    if word_search in line:
        template = "\nLine: {0}\nString: {1}\nBook: {2}\n"
        output = template.format(lineNum, line, books[bookIndex][0])
        print output

One of the comments pointed out that you may be able to do a more data driven approach, rather than hard-coding the book positions. Do each of the books start with a line or lines in a recognisable format? if so, you could try checking for this and recording which is the current book that you're looking at.

回复收藏 0 原文

~没有更多了~