如何在Python中通过删除换行符来搜索字符串，但返回找到字符串的确切行？

发布于 2024-11-26 21:24:16 字数 757 浏览 1 评论 0原文

我有一堆 PDF 文件，我必须根据它们搜索一组关键字。我必须提取找到关键字的确切行。我首先使用xpdf的pdf2text将文件转换为PDF。（尝试过 solr，但很难根据我的要求定制输出/模式）。

import sys

file_name = sys.argv[1]
searched_string = sys.argv[2]
result = [(line_number+1, line) for line_number, line in enumerate(open(file_name)) if searched_string.lower() in line.lower()]

#print result

for each in result:
    print each[0], each[1]

ThinkCode:~$ python find_string.py example.txt“字符串提取”

我遇到的问题是，对于搜索字符串在行尾被破坏的情况：

如果您要索引大型二进制文件，请记住更改尺寸限制。字符串
提取是一个常见问题

如果我正在搜索“字符串提取”，如果我使用上面提供的代码，我将错过这个关键字。在不制作 2 个文本文件副本的情况下实现此目标的最有效方法是什么（一个用于搜索关键字以提取行（数字），另一个用于删除换行符并查找关键字以消除关键字跨越 2 个文本文件的情况）线）。

非常感谢你们！

原文

I have a bunch of PDF files that I have to search for a set of keywords against. I have to extract the exact line where the keyword was found. I first used xpdf's pdf2text to convert the file to PDF. (Tried solr but had a tough time tailoring the output/schema to suit my requirement).

import sys

file_name = sys.argv[1]
searched_string = sys.argv[2]
result = [(line_number+1, line) for line_number, line in enumerate(open(file_name)) if searched_string.lower() in line.lower()]

#print result

for each in result:
    print each[0], each[1]

ThinkCode:~$ python find_string.py sample.txt "String Extraction"

The problem I have with this is that for cases where search string is broken towards the end of the line :

If you are going to index large binary files, remember to change the
size limits. String
Extraction is a common problem

If I am searching for 'String Extraction', I will miss this keyword if I use the code presented above. What is the most efficient way of achieving this without making 2 copies of text file (one for searching the keyword to extract the line (number) and the other for removing line breaks and finding the keyword to eliminate the case where the keyword spans across 2 lines).

Much appreciated guys!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

橘和柠 2024-12-03 21:24:16

~~注意：一些没有任何代码的考虑因素，但我认为它们属于答案而不是评论。~~

我的想法是仅搜索第一个关键字；如果找到匹配项，则搜索第二个。这允许您，如果在行末尾找到匹配项，则考虑下一行，并且仅当在第一个位置找到匹配项时才进行行串联*。

编辑：

编写了一个简单的示例，但最终使用了不同的算法；其背后的基本思想是以下代码片段：

def iterwords(fh):
    for number, line in enumerate(fh):
        for word in re.split(r'\s+', line.strip()):
            yield number, word

它迭代文件处理程序并为文件中的每个单词生成一个 (line_number, word) 元组。

之后的匹配就变得非常简单；您可以在 github 上找到我的实现作为要点。它可以按如下方式运行：

python search.py 'multi word search string' file.txt

链接代码有一个主要问题，出于性能和复杂性原因，我没有编写解决方法。你能弄清楚吗？ （剧透：尝试搜索第一个单词在文件中连续出现两次的句子）

* 我自己没有执行任何测试，但是本文和 python wiki 表明字符串连接在 python 中效率不高（不知道信息的实际情况如何）。

~~Note: Some considerations without any code, but I think they belong to an answer rather than to a comment.~~

My idea would be to search only for the first keyword; if a match is found, search for the second. This allows you to, if the match is found at the end of the line, take into consideration the next line and do line concatenation only if a match is found in first place*.

Edit:

Coded a simple example and ended up using a different algorithm; the basic idea behind it is this code snippet:

def iterwords(fh):
    for number, line in enumerate(fh):
        for word in re.split(r'\s+', line.strip()):
            yield number, word

It iterates over the file handler and produces a (line_number, word) tuple for each word in the file.

The matching afterwards becomes pretty easy; you can find my implementation as a gist on github. It can be run as follows:

python search.py 'multi word search string' file.txt

There is one main concern with the linked code, I didn't code a workaround both for performance and complexity reasons. Can you figure it out? (Spoiler: try to search for a sentence whose first word appears two times in a row in the file)

* I didn't perform any testing on my own, but this article and the python wiki suggest that string concatenation is not that efficient in python (don't know how actual the information is).

回复收藏 0 原文