如何在Python中通过删除换行符来搜索字符串,但返回找到字符串的确切行?
我有一堆 PDF 文件,我必须根据它们搜索一组关键字。我必须提取找到关键字的确切行。我首先使用xpdf的pdf2text将文件转换为PDF。 (尝试过 solr,但很难根据我的要求定制输出/模式)。
import sys
file_name = sys.argv[1]
searched_string = sys.argv[2]
result = [(line_number+1, line) for line_number, line in enumerate(open(file_name)) if searched_string.lower() in line.lower()]
#print result
for each in result:
print each[0], each[1]
ThinkCode:~$ python find_string.py example.txt“字符串提取”
我遇到的问题是,对于搜索字符串在行尾被破坏的情况:
如果您要索引大型二进制文件,请记住更改 尺寸限制。字符串
提取是一个常见问题
如果我正在搜索“字符串提取”,如果我使用上面提供的代码,我将错过这个关键字。在不制作 2 个文本文件副本的情况下实现此目标的最有效方法是什么(一个用于搜索关键字以提取行(数字),另一个用于删除换行符并查找关键字以消除关键字跨越 2 个文本文件的情况)线)。
非常感谢你们!
I have a bunch of PDF files that I have to search for a set of keywords against. I have to extract the exact line where the keyword was found. I first used xpdf's pdf2text to convert the file to PDF. (Tried solr but had a tough time tailoring the output/schema to suit my requirement).
import sys
file_name = sys.argv[1]
searched_string = sys.argv[2]
result = [(line_number+1, line) for line_number, line in enumerate(open(file_name)) if searched_string.lower() in line.lower()]
#print result
for each in result:
print each[0], each[1]
ThinkCode:~$ python find_string.py sample.txt "String Extraction"
The problem I have with this is that for cases where search string is broken towards the end of the line :
If you are going to index large binary files, remember to change the
size limits. StringExtraction is a common problem
If I am searching for 'String Extraction', I will miss this keyword if I use the code presented above. What is the most efficient way of achieving this without making 2 copies of text file (one for searching the keyword to extract the line (number) and the other for removing line breaks and finding the keyword to eliminate the case where the keyword spans across 2 lines).
Much appreciated guys!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
注意:一些没有任何代码的考虑因素,但我认为它们属于答案而不是评论。我的想法是仅搜索第一个关键字;如果找到匹配项,则搜索第二个。这允许您,如果在行末尾找到匹配项,则考虑下一行,并且仅当在第一个位置找到匹配项时才进行行串联*。
编辑:
编写了一个简单的示例,但最终使用了不同的算法;其背后的基本思想是以下代码片段:
它迭代文件处理程序并为文件中的每个单词生成一个 (line_number, word) 元组。
之后的匹配就变得非常简单;您可以在 github 上找到我的实现作为要点。它可以按如下方式运行:
链接代码有一个主要问题,出于性能和复杂性原因,我没有编写解决方法。你能弄清楚吗? (剧透:尝试搜索第一个单词在文件中连续出现两次的句子)
* 我自己没有执行任何测试,但是 本文 和 python wiki 表明字符串连接在 python 中效率不高(不知道信息的实际情况如何)。
Note: Some considerations without any code, but I think they belong to an answer rather than to a comment.My idea would be to search only for the first keyword; if a match is found, search for the second. This allows you to, if the match is found at the end of the line, take into consideration the next line and do line concatenation only if a match is found in first place*.
Edit:
Coded a simple example and ended up using a different algorithm; the basic idea behind it is this code snippet:
It iterates over the file handler and produces a (line_number, word) tuple for each word in the file.
The matching afterwards becomes pretty easy; you can find my implementation as a gist on github. It can be run as follows:
There is one main concern with the linked code, I didn't code a workaround both for performance and complexity reasons. Can you figure it out? (Spoiler: try to search for a sentence whose first word appears two times in a row in the file)
* I didn't perform any testing on my own, but this article and the python wiki suggest that string concatenation is not that efficient in python (don't know how actual the information is).
可能有更好的方法,但我的建议是首先接收两行(让我们称它们为
line1
和line2
),将它们连接成line3
或类似的内容,然后搜索结果行。然后,您将
line2
分配给line1
,获取新的line2
,然后重复该过程。There may be a better way of doing it, but my suggestion would be to start by taking in two lines (let's call them
line1
andline2
), concatenating them intoline3
or something similar, and then search that resultant line.Then you'd assign
line2
toline1
, get a newline2
, and repeat the process.编译表达式时使用标志
re.MULTILINE
:http ://docs.python.org/library/re.html#re.MULTILINE然后使用
\s
表示所有空白(包括换行符)。Use the flag
re.MULTILINE
when compiling your expressions: http://docs.python.org/library/re.html#re.MULTILINEThen use
\s
to represent all white space (including new lines).