Python,循环遍历文件中的行;如果行等于另一个文件中的行,则返回原始行
文本文件 1 具有以下格式:
'WORD': 1
'MULTIPLE WORDS': 1
'WORD': 2
等。
即,用冒号分隔的单词,后跟数字。
文本文件 2 具有以下格式:
'WORD'
'WORD'
等等。
我需要从文件 1 中提取单个单词(即,仅单词而不是多个单词),如果它们与文件 2 中的单词匹配,则返回文件 1 中的单词及其值。
我有一些功能不佳的代码:
def GetCounts(file1, file2):
target_contents = open(file1).readlines() #file 1 as list--> 'WORD': n
match_me_contents = open(file2).readlines() #file 2 as list -> 'WORD'
ls_stripped = [x.strip('\n') for x in match_me_contents] #get rid of newlines
match_me_as_regex= re.compile("|".join(ls_stripped))
for line in target_contents:
first_column = line.split(':')[0] #get the first item in line.split
number = line.split(':')[1] #get the number associated with the word
if len(first_column.split()) == 1: #get single word, no multiple words
""" Does the word from target contents match the word
from match_me contents? If so, return the line from
target_contents"""
if re.findall(match_me_as_regex, first_column):
print first_column, number
#OUTPUT: WORD, n
WORD, n
etc.
由于使用正则表达式,输出很短。例如,代码将返回“asset, 2”,因为 re.findall() 将匹配来自 match_me 的“set”。我需要将 target_word 与 match_me 中的整个单词进行匹配,以阻止部分正则表达式匹配导致的错误输出。
Text file 1 has the following format:
'WORD': 1
'MULTIPLE WORDS': 1
'WORD': 2
etc.
I.e., a word separated by a colon followed by a number.
Text file 2 has the following format:
'WORD'
'WORD'
etc.
I need to extract single words (i.e., only WORD not MULTIPLE WORDS) from File 1 and, if they match a word in File 2, return the word from File 1 along with its value.
I have some poorly functioning code:
def GetCounts(file1, file2):
target_contents = open(file1).readlines() #file 1 as list--> 'WORD': n
match_me_contents = open(file2).readlines() #file 2 as list -> 'WORD'
ls_stripped = [x.strip('\n') for x in match_me_contents] #get rid of newlines
match_me_as_regex= re.compile("|".join(ls_stripped))
for line in target_contents:
first_column = line.split(':')[0] #get the first item in line.split
number = line.split(':')[1] #get the number associated with the word
if len(first_column.split()) == 1: #get single word, no multiple words
""" Does the word from target contents match the word
from match_me contents? If so, return the line from
target_contents"""
if re.findall(match_me_as_regex, first_column):
print first_column, number
#OUTPUT: WORD, n
WORD, n
etc.
Because of the use of regex, the output is shotty. The code will return 'asset, 2', for example, since re.findall() will match 'set' from match_me. I need to match the target_word with the entire word from match_me to block the bad output resulting from partial regex matches.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
如果
file2
不是很大,请将它们放入一个集合中:If
file2
is not humongous, slurp them into a set:我猜你所说的“功能不佳”是指速度方面?因为我测试过,它似乎确实有效。
您可以通过在 file2 中创建单词的
集合
来提高效率:然后您会看到它是否在集合中,而不是
findall
:也感觉比正则表达式。
I guess by "poorly functioning" you mean speed wise? Because I tested and it does appear to work.
You could make things more efficient by making a
set
of the words in file2:And then instead of
findall
you'd see if it's in the set:Also feels cleaner than a regex.
看起来这可能只是 grep 的一个特例。如果 file2 本质上是一个模式列表,并且输出格式与 file1 相同,那么您可能可以这样做:
-w
告诉 grep 仅匹配整个单词。It seems like this may simply be a special case of grep. If file2 is essentially a list of patterns, and the output format is the same as file1, then you might be able to just do this:
The
-w
tells grep to match only whole words.我就是这样做的。我手头没有Python解释器,所以可能有一些拼写错误。
在使用 Python 时(特别是来自 Perl),您应该记住的主要事情之一是正则表达式通常不是一个好主意:字符串方法功能强大且速度非常快。
This is how I'd do this. I don't have a python interpreter on hand, so there may be a couple typos.
One of the main things you should remember when coming to Python (especially if coming from Perl) is that regular expressions are usually a bad idea: the string methods are powerful and very fast.
我的两个输入文件:
file1.txt
:file2.txt
:如果
file2.txt
保证没有一行中有多个单词,则无需从第一个文件中显式过滤这些单词。这将通过成员资格测试来完成:运行此命令会给出:
如果
file2.txt
可能包含多个单词,只需修改循环中的测试:注意我没有费心从单词中删除引号。我不确定这是否有必要 - 这取决于输入是否保证有它们。添加它们是微不足道的。
您应该考虑的其他事情是区分大小写。如果小写和大写单词应该被视为相同,那么您应该在进行任何测试之前将所有输入转换为大写(或小写,无论是小写)。
编辑:从允许的单词集中删除多个单词可能比对
file1
的每一行进行检查更有效:My two input files:
file1.txt
:file2.txt
:If
file2.txt
is guaranteed not to have multiple words on a line, then there is no need to explicitly filter these from the first file. This will be done by the membership test:And running this gives:
If
file2.txt
might contain multiple words, simply modify the test in the loop:Note I haven't bothered stripping the quotes from the words. I'm not sure if this is necessary - it depends on whether the input is guaranteed to have them or not. It would be trivial to add them.
Something else you should consider is case-sensitivity. If lowercase and uppercase words should be treated as the same, then you should convert all input to uppercase (or lowercase, it does not matter which) prior to doing any testing.
EDIT: It would probably be more efficient to remove multiple words from the set of allowed words, rather than doing the check on every line of
file1
:这就是我想到的:
对您版本的更改:
number = line.split(': ')[1]
,尽管速度差异很小(几乎可以肯定大部分时间将花费时间检查目标中是否有工作)但是,只有当实际输入不是您提供的格式时,才会出现潜在的错误。
This is what I came up with:
Changes from your version:
number = line.split(': ')[1]
after the checking to see if the word contains spaces for efficiency purposes, minor though the speed difference would be (almost certainly the bulk of the time would be spent checking is a work was in the target)The potential bugs would only occur however if the actual input is not in the format you presented.
让我们利用文件格式与 Python 表达式语法的相似性:
Let's exploit the similarity of the file format to Python expression syntax: