Python,循环遍历文件中的行;如果行等于另一个文件中的行,则返回原始行

发布于 2024-12-01 12:37:09 字数 1378 浏览 0 评论 0原文

文本文件 1 具有以下格式:

'WORD': 1
'MULTIPLE WORDS': 1
'WORD': 2

等。

即,用冒号分隔的单词,后跟数字。

文本文件 2 具有以下格式:

'WORD'
'WORD'

等等。

我需要从文件 1 中提取单个单词(即,仅单词而不是多个单词),如果它们与文件 2 中的单词匹配,则返回文件 1 中的单词及其值。

我有一些功能不佳的代码:

def GetCounts(file1, file2):
    target_contents  = open(file1).readlines()  #file 1 as list--> 'WORD': n
    match_me_contents = open(file2).readlines()   #file 2 as list -> 'WORD'
    ls_stripped = [x.strip('\n') for x in match_me_contents]  #get rid of newlines

    match_me_as_regex= re.compile("|".join(ls_stripped))   

    for line in target_contents:
        first_column = line.split(':')[0]  #get the first item in line.split
        number = line.split(':')[1]   #get the number associated with the word
        if len(first_column.split()) == 1: #get single word, no multiple words 
            """ Does the word from target contents match the word
            from match_me contents?  If so, return the line from  
            target_contents"""
            if re.findall(match_me_as_regex, first_column):  
                print first_column, number

#OUTPUT: WORD, n
         WORD, n
         etc.

由于使用正则表达式,输出很短。例如,代码将返回“asset, 2”,因为 re.findall() 将匹配来自 match_me 的“set”。我需要将 target_word 与 match_me 中的整个单词进行匹配,以阻止部分正则表达式匹配导致的错误输出。

Text file 1 has the following format:

'WORD': 1
'MULTIPLE WORDS': 1
'WORD': 2

etc.

I.e., a word separated by a colon followed by a number.

Text file 2 has the following format:

'WORD'
'WORD'

etc.

I need to extract single words (i.e., only WORD not MULTIPLE WORDS) from File 1 and, if they match a word in File 2, return the word from File 1 along with its value.

I have some poorly functioning code:

def GetCounts(file1, file2):
    target_contents  = open(file1).readlines()  #file 1 as list--> 'WORD': n
    match_me_contents = open(file2).readlines()   #file 2 as list -> 'WORD'
    ls_stripped = [x.strip('\n') for x in match_me_contents]  #get rid of newlines

    match_me_as_regex= re.compile("|".join(ls_stripped))   

    for line in target_contents:
        first_column = line.split(':')[0]  #get the first item in line.split
        number = line.split(':')[1]   #get the number associated with the word
        if len(first_column.split()) == 1: #get single word, no multiple words 
            """ Does the word from target contents match the word
            from match_me contents?  If so, return the line from  
            target_contents"""
            if re.findall(match_me_as_regex, first_column):  
                print first_column, number

#OUTPUT: WORD, n
         WORD, n
         etc.

Because of the use of regex, the output is shotty. The code will return 'asset, 2', for example, since re.findall() will match 'set' from match_me. I need to match the target_word with the entire word from match_me to block the bad output resulting from partial regex matches.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

旧人 2024-12-08 12:37:09

如果 file2 不是很大,请将它们放入一个集合中:

file2=set(open("file2").read().split())
for line in open("file1"):
    if line.split(":")[0].strip("'") in file2:
        print line

If file2 is not humongous, slurp them into a set:

file2=set(open("file2").read().split())
for line in open("file1"):
    if line.split(":")[0].strip("'") in file2:
        print line
七堇年 2024-12-08 12:37:09

我猜你所说的“功能不佳”是指速度方面?因为我测试过,它似乎确实有效。

您可以通过在 file2 中创建单词的集合来提高效率:

word_set = set(ls_stripped)

然后您会看到它是否在集合中,而不是findall

in_set = just_word in word_set

也感觉比正则表达式。

I guess by "poorly functioning" you mean speed wise? Because I tested and it does appear to work.

You could make things more efficient by making a set of the words in file2:

word_set = set(ls_stripped)

And then instead of findall you'd see if it's in the set:

in_set = just_word in word_set

Also feels cleaner than a regex.

无声静候 2024-12-08 12:37:09

看起来这可能只是 grep 的一个特例。如果 file2 本质上是一个模式列表,并且输出格式与 file1 相同,那么您可能可以这样做:

grep -wf file2 file1

-w 告诉 grep 仅匹配整个单词。

It seems like this may simply be a special case of grep. If file2 is essentially a list of patterns, and the output format is the same as file1, then you might be able to just do this:

grep -wf file2 file1

The -w tells grep to match only whole words.

一世旳自豪 2024-12-08 12:37:09

我就是这样做的。我手头没有Python解释器,所以可能有一些拼写错误。

在使用 Python 时(特别是来自 Perl),您应该记住的主要事情之一是正则表达式通常不是一个好主意:字符串方法功能强大且速度非常快。

def GetCounts(file1, file2):
    data = {}
    for line in open(file1):
        try:
            word, n = line.rsplit(':', 1)
        except ValueError: # not enough values
            #some kind of input error, go to next line
            continue
        n = int(n.strip())
        if word[0] == word[-1] == "'":
            word = word[1:-1]
        data[word] = n

    for line in open(file2):
        word = line.strip()
        if word[0] == word[-1] == "'":
            word = word[1:-1]
        if word in data:
            print word, data[word]

This is how I'd do this. I don't have a python interpreter on hand, so there may be a couple typos.

One of the main things you should remember when coming to Python (especially if coming from Perl) is that regular expressions are usually a bad idea: the string methods are powerful and very fast.

def GetCounts(file1, file2):
    data = {}
    for line in open(file1):
        try:
            word, n = line.rsplit(':', 1)
        except ValueError: # not enough values
            #some kind of input error, go to next line
            continue
        n = int(n.strip())
        if word[0] == word[-1] == "'":
            word = word[1:-1]
        data[word] = n

    for line in open(file2):
        word = line.strip()
        if word[0] == word[-1] == "'":
            word = word[1:-1]
        if word in data:
            print word, data[word]
梦在深巷 2024-12-08 12:37:09
import re, methodcaller

re_target = re.compile(r"^'([a-z]+)': +(\d+)", re.M|re.I)
match_me_contents = open(file2).read().splitlines()
match_me_contents = set(map(methodcaller('strip', "'"), match_me_contents))

res = []
for match in re_target.finditer(open(file1).read()):
    word, value = match.groups()
    if word in match_me_contents:
        res.append((word, value))
import re, methodcaller

re_target = re.compile(r"^'([a-z]+)': +(\d+)", re.M|re.I)
match_me_contents = open(file2).read().splitlines()
match_me_contents = set(map(methodcaller('strip', "'"), match_me_contents))

res = []
for match in re_target.finditer(open(file1).read()):
    word, value = match.groups()
    if word in match_me_contents:
        res.append((word, value))
追星践月 2024-12-08 12:37:09

我的两个输入文件:

file1.txt:

'WORD': 1
'MULTIPLE WORDS': 1
'OTHER': 2

file2.txt:

'WORD'
'NONEXISTENT'

如果file2.txt保证没有一行中有多个单词,则无需从第一个文件中显式过滤这些单词。这将通过成员资格测试来完成:

# Build a set of what words we can return a count for.
with open('file2.txt', 'r') as f:
    allowed_words = set(word.strip() for word in f)

# See which of them exist in the first file.
with open('file1.txt', 'r') as f:
    for line in f:
        word, count = line.strip().split(':')

        # This assumes that strings with a space (multiple words) do not exist in
        # the second file.
        if word in allowed_words:
            print word, count

运行此命令会给出:

$ python extract.py
'WORD' 1

如果 file2.txt 可能包含多个单词,只需修改循环中的测试:

# Build a set of what words we can return a count for.
with open('file2.txt', 'r') as f:
    allowed_words = set(word.strip() for word in f)

# See which of them exist in the first file.
with open('file1.txt', 'r') as f:
    for line in f:
        word, count = line.strip().split(':')

        # This prevents multiple words from being selected.
        if word in allowed_words and not ' ' in word:
            print word, count

注意我没有费心从单词中删除引号。我不确定这是否有必要 - 这取决于输入是否保证有它们。添加它们是微不足道的。

您应该考虑的其他事情是区分大小写。如果小写和大写单词应该被视为相同,那么您应该在进行任何测试之前将所有输入转换为大写(或小写,无论是小写)。

编辑:从允许的单词集中删除多个单词可能比对 file1 的每一行进行检查更有效:

# Build a set of what words we can return a count for.
with open('file2.txt', 'r') as f:
    allowed_words = set(word.strip() for word in f if not ' ' in f)

# See which of them exist in the first file.
with open('file1.txt', 'r') as f:
    for line in f:
        word, count = line.strip().split(':')

        # Check if the word is allowed.
        if word in allowed_words:
            print word, count

My two input files:

file1.txt:

'WORD': 1
'MULTIPLE WORDS': 1
'OTHER': 2

file2.txt:

'WORD'
'NONEXISTENT'

If file2.txt is guaranteed not to have multiple words on a line, then there is no need to explicitly filter these from the first file. This will be done by the membership test:

# Build a set of what words we can return a count for.
with open('file2.txt', 'r') as f:
    allowed_words = set(word.strip() for word in f)

# See which of them exist in the first file.
with open('file1.txt', 'r') as f:
    for line in f:
        word, count = line.strip().split(':')

        # This assumes that strings with a space (multiple words) do not exist in
        # the second file.
        if word in allowed_words:
            print word, count

And running this gives:

$ python extract.py
'WORD' 1

If file2.txt might contain multiple words, simply modify the test in the loop:

# Build a set of what words we can return a count for.
with open('file2.txt', 'r') as f:
    allowed_words = set(word.strip() for word in f)

# See which of them exist in the first file.
with open('file1.txt', 'r') as f:
    for line in f:
        word, count = line.strip().split(':')

        # This prevents multiple words from being selected.
        if word in allowed_words and not ' ' in word:
            print word, count

Note I haven't bothered stripping the quotes from the words. I'm not sure if this is necessary - it depends on whether the input is guaranteed to have them or not. It would be trivial to add them.

Something else you should consider is case-sensitivity. If lowercase and uppercase words should be treated as the same, then you should convert all input to uppercase (or lowercase, it does not matter which) prior to doing any testing.

EDIT: It would probably be more efficient to remove multiple words from the set of allowed words, rather than doing the check on every line of file1:

# Build a set of what words we can return a count for.
with open('file2.txt', 'r') as f:
    allowed_words = set(word.strip() for word in f if not ' ' in f)

# See which of them exist in the first file.
with open('file1.txt', 'r') as f:
    for line in f:
        word, count = line.strip().split(':')

        # Check if the word is allowed.
        if word in allowed_words:
            print word, count
等风来 2024-12-08 12:37:09

这就是我想到的:

def GetCounts(file1, file2):
    target_contents  = open(file1).readlines()  #file 1 as list--> 'WORD': n
    match_me_contents = set(open(file2).read().split('\n'))   #file 2 as list -> 'WORD'  
    for line in target_contents:
        word = line.split(': ')[0]  #get the first item in line.split
        if " " not in word:
            number = line.split(': ')[1]   #get the number associated with the word
            if word in match_me_contents:  
                print word, number

对您版本的更改:

  • 从正则表达式移至设置
  • 去分割而不是读取行以摆脱换行符而无需额外处理
  • 将单词拆分为单词并检查其长度是否为一更改为只需直接检查“单词”中是否有空格
    • 如果“空格”不是实际的空格,这可能会导致错误。这可以使用“\s”或等效的正则表达式来修复,但会降低性能。
  • 在 line.split(': ') 中添加了一个空格,这样数字就不会以空格为前缀
    • 如果数字前没有空格,这可能会导致错误。
  • 为了提高效率,在检查单词是否包含空格后移动了 number = line.split(': ')[1],尽管速度差异很小(几乎可以肯定大部分时间将花费时间检查目标中是否有工作)

但是,只有当实际输入不是您提供的格式时,才会出现潜在的错误。

This is what I came up with:

def GetCounts(file1, file2):
    target_contents  = open(file1).readlines()  #file 1 as list--> 'WORD': n
    match_me_contents = set(open(file2).read().split('\n'))   #file 2 as list -> 'WORD'  
    for line in target_contents:
        word = line.split(': ')[0]  #get the first item in line.split
        if " " not in word:
            number = line.split(': ')[1]   #get the number associated with the word
            if word in match_me_contents:  
                print word, number

Changes from your version:

  • Moved to set from a regex
  • Went to split instead of readlines to get rid of newlines without extra processing
  • Changed from splitting the word into words and checking if the length of that is one to simply checking if a space is in the "word" directly
    • This could cause a bug if the "space" isn't an actual space though.This could be fixed with a regex for "\s" or equivalent instead, however with a performance penalty.
  • Added a space into line.split(': ') so that that way number won't be prefixed with a space
    • This could cause a bug if there is not a space before the number.
  • Moved number = line.split(': ')[1] after the checking to see if the word contains spaces for efficiency purposes, minor though the speed difference would be (almost certainly the bulk of the time would be spent checking is a work was in the target)

The potential bugs would only occur however if the actual input is not in the format you presented.

仅一夜美梦 2024-12-08 12:37:09

让我们利用文件格式与 Python 表达式语法的相似性:

from ast import literal_eval
with file("file1") as f:
  word_values = ast.literal_eval('{' + ','.join(line for line in f) + '}')
with file("file2") as f:
  expected_words = set(ast.literal_eval(line) for line in f)
word_values = {k: v for (k, v) in word_values if k in expected_words}

Let's exploit the similarity of the file format to Python expression syntax:

from ast import literal_eval
with file("file1") as f:
  word_values = ast.literal_eval('{' + ','.join(line for line in f) + '}')
with file("file2") as f:
  expected_words = set(ast.literal_eval(line) for line in f)
word_values = {k: v for (k, v) in word_values if k in expected_words}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文