python 中的 awk：如何在 python 类中使用 awk 脚本？

发布于 2024-12-11 12:07:52 字数 828 浏览 0 评论 0原文

我正在尝试使用 python 运行 awk 脚本，这样我就可以处理一些数据。

有没有办法让 awk 脚本在 python 类中运行，而不使用系统类将其作为 shell 进程调用？我运行这些 python 脚本的框架不允许使用子进程调用，所以我要么想办法在 python 中转换我的 awk 脚本，要么如果可能的话，在 python 中运行 awk 脚本。

有什么建议吗？我的 awk 脚本基本上读取一个文本文件并分离包含特定化合物的蛋白质块（输出由我们的框架生成；我添加了一个示例，如下所示）并将它们分离出来并打印在不同的文件。

    buildProtein compoundA compoundB
    begin fusion
    Calculate : (lots of text here on multiple lines)
    (more lines)
    Final result - H20: value CO2: value Compound: value 
    Other Compounds X: Value Y: value Z:value

    [...another similar block]

例如，如果我构建一个蛋白质，我需要查看最终结果行中是否有 CH3COOH 的化合物，如果有，我必须从命令“buildProtein”开始，直到结果行的开头下一个区块；并将其保存在文件中；然后移至下一个，看看它是否再次具有我正在寻找的化合物...如果没有，我跳到下一个，直到文件末尾（该文件多次出现该化合物我搜索，有时它们是连续的，而有时它们与没有复合的块交替使用，

任何帮助都是非常受欢迎的；在发现这个网站后，我决定寻求一些帮助，

谢谢。提前感谢您的好意！

原文

I am trying to run an awk script using python, so I can process some data.

Is there any way to get an awk script to run in a python class without using the system class to invoke it as shell process? The framework where I run these python scripts does not allow the use of a subprocess call, so I am stuck either figuring out a way to convert my awk script in python, or if is possible, running the awk script in python.

Any suggestions? My awk script basically read a text file and isolate blocks of proteins that contains a specific chemical compound (the output is generated by our framework; I've add an example of how does it looks like below) and isolate them printing them out on a different file.

    buildProtein compoundA compoundB
    begin fusion
    Calculate : (lots of text here on multiple lines)
    (more lines)
    Final result - H20: value CO2: value Compound: value 
    Other Compounds X: Value Y: value Z:value

    [...another similar block]

So for example if I build a protein and I need to see if in the compounds I have CH3COOH in the final result line, if it does I have to take the whole block, starting from the command "buildProtein", until the beginning of the next block; and save it on a file; and then move to the next and see if it has again the compound that I am looking for...if it does not have it I skip to the next, until the end of the file (the file has multiple occurrence of the compound that I search for, sometimes they are contiguous while other times they are alternate with blocks that has not the compound.

Any help is more than welcome; banging my head for weeks now and after finding out this site I decided to ask for some help.

Thanks in advance for your kindness!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

莳間冲淡了誓言ζ 2024-12-18 12:07:52

如果您无法使用 subprocess 模块，最好的办法是用 Python 重新编码 AWK 脚本。为此，fileinput 模块是一个很棒的转换工具，具有类似 AWK 的感觉。

回复收藏 0 原文

眼睛会笑 2024-12-18 12:07:52

Python 的 re 模块可以提供帮助，或者，如果您不介意正则表达式和只需要进行一些快速的字段分隔，您可以使用内置的 str .split() 和 .find() 函数。

回复收藏 0 原文

终止放荡 2024-12-18 12:07:52

我刚刚开始学习 AWK，所以我无法在这方面提供任何建议。然而，对于一些满足您需要的 Python 代码：

class ProteinIterator():
    def __init__(self, file):
        self.file = open(file, 'r')
        self.first_line = self.file.readline()
    def __iter__(self):
        return self
    def __next__(self):
        "returns the next protein build"
        if not self.first_line:     # reached end of file
            raise StopIteration
        file = self.file
        protein_data = [self.first_line]
        while True:
            line = file.readline()
            if line.startswith('buildProtein ') or not line:
                self.first_line = line
                break
            protein_data.append(line)
        return Protein(protein_data)

class Protein():
    def __init__(self, data):
        self._data = data
        for line in data:
            if line.startswith('buildProtein '):
                self.initial_compounds = tuple(line[13:].split())
            elif line.startswith('Final result - '):
                pieces = line[15:].split()[::2]   # every other piece is a name
                self.final_compounds = tuple([p[:-1] for p in pieces])
            elif line.startswith('Other Compounds '):
                pieces = line[16:].split()[::2]   # every other piece is a name
                self.other_compounds = tuple([p[:-1] for p in pieces])
    def __repr__(self):
        return ("Protein(%s)"% self._data[0])
    @property
    def data(self):
        return ''.join(self._data)

我们这里有一个 build Protein 文本文件的迭代器，它一次返回一个蛋白质作为 Protein 对象。这个 Protein 对象足够聪明，可以知道它的输入、最终结果和其他结果。如果文件中的实际文本与问题中所表示的不完全一样，您可能需要修改一些代码。以下是对代码的简短测试和示例用法：

if __name__ == '__main__':
    test_data = """\
buildProtein compoundA compoundB
begin fusion
Calculate : (lots of text here on multiple lines)
(more lines)
Final result - H20: value CO2: value Compound: value 
Other Compounds X: Value Y: value Z: value"""

    open('testPI.txt', 'w').write(test_data)
    for protein in ProteinIterator('testPI.txt'):
        print(protein.initial_compounds)
        print(protein.final_compounds)
        print(protein.other_compounds)
        print()
        if 'CO2' in protein.final_compounds:
            print(protein.data)

我没有费心保存值，但如果您愿意，可以添加它。希望这能让你继续前进。

I have barely started learning AWK, so I can't offer any advice on that front. However, for some python code that does what you need:

class ProteinIterator():
    def __init__(self, file):
        self.file = open(file, 'r')
        self.first_line = self.file.readline()
    def __iter__(self):
        return self
    def __next__(self):
        "returns the next protein build"
        if not self.first_line:     # reached end of file
            raise StopIteration
        file = self.file
        protein_data = [self.first_line]
        while True:
            line = file.readline()
            if line.startswith('buildProtein ') or not line:
                self.first_line = line
                break
            protein_data.append(line)
        return Protein(protein_data)

class Protein():
    def __init__(self, data):
        self._data = data
        for line in data:
            if line.startswith('buildProtein '):
                self.initial_compounds = tuple(line[13:].split())
            elif line.startswith('Final result - '):
                pieces = line[15:].split()[::2]   # every other piece is a name
                self.final_compounds = tuple([p[:-1] for p in pieces])
            elif line.startswith('Other Compounds '):
                pieces = line[16:].split()[::2]   # every other piece is a name
                self.other_compounds = tuple([p[:-1] for p in pieces])
    def __repr__(self):
        return ("Protein(%s)"% self._data[0])
    @property
    def data(self):
        return ''.join(self._data)

What we have here is an iterator for the buildprotein text file which returns one protein at a time as a Protein object. This Protein object is smart enough to know it's inputs, final results, and other results. You may have to modify some of the code if the actual text in the file is not exactly as represented in the question. Following is a short test of the code with example usage:

if __name__ == '__main__':
    test_data = """\
buildProtein compoundA compoundB
begin fusion
Calculate : (lots of text here on multiple lines)
(more lines)
Final result - H20: value CO2: value Compound: value 
Other Compounds X: Value Y: value Z: value"""

    open('testPI.txt', 'w').write(test_data)
    for protein in ProteinIterator('testPI.txt'):
        print(protein.initial_compounds)
        print(protein.final_compounds)
        print(protein.other_compounds)
        print()
        if 'CO2' in protein.final_compounds:
            print(protein.data)

I didn't bother saving values, but you can add that in if you like. Hopefully this will get you going.

回复收藏 0 原文

~没有更多了~