在源代码中编码 Blosum62

发布于 2024-08-24 01:04:47 字数 299 浏览 4 评论 0原文

我正在尝试使用“Needleman -Wunsch”的“全局对齐”算法来实现蛋白质成对序列对齐。

我不清楚如何在源代码中包含“Blosum62 Matrix”来进行评分或填充二维矩阵?

我用谷歌搜索发现大多数人建议使用包含标准“Blosum62 Matrix”的平面文件。这是否意味着我需要从这个平面文件中读取并填充我编码的“Blosum62 Martrix”?

另外,另一种方法可能是使用一些数学公式并将其包含在您的编程逻辑中来构造“Blosum62 Matrix”。但不是非常确定这个选项。

谢谢

I am trying to implement protein pairwise sequence alignment using "Global Alignment" algorithm by 'Needleman -Wunsch'.

I am not clear about how to include 'Blosum62 Matrix' in my source code to do the scoring or to fill the two-dimensional matrix?

I have googled and found that most people suggested to use flat file which contains the standard 'Blosum62 Matrix'. Does it mean that I need to read from this flat file and fill my coded "Blosum62 Martrix' ?

Also, the other approach could be is to use some mathematical formula and include it in your programming logic to construct 'Blosum62 Matrix'. But not very sure about this option.

Any ideas or insights are appreciated.

Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

琴流音 2024-08-31 01:04:48

了解您使用的语言会很有帮助,这样我们就可以帮助您使用正确的术语,但我所做的是使用地图的地图(如果您使用 Python,则使用字典)。

这是我在 Groovy 中的代码示例,但它对于其他语言来说相当可移植:

def blosum62 = [
Cys:[Cys:9, Ser:-1, Thr:-1, Pro:-3, Ala:0,  Gly:-3, Asn:-3, Asp:-3, Glu:-4, Gln:-3, His:-3, Arg:-3, Lys:-3, Met:-1, Ile:-1, Leu:-1, Val:-1, Phe:-2, Tyr:-2, Trp:-2],
Ser:[Cys:-1,Ser:4,  Thr:1,  Pro:-1, Ala:1,  Gly:0,  Asn:1,  Asp:0,  Glu:0,  Gln:0,  His:-1, Arg:-1, Lys:0,  Met:-1, Ile:-2, Leu:-2, Val:-2, Phe:-2, Tyr:-2, Trp:-3],
Thr:[Cys:-1,Ser:1,  Thr:4,  Pro:1,  Ala:-1, Gly:1,  Asn:0,  Asp:1,  Glu:0,  Gln:0,  His:0,  Arg:-1, Lys:0,  Met:-1, Ile:-2, Leu:-2, Val:-2, Phe:-2, Tyr:-2, Trp:-3],
Pro:[Cys:-3,Ser:-1, Thr:1,  Pro:7,  Ala:-1, Gly:-2, Asn:-1, Asp:-1, Glu:-1, Gln:-1, His:-2, Arg:-2, Lys:-1, Met:-2, Ile:-3, Leu:-3, Val:-2, Phe:-4, Tyr:-3, Trp:-4],
Ala:[Cys:0, Ser:1,  Thr:-1, Pro:-1, Ala:4,  Gly:0,  Asn:-1, Asp:-2, Glu:-1, Gln:-1, His:-2, Arg:-1, Lys:-1, Met:-1, Ile:-1, Leu:-1, Val:-2, Phe:-2, Tyr:-2, Trp:-3],
Gly:[Cys:-3,Ser:0,  Thr:1,  Pro:-2, Ala:0,  Gly:6,  Asn:-2, Asp:-1, Glu:-2, Gln:-2, His:-2, Arg:-2, Lys:-2, Met:-3, Ile:-4, Leu:-4, Val:0,  Phe:-3, Tyr:-3, Trp:-2],
Asn:[Cys:-3,Ser:1,  Thr:0,  Pro:-2, Ala:-2, Gly:0,  Asn:6,  Asp:1,  Glu:0,  Gln:0,  His:-1, Arg:0,  Lys:0,  Met:-2, Ile:-3, Leu:-3, Val:-3, Phe:-3, Tyr:-2, Trp:-4],
Asp:[Cys:-3,Ser:0,  Thr:1,  Pro:-1, Ala:-2, Gly:-1, Asn:1,  Asp:6,  Glu:2,  Gln:0,  His:-1, Arg:-2, Lys:-1, Met:-3, Ile:-3, Leu:-4, Val:-3, Phe:-3, Tyr:-3, Trp:-4],
Glu:[Cys:-4,Ser:0,  Thr:0,  Pro:-1, Ala:-1, Gly:-2, Asn:0,  Asp:2,  Glu:5,  Gln:2,  His:0,  Arg:0,  Lys:1,  Met:-2, Ile:-3, Leu:-3, Val:-3, Phe:-3, Tyr:-2, Trp:-3],
Gln:[Cys:-3,Ser:0,  Thr:0,  Pro:-1, Ala:-1, Gly:-2, Asn:0,  Asp:0,  Glu:2,  Gln:5,  His:0,  Arg:1,  Lys:1,  Met:0,  Ile:-3, Leu:-2, Val:-2, Phe:-3, Tyr:-1, Trp:-2],
His:[Cys:-3,Ser:-1, Thr:0,  Pro:-2, Ala:-2, Gly:-2, Asn:1,  Asp:1,  Glu:0,  Gln:0,  His:8,  Arg:0,  Lys:-1, Met:-2, Ile:-3, Leu:-3, Val:-2, Phe:-1, Tyr:2,  Trp:-2],
Arg:[Cys:-3,Ser:-1, Thr:-1, Pro:-2, Ala:-1, Gly:-2, Asn:0,  Asp:-2, Glu:0,  Gln:1,  His:0,  Arg:5,  Lys:2,  Met:-1, Ile:-3, Leu:-2, Val:-3, Phe:-3, Tyr:-2, Trp:-3],
Lys:[Cys:-3,Ser:0,  Thr:0,  Pro:-1, Ala:-1, Gly:-2, Asn:0,  Asp:-1, Glu:1,  Gln:1,  His:-1, Arg:2,  Lys:5,  Met:-1, Ile:-3, Leu:-2, Val:-3, Phe:-3, Tyr:-2, Trp:-3],
Met:[Cys:-1,Ser:-1, Thr:-1, Pro:-2, Ala:-1, Gly:-3, Asn:-2, Asp:-3, Glu:-2, Gln:0,  His:-2, Arg:-1, Lys:-1, Met:5,  Ile:1,  Leu:2,  Val:-2, Phe:0,  Tyr:-1, Trp:-1],
Ile:[Cys:-1,Ser:-2, Thr:-2, Pro:-3, Ala:-1, Gly:-4, Asn:-3, Asp:-3, Glu:-3, Gln:-3, His:-3, Arg:-3, Lys:-3, Met:1,  Ile:4,  Leu:2,  Val:1,  Phe:0,  Tyr:-1, Trp:-3],
Leu:[Cys:-1,Ser:-2, Thr:-2, Pro:-3, Ala:-1, Gly:-4, Asn:-3, Asp:-4, Glu:-3, Gln:-2, His:-3, Arg:-2, Lys:-2, Met:2,  Ile:2,  Leu:4,  Val:3,  Phe:0,  Tyr:-1, Trp:-2],
Val:[Cys:-1,Ser:-2, Thr:-2, Pro:-2, Ala:0,  Gly:-3, Asn:-3, Asp:-3, Glu:-2, Gln:-2, His:-3, Arg:-3, Lys:-2, Met:1,  Ile:3,  Leu:1,  Val:4,  Phe:-1, Tyr:-1, Trp:-3],
Phe:[Cys:-2,Ser:-2, Thr:-2, Pro:-4, Ala:-2, Gly:-3, Asn:-3, Asp:-3, Glu:-3, Gln:-3, His:-1, Arg:-3, Lys:-3, Met:0,  Ile:0,  Leu:0,  Val:-1, Phe:6,  Tyr:3,  Trp:1],
Tyr:[Cys:-2,Ser:-2, Thr:-2, Pro:-3, Ala:-2, Gly:-3, Asn:-2, Asp:-3, Glu:-2, Gln:-1, His:2,  Arg:-2, Lys:-2, Met:-1, Ile:-1, Leu:-1, Val:-1, Phe:3,  Tyr:7,  Trp:2],
Trp:[Cys:-2,Ser:-3, Thr:-3, Pro:-4, Ala:-3, Gly:-2, Asn:-4, Asp:-4, Glu:-3, Gln:-2, His:-2, Arg:-3, Lys:-3, Met:-1, Ile:-3, Leu:-2, Val:-3, Phe:1,  Tyr:2,  Trp:11]
]

使用它,您只需调用

def score = blosum62[Arg][Cys]
println("Substituting Arg by Cys gives " + score)

It would help to know what language you're working in so we can help you with the correct terms, but what I did was use a map of maps (or dictionaries if you're using Python).

Here's an example of my code in Groovy, but it's fairly portable to other languages:

def blosum62 = [
Cys:[Cys:9, Ser:-1, Thr:-1, Pro:-3, Ala:0,  Gly:-3, Asn:-3, Asp:-3, Glu:-4, Gln:-3, His:-3, Arg:-3, Lys:-3, Met:-1, Ile:-1, Leu:-1, Val:-1, Phe:-2, Tyr:-2, Trp:-2],
Ser:[Cys:-1,Ser:4,  Thr:1,  Pro:-1, Ala:1,  Gly:0,  Asn:1,  Asp:0,  Glu:0,  Gln:0,  His:-1, Arg:-1, Lys:0,  Met:-1, Ile:-2, Leu:-2, Val:-2, Phe:-2, Tyr:-2, Trp:-3],
Thr:[Cys:-1,Ser:1,  Thr:4,  Pro:1,  Ala:-1, Gly:1,  Asn:0,  Asp:1,  Glu:0,  Gln:0,  His:0,  Arg:-1, Lys:0,  Met:-1, Ile:-2, Leu:-2, Val:-2, Phe:-2, Tyr:-2, Trp:-3],
Pro:[Cys:-3,Ser:-1, Thr:1,  Pro:7,  Ala:-1, Gly:-2, Asn:-1, Asp:-1, Glu:-1, Gln:-1, His:-2, Arg:-2, Lys:-1, Met:-2, Ile:-3, Leu:-3, Val:-2, Phe:-4, Tyr:-3, Trp:-4],
Ala:[Cys:0, Ser:1,  Thr:-1, Pro:-1, Ala:4,  Gly:0,  Asn:-1, Asp:-2, Glu:-1, Gln:-1, His:-2, Arg:-1, Lys:-1, Met:-1, Ile:-1, Leu:-1, Val:-2, Phe:-2, Tyr:-2, Trp:-3],
Gly:[Cys:-3,Ser:0,  Thr:1,  Pro:-2, Ala:0,  Gly:6,  Asn:-2, Asp:-1, Glu:-2, Gln:-2, His:-2, Arg:-2, Lys:-2, Met:-3, Ile:-4, Leu:-4, Val:0,  Phe:-3, Tyr:-3, Trp:-2],
Asn:[Cys:-3,Ser:1,  Thr:0,  Pro:-2, Ala:-2, Gly:0,  Asn:6,  Asp:1,  Glu:0,  Gln:0,  His:-1, Arg:0,  Lys:0,  Met:-2, Ile:-3, Leu:-3, Val:-3, Phe:-3, Tyr:-2, Trp:-4],
Asp:[Cys:-3,Ser:0,  Thr:1,  Pro:-1, Ala:-2, Gly:-1, Asn:1,  Asp:6,  Glu:2,  Gln:0,  His:-1, Arg:-2, Lys:-1, Met:-3, Ile:-3, Leu:-4, Val:-3, Phe:-3, Tyr:-3, Trp:-4],
Glu:[Cys:-4,Ser:0,  Thr:0,  Pro:-1, Ala:-1, Gly:-2, Asn:0,  Asp:2,  Glu:5,  Gln:2,  His:0,  Arg:0,  Lys:1,  Met:-2, Ile:-3, Leu:-3, Val:-3, Phe:-3, Tyr:-2, Trp:-3],
Gln:[Cys:-3,Ser:0,  Thr:0,  Pro:-1, Ala:-1, Gly:-2, Asn:0,  Asp:0,  Glu:2,  Gln:5,  His:0,  Arg:1,  Lys:1,  Met:0,  Ile:-3, Leu:-2, Val:-2, Phe:-3, Tyr:-1, Trp:-2],
His:[Cys:-3,Ser:-1, Thr:0,  Pro:-2, Ala:-2, Gly:-2, Asn:1,  Asp:1,  Glu:0,  Gln:0,  His:8,  Arg:0,  Lys:-1, Met:-2, Ile:-3, Leu:-3, Val:-2, Phe:-1, Tyr:2,  Trp:-2],
Arg:[Cys:-3,Ser:-1, Thr:-1, Pro:-2, Ala:-1, Gly:-2, Asn:0,  Asp:-2, Glu:0,  Gln:1,  His:0,  Arg:5,  Lys:2,  Met:-1, Ile:-3, Leu:-2, Val:-3, Phe:-3, Tyr:-2, Trp:-3],
Lys:[Cys:-3,Ser:0,  Thr:0,  Pro:-1, Ala:-1, Gly:-2, Asn:0,  Asp:-1, Glu:1,  Gln:1,  His:-1, Arg:2,  Lys:5,  Met:-1, Ile:-3, Leu:-2, Val:-3, Phe:-3, Tyr:-2, Trp:-3],
Met:[Cys:-1,Ser:-1, Thr:-1, Pro:-2, Ala:-1, Gly:-3, Asn:-2, Asp:-3, Glu:-2, Gln:0,  His:-2, Arg:-1, Lys:-1, Met:5,  Ile:1,  Leu:2,  Val:-2, Phe:0,  Tyr:-1, Trp:-1],
Ile:[Cys:-1,Ser:-2, Thr:-2, Pro:-3, Ala:-1, Gly:-4, Asn:-3, Asp:-3, Glu:-3, Gln:-3, His:-3, Arg:-3, Lys:-3, Met:1,  Ile:4,  Leu:2,  Val:1,  Phe:0,  Tyr:-1, Trp:-3],
Leu:[Cys:-1,Ser:-2, Thr:-2, Pro:-3, Ala:-1, Gly:-4, Asn:-3, Asp:-4, Glu:-3, Gln:-2, His:-3, Arg:-2, Lys:-2, Met:2,  Ile:2,  Leu:4,  Val:3,  Phe:0,  Tyr:-1, Trp:-2],
Val:[Cys:-1,Ser:-2, Thr:-2, Pro:-2, Ala:0,  Gly:-3, Asn:-3, Asp:-3, Glu:-2, Gln:-2, His:-3, Arg:-3, Lys:-2, Met:1,  Ile:3,  Leu:1,  Val:4,  Phe:-1, Tyr:-1, Trp:-3],
Phe:[Cys:-2,Ser:-2, Thr:-2, Pro:-4, Ala:-2, Gly:-3, Asn:-3, Asp:-3, Glu:-3, Gln:-3, His:-1, Arg:-3, Lys:-3, Met:0,  Ile:0,  Leu:0,  Val:-1, Phe:6,  Tyr:3,  Trp:1],
Tyr:[Cys:-2,Ser:-2, Thr:-2, Pro:-3, Ala:-2, Gly:-3, Asn:-2, Asp:-3, Glu:-2, Gln:-1, His:2,  Arg:-2, Lys:-2, Met:-1, Ile:-1, Leu:-1, Val:-1, Phe:3,  Tyr:7,  Trp:2],
Trp:[Cys:-2,Ser:-3, Thr:-3, Pro:-4, Ala:-3, Gly:-2, Asn:-4, Asp:-4, Glu:-3, Gln:-2, His:-2, Arg:-3, Lys:-3, Met:-1, Ile:-3, Leu:-2, Val:-3, Phe:1,  Tyr:2,  Trp:11]
]

Using this you can just call

def score = blosum62[Arg][Cys]
println("Substituting Arg by Cys gives " + score)
微暖i 2024-08-31 01:04:48

您可以随时从 NCBI 网站下载该矩阵:

ftp://ftp.ncbi.nih .gov/blast/matrices/BLOSUM62

其他矩阵也可以从父目录中获得。

我从未见过使用矩阵计算实现 Needleman-Wunsch 。将矩阵包含在代码中或作为单独的文件要容易得多。

您可以在此处找到一些如何计算 BLOSUM 矩阵的详细信息,例如:http://en.wikipedia.org/维基/BLOSUM

You can always download the matrix from NCBI web site:

ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM62

Other matrices are also available from the parent directory.

I never saw implementation of Needleman-Wunsch with matrix calculation. It's much easier just to include the matrix in the code or as a separate file.

You can find some details how BLOSUM matrices are calculated for example here: http://en.wikipedia.org/wiki/BLOSUM.

寄居者 2024-08-31 01:04:48

您无法像 PAM 矩阵那样从另一个 blosum 矩阵中推断出 blosum 矩阵:所有 blosum 都是根据不同的数据集计算出来的,并且它们之间不相关。

例如,PAM250 矩阵只是 PAM1 矩阵与其自身相乘 250 倍;但这对于 BLOSUM 来说并非如此,例如,您无法从 BLOSUM64 推断出 BLOSUM80。

You can't infer a blosum matrix from another as you can do for PAM ones: all the blosum are calculated from a different set of data and are not correlated within theirselves.

For example, a PAM250 Matrix is just a PAM1 matrix multiplied 250 times by itself; but this is not true for BLOSUMs, and you can't infer BLOSUM80 from BLOSUM64, for example.

眸中客 2024-08-31 01:04:48

是的,您可以将 blosum 矩阵实现为硬连线代码段,您可能会因此获得一些速度。但你肯定会失去灵活性。我建议编写一个 NCBI 格式的阅读器,例如返回 SubstitutionMatrix 数据类型。然后你可以将这样的矩阵作为对象传递。

SubstitutionMatrix 对象可以保存一个 2D 矩阵和负责解码氨基酸名称的“东西”,例如散列数组。根据您选择的语言,您还可以使用枚举来表示氨基酸类型。在这种情况下,您可以直接使用它们来寻址二维数组。

希望这是清楚的,如果您喜欢/需要,我可以写更多细节。

Yes, you can implement a blosum matrix as hardwired piece of code, you might gain some speed with this. But definitely you loose flexibility. I would recommend writing a reader for NCBI format, e.g returning SubstitutionMatrix data type. Then you can pass around such a matrix as an object.

SubstitutionMatrix object may hold a 2D matrix and "something" responsible for decoding amino acid names, e.g. a hashing array. Depending on the language you choose, you may also use enums to represent amino acid types. In such a case you can use them directly to address the 2D array.

Hopefully this is clear, I can write more details if you like/need.

初与友歌 2024-08-31 01:04:48

以下是从链接 ftp://ftp.ncbi.nih 解析 blosum62 文件的示例。 Java 中的 gov/blast/matrices/BLOSUM62

创建类解析:

private String[] rowData;
private final int MARTRIX_SIZE = 24;
private String[][] matrix = new String[MARTRIX_SIZE][MARTRIX_SIZE];
private HashMap<String, Integer> index = new HashMap<>();
private String filePath = "blosum62.txt";

public void blosumMartix() {
    File fileObj = new File(filePath);
    Scanner input;
    boolean readLine = true;
    int rowCounter = 0;

    try {
        input = new Scanner(fileObj);
        while (input.hasNext()) {
            String lineReader = input.nextLine().replaceAll("  ", " ");
            if (!lineReader.substring(0, 1).equals("#")) {
                if (readLine) {
                    readLine = false;
                    continue;
                }
                rowData = lineReader.split(" ");
                for (int i = 1; i < rowData.length; i++) {
                    matrix[rowCounter][i - 1] = rowData[i];
                }
                index.put(rowData[0], rowCounter);
                rowCounter++;
            }
        }

    } catch (FileNotFoundException ex) {
        ex.printStackTrace();
    }
}

现在你想要计算示例 A 和 * 的成本,它返回 -4,你应该为此编写方法:

public int getDistance(String strS1, String strS2) {
    try {
        return getDistance(matrixIndex.get(strS1), matrixIndex.get(strS2));
    } catch (Exception ex) {
        System.out.println("Key out of range, check your string input");
        System.exit(0);
    }
    return 0;
}

private int getDistance(int charS1, int charS2) {
    if (charS1 < 0 || charS1 > matrix[0].length) {
        System.out.println("Gap out of range");
        System.exit(1);
    }
    if (charS2 < 0 || charS2 > matrix[0].length) {
        System.out.println("Gap out of range");
        System.exit(2);
    }

    return Integer.parseInt(matrix[charS1][charS2]);
}

最后在主方法中

Parsing parsing = new Parsing();
parsing.blosumMartix();
String result = parsing.getDistance("A", "*");
System.out.print(result);

这将打印 - 4.

Here is example of parsing the blosum62 file from this link ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM62 in Java.

Create class Parsing:

private String[] rowData;
private final int MARTRIX_SIZE = 24;
private String[][] matrix = new String[MARTRIX_SIZE][MARTRIX_SIZE];
private HashMap<String, Integer> index = new HashMap<>();
private String filePath = "blosum62.txt";

public void blosumMartix() {
    File fileObj = new File(filePath);
    Scanner input;
    boolean readLine = true;
    int rowCounter = 0;

    try {
        input = new Scanner(fileObj);
        while (input.hasNext()) {
            String lineReader = input.nextLine().replaceAll("  ", " ");
            if (!lineReader.substring(0, 1).equals("#")) {
                if (readLine) {
                    readLine = false;
                    continue;
                }
                rowData = lineReader.split(" ");
                for (int i = 1; i < rowData.length; i++) {
                    matrix[rowCounter][i - 1] = rowData[i];
                }
                index.put(rowData[0], rowCounter);
                rowCounter++;
            }
        }

    } catch (FileNotFoundException ex) {
        ex.printStackTrace();
    }
}

Now you want to calculate the the cost of example A and * which returns -4 you should write method for this:

public int getDistance(String strS1, String strS2) {
    try {
        return getDistance(matrixIndex.get(strS1), matrixIndex.get(strS2));
    } catch (Exception ex) {
        System.out.println("Key out of range, check your string input");
        System.exit(0);
    }
    return 0;
}

private int getDistance(int charS1, int charS2) {
    if (charS1 < 0 || charS1 > matrix[0].length) {
        System.out.println("Gap out of range");
        System.exit(1);
    }
    if (charS2 < 0 || charS2 > matrix[0].length) {
        System.out.println("Gap out of range");
        System.exit(2);
    }

    return Integer.parseInt(matrix[charS1][charS2]);
}

Finally in main method

Parsing parsing = new Parsing();
parsing.blosumMartix();
String result = parsing.getDistance("A", "*");
System.out.print(result);

This will print -4.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文