一种用歧义的热编码（核苷酸序列）

发布于 2025-01-23 00:17:20 字数 1394 浏览 0 评论 0原文

核苷酸序列（或DNA序列）通常由4个碱基组成：ATGC，这使得为机器学习目的编码它是一种非常好，简单和有效的方法。

sequence = AAATGCC
ohe_sequence = [[1, 0, 0, 0], [1, 0, 0, 0], [1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1], [0, 0, 0, 1]]

但是，当您考虑到有时会在测序机中发生的RNA序列和错误时，添加了字母uyrwskmdvhbxn ...当一个壁炉编码此问题时，您最终会以17行的矩阵结束，其中最后13行通常都是0是。

这是非常低效的，并且没有赋予这些额外（模棱两可）字母的生物学含义。

例如：

t和u是可互换的
y表示有一个c或t
n和x代表其中有四个基础（atgc）中的任何一个

，所以我做了一个代表这种生物学含义的词典

nucleotide_dict = {'A': 'A', 'T':'T', 'U':'T', 'G':'G', 'C':'C', 'Y':['C', 'T'], 'R':['A', 'G'], 'W':['A', 'T'], 'S':['G', 'C'], 'K':['T', 'G'], 'M':['C', 'A'], 'D':['A', 'T', 'G'], 'V':['A', 'G', 'C'], 'H':['A', 'T', 'C'], 'B':['T', 'G', 'C'], 'X':['A', 'T', 'G', 'C'], 'N':['A', 'T', 'G', 'C']}

，但我似乎无法要弄清楚如何制作一个有效的一式编码脚本（或者有一种使用Scikit Learn Learn模块来执行此操作的方法），该模块是利用此词典来获得这样的结果：

sequence = ANTUYCC
ohe_sequence = [[1, 0, 0, 0], [1, 1, 1, 1], [0, 1, 0, 0], [0, 1, 0, 0], [0, 1, 0, 1], [0, 0, 0, 1], [0, 0, 0, 1]]

# or even better:
ohe_sequence = [[1, 0, 0, 0], [0.25, 0.25, 0.25, 0.25], [0, 1, 0, 0], [0, 1, 0, 0], [0, 0.5, 0, 0.5], [0, 0, 0, 1], [0, 0, 0, 1]]

原文

Nucleotide sequences (or DNA sequences) generally are comprised of 4 bases: ATGC which makes for a very nice, easy and efficient way of encoding this for machine learning purposes.

sequence = AAATGCC
ohe_sequence = [[1, 0, 0, 0], [1, 0, 0, 0], [1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1], [0, 0, 0, 1]]

But when you take into account RNA sequences and mistakes that can sometimes occur in sequencing machines, the letters UYRWSKMDVHBXN are added... When one-hot encoding this you end up with a matrix of 17 rows of which the last 13 rows are generally all 0's.

This is very inefficient and does not confer the biological meaning that these extra (ambiguous) letters have.

For example:

T and U are interchangeable
Y represents there to be a C or T
N and X represent there to be any of the 4 bases (ATGC)

And so I have made a dictionary that represents this biological meaning

nucleotide_dict = {'A': 'A', 'T':'T', 'U':'T', 'G':'G', 'C':'C', 'Y':['C', 'T'], 'R':['A', 'G'], 'W':['A', 'T'], 'S':['G', 'C'], 'K':['T', 'G'], 'M':['C', 'A'], 'D':['A', 'T', 'G'], 'V':['A', 'G', 'C'], 'H':['A', 'T', 'C'], 'B':['T', 'G', 'C'], 'X':['A', 'T', 'G', 'C'], 'N':['A', 'T', 'G', 'C']}

But I can't seem to figure out how to make an efficient one-hot-encoding script (or wether there is a way to do this with the scikit learn module) that utilizes this dictionary in order to get a result like this:

sequence = ANTUYCC
ohe_sequence = [[1, 0, 0, 0], [1, 1, 1, 1], [0, 1, 0, 0], [0, 1, 0, 0], [0, 1, 0, 1], [0, 0, 0, 1], [0, 0, 0, 1]]

# or even better:
ohe_sequence = [[1, 0, 0, 0], [0.25, 0.25, 0.25, 0.25], [0, 1, 0, 0], [0, 1, 0, 0], [0, 0.5, 0, 0.5], [0, 0, 0, 1], [0, 0, 0, 1]]

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦幻的心爱 2025-01-30 00:17:20

这很有趣！我认为您可以使用具有适当值的字典来执行此操作。我添加了Scikit-Learn类，因为您提到您正在使用它。请参阅转换下面：

from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np

nucleotide_dict = {
    "A": [1, 0, 0, 0],
    "G": [0, 1, 0, 0],
    "C": [0, 0, 1, 0],
    "T": [0, 0, 0, 1],
    "U": [0, 0, 0, 1],
    "Y": [0, 0, 1, 1],
    "R": [1, 1, 0, 0],
    "W": [1, 0, 0, 1],
    "S": [0, 1, 1, 0],
    "K": [0, 1, 0, 1],
    "M": [1, 0, 1, 0],
    "D": [1, 1, 0, 1],
    "V": [1, 1, 1, 0],
    "H": [1, 0, 1, 1],
    "B": [0, 1, 1, 1],
    "X": [1, 1, 1, 1],
    "N": [1, 1, 1, 1],
    "-": [0, 0, 0, 0],
}


class NucleotideEncoder(BaseEstimator, TransformerMixin):
    def __init__(self, norm=True):
        self.norm = norm

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        f1 = lambda a: list(a)
        f2 = lambda g: nucleotide_dict[g]
        f3 = lambda c: list(map(f2, f1(c[0])))
        f4 = lambda t: np.array(f3(t)) / np.sum(np.array(f3(t)), axis=1)[:, np.newaxis]
        f = f3
        if self.norm:
            f = f4
        return np.apply_along_axis(f, 1, X)


samples = np.array([["AAATGCC"], ["ANTUYCC"]])
print(NucleotideEncoder().fit_transform(samples))

输出：

[[[1.   0.   0.   0.  ]
  [1.   0.   0.   0.  ]
  [1.   0.   0.   0.  ]
  [0.   0.   0.   1.  ]
  [0.   1.   0.   0.  ]
  [0.   0.   1.   0.  ]
  [0.   0.   1.   0.  ]]

 [[1.   0.   0.   0.  ]
  [0.25 0.25 0.25 0.25]
  [0.   0.   0.   1.  ]
  [0.   0.   0.   1.  ]
  [0.   0.   0.5  0.5 ]
  [0.   0.   1.   0.  ]
  [0.   0.   1.   0.  ]]]

This was fun! I think you can do this using a dictionary with the appropriate values. I added the scikit-learn class because you mentioned you were using that. See the transform below:

from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np

nucleotide_dict = {
    "A": [1, 0, 0, 0],
    "G": [0, 1, 0, 0],
    "C": [0, 0, 1, 0],
    "T": [0, 0, 0, 1],
    "U": [0, 0, 0, 1],
    "Y": [0, 0, 1, 1],
    "R": [1, 1, 0, 0],
    "W": [1, 0, 0, 1],
    "S": [0, 1, 1, 0],
    "K": [0, 1, 0, 1],
    "M": [1, 0, 1, 0],
    "D": [1, 1, 0, 1],
    "V": [1, 1, 1, 0],
    "H": [1, 0, 1, 1],
    "B": [0, 1, 1, 1],
    "X": [1, 1, 1, 1],
    "N": [1, 1, 1, 1],
    "-": [0, 0, 0, 0],
}


class NucleotideEncoder(BaseEstimator, TransformerMixin):
    def __init__(self, norm=True):
        self.norm = norm

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        f1 = lambda a: list(a)
        f2 = lambda g: nucleotide_dict[g]
        f3 = lambda c: list(map(f2, f1(c[0])))
        f4 = lambda t: np.array(f3(t)) / np.sum(np.array(f3(t)), axis=1)[:, np.newaxis]
        f = f3
        if self.norm:
            f = f4
        return np.apply_along_axis(f, 1, X)


samples = np.array([["AAATGCC"], ["ANTUYCC"]])
print(NucleotideEncoder().fit_transform(samples))

Output:

[[[1.   0.   0.   0.  ]
  [1.   0.   0.   0.  ]
  [1.   0.   0.   0.  ]
  [0.   0.   0.   1.  ]
  [0.   1.   0.   0.  ]
  [0.   0.   1.   0.  ]
  [0.   0.   1.   0.  ]]

 [[1.   0.   0.   0.  ]
  [0.25 0.25 0.25 0.25]
  [0.   0.   0.   1.  ]
  [0.   0.   0.   1.  ]
  [0.   0.   0.5  0.5 ]
  [0.   0.   1.   0.  ]
  [0.   0.   1.   0.  ]]]

回复收藏 0 原文

硪扪都還晓 2025-01-30 00:17:20

我喜欢后一种方法，因为它更紧密地与真实含义相对应：例如y并不意味着c 和 t ，但是c 或 t之一。如果没有其他信息可用，则假设同等的概率（IE权重）似乎是合理的。当然，这种非标准的编码需要通过选择损失功能的选择来反映。

要回答您的问题：您可以将映射从字母到编码进行预先计算，然后创建一个以sequence为输入的encode函数，并将编码序列返回为<代码>（len（序列），4） - 形状np.array如下：

import numpy as np

nucleotide_dict = {'A':'A', 'T':'T', 'U':'T', 'G':'G', 'C':'C', 'Y':['C', 'T'], 'R':['A', 'G'], 'W':['A', 'T'], 'S':['G', 'C'], 'K':['T', 'G'], 'M':['C', 'A'], 'D':['A', 'T', 'G'], 'V':['A', 'G', 'C'], 'H':['A', 'T', 'C'], 'B':['T', 'G', 'C'], 'X':['A', 'T', 'G', 'C'], 'N':['A', 'T', 'G', 'C']}
index_mapper = {'A': 0, 'T': 1, 'G': 2, 'C': 3}
mapper_dict = dict()
for k, v in nucleotide_dict.items():
    encoding = np.zeros(4)
    p = 1 / len(v)
    encoding[[index_mapper[i] for i in v]] = p
    mapper_dict[k] = encoding

def encode(sequence):
    return np.array([mapper_dict[s] for s in sequence])

这似乎产生了所需的结果，并且可能有些有效。

一个示例：

print(encode('AYSDX'))

打印以下内容：

array([[1.        , 0.        , 0.        , 0.        ],
       [0.        , 0.5       , 0.        , 0.5       ],
       [0.        , 0.        , 0.5       , 0.5       ],
       [0.33333333, 0.33333333, 0.33333333, 0.        ],
       [0.25      , 0.25      , 0.25      , 0.25      ]])

I like the latter approach, as it more closely corresponds to the real meaning: E.g. Y does not mean C and T, but one of C or T. If no further information is available, assuming equal probabilities (i.e. weights) seems reasonable. Of course this non-standard encoding needs to be reflected by the choice of loss function down the road.

To answer your question: You could precompute the mapping from the letters to the encoding and then create an encode function that takes a sequence as input and returns the encoded sequence as an (len(sequence), 4)-shaped np.array as follows:

import numpy as np

nucleotide_dict = {'A':'A', 'T':'T', 'U':'T', 'G':'G', 'C':'C', 'Y':['C', 'T'], 'R':['A', 'G'], 'W':['A', 'T'], 'S':['G', 'C'], 'K':['T', 'G'], 'M':['C', 'A'], 'D':['A', 'T', 'G'], 'V':['A', 'G', 'C'], 'H':['A', 'T', 'C'], 'B':['T', 'G', 'C'], 'X':['A', 'T', 'G', 'C'], 'N':['A', 'T', 'G', 'C']}
index_mapper = {'A': 0, 'T': 1, 'G': 2, 'C': 3}
mapper_dict = dict()
for k, v in nucleotide_dict.items():
    encoding = np.zeros(4)
    p = 1 / len(v)
    encoding[[index_mapper[i] for i in v]] = p
    mapper_dict[k] = encoding

def encode(sequence):
    return np.array([mapper_dict[s] for s in sequence])

This seems to produce the desired result and is probably somewhat efficient.

An example:

print(encode('AYSDX'))

prints the following:

array([[1.        , 0.        , 0.        , 0.        ],
       [0.        , 0.5       , 0.        , 0.5       ],
       [0.        , 0.        , 0.5       , 0.5       ],
       [0.33333333, 0.33333333, 0.33333333, 0.        ],
       [0.25      , 0.25      , 0.25      , 0.25      ]])

回复收藏 0 原文

爱的那么颓废 2025-01-30 00:17:20

允许不同长度序列的另一个版本：

import numpy as np

nucleotide_dict = {
    "A": [1, 0, 0, 0],
    "G": [0, 1, 0, 0],
    "C": [0, 0, 1, 0],
    "T": [0, 0, 0, 1],
    "U": [0, 0, 0, 1],
    "Y": [0, 0, 1, 1],
    "R": [1, 1, 0, 0],
    "W": [1, 0, 0, 1],
    "S": [0, 1, 1, 0],
    "K": [0, 1, 0, 1],
    "M": [1, 0, 1, 0],
    "D": [1, 1, 0, 1],
    "V": [1, 1, 1, 0],
    "H": [1, 0, 1, 1],
    "B": [0, 1, 1, 1],
    "X": [1, 1, 1, 1],
    "N": [1, 1, 1, 1],
    "-": [0, 0, 0, 0],
}

norm = True
samples = np.array(["AAATGCC", "ANTUYCC", "".join(list(nucleotide_dict.keys()))[:-1]])


def nucleotide_encode(samples, norm=True):
    m = map(list, samples)
    f1 = lambda x: np.array(list(map(nucleotide_dict.get, x)))
    f = f1
    if norm:
        f = lambda x: np.nan_to_num(f1(x) / np.sum(f1(x), axis=1)[:, np.newaxis])
    return list(map(f, m))


for i, j in zip(samples, nucleotide_encode(samples, norm=norm)):
    print(i)
    print(j)

产量：

AAATGCC
[[1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [0. 0. 0. 1.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]]
ANTUYCC
[[1.   0.   0.   0.  ]
 [0.25 0.25 0.25 0.25]
 [0.   0.   0.   1.  ]
 [0.   0.   0.   1.  ]
 [0.   0.   0.5  0.5 ]
 [0.   0.   1.   0.  ]
 [0.   0.   1.   0.  ]]
AGCTUYRWSKMDVHBXN
[[1.         0.         0.         0.        ]
 [0.         1.         0.         0.        ]
 [0.         0.         1.         0.        ]
 [0.         0.         0.         1.        ]
 [0.         0.         0.         1.        ]
 [0.         0.         0.5        0.5       ]
 [0.5        0.5        0.         0.        ]
 [0.5        0.         0.         0.5       ]
 [0.         0.5        0.5        0.        ]
 [0.         0.5        0.         0.5       ]
 [0.5        0.         0.5        0.        ]
 [0.33333333 0.33333333 0.         0.33333333]
 [0.33333333 0.33333333 0.33333333 0.        ]
 [0.33333333 0.         0.33333333 0.33333333]
 [0.         0.33333333 0.33333333 0.33333333]
 [0.25       0.25       0.25       0.25      ]
 [0.25       0.25       0.25       0.25      ]]

Another version that allows for different length sequences:

import numpy as np

nucleotide_dict = {
    "A": [1, 0, 0, 0],
    "G": [0, 1, 0, 0],
    "C": [0, 0, 1, 0],
    "T": [0, 0, 0, 1],
    "U": [0, 0, 0, 1],
    "Y": [0, 0, 1, 1],
    "R": [1, 1, 0, 0],
    "W": [1, 0, 0, 1],
    "S": [0, 1, 1, 0],
    "K": [0, 1, 0, 1],
    "M": [1, 0, 1, 0],
    "D": [1, 1, 0, 1],
    "V": [1, 1, 1, 0],
    "H": [1, 0, 1, 1],
    "B": [0, 1, 1, 1],
    "X": [1, 1, 1, 1],
    "N": [1, 1, 1, 1],
    "-": [0, 0, 0, 0],
}

norm = True
samples = np.array(["AAATGCC", "ANTUYCC", "".join(list(nucleotide_dict.keys()))[:-1]])


def nucleotide_encode(samples, norm=True):
    m = map(list, samples)
    f1 = lambda x: np.array(list(map(nucleotide_dict.get, x)))
    f = f1
    if norm:
        f = lambda x: np.nan_to_num(f1(x) / np.sum(f1(x), axis=1)[:, np.newaxis])
    return list(map(f, m))


for i, j in zip(samples, nucleotide_encode(samples, norm=norm)):
    print(i)
    print(j)

Yields:

AAATGCC
[[1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [0. 0. 0. 1.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]]
ANTUYCC
[[1.   0.   0.   0.  ]
 [0.25 0.25 0.25 0.25]
 [0.   0.   0.   1.  ]
 [0.   0.   0.   1.  ]
 [0.   0.   0.5  0.5 ]
 [0.   0.   1.   0.  ]
 [0.   0.   1.   0.  ]]
AGCTUYRWSKMDVHBXN
[[1.         0.         0.         0.        ]
 [0.         1.         0.         0.        ]
 [0.         0.         1.         0.        ]
 [0.         0.         0.         1.        ]
 [0.         0.         0.         1.        ]
 [0.         0.         0.5        0.5       ]
 [0.5        0.5        0.         0.        ]
 [0.5        0.         0.         0.5       ]
 [0.         0.5        0.5        0.        ]
 [0.         0.5        0.         0.5       ]
 [0.5        0.         0.5        0.        ]
 [0.33333333 0.33333333 0.         0.33333333]
 [0.33333333 0.33333333 0.33333333 0.        ]
 [0.33333333 0.         0.33333333 0.33333333]
 [0.         0.33333333 0.33333333 0.33333333]
 [0.25       0.25       0.25       0.25      ]
 [0.25       0.25       0.25       0.25      ]]

回复收藏 0 原文

~没有更多了~