一种用歧义的热编码(核苷酸序列)
核苷酸序列(或DNA序列)通常由4个碱基组成:ATGC,这使得为机器学习目的编码它是一种非常好,简单和有效的方法。
sequence = AAATGCC
ohe_sequence = [[1, 0, 0, 0], [1, 0, 0, 0], [1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1], [0, 0, 0, 1]]
但是,当您考虑到有时会在测序机中发生的RNA序列和错误时,添加了字母uyrwskmdvhbxn ...当一个壁炉编码此问题时,您最终会以17行的矩阵结束,其中最后13行通常都是0是。
这是非常低效的,并且没有赋予这些额外(模棱两可)字母的生物学含义。
例如:
- t和u是可互换的
- y表示有一个c或t
- n和x代表其中有四个基础(atgc)中的任何一个
,所以我做了一个代表这种生物学含义的词典
nucleotide_dict = {'A': 'A', 'T':'T', 'U':'T', 'G':'G', 'C':'C', 'Y':['C', 'T'], 'R':['A', 'G'], 'W':['A', 'T'], 'S':['G', 'C'], 'K':['T', 'G'], 'M':['C', 'A'], 'D':['A', 'T', 'G'], 'V':['A', 'G', 'C'], 'H':['A', 'T', 'C'], 'B':['T', 'G', 'C'], 'X':['A', 'T', 'G', 'C'], 'N':['A', 'T', 'G', 'C']}
,但我似乎无法要弄清楚如何制作一个有效的一式编码脚本(或者有一种使用Scikit Learn Learn模块来执行此操作的方法),该模块是利用此词典来获得这样的结果:
sequence = ANTUYCC
ohe_sequence = [[1, 0, 0, 0], [1, 1, 1, 1], [0, 1, 0, 0], [0, 1, 0, 0], [0, 1, 0, 1], [0, 0, 0, 1], [0, 0, 0, 1]]
# or even better:
ohe_sequence = [[1, 0, 0, 0], [0.25, 0.25, 0.25, 0.25], [0, 1, 0, 0], [0, 1, 0, 0], [0, 0.5, 0, 0.5], [0, 0, 0, 1], [0, 0, 0, 1]]
Nucleotide sequences (or DNA sequences) generally are comprised of 4 bases: ATGC which makes for a very nice, easy and efficient way of encoding this for machine learning purposes.
sequence = AAATGCC
ohe_sequence = [[1, 0, 0, 0], [1, 0, 0, 0], [1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1], [0, 0, 0, 1]]
But when you take into account RNA sequences and mistakes that can sometimes occur in sequencing machines, the letters UYRWSKMDVHBXN are added... When one-hot encoding this you end up with a matrix of 17 rows of which the last 13 rows are generally all 0's.
This is very inefficient and does not confer the biological meaning that these extra (ambiguous) letters have.
For example:
- T and U are interchangeable
- Y represents there to be a C or T
- N and X represent there to be any of the 4 bases (ATGC)
And so I have made a dictionary that represents this biological meaning
nucleotide_dict = {'A': 'A', 'T':'T', 'U':'T', 'G':'G', 'C':'C', 'Y':['C', 'T'], 'R':['A', 'G'], 'W':['A', 'T'], 'S':['G', 'C'], 'K':['T', 'G'], 'M':['C', 'A'], 'D':['A', 'T', 'G'], 'V':['A', 'G', 'C'], 'H':['A', 'T', 'C'], 'B':['T', 'G', 'C'], 'X':['A', 'T', 'G', 'C'], 'N':['A', 'T', 'G', 'C']}
But I can't seem to figure out how to make an efficient one-hot-encoding script (or wether there is a way to do this with the scikit learn module) that utilizes this dictionary in order to get a result like this:
sequence = ANTUYCC
ohe_sequence = [[1, 0, 0, 0], [1, 1, 1, 1], [0, 1, 0, 0], [0, 1, 0, 0], [0, 1, 0, 1], [0, 0, 0, 1], [0, 0, 0, 1]]
# or even better:
ohe_sequence = [[1, 0, 0, 0], [0.25, 0.25, 0.25, 0.25], [0, 1, 0, 0], [0, 1, 0, 0], [0, 0.5, 0, 0.5], [0, 0, 0, 1], [0, 0, 0, 1]]
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这很有趣!我认为您可以使用具有适当值的字典来执行此操作。我添加了Scikit-Learn类,因为您提到您正在使用它。请参阅
转换
下面:输出:
This was fun! I think you can do this using a dictionary with the appropriate values. I added the scikit-learn class because you mentioned you were using that. See the
transform
below:Output:
我喜欢后一种方法,因为它更紧密地与真实含义相对应:例如
y
并不意味着c
和t ,但是
c
或t
之一。如果没有其他信息可用,则假设同等的概率(IE权重)似乎是合理的。当然,这种非标准的编码需要通过选择损失功能的选择来反映。要回答您的问题:您可以将映射从字母到编码进行预先计算,然后创建一个以
sequence
为输入的encode
函数,并将编码序列返回为<代码>(len(序列),4) - 形状np.array
如下:这似乎产生了所需的结果,并且可能有些有效。
一个示例:
打印以下内容:
I like the latter approach, as it more closely corresponds to the real meaning: E.g.
Y
does not meanC
andT
, but one ofC
orT
. If no further information is available, assuming equal probabilities (i.e. weights) seems reasonable. Of course this non-standard encoding needs to be reflected by the choice of loss function down the road.To answer your question: You could precompute the mapping from the letters to the encoding and then create an
encode
function that takes asequence
as input and returns the encoded sequence as an(len(sequence), 4)
-shapednp.array
as follows:This seems to produce the desired result and is probably somewhat efficient.
An example:
prints the following:
允许不同长度序列的另一个版本:
产量:
Another version that allows for different length sequences:
Yields: