我对单热编码Minhash基因组有算法,并且正在寻求有关我是否根据Minhashing的性质正确构建它的意见。我自己和合作者之间存在一些分歧,我们正在尝试找到正确的方法。
我已经使用了mash()以1,000个样本的原始遗传序列读取(FASTQ文件)的数据库。总而言之,对于一个样本,这会产生2000 Hash函数的草图,其中每个哈希函数编码一个21 kmer的等位基因序列(alphabet {atcg})。
我通过将每个新草图中的哈希函数与先前处理的示例数据库中的哈希函数进行比较来编码这些草图。如果新草图在数据库中具有HASH,则该列中的1个在数据库中获得1,如果Hash在数据库中,我们将在该哈希的数据库中添加一个列,而当前示例为1个,为所有以前的示例添加一个0 。我相信这会产生准确的单次编码。
我的合作者认为草图中的哈希功能的顺序很重要。如果这是正确的,则仅当新样本中的哈希函数与以前的哈希函数与以前的哈希函数相同时,与先前哈希的数据库进行比较才有效。
我对Minhashing的理解是,假设没有哈希碰撞,每个哈希函数都应代表独特的K-mer。按哈希的上升顺序对草图进行排序主要是为了随机分组,因此比较在同一索引上的哈希相并不重要,而是要查看一个草图中是否存在一个草图中的任何哈希。
这感觉很利基,很难以书面形式解释,所以请让我知道是否需要任何澄清。谢谢!
I have an algorithm to one-hot encode minHashed genomes and I am seeking opinions on whether I have constructed it correctly based on the nature of minHashing. There's some disagreement between myself and a collaborator and we are trying to find the correct approach.
I have used MASH (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x) to minHash a database of raw genetic sequence reads (fastq files) for 1,000 samples. In summary, for one sample this produces a sketch of 2000 hash functions, where each hash function encodes a 21-kmer sequence of alleles (alphabet {ATCG}).
I one-hot encode these sketches by comparing the hash functions in each new sketch to the hash functions from previously processed samples database. If the new sketch has a hash in the database it gets a 1 in that column, if the hash is not in the database we add a column to the database for that hash with a 1 for the current sample and a 0 for all previous samples. I believe this produces an accurate one-hot encoding.
My collaborator believes the order of the hash functions in the sketches matter. If this is true, then comparison to the database of previous hashes is only valid if the hash function in the new sample has the same index in the 2,000 length vector as the previous hash function it is being compared to.
My understanding of minHashing is that assuming no hash collisions, each hash function should represent a unique k-mer. Sorting the sketch in ascending order of hashes is largely for randomization and thus it is not important to compare hashes at the same index, but rather to see if any of the hashes contained in one sketch are present in the others.
This feels quite niche and difficult to explain in writing so please let me know if any clarification is needed. Thanks!
发布评论