在 Ruby 中处理染色体数据
假设我有一个正在使用 Ruby 处理的染色体数据文件,
#Base_ID Segment_ID Read_Depth
1 100
2 800
3 seg1 1900
4 seg1 2700
5 1600
6 2400
7 200
8 15000
9 seg2 300
10 seg2 400
11 seg2 900
12 1000
13 600
...
我将每一行粘贴到数组的哈希中,键取自第 2 列 Segment_ID,值取自第 3 列 Read_Depth,给出了
mr_hashy = {
"seg1" => [1900, 2700],
"" => [100, 800, 1600, 2400, 200, 15000, 1000, 600],
"seg2" => [300, 400, 900],
}
A < strong>primer,它是由上述数据中的两个连续行组成的小段,位于每个常规段的前面和后面。常规片段的 Segment_ID 具有非空字符串值,并且长度各不相同,而第二列中具有空字符串的行是引物的一部分。引物片段总是具有相同的长度,2。如上所示,Base_ID的1、2、5、6、7、8、12、13是引物的一部分。上述数据中总共存在四个引物片段。
我想做的是,在第 2 列 Segment_ID 中遇到包含空字符串的行时,将 READ_DEPTH 添加到哈希中的适当元素中。例如,我想要的结果看起来像
mr_hashy = {
"seg1" => [100, 800, 1900, 2700, 1600, 2400],
"seg2" => [200, 15000, 300, 400, 900, 1000, 600],
}
Say I have a file of chromosomal data I'm processing with Ruby,
#Base_ID Segment_ID Read_Depth
1 100
2 800
3 seg1 1900
4 seg1 2700
5 1600
6 2400
7 200
8 15000
9 seg2 300
10 seg2 400
11 seg2 900
12 1000
13 600
...
I'm sticking each row into a hash of arrays, with my keys taken from column 2, Segment_ID, and my values from column 3, Read_Depth, giving me
mr_hashy = {
"seg1" => [1900, 2700],
"" => [100, 800, 1600, 2400, 200, 15000, 1000, 600],
"seg2" => [300, 400, 900],
}
A primer, which is a small segment that consists of two consecutive rows in the above data, prepends and follows each regular segment. Regular segments have a non-empty-string value for Segment_ID, and vary in length, while rows with an empty string in the second column are parts of primers. Primer segments always have the same length, 2. Seen above, Base_ID's 1, 2, 5, 6, 7, 8, 12, 13 are parts of primers. In total, there are four primer segments present in the above data.
What I'd like to do is, upon encountering a line with an empty string in column 2, Segment_ID, add the READ_DEPTH to the appropriate element in my hash. For instance, my desired result from above would look like
mr_hashy = {
"seg1" => [100, 800, 1900, 2700, 1600, 2400],
"seg2" => [200, 15000, 300, 400, 900, 1000, 600],
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
第二次重构。我认为这是干净、优雅的,最重要的是完整的。它很容易阅读,没有硬编码的字段长度或丑陋的正则表达式。我投票我的为最好的!耶!我是最棒的,耶! ;)
我用它作为我的数据文件:
输出:
Second-ish refactor. I think this is clean, elegant, and most of all complete. It's easy to read with no hardcoded field lengths or ugly RegEx. I vote mine as the best! Yay! I'm the best, yay! ;)
I used this as my data file:
Which outputs:
这是一些 Ruby 代码(很好的练习示例:P)。我假设固定宽度的列,这似乎是您的输入数据的情况。该代码会跟踪哪些深度值是引物值,直到找到其中 4 个,之后它将知道段 ID。
提供输入输出:
Here's some Ruby code (nice practice example :P). I'm assuming fixed-width columns, which appears to be the case with your input data. The code keeps track of which depth values are primer values until it finds 4 of them, after which it will know the segment id.
Output on input provided: