split命令问题
我在使用 split 命令时遇到问题。 输入字符串如下:
080821_HWI-EAS301_0002_30ALBAAXX:1:8:1649:2027 83 chr10 42038185 255 36M = 42037995 -225 GCCAGGTTTAATAAATTATTTATAGAATACTGCATC @?DDEAEFDAD@FBG@CDA?DBCDEECD@D?CBA>A NM:i:0 MD:Z:36
我想从这个字符串中抓取'2027' 我的命令是:line.split(':',4)[1].split()[0] 然而,这不起作用。输出为“1”
然后我切换到 line.split(':',4)
输出仍然是“1”,我发现第一步分割已经有问题了。
但是,当我尝试 line.split(':',1)
时,我得到的预期结果是:
1:8:1649:2027 83 chr10 42038185 255 36M = 42037995-225 GCCAGGTTTAATAAATTATTTATAGAATACTGCATC @?DDEAEFDAD@FBG@CDA?DBCDEECD@D?CBA>A NM:i:0 MD:Z:36
我对这个 split 命令感到困惑! (我之前问过类似的问题,当时 split 命令起作用了) 谢谢
I got problem with using split command.
The input string is as follows:
080821_HWI-EAS301_0002_30ALBAAXX:1:8:1649:2027 83 chr10 42038185 255 36M = 42037995 -225 GCCAGGTTTAATAAATTATTTATAGAATACTGCATC @?DDEAEFDAD@FBG@CDA?DBCDEECD@D?CBA>A NM:i:0 MD:Z:36
I want to grab '2027' from this string
my command is: line.split(':',4)[1].split()[0]
However, it doesn't work. The output is '1'
Then I switch to line.split(':',4)
And output is still '1', and I see the first-step split is already problematic.
However, when I try line.split(':',1)
, I got expected result as:
1:8:1649:2027 83 chr10 42038185 255 36M = 42037995-225 GCCAGGTTTAATAAATTATTTATAGAATACTGCATC @?DDEAEFDAD@FBG@CDA?DBCDEECD@D?CBA>A NM:i:0 MD:Z:36
I'm confused by this split command! (I asked the similar question before, and split command worked at that time)
thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
看来您想要的是
The numeric parameter to split 指示将发生的最大分割数。所以你有:
如果你从这个返回值中取出元素[1],你会得到“1”。我不明白你为什么对此感到惊讶。
由于您最多允许 4 次拆分,并且您想要的项目将是最后一个,因此您想要的下标是 [4]:
然后您可以在空间上拆分它并从中获取元素 [0] 以生成结果。
如果根本不传递分割限制值,您会得到相同的结果:
It appears that what you want is
The numeric parameter to split indicates the maximum number of splits that will occur. So you have:
If you pull element [1] out of this return value, you get '1'. I don't see why you are surprised by this.
Since you are allowing up to 4 splits, and the item you want will be the last one, the subscript you want is [4]:
Then you can split that on space and get element [0] from it to produce your result.
You get the same result if you don't pass a split limit value at all:
试试这个:
我不确定你为什么要尝试像这样访问包含 2027 的令牌:
而不是这样:
我认为你可能对 split 的工作原理感到困惑。 Python split 函数 的最后一个参数是要执行的最大分割数。
Try this:
I'm not sure why you're trying to access the token containing 2027 like this:
rather than this:
I think that you might be confused about how split works. The last parameter to the Python split function is the maximum number of splits to perform.
split
的第二个参数是要执行的最大分割数,因此您可能不想在这种情况下使用它。要在执行分割后访问第五个元素,请执行以下操作:无论如何,您可能想要的是首先按空格分割(您可以不使用参数来完成此操作),然后用冒号分割。这可以在一行中完成,如下所示:
The second argument to
split
is the maximum number of splits to exercise, so you probably don't want to be using it in this case. To access the 5th element after performing the split, do this:Anyway, what you probably want is to first split by whitespace (you can do this by using no arguments), and then split by colons. This can be done on one line like this:
您可以使用:
You can use instead:
首先在空白处分开。然后根据分隔符(此处:“:”)拆分结果列表中的第一个元素。
Split on the white space first. Then split the first element in the resultant list based on the separator (here: ':').
您必须使用
split
吗?我问这个问题是因为我发现当我只需要获取特定的子字符串时,正则表达式是一个更好的工具。这并不是最容易学习的东西,而且一开始看起来非常难以接近,但你只需要付出学习一次的代价,这是一项值得的投资。 :)
Python 主页对此有很好的介绍。
PS
2027
将通过以下正则表达式.*?:([0-9]+)\s+
进行匹配Do you must use
split
?I ask this because I've found regex to be a much better tool to use when I just need to grab a specific substring. It's not the easiest thing to learn and does appear very unapproachable at first, but you have to pay the price of learning it only once and it is an investment worth making. :)
Python homepage has a good introduction of it.
P.S.
2027
will be matched by the following regex.*?:([0-9]+)\s+
我想您将来会从字符串中进行大量的信息提取。那么,我的建议是学习使用正则表达式工具,这将是不可避免的。
或者您必须学习并使用专门的库来处理基因组学领域的字符串。
使用模块
re
解决当前问题的简单方法:如果第四个 ':' 之前有空格,则正则表达式的模式将为:
(:[^:]+)
匹配':' 后跟{3}
后面可能有多个与 ':' 不同的字符表示此匹配必须执行 3 次
,然后第四个
':'
必须是遇到的,后跟匹配的搜索号码<代码>\d+;不再需要指出数字后面必须有空格,因为一旦遇到非数字字符,\d+
将停止在字符串中匹配。 括号定义组。这里所需的数字被第二组捕获
I presume that you will do numerous extractions of information from strings in the future. Then, my advice is to learn to use the regex tool, it will be inevitable.
Or you'll have to learn and use specialized library to do treatments of string in the field of genomics.
Simple solution to your present problem with module
re
:If there are blanks before the fourth ':' the regex's pattern will be:
(:[^:]+)
matches a ':' followed by as many characters different from ':' that may follow{3}
says that this match must be performed 3 timesthen the fourth
':'
must be encountered, followed by the searched number matched by\d+
; there is no more need to indicate that there must be a blank after the number, because\d+
will stop to match in the string as soon as a non-digit character will be encounteredParentheseses define groups. Here the desired number is catched by the second group