split命令问题

发布于 2024-11-28 22:04:26 字数 728 浏览 0 评论 0原文

我在使用 split 命令时遇到问题。输入字符串如下：

080821_HWI-EAS301_0002_30ALBAAXX:1:8:1649:2027  83  chr10   42038185    255 36M =   42037995    -225    GCCAGGTTTAATAAATTATTTATAGAATACTGCATC    @?DDEAEFDAD@FBG@CDA?DBCDEECD@D?CBA>A    NM:i:0  MD:Z:36

我想从这个字符串中抓取'2027' 我的命令是：line.split(':',4)[1].split()[0] 然而，这不起作用。输出为“1”

然后我切换到 line.split(':',4) 输出仍然是“1”，我发现第一步分割已经有问题了。

但是，当我尝试 line.split(':',1) 时，我得到的预期结果是：

1:8:1649:2027   83  chr10   42038185    255 36M =   42037995-225    GCCAGGTTTAATAAATTATTTATAGAATACTGCATC    @?DDEAEFDAD@FBG@CDA?DBCDEECD@D?CBA>A    NM:i:0  MD:Z:36

我对这个 split 命令感到困惑！（我之前问过类似的问题，当时 split 命令起作用了）谢谢

原文

I got problem with using split command.
The input string is as follows:

080821_HWI-EAS301_0002_30ALBAAXX:1:8:1649:2027  83  chr10   42038185    255 36M =   42037995    -225    GCCAGGTTTAATAAATTATTTATAGAATACTGCATC    @?DDEAEFDAD@FBG@CDA?DBCDEECD@D?CBA>A    NM:i:0  MD:Z:36

I want to grab '2027' from this string
my command is: line.split(':',4)[1].split()[0]
However, it doesn't work. The output is '1'

Then I switch to line.split(':',4)
And output is still '1', and I see the first-step split is already problematic.

However, when I try line.split(':',1), I got expected result as:

1:8:1649:2027   83  chr10   42038185    255 36M =   42037995-225    GCCAGGTTTAATAAATTATTTATAGAATACTGCATC    @?DDEAEFDAD@FBG@CDA?DBCDEECD@D?CBA>A    NM:i:0  MD:Z:36

I'm confused by this split command! (I asked the similar question before, and split command worked at that time)
thanks

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

青衫负雪 2024-12-05 22:04:26

看来您想要的是

line.split(':',4)[4].split()[0]

The numeric parameter to split 指示将发生的最大分割数。所以你有：

>>> line='080821_HWI-EAS301_0002_30ALBAAXX:1:8:1649:2027 ...'
>>> line.split(':',4)
['080821_HWI-EAS301_0002_30ALBAAXX', '1', '8', '1649', '2027 ...']

如果你从这个返回值中取出元素[1]，你会得到“1”。我不明白你为什么对此感到惊讶。

由于您最多允许 4 次拆分，并且您想要的项目将是最后一个，因此您想要的下标是 [4]：

>>> line.split(':',4)[4]
'2027 ...'

然后您可以在空间上拆分它并从中获取元素 [0] 以生成结果。

如果根本不传递分割限制值，您会得到相同的结果：

>>> line.split(':')[4].split()[0]
'2027'

It appears that what you want is

line.split(':',4)[4].split()[0]

The numeric parameter to split indicates the maximum number of splits that will occur. So you have:

>>> line='080821_HWI-EAS301_0002_30ALBAAXX:1:8:1649:2027 ...'
>>> line.split(':',4)
['080821_HWI-EAS301_0002_30ALBAAXX', '1', '8', '1649', '2027 ...']

If you pull element [1] out of this return value, you get '1'. I don't see why you are surprised by this.

Since you are allowing up to 4 splits, and the item you want will be the last one, the subscript you want is [4]:

>>> line.split(':',4)[4]
'2027 ...'

Then you can split that on space and get element [0] from it to produce your result.

You get the same result if you don't pass a split limit value at all:

>>> line.split(':')[4].split()[0]
'2027'

回复收藏 0 原文

悲念泪 2024-12-05 22:04:26

试试这个：

#!/usr/bin/python

line = '080821_HWI-EAS301_0002_30ALBAAXX:1:8:1649:2027  83  chr10   42038185    255 36M =   42037995    -225    GCCAGGTTTAATAAATTATTTATAGAATACTGCATC    @?DDEAEFDAD@FBG@CDA?DBCDEECD@D?CBA>A    NM:i:0  MD:Z:36'

print line.split(':')[4].split()[0]

我不确定你为什么要尝试像这样访问包含 2027 的令牌：

line.split(':',4)

而不是这样：

line.split(':')[4]

我认为你可能对 split 的工作原理感到困惑。 Python split 函数的最后一个参数是要执行的最大分割数。

Try this:

#!/usr/bin/python

line = '080821_HWI-EAS301_0002_30ALBAAXX:1:8:1649:2027  83  chr10   42038185    255 36M =   42037995    -225    GCCAGGTTTAATAAATTATTTATAGAATACTGCATC    @?DDEAEFDAD@FBG@CDA?DBCDEECD@D?CBA>A    NM:i:0  MD:Z:36'

print line.split(':')[4].split()[0]

I'm not sure why you're trying to access the token containing 2027 like this:

line.split(':',4)

rather than this:

line.split(':')[4]

I think that you might be confused about how split works. The last parameter to the Python split function is the maximum number of splits to perform.

回复收藏 0 原文

故事和酒 2024-12-05 22:04:26

split 的第二个参数是要执行的最大分割数，因此您可能不想在这种情况下使用它。要在执行分割后访问第五个元素，请执行以下操作：

line.split(":")[4]

无论如何，您可能想要的是首先按空格分割（您可以不使用参数来完成此操作），然后用冒号分割。这可以在一行中完成，如下所示：

line.split()[0].split(":")[4]

The second argument to split is the maximum number of splits to exercise, so you probably don't want to be using it in this case. To access the 5th element after performing the split, do this:

line.split(":")[4]

Anyway, what you probably want is to first split by whitespace (you can do this by using no arguments), and then split by colons. This can be done on one line like this:

line.split()[0].split(":")[4]

回复收藏 0 原文

策马西风 2024-12-05 22:04:26

您可以使用：

s.split()[0].split(':')[4]

You can use instead:

s.split()[0].split(':')[4]

回复收藏 0 原文

℉絮湮 2024-12-05 22:04:26

首先在空白处分开。然后根据分隔符（此处：“：”）拆分结果列表中的第一个元素。

line.split()[0].split(':')[4]

Split on the white space first. Then split the first element in the resultant list based on the separator (here: ':').

line.split()[0].split(':')[4]

回复收藏 0 原文

你げ笑在眉眼 2024-12-05 22:04:26

您必须使用split吗？

我问这个问题是因为我发现当我只需要获取特定的子字符串时，正则表达式是一个更好的工具。这并不是最容易学习的东西，而且一开始看起来非常难以接近，但你只需要付出学习一次的代价，这是一项值得的投资。 :)

Python 主页对此有很好的介绍。

PS 2027 将通过以下正则表达式 .*?:([0-9]+)\s+ 进行匹配

回复收藏 0 原文

黯淡〆 2024-12-05 22:04:26

我想您将来会从字符串中进行大量的信息提取。那么，我的建议是学习使用正则表达式工具，这将是不可避免的。

或者您必须学习并使用专门的库来处理基因组学领域的字符串。

使用模块 re 解决当前问题的简单方法：

line = '''080821_HWI-EAS301_0002_30ALBAAXX:1:8:1649:2027  83  chr10
42038185    255 36M =   42037995    -225
GCCAGGTTTAATAAATTATTTATAGAATACTGCATC    @?DDEAEFDAD@FBG@CDA?
DBCDEECD@D?CBA>A    NM:i:0  MD:Z:36'''

import re

print re.search(':(\d+) ',line).group(1)

如果第四个 ':' 之前有空格，则正则表达式的模式将为：

line = '''080821_HWI-EAS301_0002_30AL BAAXX:1:8     :1649:2027  83  chr10
42038185    255 36M =   42037995    -225
GCCAGGTTTAATAAATTATTTATAGAATACTGCATC    @?DDEAEFDAD@FBG@CDA?
DBCDEECD@D?CBA>A    NM:i:0  MD:Z:36'''

import re

print re.search('(:[^:]+){3}:(\d+)',line).group(2)

(:[^:]+) 匹配':' 后跟 {3} 后面可能有多个与 ':' 不同的字符

表示此匹配必须执行 3 次

，然后第四个 ':' 必须是遇到的，后跟匹配的搜索号码<代码>\d+;不再需要指出数字后面必须有空格，因为一旦遇到非数字字符， \d+ 将停止在字符串中匹配

。括号定义组。这里所需的数字被第二组捕获

I presume that you will do numerous extractions of information from strings in the future. Then, my advice is to learn to use the regex tool, it will be inevitable.

Or you'll have to learn and use specialized library to do treatments of string in the field of genomics.

Simple solution to your present problem with module re :

line = '''080821_HWI-EAS301_0002_30ALBAAXX:1:8:1649:2027  83  chr10
42038185    255 36M =   42037995    -225
GCCAGGTTTAATAAATTATTTATAGAATACTGCATC    @?DDEAEFDAD@FBG@CDA?
DBCDEECD@D?CBA>A    NM:i:0  MD:Z:36'''

import re

print re.search(':(\d+) ',line).group(1)

If there are blanks before the fourth ':' the regex's pattern will be:

line = '''080821_HWI-EAS301_0002_30AL BAAXX:1:8     :1649:2027  83  chr10
42038185    255 36M =   42037995    -225
GCCAGGTTTAATAAATTATTTATAGAATACTGCATC    @?DDEAEFDAD@FBG@CDA?
DBCDEECD@D?CBA>A    NM:i:0  MD:Z:36'''

import re

print re.search('(:[^:]+){3}:(\d+)',line).group(2)

(:[^:]+) matches a ':' followed by as many characters different from ':' that may follow

{3} says that this match must be performed 3 times

then the fourth ':' must be encountered, followed by the searched number matched by \d+ ; there is no more need to indicate that there must be a blank after the number, because \d+ will stop to match in the string as soon as a non-digit character will be encountered

Parentheseses define groups. Here the desired number is catched by the second group

回复收藏 0 原文

~没有更多了~