如何将间隔列表转换为在这些间隔内组成的数字列表?
我有一个像这样的大文件:
esup_255_3 transdecoder 7655 8192
esup_6093_1 transdecoder 2732 2774
esup_25727_1 transdecoder 1 60
...
3和4列代表数字的间隔。
我试图修改此文件以在间隔内包含数字列表,在不同的列中列出(在第5列中),如下所示:
esup_255_3 transdecoder 7655 8192 7655
esup_255_3 transdecoder 7655 8192 7656
esup_255_3 transdecoder 7655 8192 7657
esup_255_3 transdecoder 7655 8192 ...
esup_255_3 transdecoder 7655 8192 8192
esup_6093_1 transdecoder 2732 2774 2732
esup_6093_1 transdecoder 2732 2774 2733
esup_6093_1 transdecoder 2732 2774 ....
esup_6093_1 transdecoder 2732 2774 2774
... and so on...
我认为Perl可能对此有所帮助,但我对此非常新。我只精通Bash,在这里,我似乎找不到正确的方法来获得所需的东西。
I have a large file looking like this:
esup_255_3 transdecoder 7655 8192
esup_6093_1 transdecoder 2732 2774
esup_25727_1 transdecoder 1 60
...
with columns 3 and 4 representing intervals of numbers.
I am trying to modify this file to have the list of numbers comprised within the intervals, listed in a different column (here in column 5) as follows:
esup_255_3 transdecoder 7655 8192 7655
esup_255_3 transdecoder 7655 8192 7656
esup_255_3 transdecoder 7655 8192 7657
esup_255_3 transdecoder 7655 8192 ...
esup_255_3 transdecoder 7655 8192 8192
esup_6093_1 transdecoder 2732 2774 2732
esup_6093_1 transdecoder 2732 2774 2733
esup_6093_1 transdecoder 2732 2774 ....
esup_6093_1 transdecoder 2732 2774 2774
... and so on...
I think Perl may be helpful with this, but I am very new to it. I am only proficient in bash, and here I cannot seem to find the right way to obtain what I need.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这样的东西?
当我在您的片段上运行它时,它会打印出641行:
以下说明。让我们从选项开始:
我们将它们向左到左。
-e
(对于“执行”或“评估”)只是告诉Perl,命令行上的下一件事是要运行的代码,因此它不会在标准输入中寻找代码。-n
告诉它,它会自动迭代其逐条输入行;它的作用好像有一个while(<>){
...}
绕着实际代码包裹的循环。在循环的主体内部,当前行将在主题变量$ _
中找到。-l </code>告诉它,将新线从输入中剥离,并自动将一个附加到打印的每个字符串中;这基本上将新线从图片中取出并简化了逻辑。
因此,该程序将逐条读取输入行,并运行以参数为
-e
的代码。让我们看一下该代码,该代码从以下语句开始:正则表达式没有明确的字符串要匹配,因此它会自动与具有当前行的
$ _
匹配。它必须与整行匹配(因为在开始时具有^
,最后$
)。由于最外层的括号,也捕获了实际的行值,因此它将是比赛返回的第一项,该项目分配给了变量$ line
。该行的第一部分根本可以是任何东西(因为
。*
匹配所有内容),因此我们确实在研究字符串结束的方式而不是其启动方式。第一个项目是任何Whitespace字符(\ s
),它可以确保我们不会错过以下任何数字。具体来说,我们正在寻找括号捕获的一个或多个数字(\ d+
),以便该值也将由匹配项返回;这是第二个捕获,因此它进入了分配中的第二个变量, <代码> $。在这些数字之后,我们会寻找更多的空格(至少需要一个空格字符,但允许任何数字),然后是另一个数字序列。第二组数字再次被捕获并返回,因此在最后一个变量中结束,$ to
。最后,我们允许最后一组数字,后面是任何数量的可选尾随空间。因此,在阅读了第一行后,
$ _ =“ esup_255_3 transdecoder 7655 8192”
,匹配 +分配将设置$ line
$ line 到整个字符串的副本,> $从
到7655
,以及$ to
to 8192 。然后我们来输出。这一行:
是编写此循环的较短方法:
它意味着它循环从 $ 到
$到
的整个数字上,重复使用$ _
作为循环控制变量(这就是为什么我们必须将当前行复制到$ line
)。对于范围内的每个值,它打印出整行的副本,然后打印出一个选项卡和当前数字。Something like this?
When I run it on your snippet it prints out 641 lines:
An explanation follows. Let's start with the options:
We'll take them right to left. The
-e
(for "execute" or "evaluate") just tells Perl that the next thing on the command line is the code to run, so it won't be looking for code on standard input.The
-n
tells it to automatically iterate over its input line-by-line; it acts as though there's awhile (<>) {
...}
loop wrapped around the actual code. Inside the body of the loop the current line will be found in the topic variable$_
.The
-l
tells it to strip the newlines off the input and automatically append one to each string printed out; this basically takes newlines out of the picture and simplifies the logic.So the program will read the input line-by-line and run the code that is given as the argument to
-e
on each line. Let's look at that code, which starts with this statement:The regular expression doesn't have an explicit string to match against, so it automatically matches against
$_
, which has the current line. It must match the whole line (because of the^
at the beginning and$
at the end). The actual line value is also captured because of the outermost parentheses, so it will be the first item returned by the match, which is assigned to the variable$line
.The first part of the line can be anything at all (since
.*
matches everything), so we're really looking at the way the string ends instead of the way it starts. The first item of interest is any whitespace character (\s
), which is there to make sure we don't miss any of the following numbers. Specifically, we're looking for one or more digits (\d+
), which the parentheses capture, so that value will also be returned by the match; it's the second capture, so it goes into the second variable in the assignment,$from
. After those digits we look for more whitespace (at least one whitespace character is required but any number is allowed) followed by another sequence of digits; this second set of digits is again captured and returned, so it winds up in the last variable,$to
. Finally we allow the last set of digits to be followed by any amount of optional trailing space.So after reading your first line,
$_ = "esup_255_3 transdecoder 7655 8192 "
, the match + assignment will set$line
to a copy of that whole string,$from
to7655
, and$to
to8192
.Then we come to the output. This line:
Is a shorter way of writing this loop:
Which means it loops over the whole numbers from
$from
to$to
, reusing$_
as the loop control variable (which is why we had to copy the current line into$line
). For each value in the range, it prints out a copy of the whole line, followed by a tab and the current number.