NextFlow：将FREF FREFEPAIRS的输入转换为元组（MAP，LIST_PAIR_1，LIST_PAIR_2）

发布于 2025-02-08 11:13:52 字数 1806 浏览 2 评论 0原文

在我的NextFlow工作流中，我需要处理类似于下面示例的文件。

a.vcf.gz
a.vcf.gz.tbi
b.vcf.gz
b.vcf.gz.tbi
c.vcf.gz
c.vcf.gz.tbi

特别是，我需要创建一个通道，该频道将使用此结构输出它们：

[
    ["id": "test"], 
    ["a.vcf.gz", "b.vcf.gz", "c.vcf.gz"], 
    ["a.vcf.gz.tbi", "b.vcf.gz.tbi", "c.vcf.gz.tbi"]
]

这意味着一个单个地图的元组，一个元组*。VCF.GZ文件和一个元组** .vcf.gz.tbi文件。

我的问题是，从我对文档的阅读来看，如何从依次以三组组为单位发射项目的频道来创建它并不明显。

为简单起见，我使用channel.fromfilepairs从对收集文件：

ch_input = Channel
    .fromFilePairs("*{.vcf.gz,.vcf.gz.tbi}")

这就是我被卡住的地方。我获得的最接近是从filepairs 中取得和使用grouptuple：

 ch_input = Channel
    .fromPath("*.vcf.gz*")
    .map {
       file ->
       def fmeta = ["id": "test"]
       value = file.extension == "gz" ? "vcf": "tbi"
       [value, file]
     }.groupTuple()
    
    println ch_input.view()

哪个给出：

[tbi, [/Users/einar/Coding/a.vcf.gz.tbi, /Users/einar/Coding/c.vcf.gz.tbi, /Users/einar/Coding/einar/b.vcf.gz.tbi]]
[vcf, [/Users/einar/Coding/b.vcf.gz, /Users/einar/Coding/a.vcf.gz, /Users/einar/Coding/c.vcf.gz]]

它仍然远离我想要的东西和更脆弱，因为它依赖于文件扩展。

channel.multimap靠近我想要的东西，但是它会生成多个频道，而我需要一个频道。

如何正确地完成？

编辑：

这是另一个尝试，它得到了我想要的东西，但是它看起来对我有点脆弱：

ch_input = Channel
        .fromPath("*.vcf*")
        .map{
            file -> 
            [file.extension, file]
        }.groupTuple()
        .map {
         it ->
          def fmeta = ["id": "test"]
          [fmeta, it[1].flatten()]
         }.groupTuple()
         .map{
           it -> 
           [it[0], it[1][0], it[1][1]]
         }

    
    println ch_input.view()

原文

In my Nextflow workflow, I need to process files similar to the below example.

a.vcf.gz
a.vcf.gz.tbi
b.vcf.gz
b.vcf.gz.tbi
c.vcf.gz
c.vcf.gz.tbi

In particular, I need to create a channel which will output them with this structure:

[
    ["id": "test"], 
    ["a.vcf.gz", "b.vcf.gz", "c.vcf.gz"], 
    ["a.vcf.gz.tbi", "b.vcf.gz.tbi", "c.vcf.gz.tbi"]
]

This means a tuple of a single map, one tuple of *.vcf.gz files and one tuple of *.vcf.gz.tbi files.

My problem is that, from my reading of the documentation, it's not evident how to create it from a channel that emits items sequentially in groups of three.

For simplicity, I collect the files from pairs using Channel.fromFilePairs:

ch_input = Channel
    .fromFilePairs("*{.vcf.gz,.vcf.gz.tbi}")

This is where I got stuck. The closest I've got was by scrapping fromFilePairs and using groupTuple:

 ch_input = Channel
    .fromPath("*.vcf.gz*")
    .map {
       file ->
       def fmeta = ["id": "test"]
       value = file.extension == "gz" ? "vcf": "tbi"
       [value, file]
     }.groupTuple()
    
    println ch_input.view()

Which gives:

[tbi, [/Users/einar/Coding/a.vcf.gz.tbi, /Users/einar/Coding/c.vcf.gz.tbi, /Users/einar/Coding/einar/b.vcf.gz.tbi]]
[vcf, [/Users/einar/Coding/b.vcf.gz, /Users/einar/Coding/a.vcf.gz, /Users/einar/Coding/c.vcf.gz]]

Which still is far away from what I'd like and more fragile because it relies on file extensions.

Channel.multiMap is close to what I want, however it generates multiple channels, while instead I need a single channel.

How can this be done properly?

EDIT:

This is another attempt, which gets what I want, however it looks kind of hacky and fragile to me:

ch_input = Channel
        .fromPath("*.vcf*")
        .map{
            file -> 
            [file.extension, file]
        }.groupTuple()
        .map {
         it ->
          def fmeta = ["id": "test"]
          [fmeta, it[1].flatten()]
         }.groupTuple()
         .map{
           it -> 
           [it[0], it[1][0], it[1][1]]
         }

    
    println ch_input.view()

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

如梦初醒的夏天 2025-02-15 11:13:52

要获得想要的东西，您需要 collect collect 操作员 value channge

Channel
    .fromFilePairs( '/path/to/files/*.vcf.gz{,.tbi}' )
    .collect { sample, indexed_vcf -> [ indexed_vcf ] }
    .map { 
        def fmeta = [ "id": "test" ]

        [ fmeta, it*.first(), it*.last() ] 
    } 
    .view()

给您一个 “ noreflow noreferrer”> 详细信息，但通常您不需要将索引文件与实际VCF文件分开。如果要直接将此通道用作过程输入，我的首选是更改输入声明，以便我可以使用类似的内容：

Channel
    .fromPath( '/path/to/files/*.vcf.gz{,.tbi}' )
    .collect()
    .map { 
        def fmeta = ["id": "test"]

        [ fmeta, it ]
    } 
    .view()

To get what you want, you'd need the collect operator which gives you a value channel:

Channel
    .fromFilePairs( '/path/to/files/*.vcf.gz{,.tbi}' )
    .collect { sample, indexed_vcf -> [ indexed_vcf ] }
    .map { 
        def fmeta = [ "id": "test" ]

        [ fmeta, it*.first(), it*.last() ] 
    } 
    .view()

It's difficult to say without the details, but usually you don't need to separate out the index files from the actual VCF files. If this channel is to be used directly as process input, my preference would be to alter the input declaration so that I could use something like this instead:

Channel
    .fromPath( '/path/to/files/*.vcf.gz{,.tbi}' )
    .collect()
    .map { 
        def fmeta = ["id": "test"]

        [ fmeta, it ]
    } 
    .view()

回复收藏 0 原文

~没有更多了~