Snakemake规则日志/基准通配符在检查点之后与输出通配符不匹配
我正在运行一个带有检查站的Snakemake工作流程,在某个时候我收集了以前未知数的输出文件。然后,SnakeMake应使用下一个规则的文件编号创建许多任务,使用收集的检查点文件作为该规则通配符的通配符。一切都很好,除非我希望该规则也可以创建日志和/或基准文件,否则抛出了这一点:
SyntaxError:
Not all output, log and benchmark files of rule plasmid_spades contain the same wildcards. This is crucial though, in order to avoid that two or more jobs write to the same file.
File "/path/to/Snakefile", line N, in <module>
这些是工作流程的相关部分:
WCS = None
...
def gather_checkpoint_output(wildcards):
ck_output = checkpoints.checkpoint_rule.get(**wildcards).output[0]
global WCS
WCS, = glob_wildcards(os.path.join(ck_output, "{wc}", "{wc}.file"))
return expand(os.path.join(ck_output, "{wc}", "{wc}.file"), wc=WCS)
def gather_some_rule_after_checkpoint_out(wildcards):
rule_output = checkpoints.checkpoint_rule.get(**wildcards).output[0]
WCS2, = glob_wildcards(os.path.join(rule_output, "{wc}", "{wc}.file"))
return expand(os.path.join("some", "{wc}", "path", "output.file"), wc=WCS2)
...
localrules: all
rule all:
input:
gather_checkpoint_output,
gather_some_rule_after_checkpoint_out
...
rule some_rule_after_checkpoint:
input:
input = gather_checkpoint_output
output:
out_dir = directory(expand(os.path.join("some", "{wc}", "dir"), wc=WCS)),
output = expand(os.path.join("some", "{wc}", "path", "output.file"), wc=WCS)
log:
os.path.join("logs", "some", "path", "{wc}_rule.log")
benchmark:
os.path.join("logs", "some", "path", "{wc}_rule_benchmark.tsv")
...
是问题所在,它在开始时评估日志/基准标记通配符(WCS =无),而输出将通过检查点函数重新评估?尽管我认为,规则通配符是基于输出通配符的。我尝试了lambda函数,Expand()等,以特别从(希望重新评估的WCS)获得日志中的通配符,但这显然不允许。我在这里忽略了一些明显的东西,还是整个结构以某种方式错误?
I am running a Snakemake workflow with a checkpoint at some point from which I gather the previously unknown number of output files. Snakemake should then create a number of tasks based on the file number with the next rule, using some part of the gathered checkpoints files as wildcards for that rules wildcards. It all works fine, unless I want that rule to also create log and/or benchmark files, at which point is throws:
SyntaxError:
Not all output, log and benchmark files of rule plasmid_spades contain the same wildcards. This is crucial though, in order to avoid that two or more jobs write to the same file.
File "/path/to/Snakefile", line N, in <module>
These are the relevant parts of the workflow:
WCS = None
...
def gather_checkpoint_output(wildcards):
ck_output = checkpoints.checkpoint_rule.get(**wildcards).output[0]
global WCS
WCS, = glob_wildcards(os.path.join(ck_output, "{wc}", "{wc}.file"))
return expand(os.path.join(ck_output, "{wc}", "{wc}.file"), wc=WCS)
def gather_some_rule_after_checkpoint_out(wildcards):
rule_output = checkpoints.checkpoint_rule.get(**wildcards).output[0]
WCS2, = glob_wildcards(os.path.join(rule_output, "{wc}", "{wc}.file"))
return expand(os.path.join("some", "{wc}", "path", "output.file"), wc=WCS2)
...
localrules: all
rule all:
input:
gather_checkpoint_output,
gather_some_rule_after_checkpoint_out
...
rule some_rule_after_checkpoint:
input:
input = gather_checkpoint_output
output:
out_dir = directory(expand(os.path.join("some", "{wc}", "dir"), wc=WCS)),
output = expand(os.path.join("some", "{wc}", "path", "output.file"), wc=WCS)
log:
os.path.join("logs", "some", "path", "{wc}_rule.log")
benchmark:
os.path.join("logs", "some", "path", "{wc}_rule_benchmark.tsv")
...
Is the problem, that it evaluates the logs/benchmarks wildcard in the beginning (WCS = None), while the output will be reevaluated with the checkpoint functions? Although, a rules wildcards are based off of the outputs wildcards, I think. I tried lambda functions, expand(), etc., to specifically get the wildcards from (the hopefully reevaluated WCS) for the logs, but that is apparently not permitted. Am I overlookig something obvious here or is the entire construction wrong somehow?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您的问题可能是臭名昭著且非常常见的“对扩展”一般问题的化身。
在
中,some_rule_after_checkpoint
规则log
和基准标准
指令包含通配符,而utputs>输出
指令则没有。实际上,您需要很好地意识到输出文件名模式中的通配符扩展,从而导致了完全解决的文件名。这使SnakeMake感到困惑:通配符的值应使用什么值来确定
log
和基准标准
文件的名称,如果output 文件名?规则中的通配符值是通过将本规则的输出文件名模式与下游规则中完全分辨的输入文件名匹配来确定的。
您可能不应在
some_rule_after_checkpoint
的输出中使用展开
,因为扩展已在all
规则的输入中完成。使用
some_rule_after_checkpoint
中的非扩展输出
文件名模式,规则ALL
在扩展的输入中的每个不同文件都会触发some_rule_after_checkpoint 规则,将根据
some_rule_after_checkpoint
中的输出文件模式与所需的“完全解析”输入的所有输入中的输出文件模式之间的模式匹配确定其值。
。然后,此通配符将使SnakeMake能够生成相应的日志
和基准
文件。Your issue might be an avatar of the infamous and very common "wrong use of expand" general problem.
In the
some_rule_after_checkpoint
rule thelog
andbenchmark
directives contain a wildcard, while theoutput
directive doesn't. Indeed, you need to be well aware of the fact that the wildcards in the output file name patterns are expanded, resulting in a list of fully resolved file names.This confuses Snakemake: What value of the wildcard should it use to determine the names of the
log
andbenchmark
files if there is no wildcard in theoutput
file names? Wildcard values in a rule are determined by matching this rule's output file name patterns with fully-resolved input file names in a downstream rule.You should likely not use
expand
in the outputs ofsome_rule_after_checkpoint
, since the expanding is already done in the input of theall
rule.With non-expanded
output
file name patterns insome_rule_after_checkpoint
, each different file in the expanded input of ruleall
will trigger one instance of thesome_rule_after_checkpoint
rule, for which the value of the wildcard will be determined based on a pattern matching between the output file patterns insome_rule_after_checkpoint
and the desired "fully resolved" input forall
. This wildcard will then enable Snakemake to generate the correspondinglog
andbenchmark
file.