计算图中子图的出现次数
我有一个 3 维数据集,描述了可以用图表表示的基因相互作用。数据集样本为:
a + b
b + c
c - f
b - d
a + c
f + g
g + h
f + h
“+”表示左侧基因正向调节右侧基因。在这些数据中,我想计算一个基因(例如,x)正向调节另一个基因(例如,y),y 反过来正向调节另一个基因(例如,z)的子图。此外,z也受到x的正向调节。上图中有两种这样的情况。我想最好使用 awk 来执行此搜索,但任何脚本语言都可以。我很抱歉提出了一个过于具体的问题,并提前感谢您的帮助。
I have a 3 dimensional dataset that describes the gene interactions which can be formulated as a graph. The sample of dataset is:
a + b
b + c
c - f
b - d
a + c
f + g
g + h
f + h
'+' indicates that a gene on the left side positively regulates the gene on the right. In this data I want to count the sub-graph where a gene (say, x) positively regulates another gene (say, y), y in turn positively regulates yet another gene (say, z). Furthermore, z is also positively regulated by x. There are two such cases in above graph. I want to perform this search preferably using awk but any scripting language is fine. My apologies for being a too specific question and thanks in advance for the help.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
注意:请参阅下面有关 Graphviz 的信息。
这应该为您提供一个起点:
编辑:此版本处理由多个字符描述的基因。
在
BEGIN
子句中,将regdelim
设置为数据中未出现的字符。我省略了负数据的处理代码。
输出:
编辑 2:
下面的版本允许您搜索任意组合。它概括了原始版本中使用的技术,因此不需要重复代码。它还修复了其他一些
错误限制。可以这样调用它来进行详尽的搜索:
对于此数据:
调用新版本脚本的 shell 循环的输出将如下所示:
Edit 3:
Graphviz
另一种方法是使用 Graphviz。 DOT 语言可以描述图形和
gvpr
,这是一种“类似 AWK”1 编程语言,可以分析和操作 DOT 文件。给定问题中所示格式的输入数据,您可以使用以下 AWK 程序将其转换为 DOT:
要运行的命令如下所示:
然后您可以使用以下命令创建上面的图形:
我在这个答案中使用了上面给出的扩展数据。
要对指定的子图类型进行详尽的搜索,您可以使用以下
gvpr
程序:要运行它,您可以使用:
输出将与上面的 AWK/shell 组合的输出类似(在“编辑 2”下):
1 宽松地说。
Note: See the information regarding Graphviz below.
This should give you a starting point:
Edit: This version handles genes that are described by more than one character.
In the
BEGIN
clause, setregdelim
to a character that doesn't appear in your data.I've omitted the processing code for the minus data.
Output:
Edit 2:
The version below allows you to search for arbitrary combinations. It generalizes the technique used in the original version so no code needs to be duplicated. It also fixes a couple of other
bugslimitations.This can be called like this to do an exhaustive search:
For this data:
The output of the shell loop calling the new version of the script would look like this:
Edit 3:
Graphviz
Another approach would be to use Graphviz. The DOT language can describe the graph and
gvpr
, which is an "AWK-like"1 programming language, can analyze and manipulate DOT files.Given the input data in the format as shown in the question, you can use the following AWK program to convert it to DOT:
The command to run would be something like this:
You can then create the graphic above using:
I used the extended data as given above in this answer.
To do an exhaustive search for the type of subgraphs you specified, you can use the following
gvpr
program:To run it, you could use:
The output would be similar to that from the AWK/shell combination above (under "Edit 2"):
1 Loosely speaking.
下面是使用 Perl 的非常规方法。
正则表达式引擎是一个强大的工具。对于您的问题,我们描述传递子图的结构,然后允许生成的机器查找所有子图。
输出:
使用相同技术的更通用的工具是可能的。例如,假设您希望能够指定基因模式,例如
a+b+c;a+c
或g1+g2-g3;g1+g3
。我希望这些模式的含义是显而易见的。在前面的内容中,我指定了最低版本 5.10.0,因为代码使用
//
和词法$_
。该代码构造了将评估代码的动态正则表达式,这是use re 'eval'
pragma 启用的。标识符是一个或多个“单词字符”的序列,即字母、数字或下划线。
给定一个接受变量名称的正则表达式,
unique_vars
会在某个规范中搜索所有变量名称并返回它们而不重复。将基因模式编译成正则表达式有点麻烦。它动态生成一个搜索目标和正则表达式,其形式与上面的静态形式相同。
多次出现逗号分隔变量的第一部分让正则表达式引擎尝试每个基因的每个可能值。然后,前瞻
(?=...)
扫描图形,查找具有所需属性的边。如果所有前瞻都成功,我们就会记录命中。最后的奇怪的正则表达式
(?!)
是一个无条件失败,迫使匹配器回溯并尝试与不同的基因进行匹配。因为它是无条件的,所以引擎将评估所有可能性。同时从多个线程调用同一个闭包可能会产生奇怪的结果。
阅读数据集并让用户了解任何问题。
现在我们将一切付诸实践:
给定
graphs.txt
的内容,然后运行程序,我们得到以下输出:
An unconventional approach using Perl is below.
The regex engine is a powerful tool. For your problem, we describe the structure of a transitive subgraph and then allow the resulting machine to go find all of them.
Output:
A more general tool using this same technique is possible. For example, let's say you want to be able to specify gene patterns such as
a+b+c;a+c
org1+g2-g3;g1+g3
. I hope the meanings of these patterns are obvious.In the front matter, I specify a minimum version of 5.10.0 because the code uses
//
and lexical$_
. The code constructs dynamic regexes that will evaluate code, which theuse re 'eval'
pragma enables.An identifier is a sequence of one or more “word characters,” i.e., letters, digits, or underscores.
Given a regex that accepts variable names,
unique_vars
searches some specification for all variable names and returns them without repetition.Compiling a gene pattern into a regex is a little hairy. It dynamically generates a search target and regex with the same form as the static one above.
The first part with multiple occurrences of comma-separated variables lets the regex engine try each possible value for each gene. Then the lookaheads,
(?=...)
, scan the graph looking for edges with the desired properties. If all the lookaheads succeed, we record the hit.The strange regex
(?!)
at the end is an unconditional failure that forces the matcher to backtrack and attempt the match with different genes. Because it's unconditional, the engine will evaluate all possibilities.Calling the same closure from multiple threads concurrently will likely produce strange results.
Read the dataset and let the user know about any problems.
Now we set it all into motion:
Given
graphs.txt
with contentsand then running the program, we get the following output:
我假设“计算子图”的意思是计算子图中的节点。如果这就是您需要的,您可以使用任何脚本语言,并且必须存储图形,首先,通过创建存储图形的结构或类,节点结构/类应该如下所示(这不符合任何语言的语法,这只是您的应用程序的计划):
其中 color = 0(颜色的默认值意味着您之前没有访问过该节点),标题将为 'a', 'b', 'c ', 等等。 minusNodeSet 是存储这些节点的节点集,其中负顶点指向我们的节点,plusNodeSet 是存储这些节点的节点集,其中正顶点指向我们的节点。
现在,我们有了一个架构,应该在深度优先算法中使用它:
如果我误解了你的问题,请告诉我,以便能够将我的答案编辑为更有用的答案。
I assume that by "count the sub-graph" you mean counting the nodes in a sub-graph. If that's what you need, you can use any scripting language and will have to store the graph, first of all, by creating a structure or class where you store your graph, the node structure/class should look like this (this is not conforming the syntax of any language, this is only a plan for your application):
Where color = 0 (the defaul value of color means you haven't visited this node before), title will be 'a', 'b', 'c', and so on. minusNodeSet is a Set of Nodes where those nodes are stored, where a minus vertice points from our Node, plusNodeSet is a Set of Nodes where those nodes are stored, where a plus vertice points from our Node.
Now, we have an architecture and should use it in a depth-first algoritm:
If I misunderstood your question, please, tell me, to be able to edit my answer to be a more useful one.
我的其他中的正则表达式的结构答案类似于列表单子处理。受此启发,对传递子图的搜索如下作为 literate Haskell。将此答案复制并粘贴到扩展名为
.lhs
的文件中以获取工作程序。请务必用空行将由前导>
标记的代码部分括起来。感谢您提出有趣的问题!
前面的一点:
基因的名称可以是任何字符串,对于给定的
Gene
g,类型为PosReg
的函数应该给我们所有g正向调节的基因。从您问题中指定的图表中,我们需要基因的三元组,以便“正调节”关系是传递的,并且
子图
描述了所需的属性。首先,从图中选择一个任意基因x。接下来,选择 x 正向调节的基因 y 之一。为了保持传递性,z 必须是 x 和 y 正向调节的基因。通过decode中的简单解析器,我们提取了图中的基因列表和一个PosReg函数,该函数给出了受其他基因正向调节的所有基因。
最后,主程序将它们粘合在一起。对于找到的每个子图,将其打印到标准输出。
输出:
The structure of the regex in my other answer resembles list-monad processing. Given that inspiration, a search for the transitive subgraphs is below as a literate Haskell. Copy-and-paste this answer to a file with the extension
.lhs
to get a working program. Be sure to surround the code sections, marked by leading>
, with empty lines.Thanks for the fun problem!
A bit of front matter:
The name of a gene can be any string, and for a given
Gene
g, a function of typePosReg
should give us all the genes that g positively regulates.From a graph specified as in your question, we want triples of genes such that the is-positively-regulated-by relation is transitive, and
subgraphs
describes the desired properties. First, pick an arbitrary gene x from the graph. Next, choose one of the genes y that x positively regulates. For the transitive property to hold, z must be a gene that both x and y positively regulate.With the simple parser in
decode
, we distill the list of genes in the graph and aPosReg
function that gives all genes positively regulated by some other gene.Finally, the main program glues it all together. For each subgraph found, print it to the standard output.
Output: