计算图中子图的出现次数

发布于 2024-10-02 05:10:58 字数 305 浏览 7 评论 0原文

我有一个 3 维数据集，描述了可以用图表表示的基因相互作用。数据集样本为：

a + b  
b + c  
c - f  
b - d  
a + c  
f + g  
g + h  
f + h

“+”表示左侧基因正向调节右侧基因。在这些数据中，我想计算一个基因（例如，x）正向调节另一个基因（例如，y），y 反过来正向调节另一个基因（例如，z）的子图。此外，z也受到x的正向调节。上图中有两种这样的情况。我想最好使用 awk 来执行此搜索，但任何脚本语言都可以。我很抱歉提出了一个过于具体的问题，并提前感谢您的帮助。

原文

I have a 3 dimensional dataset that describes the gene interactions which can be formulated as a graph. The sample of dataset is:

a + b  
b + c  
c - f  
b - d  
a + c  
f + g  
g + h  
f + h

'+' indicates that a gene on the left side positively regulates the gene on the right. In this data I want to count the sub-graph where a gene (say, x) positively regulates another gene (say, y), y in turn positively regulates yet another gene (say, z). Furthermore, z is also positively regulated by x. There are two such cases in above graph. I want to perform this search preferably using awk but any scripting language is fine. My apologies for being a too specific question and thanks in advance for the help.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

七堇年 2024-10-09 05:10:58

注意：请参阅下面有关 Graphviz 的信息。

这应该为您提供一个起点：

编辑：此版本处理由多个字符描述的基因。

awk '
    BEGIN { regdelim = "|" }
    {
        delim=""
        if ($2 == "+") {
            if (plus[$1]) delim=regdelim
            plus[$1]=plus[$1] delim $3
        }
        else
            if ($2 == "-") {
            if (minus[$1]) delim=regdelim
                minus[$1]=minus[$1] delim $3
            }
    }
    END {
        for (root in plus) {
            split(plus[root],regs,regdelim)
            for (reg in regs) {
                if (plus[regs[reg]] && plus[root] ~ plus[regs[reg]]) {
                    print "Match: ", root, "+", regs[reg], "+", plus[regs[reg]]
                }
            }
        }
    }
' inputfile

在 BEGIN 子句中，将 regdelim 设置为数据中未出现的字符。

我省略了负数据的处理代码。

输出：

Match:  a + b + c
Match:  f + g + h

编辑 2：

下面的版本允许您搜索任意组合。它概括了原始版本中使用的技术，因此不需要重复代码。它还修复了其他一些错误限制。

#!/bin/bash
# written by Dennis Williamson - 2010-11-12
# for http://stackoverflow.com/questions/4161001/counting-the-occurrence-of-a-sub-graph-in-a-graph
# A (AB) B, A (AC) C, B (BC) C - where "(XY)" represents a + or a - 
# provided by the positional parameters $1, $2 and $3
# $4 carries the data file name and is referenced at the end of the script
awk -v AB=$1 -v AC=$2 -v BC=$3 '
    BEGIN { regdelim = "|" }
    {
        if ($2 == AB) {
            if (regAB[$1]) delim=regdelim; else delim=""
            regAB[$1]=regAB[$1] delim $3
        }
        if ($2 == AC) {
            if (regAC[$1]) delim=regdelim; else delim=""
            regAC[$1]=regAC[$1] delim $3
        }
        if ($2 == BC) {
            if (regBC[$1]) delim=regdelim; else delim=""
            regBC[$1]=regBC[$1] delim $3
        }
    }
    END {
        for (root in regAB) {
            split(regAB[root],ABarray,regdelim)
            for (ABindex in ABarray) {
                split(regAC[root],ACarray,regdelim)
                for (ACindex in ACarray) {
                    split(regBC[ABarray[ABindex]],BCarray,regdelim)
                    for (BCindex in BCarray) {
                        if (ACarray[ACindex] == BCarray[BCindex]) {
                            print "    Match:", root, AB, ABarray[ABindex] ",", root, AC, ACarray[ACindex] ",", ABarray[ABindex], BC, BCarray[BCindex]
                        }
                    }
                }
            }
        }
    }
' "$4"

可以这样调用它来进行详尽的搜索：

for ab in + -; do for ac in + -; do for bc in + -; do echo "Searching: $ab$ac$bc"; ./searchgraph $ab $ac $bc inputfile; done; done; done

对于此数据：

a - e
a + b
b + c
c - f
m - n
b - d
a + c
b - e
l - n
f + g
b + i
g + h
l + m
f + h
a + i
a - j
k - j
a - k

调用新版本脚本的 shell 循环的输出将如下所示：

Searching: +++
    Match: a + b, a + c, b + c
    Match: a + b, a + i, b + i
    Match: f + g, f + h, g + h
Searching: ++-
Searching: +-+
Searching: +--
    Match: l + m, l - n, m - n
    Match: a + b, a - e, b - e
Searching: -++
Searching: -+-
Searching: --+
Searching: ---
    Match: a - k, a - j, k - j

Edit 3:

Graphviz

另一种方法是使用 Graphviz。 DOT 语言可以描述图形和gvpr，这是一种“类似 AWK”¹ 编程语言，可以分析和操作 DOT 文件。

给定问题中所示格式的输入数据，您可以使用以下 AWK 程序将其转换为 DOT：

#!/usr/bin/awk -f
BEGIN {
    print "digraph G {"
    print "    size=\"5,5\""
    print "    ratio=.85"
    print "    node [fontsize=24 color=blue penwidth=3]"
    print "    edge [fontsize=18 labeldistance=5 labelangle=-8 minlen=2 penwidth=3]"
    print "    {rank=same; f l}"
    m  = "-"    # ASCII minus/hyphen as in the source data
    um = "−"    # u2212 minus: − which looks better on the output graphic
    p  = "+"
}

{
    if ($2 == m) { $2 = um; c = lbf = "red"; arr=" arrowhead = empty" }
    if ($2 == p) { c = lbf = "green3"; arr="" }
    print "    " $1, "->", $3, "[taillabel = \"" $2 "\" color = \"" c "\" labelfontcolor = \"" lbf "\"" arr "]"
}
END {
    print "}"
}

要运行的命令如下所示：

$ ./dat2dot data.dat > data.dot

然后您可以使用以下命令创建上面的图形：

$ dot -Tpng -o data.png data.dot

我在这个答案中使用了上面给出的扩展数据。

要对指定的子图类型进行详尽的搜索，您可以使用以下 gvpr 程序：

BEGIN {
    edge_t AB, BC, AC;
}

E {
    AB = $;
    BC = fstedge(AB.head);
    while (BC && BC.head.name != AB.head.name) {
        AC = isEdge(AB.tail,BC.head,"");
        if (AC) {
            printf("%s %s %s, ", AB.tail.name, AB.taillabel, AB.head.name);
            printf("%s %s %s, ", AC.tail.name, AC.taillabel, AC.head.name);
            printf("%s %s %s\n", BC.tail.name, BC.taillabel, BC.head.name);
        }
        BC = nxtedge(BC, AB.head);
    }
}

要运行它，您可以使用：

$ gvpr -f groups.g data.dot | sort -k 2,2 -k 5,5 -k 8,8

输出将与上面的 AWK/shell 组合的输出类似（在“编辑 2”下）：

a + b, a + c, b + c
a + b, a + i, b + i
f + g, f + h, g + h
a + b, a − e, b − e
l + m, l − n, m − n
a − k, a − j, k − j

¹_{宽松地说。}

Note: See the information regarding Graphviz below.

This should give you a starting point:

Edit: This version handles genes that are described by more than one character.

awk '
    BEGIN { regdelim = "|" }
    {
        delim=""
        if ($2 == "+") {
            if (plus[$1]) delim=regdelim
            plus[$1]=plus[$1] delim $3
        }
        else
            if ($2 == "-") {
            if (minus[$1]) delim=regdelim
                minus[$1]=minus[$1] delim $3
            }
    }
    END {
        for (root in plus) {
            split(plus[root],regs,regdelim)
            for (reg in regs) {
                if (plus[regs[reg]] && plus[root] ~ plus[regs[reg]]) {
                    print "Match: ", root, "+", regs[reg], "+", plus[regs[reg]]
                }
            }
        }
    }
' inputfile

In the BEGIN clause, set regdelim to a character that doesn't appear in your data.

I've omitted the processing code for the minus data.

Output:

Match:  a + b + c
Match:  f + g + h

Edit 2:

The version below allows you to search for arbitrary combinations. It generalizes the technique used in the original version so no code needs to be duplicated. It also fixes a couple of other ~~bugs~~limitations.

#!/bin/bash
# written by Dennis Williamson - 2010-11-12
# for http://stackoverflow.com/questions/4161001/counting-the-occurrence-of-a-sub-graph-in-a-graph
# A (AB) B, A (AC) C, B (BC) C - where "(XY)" represents a + or a - 
# provided by the positional parameters $1, $2 and $3
# $4 carries the data file name and is referenced at the end of the script
awk -v AB=$1 -v AC=$2 -v BC=$3 '
    BEGIN { regdelim = "|" }
    {
        if ($2 == AB) {
            if (regAB[$1]) delim=regdelim; else delim=""
            regAB[$1]=regAB[$1] delim $3
        }
        if ($2 == AC) {
            if (regAC[$1]) delim=regdelim; else delim=""
            regAC[$1]=regAC[$1] delim $3
        }
        if ($2 == BC) {
            if (regBC[$1]) delim=regdelim; else delim=""
            regBC[$1]=regBC[$1] delim $3
        }
    }
    END {
        for (root in regAB) {
            split(regAB[root],ABarray,regdelim)
            for (ABindex in ABarray) {
                split(regAC[root],ACarray,regdelim)
                for (ACindex in ACarray) {
                    split(regBC[ABarray[ABindex]],BCarray,regdelim)
                    for (BCindex in BCarray) {
                        if (ACarray[ACindex] == BCarray[BCindex]) {
                            print "    Match:", root, AB, ABarray[ABindex] ",", root, AC, ACarray[ACindex] ",", ABarray[ABindex], BC, BCarray[BCindex]
                        }
                    }
                }
            }
        }
    }
' "$4"

This can be called like this to do an exhaustive search:

for ab in + -; do for ac in + -; do for bc in + -; do echo "Searching: $ab$ac$bc"; ./searchgraph $ab $ac $bc inputfile; done; done; done

For this data:

a - e
a + b
b + c
c - f
m - n
b - d
a + c
b - e
l - n
f + g
b + i
g + h
l + m
f + h
a + i
a - j
k - j
a - k

The output of the shell loop calling the new version of the script would look like this:

Searching: +++
    Match: a + b, a + c, b + c
    Match: a + b, a + i, b + i
    Match: f + g, f + h, g + h
Searching: ++-
Searching: +-+
Searching: +--
    Match: l + m, l - n, m - n
    Match: a + b, a - e, b - e
Searching: -++
Searching: -+-
Searching: --+
Searching: ---
    Match: a - k, a - j, k - j

Edit 3:

Graphviz

Another approach would be to use Graphviz. The DOT language can describe the graph and gvpr, which is an "AWK-like"¹ programming language, can analyze and manipulate DOT files.

Given the input data in the format as shown in the question, you can use the following AWK program to convert it to DOT:

#!/usr/bin/awk -f
BEGIN {
    print "digraph G {"
    print "    size=\"5,5\""
    print "    ratio=.85"
    print "    node [fontsize=24 color=blue penwidth=3]"
    print "    edge [fontsize=18 labeldistance=5 labelangle=-8 minlen=2 penwidth=3]"
    print "    {rank=same; f l}"
    m  = "-"    # ASCII minus/hyphen as in the source data
    um = "−"    # u2212 minus: − which looks better on the output graphic
    p  = "+"
}

{
    if ($2 == m) { $2 = um; c = lbf = "red"; arr=" arrowhead = empty" }
    if ($2 == p) { c = lbf = "green3"; arr="" }
    print "    " $1, "->", $3, "[taillabel = \"" $2 "\" color = \"" c "\" labelfontcolor = \"" lbf "\"" arr "]"
}
END {
    print "}"
}

The command to run would be something like this:

$ ./dat2dot data.dat > data.dot

You can then create the graphic above using:

$ dot -Tpng -o data.png data.dot

I used the extended data as given above in this answer.

To do an exhaustive search for the type of subgraphs you specified, you can use the following gvpr program:

BEGIN {
    edge_t AB, BC, AC;
}

E {
    AB = $;
    BC = fstedge(AB.head);
    while (BC && BC.head.name != AB.head.name) {
        AC = isEdge(AB.tail,BC.head,"");
        if (AC) {
            printf("%s %s %s, ", AB.tail.name, AB.taillabel, AB.head.name);
            printf("%s %s %s, ", AC.tail.name, AC.taillabel, AC.head.name);
            printf("%s %s %s\n", BC.tail.name, BC.taillabel, BC.head.name);
        }
        BC = nxtedge(BC, AB.head);
    }
}

To run it, you could use:

$ gvpr -f groups.g data.dot | sort -k 2,2 -k 5,5 -k 8,8

The output would be similar to that from the AWK/shell combination above (under "Edit 2"):

a + b, a + c, b + c
a + b, a + i, b + i
f + g, f + h, g + h
a + b, a − e, b − e
l + m, l − n, m − n
a − k, a − j, k − j

¹_{Loosely speaking.}

回复收藏 0 原文

香草可樂 2024-10-09 05:10:58

下面是使用 Perl 的非常规方法。

#! /usr/bin/perl

use warnings;
use strict;

my $graph = q{
  a + c
  b + c
  c - f
  b - d
  a + b
  f + g
  g + h
  f + h
};

my $nodes = join ",", sort keys %{ { map +($_ => 1), $graph =~ /(\w+)/g } };
my $search = "$nodes:$nodes:$nodes:$graph";

my $subgraph = qr/
  \A  .*?  (?<x>\w+)  .*:
      .*?  (?<y>\w+)  .*:
      .*?  (?<z>\w+)  .*:
  (?= .*^\s*  \k<x>  \s*  \+  \s*  \k<y>  \s*$)
  (?= .*^\s*  \k<y>  \s*  \+  \s*  \k<z>  \s*$)
  (?= .*^\s*  \k<x>  \s*  \+  \s*  \k<z>  \s*$)
  (?{ print "x=$+{x}, y=$+{y}, z=$+{z}\n" })
  (?!)
/smx;

$search =~ /$subgraph/;

正则表达式引擎是一个强大的工具。对于您的问题，我们描述传递子图的结构，然后允许生成的机器查找所有子图。

输出：

x=a, y=b, z=c
x=f, y=g, z=h

使用相同技术的更通用的工具是可能的。例如，假设您希望能够指定基因模式，例如 a+b+c;a+c 或 g1+g2-g3;g1+g3。我希望这些模式的含义是显而易见的。

在前面的内容中，我指定了最低版本 5.10.0，因为代码使用 // 和词法 $_。该代码构造了将评估代码的动态正则表达式，这是 use re 'eval' pragma 启用的。

#! /usr/bin/perl

use warnings;
use strict;

use 5.10.0;
use re 'eval';

标识符是一个或多个“单词字符”的序列，即字母、数字或下划线。

my $ID = qr/\w+/;

给定一个接受变量名称的正则表达式，unique_vars 会在某个规范中搜索所有变量名称并返回它们而不重复。

sub unique_vars {
  my($_,$pattern) = @_;
  keys %{ { map +($_ => undef), /($pattern)/g } };
}

将基因模式编译成正则表达式有点麻烦。它动态生成一个搜索目标和正则表达式，其形式与上面的静态形式相同。

多次出现逗号分隔变量的第一部分让正则表达式引擎尝试每个基因的每个可能值。然后，前瞻 (?=...) 扫描图形，查找具有所需属性的边。如果所有前瞻都成功，我们就会记录命中。

最后的奇怪的正则表达式 (?!) 是一个无条件失败，迫使匹配器回溯并尝试与不同的基因进行匹配。因为它是无条件的，所以引擎将评估所有可能性。

同时从多个线程调用同一个闭包可能会产生奇怪的结果。

sub compile_gene_pattern {
  my($dataset,$pattern) = @_;
  my @vars   = sort +unique_vars $pattern, qr/[a-z]\d*/;  # / for SO hilite
  my $nodes  = join ",", sort +unique_vars $dataset, $ID;
  my $search = join("", map "$_:", ($nodes) x @vars) . "\n"
             . $dataset;

  my $spec = '\A' . "\n" . join("", map ".*?  (?<$_>$ID)  .*:\n", @vars);
  for (split /;/, $pattern) {
    while (s/^($ID)([-+])($ID)/$3/) {
      $spec .= '(?= .*^\s*  ' .
               ' \b\k<' .           $1  . '>\b ' .
               ' \s*'   . quotemeta($2) . '\s* ' .
               ' \b\k<' .           $3  . '>\b ' .
               ' \s*$)' . "\n";
    }
  }
  my %hits;
  $spec .= '(?{ ++$hits{join "-", @+{@vars}} })' . "\n" .
           '(?!) # backtrack' . "\n";

  my $nfa = eval { qr/$spec/smx } || die "$0: INTERNAL: bad regex:\n$@";
  sub {
    %hits = ();  # thread-safety? :-(
    (my $_ = $search) =~ /$nfa/;
    map [split /-/], sort keys %hits;
  }
}

阅读数据集并让用户了解任何问题。

sub read_dataset {
  my($path) = @_;

  open my $fh, "<", $path or die "$0: open $path: $!";

  local $/ = "\n";
  local $_;
  my $graph;

  my @errors;
  while (<$fh>) {
    next if /^\s*#/ || /^\s*$/;

    if (/^ \s* $ID \s* [-+] \s* $ID \s* $/x) {
      $graph .= $_;
    }
    else {
      push @errors, "$.: syntax error";
    }
  }

  return $graph unless @errors;

  die map "$0: $path:$_\n", @errors;
}

现在我们将一切付诸实践：

my $graphs = shift // "graphs.txt";
my $dataset = read_dataset $graphs;

my $ppp = compile_gene_pattern $dataset, "a+b+c;a+c";
print "@$_\n" for $ppp->();

my $pmp = compile_gene_pattern $dataset, "g1+g2-g3;g1+g3";
print "@$_\n" for $pmp->();

给定 graphs.txt 的内容

a + b  
b + c  
c - f  
b - d  
a + c  
f + g  
g + h  
f + h

foo + bar
bar - baz
foo + baz

，然后运行程序，我们得到以下输出：

a b c
f g h
foo bar baz

An unconventional approach using Perl is below.

#! /usr/bin/perl

use warnings;
use strict;

my $graph = q{
  a + c
  b + c
  c - f
  b - d
  a + b
  f + g
  g + h
  f + h
};

my $nodes = join ",", sort keys %{ { map +($_ => 1), $graph =~ /(\w+)/g } };
my $search = "$nodes:$nodes:$nodes:$graph";

my $subgraph = qr/
  \A  .*?  (?<x>\w+)  .*:
      .*?  (?<y>\w+)  .*:
      .*?  (?<z>\w+)  .*:
  (?= .*^\s*  \k<x>  \s*  \+  \s*  \k<y>  \s*$)
  (?= .*^\s*  \k<y>  \s*  \+  \s*  \k<z>  \s*$)
  (?= .*^\s*  \k<x>  \s*  \+  \s*  \k<z>  \s*$)
  (?{ print "x=$+{x}, y=$+{y}, z=$+{z}\n" })
  (?!)
/smx;

$search =~ /$subgraph/;

The regex engine is a powerful tool. For your problem, we describe the structure of a transitive subgraph and then allow the resulting machine to go find all of them.

Output:

x=a, y=b, z=c
x=f, y=g, z=h

A more general tool using this same technique is possible. For example, let's say you want to be able to specify gene patterns such as a+b+c;a+c or g1+g2-g3;g1+g3. I hope the meanings of these patterns are obvious.

In the front matter, I specify a minimum version of 5.10.0 because the code uses // and lexical $_. The code constructs dynamic regexes that will evaluate code, which the use re 'eval' pragma enables.

#! /usr/bin/perl

use warnings;
use strict;

use 5.10.0;
use re 'eval';

An identifier is a sequence of one or more “word characters,” i.e., letters, digits, or underscores.

my $ID = qr/\w+/;

Given a regex that accepts variable names, unique_vars searches some specification for all variable names and returns them without repetition.

sub unique_vars {
  my($_,$pattern) = @_;
  keys %{ { map +($_ => undef), /($pattern)/g } };
}

Compiling a gene pattern into a regex is a little hairy. It dynamically generates a search target and regex with the same form as the static one above.

The first part with multiple occurrences of comma-separated variables lets the regex engine try each possible value for each gene. Then the lookaheads, (?=...), scan the graph looking for edges with the desired properties. If all the lookaheads succeed, we record the hit.

The strange regex (?!) at the end is an unconditional failure that forces the matcher to backtrack and attempt the match with different genes. Because it's unconditional, the engine will evaluate all possibilities.

Calling the same closure from multiple threads concurrently will likely produce strange results.

sub compile_gene_pattern {
  my($dataset,$pattern) = @_;
  my @vars   = sort +unique_vars $pattern, qr/[a-z]\d*/;  # / for SO hilite
  my $nodes  = join ",", sort +unique_vars $dataset, $ID;
  my $search = join("", map "$_:", ($nodes) x @vars) . "\n"
             . $dataset;

  my $spec = '\A' . "\n" . join("", map ".*?  (?<$_>$ID)  .*:\n", @vars);
  for (split /;/, $pattern) {
    while (s/^($ID)([-+])($ID)/$3/) {
      $spec .= '(?= .*^\s*  ' .
               ' \b\k<' .           $1  . '>\b ' .
               ' \s*'   . quotemeta($2) . '\s* ' .
               ' \b\k<' .           $3  . '>\b ' .
               ' \s*$)' . "\n";
    }
  }
  my %hits;
  $spec .= '(?{ ++$hits{join "-", @+{@vars}} })' . "\n" .
           '(?!) # backtrack' . "\n";

  my $nfa = eval { qr/$spec/smx } || die "$0: INTERNAL: bad regex:\n$@";
  sub {
    %hits = ();  # thread-safety? :-(
    (my $_ = $search) =~ /$nfa/;
    map [split /-/], sort keys %hits;
  }
}

Read the dataset and let the user know about any problems.

sub read_dataset {
  my($path) = @_;

  open my $fh, "<", $path or die "$0: open $path: $!";

  local $/ = "\n";
  local $_;
  my $graph;

  my @errors;
  while (<$fh>) {
    next if /^\s*#/ || /^\s*$/;

    if (/^ \s* $ID \s* [-+] \s* $ID \s* $/x) {
      $graph .= $_;
    }
    else {
      push @errors, "$.: syntax error";
    }
  }

  return $graph unless @errors;

  die map "$0: $path:$_\n", @errors;
}

Now we set it all into motion:

my $graphs = shift // "graphs.txt";
my $dataset = read_dataset $graphs;

my $ppp = compile_gene_pattern $dataset, "a+b+c;a+c";
print "@$_\n" for $ppp->();

my $pmp = compile_gene_pattern $dataset, "g1+g2-g3;g1+g3";
print "@$_\n" for $pmp->();

Given graphs.txt with contents

a + b  
b + c  
c - f  
b - d  
a + c  
f + g  
g + h  
f + h

foo + bar
bar - baz
foo + baz

and then running the program, we get the following output:

a b c
f g h
foo bar baz

回复收藏 0 原文

流星番茄 2024-10-09 05:10:58

我假设“计算子图”的意思是计算子图中的节点。如果这就是您需要的，您可以使用任何脚本语言，并且必须存储图形，首先，通过创建存储图形的结构或类，节点结构/类应该如下所示（这不符合任何语言的语法，这只是您的应用程序的计划）：

Node {color = 0; title = ""; minusNodeSet = null; plusNodeSet = null}

其中 color = 0（颜色的默认值意味着您之前没有访问过该节点），标题将为 'a', 'b', 'c '，等等。 minusNodeSet 是存储这些节点的节点集，其中负顶点指向我们的节点，plusNodeSet 是存储这些节点的节点集，其中正顶点指向我们的节点。

现在，我们有了一个架构，应该在深度优先算法中使用它：

int depth_first(Node actualNode)
{
    if (actualNode.color == 1)
    return;

    number = 1;
    actualNode.color = 1;
    foreach actualNode.nodeSet as node do
        if (node.color == 0)
            number = number + depth_first(node);
    return number;
}

如果我误解了你的问题，请告诉我，以便能够将我的答案编辑为更有用的答案。

I assume that by "count the sub-graph" you mean counting the nodes in a sub-graph. If that's what you need, you can use any scripting language and will have to store the graph, first of all, by creating a structure or class where you store your graph, the node structure/class should look like this (this is not conforming the syntax of any language, this is only a plan for your application):

Node {color = 0; title = ""; minusNodeSet = null; plusNodeSet = null}

Where color = 0 (the defaul value of color means you haven't visited this node before), title will be 'a', 'b', 'c', and so on. minusNodeSet is a Set of Nodes where those nodes are stored, where a minus vertice points from our Node, plusNodeSet is a Set of Nodes where those nodes are stored, where a plus vertice points from our Node.

Now, we have an architecture and should use it in a depth-first algoritm:

int depth_first(Node actualNode)
{
    if (actualNode.color == 1)
    return;

    number = 1;
    actualNode.color = 1;
    foreach actualNode.nodeSet as node do
        if (node.color == 0)
            number = number + depth_first(node);
    return number;
}

If I misunderstood your question, please, tell me, to be able to edit my answer to be a more useful one.

回复收藏 0 原文

仙女山的月亮 2024-10-09 05:10:58

我的其他中的正则表达式的结构答案类似于列表单子处理。受此启发，对传递子图的搜索如下作为 literate Haskell。将此答案复制并粘贴到扩展名为 .lhs 的文件中以获取工作程序。请务必用空行将由前导 > 标记的代码部分括起来。

感谢您提出有趣的问题！

前面的一点：

> {-# LANGUAGE ViewPatterns #-}

> module Main where

> import Control.Monad (guard)
> import Data.List (nub)
> import Data.Map (findWithDefault,fromListWith,toList)

基因的名称可以是任何字符串，对于给定的 Gene g，类型为 PosReg 的函数应该给我们所有g正向调节的基因。

> type Gene = String
> type PosReg = Gene -> [Gene]

从您问题中指定的图表中，我们需要基因的三元组，以便“正调节”关系是传递的，并且子图描述了所需的属性。首先，从图中选择一个任意基因x。接下来，选择 x 正向调节的基因 y 之一。为了保持传递性，z 必须是 x 和 y 正向调节的基因。

> subgraphs :: String -> [(Gene,Gene,Gene)]
> subgraphs g = do
>   x <- choose
>   y <- posRegBy x
>   z <- posRegBy y
>   guard $ z `elem` posRegBy x
>   return (x,y,z)
>   where (choose,posRegBy) = decode g

通过decode中的简单解析器，我们提取了图中的基因列表和一个PosReg函数，该函数给出了受其他基因正向调节的所有基因。

> decode :: String -> ([Gene], PosReg)
> decode g =
>   let pr = fromListWith (++) $ go (lines g)
>       gs = nub $ concatMap (\(a,b) -> a : b) $ toList pr
>   in (gs, (\x -> findWithDefault [] x pr))
>   where
>     go ((words -> [a, op, b]):ls)
>       | op == "+" = (a,[b]) : go ls
>       | otherwise = go ls
>     go _ = []

最后，主程序将它们粘合在一起。对于找到的每个子图，将其打印到标准输出。

> main :: IO ()
> main = mapM_ (putStrLn . show) $ subgraphs graph
>   where graph = "a + b\n\
>                 \b + c\n\
>                 \c - f\n\
>                 \b - d\n\
>                 \a + c\n\
>                 \f + g\n\
>                 \g + h\n\
>                 \f + h\n"

输出：

("a","b","c")
("f","g","h")

The structure of the regex in my other answer resembles list-monad processing. Given that inspiration, a search for the transitive subgraphs is below as a literate Haskell. Copy-and-paste this answer to a file with the extension .lhs to get a working program. Be sure to surround the code sections, marked by leading >, with empty lines.

Thanks for the fun problem!

A bit of front matter:

> {-# LANGUAGE ViewPatterns #-}

> module Main where

> import Control.Monad (guard)
> import Data.List (nub)
> import Data.Map (findWithDefault,fromListWith,toList)

The name of a gene can be any string, and for a given Gene g, a function of type PosReg should give us all the genes that g positively regulates.

> type Gene = String
> type PosReg = Gene -> [Gene]

From a graph specified as in your question, we want triples of genes such that the is-positively-regulated-by relation is transitive, and subgraphs describes the desired properties. First, pick an arbitrary gene x from the graph. Next, choose one of the genes y that x positively regulates. For the transitive property to hold, z must be a gene that both x and y positively regulate.

> subgraphs :: String -> [(Gene,Gene,Gene)]
> subgraphs g = do
>   x <- choose
>   y <- posRegBy x
>   z <- posRegBy y
>   guard $ z `elem` posRegBy x
>   return (x,y,z)
>   where (choose,posRegBy) = decode g

With the simple parser in decode, we distill the list of genes in the graph and a PosReg function that gives all genes positively regulated by some other gene.

> decode :: String -> ([Gene], PosReg)
> decode g =
>   let pr = fromListWith (++) $ go (lines g)
>       gs = nub $ concatMap (\(a,b) -> a : b) $ toList pr
>   in (gs, (\x -> findWithDefault [] x pr))
>   where
>     go ((words -> [a, op, b]):ls)
>       | op == "+" = (a,[b]) : go ls
>       | otherwise = go ls
>     go _ = []

Finally, the main program glues it all together. For each subgraph found, print it to the standard output.

> main :: IO ()
> main = mapM_ (putStrLn . show) $ subgraphs graph
>   where graph = "a + b\n\
>                 \b + c\n\
>                 \c - f\n\
>                 \b - d\n\
>                 \a + c\n\
>                 \f + g\n\
>                 \g + h\n\
>                 \f + h\n"

Output: