根据排序标准选择最佳集合

发布于 2024-09-10 10:11:57 字数 626 浏览 3 评论 0原文

我得到了一个字符串和一组规则,这些规则通过一个在这里并不重要的过程来选择有效的子字符串。给定所有有效子字符串的枚举,我必须根据一组排名标准找到最佳的子字符串集,例如:

  1. 子字符串不得重叠
  2. 所有字符必须是子字符串的一部分(如果可能)
  3. 使用尽可能少的不同子字符串
  4. 等 。

例如,给定字符串 abc 和子字符串 [a, ab, bc],根据上述规则的最佳子字符串集是 [a, bc] ]

目前,我正在通过标准的朴素算法来执行此操作,该算法枚举所有可能的子字符串集,然后迭代它们以找到最佳候选者。问题在于,随着字符串长度和子字符串数量的增加,可能的集合数量呈指数增长。对于 50 个子字符串(此应用程序完全有可能),要枚举的集合数量为 2^50,这极其令人望而却步。

似乎应该有一种方法来避免生成许多明显会失败的集合,或者通过算法收敛到最佳集合,而不必盲目地生成每个候选集合。有哪些选择?

请注意,对于此应用程序,使用提供统计而非绝对保证的算法可能是可以接受的,例如击中非最佳候选者的 n% 机会,其中 n code> 适当小。

I am given a string, and a set of rules which select valid substrings by a process which isn't important here. Given an enumeration of all valid substrings, I have to find the optimum set of substrings according to a set of ranked criteria, such as:

  1. Substrings may not overlap
  2. All characters must be part of a substring if possible
  3. Use as few different substrings as possible
  4. etc.

For example, given the string abc and the substrings [a, ab, bc], the optimal set of substrings by the preceding rules is [a, bc].

Currently I'm doing this by a standard naive algorithm of enumerating all possible sets of substrings, then iterating over them to find the best candidate. The problem is that as the length of the string and the number of substrings goes up, the number of possible sets increases exponentially. With 50 substrings (well within possibility for this app), the number of sets to enumerate is 2^50, which is extremely prohibitive.

It seems like there should be a way to avoid generating many of the sets that will obviously be losers, or to algorithmically converge on the optimum set without having to blindly generate every candidate. What options are there?

Note that for this application it may be acceptable to use an algorithm that offers a statistical rather than absolute guarantee, such as an n% chance of hitting a non-optimal candidate, where n is suitably small.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

木落 2024-09-17 10:11:57

在我看来,需要一个树结构。

基本上,您的初始分支位于所有子字符串上,然后是除第一轮中使用的子字符串之外的所有子字符串,等等,一直到底部。你是对的,这个分支到 2^50,但如果你使用 ab-pruning 来快速终止明显较差的分支,然后添加一些记忆来修剪你见过的情况,然后你才能大大加快速度。

你可能需要进行大量的人工智能学习才能获得这一切,但关于 ab 修剪和换位表的维基百科页面将为你提供一个开始。

编辑:
是的,你说得对,可能还不够清楚。
假设您的示例“ABABABAB BABABABA”带有子字符串{“ABAB”,“BABA”}。
如果您将评估函数设置为简单地将浪费的字符视为坏字符,那么树将像这样:

ABAB (eval=0)
  ABAB (eval=0)
    ABAB (eval=2 because we move past/waste a space char and a B)
      [missing expansion]
    BABA (eval=1 because we only waste the space)
      ABAB (eval=2 now have wasted the space above and a B at this level)
      BABA (eval=1 still only wasted the space)*
  BABA (eval=1 prune here because we already have a result that is 1)
BABA (eval=1 prune here for same reason)

*最佳解决方案

我怀疑简单的“浪费的字符”在这个不平凡的示例中是不够的,但它确实在这里修剪了一半的树。

Looks to me like a tree structure is needed.

Basically your initial branching is on all the substrings, then all but the one you used in the first round etc all the way to the bottom. You're right in that this branches to 2^50 but if you use ab-pruning to quickly terminate branches that are obviously inferior and then add some memoization to prune situations you've seen before you could speed up considerably.

You'll probably have to do a fair amount of AI learning to get it all but wikipedia pages on ab-pruning and transposition tables will get you a start.

edit:
Yep you're right, probably not clear enough.
Assuming your example "ABABABAB BABABABA" with substrings {"ABAB","BABA"}.
If you set your evaluation function to simply treat wasted characters as bad the tree will go something like this:

ABAB (eval=0)
  ABAB (eval=0)
    ABAB (eval=2 because we move past/waste a space char and a B)
      [missing expansion]
    BABA (eval=1 because we only waste the space)
      ABAB (eval=2 now have wasted the space above and a B at this level)
      BABA (eval=1 still only wasted the space)*
  BABA (eval=1 prune here because we already have a result that is 1)
BABA (eval=1 prune here for same reason)

*best solution

I suspect the simple 'wasted chars' isn't enough in the non trivial example but it does prune half the tree here.

空心空情空意 2024-09-17 10:11:57

这是 Haskell 中的一个可行的解决方案。我将唯一的子字符串称为符号,并将一次出现的子字符串的关联称为位置。我还将标准 3(“使用尽可能少的不同子字符串”)解释为“使用尽可能少的符号”,而不是“使用尽可能少的位置”。

这是一种动态规划方法;实际的修剪是由于记忆而发生的。理论上,一个智能的 haskell 实现可以为你做到这一点,(但是还有其他方法在您包装 makeFindBest 的地方),我建议使用位字段来表示使用的符号,并仅使用一个整数来表示剩余的字符串。优化是可能的,因为:给定使用相同符号集的字符串 S1 和 S2 的最佳解决方案,如果将 S1 和 S2 连接起来,则可以以类似的方式连接两个解决方案,并且新的解决方案将是最佳的。因此,对于输入字符串的每个分区,makeFindBest 只需要在后缀上针对前缀中使用的每个可能的符号集进行一次评估。

我还按照丹尼尔的回答中的建议集成了分支定界修剪;这利用了评估函数,跳过的字符越多,评估函数就越差。处理的字符数的成本是单调的,因此,如果我们发现一组仅浪费 alpha 字符的展示位置,那么我们再也不会尝试跳过超过 alpha 的字符> 字符。

其中n是字符串长度,m是符号数量,最坏的情况是O(m^n),而m是O(2^n)。请注意,删除约束 3 将使事情变得更快:记忆化只需要通过剩余的字符串进行参数化,该字符串是 O(n) 缓存,而不是 O(n em> * 2^m)!

使用字符串搜索/匹配算法,例如 Aho-Corasick 的字符串匹配算法,将我在这里使用的consume/drop 1模式从指数改进为二次。然而,这本身并不能避免匹配组合的阶乘增长,而这正是动态规划的帮助之处。

另请注意您的第四个“等”。如果标准以某种方式限制问题,从而可以进行更积极的修剪,或者需要回溯,那么它可能会极大地改变问题!

module Main where

import List
import Maybe
import System.Environment

type Symbol = String
type Placement = String

-- (remaining, placement or Nothing to skip one character)
type Move = (String, Maybe Placement)

-- (score, usedsymbols, placements)
type Solution = (Int, [Symbol], [Placement])

-- invoke like ./a.out STRING SPACE-SEPARATED-SYMBOLS ...
-- e.g. ./a.out "abcdeafghia" "a bc fg"
-- output is a list of placements
main = do
  argv <- System.Environment.getArgs
  let str = head argv
      symbols = concat (map words (tail argv))
  (putStr . show) $ findBest str symbols
  putStr "\n"

getscore :: Solution -> Int
getscore (sc,_,_) = sc

-- | consume STR SYM consumes SYM from the start of STR.  returns (s, SYM)
-- where s is the rest of STR, after the consumed occurrence, or Nothing if
-- SYM isnt a prefix of STR.
consume :: String -> Symbol -> Maybe Move
consume str sym = if sym `isPrefixOf` str
                  then (Just (drop (length sym) str, (Just sym)))
                  else Nothing

-- | addToSoln SYMBOLS P SOL incrementally updates SOL with the new SCORE and
-- placement P
addToSoln :: [Symbol] -> Maybe Placement -> Solution -> Solution
addToSoln symbols Nothing (sc, used, ps) = (sc - (length symbols) - 1, used, ps)
addToSoln symbols (Just p) (sc, used, ps) = 
  if p `elem` symbols
  then (sc - 1, used `union` [p], p : ps)
  else (sc, used, p : ps)

reduce :: [Symbol] -> Solution -> Solution -> [Move] -> Solution
reduce _ _ cutoff [] = cutoff
reduce symbols parent cutoff ((s,p):moves) =
    let sol = makeFindBest symbols (addToSoln symbols p parent) cutoff s
        best = if (getscore sol) > (getscore cutoff)
               then sol
               else cutoff
    in reduce symbols parent best moves

-- | makeFindBest SYMBOLS PARENT CUTOFF STR searches for the best placements
-- that can be made on STR from SYMBOLS, that are strictly better than CUTOFF,
-- and prepends those placements to PARENTs third element.
makeFindBest :: [Symbol] -> Solution -> Solution -> String -> Solution
makeFindBest _ cutoff _ "" = cutoff
makeFindBest symbols parent cutoff str =
  -- should be memoized by (snd parent) (i.e. the used symbols) and str
  let moves = if (getscore parent) > (getscore cutoff)
              then (mapMaybe (consume str) symbols) ++ [(drop 1 str, Nothing)]
              else (mapMaybe (consume str) symbols)
  in reduce symbols parent cutoff moves

-- a solution that makes no placements
worstScore str symbols = -(length str) * (1 + (length symbols))

findBest str symbols =
  (\(_,_,ps) -> reverse ps)
  (makeFindBest symbols (0, [], []) (worstScore str symbols, [], []) str)

Here's a working solution in Haskell. I have called the unique substrings symbols, and an association of one occurrence of the substrings a placement. I have also interpreted criterion 3 ("Use as few different substrings as possible") as "use as few symbols as possible", as opposed to "use as few placements as possible".

This is a dynamic programming approach; the actual pruning occurs due to the memoization. Theoretically, a smart haskell implementation could do it for you, (but there are other ways where you wrap makeFindBest), I'd suggest using a bitfield to represent the used symbols and just an integer to represent the remaining string. The optimisation is possible from the fact that: given optimal solutions for the strings S1 and S2 that both use the same set of symbols, if S1 and S2 are concatenated then the two solutions can be concatenated in a similar manner and the new solution will be optimal. Hence for each partition of the input string, makeFindBest need only be evaluated once on the postfix for each possible set of symbols used in the prefix.

I've also integrated branch-and-bound pruning as suggested in Daniel's answer; this makes use of an evaluation function which becomes worse the more characters skipped. The cost is monotonic in the number of characters processed, so that if we have found a set of placements that wasted only alpha characters, then we never again try to skip more than alpha characters.

Where n is the string length and m is the number of symbols, the worst case is O(m^n) naively, and m is O(2^n). Note that removing constraint 3 would make things much quicker: the memoization would only need to be parameterized by the remaining string which is an O(n) cache, as opposed to O(n * 2^m)!

Using a string search/matching algorithm such as Aho-Corasick's string matching algorithm, improves the consume/drop 1 pattern I use here from exponential to quadratic. However, this by itself doesn't avoid the factorial growth in the combinations of the matches, which is where the dynamic programming helps.

Also note that your 4th "etc." criteria could possibly change the problem a lot if it constrains the problem in a way that makes it possible to do more aggressive pruning, or requires backtracking!

module Main where

import List
import Maybe
import System.Environment

type Symbol = String
type Placement = String

-- (remaining, placement or Nothing to skip one character)
type Move = (String, Maybe Placement)

-- (score, usedsymbols, placements)
type Solution = (Int, [Symbol], [Placement])

-- invoke like ./a.out STRING SPACE-SEPARATED-SYMBOLS ...
-- e.g. ./a.out "abcdeafghia" "a bc fg"
-- output is a list of placements
main = do
  argv <- System.Environment.getArgs
  let str = head argv
      symbols = concat (map words (tail argv))
  (putStr . show) $ findBest str symbols
  putStr "\n"

getscore :: Solution -> Int
getscore (sc,_,_) = sc

-- | consume STR SYM consumes SYM from the start of STR.  returns (s, SYM)
-- where s is the rest of STR, after the consumed occurrence, or Nothing if
-- SYM isnt a prefix of STR.
consume :: String -> Symbol -> Maybe Move
consume str sym = if sym `isPrefixOf` str
                  then (Just (drop (length sym) str, (Just sym)))
                  else Nothing

-- | addToSoln SYMBOLS P SOL incrementally updates SOL with the new SCORE and
-- placement P
addToSoln :: [Symbol] -> Maybe Placement -> Solution -> Solution
addToSoln symbols Nothing (sc, used, ps) = (sc - (length symbols) - 1, used, ps)
addToSoln symbols (Just p) (sc, used, ps) = 
  if p `elem` symbols
  then (sc - 1, used `union` [p], p : ps)
  else (sc, used, p : ps)

reduce :: [Symbol] -> Solution -> Solution -> [Move] -> Solution
reduce _ _ cutoff [] = cutoff
reduce symbols parent cutoff ((s,p):moves) =
    let sol = makeFindBest symbols (addToSoln symbols p parent) cutoff s
        best = if (getscore sol) > (getscore cutoff)
               then sol
               else cutoff
    in reduce symbols parent best moves

-- | makeFindBest SYMBOLS PARENT CUTOFF STR searches for the best placements
-- that can be made on STR from SYMBOLS, that are strictly better than CUTOFF,
-- and prepends those placements to PARENTs third element.
makeFindBest :: [Symbol] -> Solution -> Solution -> String -> Solution
makeFindBest _ cutoff _ "" = cutoff
makeFindBest symbols parent cutoff str =
  -- should be memoized by (snd parent) (i.e. the used symbols) and str
  let moves = if (getscore parent) > (getscore cutoff)
              then (mapMaybe (consume str) symbols) ++ [(drop 1 str, Nothing)]
              else (mapMaybe (consume str) symbols)
  in reduce symbols parent cutoff moves

-- a solution that makes no placements
worstScore str symbols = -(length str) * (1 + (length symbols))

findBest str symbols =
  (\(_,_,ps) -> reverse ps)
  (makeFindBest symbols (0, [], []) (worstScore str symbols, [], []) str)
天荒地未老 2024-09-17 10:11:57

这听起来像是一个动态编程问题。您可以在其上找到许多好的资源,但要点是生成子问题的集合,然后通过组合最佳子解决方案来构建“更大”的最佳解决方案。

This smells like a dynamic programming problem. You can find a number of good sources on it, but the gist is that you generate a collection of subproblems, and then build up "larger" optimal solutions by combining optimal subsolutions.

许久 2024-09-17 10:11:57

这是一个用 C++ 重写的答案,使用 Aho-Corasick 字符串匹配算法Dijkstra 算法。这应该更接近您的目标语言 C#。

Aho-Corasick 步骤根据模式集构造一个自动机(基于后缀树),然后使用该自动机查找输入字符串中的所有匹配项。然后,Dijkstra 算法将这些匹配视为 DAG 中的节点,并向字符串末尾移动,寻找成本最低的路径。

这种方法更容易分析,因为它只是结合了两种易于理解的算法。

构造 Aho-Corasick 自动机是模式长度的线性时间,然后搜索是输入字符串 + 匹配的累积长度的线性时间。

假设 STL 高效,Dijkstra 算法的运行时间为 O(|E| + |V| log |V|)。该图是一个 DAG,其中顶点对应于匹配项或跳过的字符串。边缘权重是使用额外模式或跳过字符的惩罚。如果两个匹配相邻且不重叠,则它们之间存在边。如果这是 m 和与某些匹配 m2 重叠的另一个匹配 m2 之间可能的最短跳跃,则存在从匹配 m 到跳跃的边em>m3 从与跳跃相同的位置开始(唷!)。 Dijkstra 算法的结构确保了当我们到达输入字符串末尾时,最佳答案是第一个找到的答案(它实现了 Daniel 隐式建议的剪枝)。

#include <iostream>
#include <queue>
#include <vector>
#include <list>
#include <string>
#include <algorithm>
#include <set>

using namespace std;

static vector<string> patterns;
static string input;
static int skippenalty;

struct acnode {
      acnode() : failure(NULL), gotofn(256) {}
      struct acnode *failure;
      vector<struct acnode *> gotofn;
      list<int> outputs; // index into patterns global
};

void
add_string_to_trie(acnode *root, const string &s, int sid)
{
   for (string::const_iterator p = s.begin(); p != s.end(); ++p) {
      if (!root->gotofn[*p])
     root->gotofn[*p] = new acnode;
      root = root->gotofn[*p];
   }
   root->outputs.push_back(sid);
}

void
init_tree(acnode *root)
{
   queue<acnode *> q;
   unsigned char c = 0;
   do {
      if (acnode *u = root->gotofn[c]) {
         u->failure = root;
         q.push(u);
      } else
         root->gotofn[c] = root;
   } while (++c);
   while (!q.empty()) {
      acnode *r = q.front();
      q.pop();

      do {
         acnode *u, *v;
         if (!(u = r->gotofn[c]))
            continue;
         q.push(u);
         v = r->failure;
         while (!v->gotofn[c])
            v = v->failure;
         u->failure = v->gotofn[c];
         u->outputs.splice(u->outputs.begin(), v->gotofn[c]->outputs);
      } while (++c);
   }
}

struct match { int begin, end, sid; };

void
ahocorasick(const acnode *state, list<match> &out, const string &str)
{
   int i = 1;
   for (string::const_iterator p = str.begin(); p != str.end(); ++p, ++i) {
      while (!state->gotofn[*p])
         state = state->failure;
      state = state->gotofn[*p];
      for (list<int>::const_iterator q = state->outputs.begin();
           q != state->outputs.end(); ++q) {
         struct match m = { i - patterns[*q].size(), i, *q };
         out.push_back(m);
      }
   }
}

////////////////////////////////////////////////////////////////////////
bool operator<(const match& m1, const match& m2)
{
   return m1.begin < m2.begin
      || (m1.begin == m2.end && m1.end < m2.end);
}

struct dnode {
      int usedchars;
      vector<bool> usedpatterns;
      int last;
};

bool operator<(const dnode& a, const dnode& b) {
   return a.usedchars > b.usedchars
      || (a.usedchars == b.usedchars && a.usedpatterns < b.usedpatterns);
}
bool operator==(const dnode& a, const dnode& b) {
   return a.usedchars == b.usedchars
      && a.usedpatterns == b.usedpatterns;
}

typedef priority_queue<pair<int, dnode>,
               vector<pair<int, dnode> >,
               greater<pair<int, dnode> > > mypq;

void
dijkstra(const vector<match> &matches)
{
   typedef vector<match>::const_iterator mIt;
   vector<bool> used(patterns.size(), false);
   dnode initial = { 0, used, -1 };
   mypq q;

   set<dnode> last;
   dnode d;

   q.push(make_pair(0, initial));
   while (!q.empty()) {
      int cost = q.top().first;
      d = q.top().second;
      q.pop();

      if (last.end() != last.find(d)) // we've been here before
         continue;

      last.insert(d);
      if (d.usedchars >= input.size()) {
         break; // found optimum
      }

      match m = { d.usedchars, 0, 0 };      
      mIt mp = lower_bound(matches.begin(), matches.end(), m);

      if (matches.end() == mp) {
         // no more matches, skip the remaining string
         dnode nextd = d;
         d.usedchars = input.size();
         int skip = nextd.usedchars - d.usedchars;
         nextd.last = -skip;

         q.push(make_pair(cost + skip * skippenalty, nextd));
         continue;
      }

      // keep track of where the shortest match ended; we don't need to
      // skip more than this.
      int skipmax = (mp->begin == d.usedchars) ? mp->end : mp->begin + 1;
      while (mp != matches.end() && mp->begin == d.usedchars) {
         dnode nextd = d;
         nextd.usedchars = mp->end;
         int extra = nextd.usedpatterns[mp->sid] ? 0 : 1; // extra pattern
         int nextcost = cost + extra;
         nextd.usedpatterns[mp->sid] = true;
         nextd.last = mp->sid * 2 + extra; // encode used pattern
         q.push(make_pair(nextcost, nextd));
         ++mp;
      }

      if (mp == matches.end() || skipmax <= mp->begin)
         continue;

      // skip
      dnode nextd = d;
      nextd.usedchars = mp->begin;
      int skip = nextd.usedchars - d.usedchars;
      nextd.last = -skip;
      q.push(make_pair(cost + skip * skippenalty, nextd));
   }

   // unwind
   string answer;
   while (d.usedchars > 0) {
      if (0 > d.last) {
         answer = string(-d.last, '*') + answer;
         d.usedchars += d.last;
      } else {
         answer = "[" + patterns[d.last / 2] + "]" + answer;
         d.usedpatterns[d.last / 2] = !(d.last % 2);
         d.usedchars -= patterns[d.last / 2].length();
      }

      set<dnode>::const_iterator lp = last.find(d);
      if (last.end() == lp) return; // should not happen
      d.last = lp->last;
   }
   cout << answer;
}

int
main()
{
   int n;
   cin >> n; // read n patterns
   patterns.reserve(n);
   acnode root;
   for (int i = 0; i < n; ++i) {
      string s;
      cin >> s;
      patterns.push_back(s);
      add_string_to_trie(&root, s, i);
   }
   init_tree(&root);

   getline(cin, input); // eat the rest of the first line
   getline(cin, input);
   cerr << "got input: " << input << endl;
   list<match> matches;
   ahocorasick(&root, matches, input);

   vector<match> vmatches(matches.begin(), matches.end());
   sort(vmatches.begin(), vmatches.end());
   skippenalty = 1 + patterns.size();

   dijkstra(vmatches);
   return 0;
}

这是一个包含 52 个单字母模式的测试文件(编译然后在标准输入上使用测试文件运行):

52 a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 
abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz

This is an answer rewritten to use the Aho-Corasick string-matching algorithm and Dijkstra's algorithm, in C++. This should be a lot closer to your target language of C#.

The Aho-Corasick step constructs an automaton (based on a suffix tree) from the set of patterns, and then uses that automaton to find all matches in the input string. Dijkstra's algorithm then treats those matches as nodes in a DAG, and moves toward the end of the string looking for the lowest cost path.

This approach is a lot easier to analyze, as it's simply combining two well-understood algorithms.

Constructing the Aho-Corasick automaton is linear time in the length of the patterns, and then the search is linear in the input string + the cumulative length of the matches.

Dijkstra's algorithm runs in O(|E| + |V| log |V|) assuming an efficient STL. The graph is a DAG, where vertices correspond to matches or to runs of characters that are skipped. Edge weights are the penalty for using an extra pattern or for skipping characters. An edge exists between two matches if they are adjacent and non-overlapping. An edge exists from a match m to a skip if that is the shortest possible skip between m and another match m2 that overlaps with some match m3 starting at the same place as the skip (phew!). The structure of Dijkstra's algorithm ensures that the optimal answer is the first one to be found by the time we reach the end of the input string (it achieves the pruning Daniel suggested implicitly).

#include <iostream>
#include <queue>
#include <vector>
#include <list>
#include <string>
#include <algorithm>
#include <set>

using namespace std;

static vector<string> patterns;
static string input;
static int skippenalty;

struct acnode {
      acnode() : failure(NULL), gotofn(256) {}
      struct acnode *failure;
      vector<struct acnode *> gotofn;
      list<int> outputs; // index into patterns global
};

void
add_string_to_trie(acnode *root, const string &s, int sid)
{
   for (string::const_iterator p = s.begin(); p != s.end(); ++p) {
      if (!root->gotofn[*p])
     root->gotofn[*p] = new acnode;
      root = root->gotofn[*p];
   }
   root->outputs.push_back(sid);
}

void
init_tree(acnode *root)
{
   queue<acnode *> q;
   unsigned char c = 0;
   do {
      if (acnode *u = root->gotofn[c]) {
         u->failure = root;
         q.push(u);
      } else
         root->gotofn[c] = root;
   } while (++c);
   while (!q.empty()) {
      acnode *r = q.front();
      q.pop();

      do {
         acnode *u, *v;
         if (!(u = r->gotofn[c]))
            continue;
         q.push(u);
         v = r->failure;
         while (!v->gotofn[c])
            v = v->failure;
         u->failure = v->gotofn[c];
         u->outputs.splice(u->outputs.begin(), v->gotofn[c]->outputs);
      } while (++c);
   }
}

struct match { int begin, end, sid; };

void
ahocorasick(const acnode *state, list<match> &out, const string &str)
{
   int i = 1;
   for (string::const_iterator p = str.begin(); p != str.end(); ++p, ++i) {
      while (!state->gotofn[*p])
         state = state->failure;
      state = state->gotofn[*p];
      for (list<int>::const_iterator q = state->outputs.begin();
           q != state->outputs.end(); ++q) {
         struct match m = { i - patterns[*q].size(), i, *q };
         out.push_back(m);
      }
   }
}

////////////////////////////////////////////////////////////////////////
bool operator<(const match& m1, const match& m2)
{
   return m1.begin < m2.begin
      || (m1.begin == m2.end && m1.end < m2.end);
}

struct dnode {
      int usedchars;
      vector<bool> usedpatterns;
      int last;
};

bool operator<(const dnode& a, const dnode& b) {
   return a.usedchars > b.usedchars
      || (a.usedchars == b.usedchars && a.usedpatterns < b.usedpatterns);
}
bool operator==(const dnode& a, const dnode& b) {
   return a.usedchars == b.usedchars
      && a.usedpatterns == b.usedpatterns;
}

typedef priority_queue<pair<int, dnode>,
               vector<pair<int, dnode> >,
               greater<pair<int, dnode> > > mypq;

void
dijkstra(const vector<match> &matches)
{
   typedef vector<match>::const_iterator mIt;
   vector<bool> used(patterns.size(), false);
   dnode initial = { 0, used, -1 };
   mypq q;

   set<dnode> last;
   dnode d;

   q.push(make_pair(0, initial));
   while (!q.empty()) {
      int cost = q.top().first;
      d = q.top().second;
      q.pop();

      if (last.end() != last.find(d)) // we've been here before
         continue;

      last.insert(d);
      if (d.usedchars >= input.size()) {
         break; // found optimum
      }

      match m = { d.usedchars, 0, 0 };      
      mIt mp = lower_bound(matches.begin(), matches.end(), m);

      if (matches.end() == mp) {
         // no more matches, skip the remaining string
         dnode nextd = d;
         d.usedchars = input.size();
         int skip = nextd.usedchars - d.usedchars;
         nextd.last = -skip;

         q.push(make_pair(cost + skip * skippenalty, nextd));
         continue;
      }

      // keep track of where the shortest match ended; we don't need to
      // skip more than this.
      int skipmax = (mp->begin == d.usedchars) ? mp->end : mp->begin + 1;
      while (mp != matches.end() && mp->begin == d.usedchars) {
         dnode nextd = d;
         nextd.usedchars = mp->end;
         int extra = nextd.usedpatterns[mp->sid] ? 0 : 1; // extra pattern
         int nextcost = cost + extra;
         nextd.usedpatterns[mp->sid] = true;
         nextd.last = mp->sid * 2 + extra; // encode used pattern
         q.push(make_pair(nextcost, nextd));
         ++mp;
      }

      if (mp == matches.end() || skipmax <= mp->begin)
         continue;

      // skip
      dnode nextd = d;
      nextd.usedchars = mp->begin;
      int skip = nextd.usedchars - d.usedchars;
      nextd.last = -skip;
      q.push(make_pair(cost + skip * skippenalty, nextd));
   }

   // unwind
   string answer;
   while (d.usedchars > 0) {
      if (0 > d.last) {
         answer = string(-d.last, '*') + answer;
         d.usedchars += d.last;
      } else {
         answer = "[" + patterns[d.last / 2] + "]" + answer;
         d.usedpatterns[d.last / 2] = !(d.last % 2);
         d.usedchars -= patterns[d.last / 2].length();
      }

      set<dnode>::const_iterator lp = last.find(d);
      if (last.end() == lp) return; // should not happen
      d.last = lp->last;
   }
   cout << answer;
}

int
main()
{
   int n;
   cin >> n; // read n patterns
   patterns.reserve(n);
   acnode root;
   for (int i = 0; i < n; ++i) {
      string s;
      cin >> s;
      patterns.push_back(s);
      add_string_to_trie(&root, s, i);
   }
   init_tree(&root);

   getline(cin, input); // eat the rest of the first line
   getline(cin, input);
   cerr << "got input: " << input << endl;
   list<match> matches;
   ahocorasick(&root, matches, input);

   vector<match> vmatches(matches.begin(), matches.end());
   sort(vmatches.begin(), vmatches.end());
   skippenalty = 1 + patterns.size();

   dijkstra(vmatches);
   return 0;
}

Here is a test file with 52 single-letter patterns (compile and then run with the test file on stdin):

52 a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 
abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文