当前位置：文江博客话题详情

C# algorithm data-structures text-segmentation

解析连续字符串中的单词

发布于 2024-11-16 08:40:51 字数 290 浏览 5 评论 0原文

如果 a 有一个包含单词且没有空格的字符串，那么鉴于我有一个包含这些单词的字典/列表，我应该如何解析这些单词？

例如，如果我的字符串是“thisisastringwithwords”，我如何使用字典来创建输出“这是一个带单词的字符串”？

我听说使用数据结构 Tries 可能会有所帮助，但也许有人可以帮助编写伪代码？例如，我在想也许你可以将字典索引到特里结构中，然后沿着特里结构跟踪每个字符；问题是，我不熟悉如何在（伪）代码中执行此操作。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

赏烟花じ飞满天 2024-11-23 08:40:51

我假设您想要一种有效的解决方案，而不是反复检查文本是否以字典单词开头的明显解决方案。

如果字典足够小，我想你可以尝试修改标准 KMP 算法。基本上，在你的字典上构建一个有限状态机，它逐个字符地消耗文本并生成构造的单词。

编辑：看来我正在重新发明尝试。

回复收藏 0 原文

箜明 2024-11-23 08:40:51

我已经做过类似的事情了。您不能使用简单的字典。结果会很混乱。这取决于您是否只需执行一次或作为整个程序执行此操作。

我的解决方案是：

连接到数据库
字典列表中的单词（例如
例如在线词典）
过滤词典中的长单词和短单词，并检查是否要修剪内容（例如，不要使用只有一个字符的单词，如'I'）
从短单词开始，然后比较您的单词bigString 与数据库字典。

现在您需要创建一个“可能性表”。因为很多词可以放进100%却是错误的。单词越长，您就越确定这个单词是正确的。

它是 CPU 密集型的，但它可以在结果中精确工作。
假设您正在使用一个包含 10,000 个单词的小字典，其中 3,000 个单词的长度为 8 个字符，您需要在开始时将您的 bigString 与所有 3,000 个单词进行比较，只有找到结果，才允许继续进行下一个词。如果您的 bigString 有 200 个字符，则需要大约（2000 个字符/8 个平均字符）= 至少 250 个完整循环进行比较。

对于我来说，我还在比较中对拼写错误的单词进行了小的验证。

程序示例（请勿复制粘贴）

    Dim bigString As String = "helloworld.thisisastackoverflowtest!"

    Dim dictionary As New List(Of String) 'contains the original words. lets make it case insentitive
    dictionary.Add("Hello")
    dictionary.Add("World")
    dictionary.Add("this")
    dictionary.Add("is")
    dictionary.Add("a")
    dictionary.Add("stack")
    dictionary.Add("over")
    dictionary.Add("flow")
    dictionary.Add("stackoverflow")
    dictionary.Add("test")
    dictionary.Add("!")


    For Each word As String In dictionary
        If word.Length < 1 Then dictionary.Remove(word) 'remove short words (will not work with for each in real)
        word = word.ToLower 'make it case insentitive
    Next

    Dim ResultComparer As New Dictionary(Of String, Double) 'String is the dictionary word. Double is a value as percent for a own function to weight result

    Dim i As Integer = 0 'start at the beginning
    Dim Found As Boolean = False
    Do
        For Each word In dictionary
            If bigString.IndexOf(word, i) > 0 Then
                ResultComparer.Add(word, MyWeightOfWord) 'add the word if found, long words are better and will increase the weight value 
                Found = True
            End If
        Next
        If Found = True Then
            i += ResultComparer(BestWordWithBestWeight).Length
        Else
            i += 1
        End If
    Loop

I already did something similar. You cannot use a simple dictionary. The result will be messy. It depends if you only have to do this once or as whole program.

My solution was to:

Connect to a database with working
words from a dictionary list (for
example online dictionary)
Filter long and short words in dictionary and check if you want to trim stuff (for example don't use words with only one character like 'I')
Start with short words and compare your bigString with the database dictionary.

Now you need to create a "table of possibility". Because a lot of words can fit into 100% but are wrong. As longer the word as more sure you are, that this word is the right one.

It is cpu intensive but it can work precise in the result.
So lets say, you are using a small dictionary of 10,000 words and 3,000 of them are with a length of 8 characters, you need to compare your bigString at start with all 3,000 words and only if result was found, it is allowed to proceed to the next word. If you have 200 characters in your bigString you need about (2000chars / 8 average chars) = 250 full loops minimum with comparation.

For me, I also did a small verification of misspelled words into the comparation.

example of procedure (don't copy paste)

    Dim bigString As String = "helloworld.thisisastackoverflowtest!"

    Dim dictionary As New List(Of String) 'contains the original words. lets make it case insentitive
    dictionary.Add("Hello")
    dictionary.Add("World")
    dictionary.Add("this")
    dictionary.Add("is")
    dictionary.Add("a")
    dictionary.Add("stack")
    dictionary.Add("over")
    dictionary.Add("flow")
    dictionary.Add("stackoverflow")
    dictionary.Add("test")
    dictionary.Add("!")


    For Each word As String In dictionary
        If word.Length < 1 Then dictionary.Remove(word) 'remove short words (will not work with for each in real)
        word = word.ToLower 'make it case insentitive
    Next

    Dim ResultComparer As New Dictionary(Of String, Double) 'String is the dictionary word. Double is a value as percent for a own function to weight result

    Dim i As Integer = 0 'start at the beginning
    Dim Found As Boolean = False
    Do
        For Each word In dictionary
            If bigString.IndexOf(word, i) > 0 Then
                ResultComparer.Add(word, MyWeightOfWord) 'add the word if found, long words are better and will increase the weight value 
                Found = True
            End If
        Next
        If Found = True Then
            i += ResultComparer(BestWordWithBestWeight).Length
        Else
            i += 1
        End If
    Loop

回复收藏 0 原文

会傲 2024-11-23 08:40:51

我告诉过你，这似乎是一项不可能完成的任务。但是你可以看看这个相关的SO问题 - 它可能对你有帮助。

回复收藏 0 原文

深者入戏 2024-11-23 08:40:51

如果你确定你在字典中拥有该短语的所有单词，你可以使用该算法：

String phrase = "thisisastringwithwords";
String fullPhrase = "";
Set<String> myDictionary;
do {
    foreach(item in myDictionary){
        if(phrase.startsWith(item){
            fullPhrase += item + " ";
            phrase.remove(item);
            break;
        }
    }
} while(phrase.length != 0);

有很多复杂的情况，例如，某些项目以相同的方式开始，因此代码将更改为使用一些树搜索，BST 等。

If you are sure you have all the words of the phrase in the dictionary, you can use that algo:

String phrase = "thisisastringwithwords";
String fullPhrase = "";
Set<String> myDictionary;
do {
    foreach(item in myDictionary){
        if(phrase.startsWith(item){
            fullPhrase += item + " ";
            phrase.remove(item);
            break;
        }
    }
} while(phrase.length != 0);

There are so many complications, like, some items starting equally, so the code will be changed to use some tree search, BST or so.

回复收藏 0 原文

感情洁癖 2024-11-23 08:40:51

这正是尝试以编程方式解析诸如中文之类的单词之间没有空格的语言时遇到的问题。适用于这些语言的一种方法是首先根据标点符号拆分文本。这将为您提供短语。接下来，您将迭代这些短语，并尝试从词典中最长单词的长度开始将它们分解为单词。假设长度为 13 个字符。取出该短语的前 13 个字符，看看它是否在您的字典中。如果是这样，暂时将其视为正确的单词，继续该短语并重复。否则，将子字符串缩短为 12 个字符，然后是 11 个字符，等等。

这种方法效果非常好，但并不完美，因为我们不小心对出现在前面的单词产生了偏见。消除这种偏差并仔细检查结果的一种方法是从短语末尾开始重复该过程。如果你得到同样的断词，你可能会说它很好。如果不是，则有重叠的词段。例如，当您从末尾开始解析示例短语时，您可能会得到（向后强调）

words with string a Isis th

首先，单词 Isis（埃及女神）似乎是正确的单词。然而，当您发现“th”不在您的字典中时，您就知道附近存在分词问题。通过使用非对齐序列“thisis”的前向分段结果“this is”来解决此问题，因为这两个单词都在字典中。

这个问题的一个不太常见的变体是相邻单词共享一个可以任意方向的序列。如果你有一个像“archand”（编造一些东西）这样的序列，它应该是“arc hand”还是“arch and”？确定的方法是对结果应用语法检查器。无论如何，应该对整个文本进行此操作。

This is the exact problem one has when trying to programmatically parse languages like Chinese where there are no spaces between words. One method that works with those languages is to start by splitting text on punctuation. This gives you phrases. Next you iterate over the phrases and try to break them into words starting with the length of the longest word in your dictionary. Let's say that length is 13 characters. Take the first 13 characters from the phrase and see if it is in your dictionary. If so, take it as a correct word for now, move forward in the phrase and repeat. Otherwise, shorten your substring to 12 characters, then 11 characters, etc.

This works extremely well, but not perfectly because we've accidentally put in a bias towards words that come first. One way to remove this bias and double check your result is to repeat the process starting at the end of the phrase. If you get the same word breaks you can probably call it good. If not, you have an overlapping word segment. For example, when you parse your sample phrase starting at the end you might get (backwards for emphasis)

words with string a Isis th

At first, the word Isis (Egyptian Goddess) appears to be the correct word. When you find that "th" is not in your dictionary, however, you know there is a word segmentation problem nearby. Resolve this by going with the forward segmentation result "this is" for the non-aligned sequence "thisis" since both words are in the dictionary.

A less common variant of this problem is when adjacent words share a sequence which could go either way. If you had a sequence like "archand" (to make something up), should it be "arc hand" or "arch and"? The way to determine is to apply a grammar checker to the results. This should be done to the whole text anyway.

回复收藏 0 原文

天涯离梦残月幽梦 2024-11-23 08:40:51

好吧，我会尝试一下。对于您的问题来说，完美的数据结构是（正如您所说的特里树）由字典中的单词组成。 trie 最好被可视化为 DFA，一个很好的状态机，你可以从一个状态开始到每个新角色的下一个。这在代码中确实很容易做到，Java（ish）风格的类将是：

Class State 
{
   String matchedWord;
   Map<char,State> mapChildren;
}

从这里开始，构建 trie 就很容易了。它就像一个有根树结构，每个节点都有多个子节点。每个孩子都会在一次角色转变时被拜访。使用HashMap 类型的结构可以缩短查找字符到下一个State 映射的时间。或者，如果您的字母表只有 26 个字符，那么 26 的固定大小数组 也可以达到目的。

现在，假设所有这些都有意义，您有一个尝试，您的问题仍然没有完全解决。这是你开始做正则表达式引擎所做的事情的地方，沿着 trie 走下去，跟踪与字典中的整个单词匹配的状态（这就是我在 matchedWord 中的作用） >State 结构），如果当前路径遇到死胡同，请使用一些回溯逻辑跳转到先前的匹配状态。我知道它的一般情况，但考虑到特里结构，其余的就相当简单了。

Ok, I will make a hand wavy attempt at this. The perfect(ish) data structure for your problem is (as you've said a trie) made up of the words in the dictionary. A trie is best visualised as a DFA, a nice state machine where you go from one state to the next on every new character. This is really easy to do in code, a Java(ish) style class for this would be :

Class State 
{
   String matchedWord;
   Map<char,State> mapChildren;
}

From hereon, building the trie is easy. Its like having a rooted tree structure with each node having multiple children. Each child is visited on one character transition. The use of a HashMap kind of structure trims down time to look up character to next State mappings. Alternately if all you have are 26 characters for the alphabet, a fixed size array of 26 would do the trick as well.

Now, assuming all of that made sense, you have a trie, your problem still isn't fully solved. This is where you start doing things like regular expressions engines do, walk down the trie, keep track of states which match to a whole word in the dictionary (thats what I had the matchedWord for in the State structure), use some backtracking logic to jump to a previous match state if the current trail hits a dead end. I know its general but given the trie structure, the rest is fairly straightforward.

回复收藏 0 原文

辞别 2024-11-23 08:40:51

如果您有单词字典并且需要快速实现，则可以通过动态编程在 O(n^2) 时间内有效地解决这个问题，假设字典查找为 O(1)。下面是一些 C# 代码，可以改进子字符串提取和字典查找。

public static String[] StringToWords(String str, HashSet<string> words)
{      
  //Index of char - length of last valid word
  int[] bps = new int[str.Length + 1];

  for (int i = 0; i < bps.Length; i++)      
    bps[i] = -1;

  for (int i = 0; i < str.Length; i++)
  {
    for (int j = i + 1; j <= str.Length ; j++)
    {
      if (bps[j] == -1)
      {
        //Destination cell doesn't have valid backpointer yet
        //Try with the current substring
        String s = str.Substring(i, j - i);
        if (words.Contains(s))
          bps[j] = i;
      }
    }        
  }      

  //Backtrack to recovery sequence and then reverse 
  List<String> seg = new List<string>();
  for (int bp = str.Length; bps[bp] != -1 ;bp = bps[bp])      
    seg.Add(str.Substring(bps[bp], bp - bps[bp]));      
  seg.Reverse();
  return seg.ToArray();
}

使用 /usr/share/dict/words 中的单词列表构建一个 hastset 并进行测试，

foreach (var s in StringSplitter.StringToWords("thisisastringwithwords", dict))
    Console.WriteLine(s);

得到输出“t hi 是一个包含单词的字符串”。因为正如其他人指出的那样，该算法将返回有效的分段（如果存在），但这可能不是您期望的分段。短单词的存在会降低分段质量，如果两个有效的子分段进入一个元素，您可能可以添加启发式以支持较长的单词。

有更复杂的方法，有限状态机和语言模型可以生成多个分段并应用概率排名。

If you have dictionary of words and need a quick implmentation this can be solved efficiently with dynamic programming in O(n^2) time, assuming the dictionary lookups are O(1). Below is some C# code, the substring extraction could and dictionary lookup could be improved.

public static String[] StringToWords(String str, HashSet<string> words)
{      
  //Index of char - length of last valid word
  int[] bps = new int[str.Length + 1];

  for (int i = 0; i < bps.Length; i++)      
    bps[i] = -1;

  for (int i = 0; i < str.Length; i++)
  {
    for (int j = i + 1; j <= str.Length ; j++)
    {
      if (bps[j] == -1)
      {
        //Destination cell doesn't have valid backpointer yet
        //Try with the current substring
        String s = str.Substring(i, j - i);
        if (words.Contains(s))
          bps[j] = i;
      }
    }        
  }      

  //Backtrack to recovery sequence and then reverse 
  List<String> seg = new List<string>();
  for (int bp = str.Length; bps[bp] != -1 ;bp = bps[bp])      
    seg.Add(str.Substring(bps[bp], bp - bps[bp]));      
  seg.Reverse();
  return seg.ToArray();
}

Building a hastset with the word list from /usr/share/dict/words and testing with

foreach (var s in StringSplitter.StringToWords("thisisastringwithwords", dict))
    Console.WriteLine(s);

I get the output "t hi sis a string with words". Because as others have pointed out this algorithm will return a valid segmentation (if one exists), however this may not be the segmentation you expect. The presence of short words is reducing the segmentation quality, you might be able to add heuristic to favour longer words if two valid sub-segmentation enter an element.

There are more sophisticated methods that finite state machines and language models that can generate multiple segmentations and apply probabilistic ranking.

回复收藏 0 原文

~没有更多了~