从字符串生成子字符串的组合

发布于 2024-12-08 15:08:53 字数 656 浏览 2 评论 0 原文

我试图为给定的单词生成所有可能的音节组合。识别音节的过程在这里并不相关,但所有组合的生成给我带来了问题。我认为这可能可以在几行中递归地完成(尽管任何其他方式都可以),但我很难让它工作。有人可以帮忙吗?

    // how to test a syllable, just for the purpose of this example
    bool IsSyllable(string possibleSyllable) 
    {
        return Regex.IsMatch(possibleSyllable, "^(mis|und|un|der|er|stand)$");
    }

    List<string> BreakIntoSyllables(string word)
    {
       // the code here is what I'm trying to write 
       // if 'word' is "misunderstand" , I'd like this to return
       //  => {"mis","und","er","stand"},{ "mis","un","der","stand"}
       // and for any other combinations to be not included
    }

I'm trying to is generate all possible syllable combinations for a given word. The process for identifying what's a syllable isn't relevant here, but it's the generating of all combinations that's giving me a problem. I think this is probably possible to do recursively in a few lines I think (though any other way is fine), but I'm having trouble getting it working. Can anyone help ?

    // how to test a syllable, just for the purpose of this example
    bool IsSyllable(string possibleSyllable) 
    {
        return Regex.IsMatch(possibleSyllable, "^(mis|und|un|der|er|stand)$");
    }

    List<string> BreakIntoSyllables(string word)
    {
       // the code here is what I'm trying to write 
       // if 'word' is "misunderstand" , I'd like this to return
       //  => {"mis","und","er","stand"},{ "mis","un","der","stand"}
       // and for any other combinations to be not included
    }

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

万人眼中万个我 2024-12-15 15:08:53

尝试从这里开始:

var word = "misunderstand";

Func<string, bool> isSyllable =
    t => Regex.IsMatch(t, "^(mis|und|un|der|er|stand)$");

var query =
    from i in Enumerable.Range(0, word.Length)
    from l in Enumerable.Range(1, word.Length - i)
    let part = word.Substring(i, l)
    where isSyllable(part)
    select part;

这将返回:

misunderstand-results

这至少对开始有帮助吗?


编辑:我进一步思考了这个问题,并提出了这几个查询:

Func<string, IEnumerable<string[]>> splitter = null;
splitter =
    t =>
        from n in Enumerable.Range(1, t.Length - 1)
        let s = t.Substring(0, n)
        let e = t.Substring(n)
        from g in (new [] { new [] { e } }).Concat(splitter(e))
        select (new [] { s }).Concat(g).ToArray();

var query =
    from split in (new [] { new [] { word } }).Concat(splitter(word))
    where split.All(part => isSyllable(part))
    select split;

现在 query 返回:

misunderstanding-results2

让我知道现在是否解决了。

Try starting with this:

var word = "misunderstand";

Func<string, bool> isSyllable =
    t => Regex.IsMatch(t, "^(mis|und|un|der|er|stand)$");

var query =
    from i in Enumerable.Range(0, word.Length)
    from l in Enumerable.Range(1, word.Length - i)
    let part = word.Substring(i, l)
    where isSyllable(part)
    select part;

This returns:

misunderstand-results

Does that help to begin with at least?


EDIT: I thought a bit further about this problem an came up with this couple of queries:

Func<string, IEnumerable<string[]>> splitter = null;
splitter =
    t =>
        from n in Enumerable.Range(1, t.Length - 1)
        let s = t.Substring(0, n)
        let e = t.Substring(n)
        from g in (new [] { new [] { e } }).Concat(splitter(e))
        select (new [] { s }).Concat(g).ToArray();

var query =
    from split in (new [] { new [] { word } }).Concat(splitter(word))
    where split.All(part => isSyllable(part))
    select split;

Now query return this:

misunderstanding-results2

Let me know if that's nailed it now.

过期以后 2024-12-15 15:08:53

通常此类问题可以使用 尝试 来解决。我将基于如何创建特里树来实现特里树在 c# 中(但请注意我已经重写了它)。

var trie = new Trie(new[] { "un", "que", "stio", "na", "ble", "qu", "es", "ti", "onable", "o", "nable" });
//var trie = new Trie(new[] { "u", "n", "q", "u", "e", "s", "t", "i", "o", "n", "a", "b", "l", "e", "un", "qu", "es", "ti", "on", "ab", "le", "nq", "ue", "st", "io", "na", "bl", "unq", "ues", "tio", "nab", "nqu", "est", "ion", "abl", "que", "stio", "nab" });

var word = "unquestionable";

var parts = new List<List<string>>();

Split(word, 0, trie, trie.Root, new List<string>(), parts);

//

public static void Split(string word, int index, Trie trie, TrieNode node, List<string> currentParts, List<List<string>> parts)
{   
    // Found a syllable. We have to split: one way we take that syllable and continue from it (and it's done in this if).
    // Another way we ignore this possible syllable and we continue searching for a longer word (done after the if)
    if (node.IsTerminal)
    {
        // Add the syllable to the current list of syllables
        currentParts.Add(node.Word);

        // "covered" the word with syllables
        if (index == word.Length)
        {
            // Here we make a copy of the parts of the word. This because the currentParts list is a "working" list and is modified every time.
            parts.Add(new List<string>(currentParts));
        }
        else
        {
            // There are remaining letters in the word. We restart the scan for more syllables, restarting from the root.
            Split(word, index, trie, trie.Root, currentParts, parts);
        }

        // Remove the syllable from the current list of syllables
        currentParts.RemoveAt(currentParts.Count - 1);
    }

    // We have covered all the word with letters. No more work to do in this subiteration
    if (index == word.Length)
    {
        return;
    }

    // Here we try to find the edge corresponding to the current character

    TrieNode nextNode;

    if (!node.Edges.TryGetValue(word[index], out nextNode))
    {
        return;
    }

    Split(word, index + 1, trie, nextNode, currentParts, parts);
}

public class Trie
{
    public readonly TrieNode Root = new TrieNode();

    public Trie()
    {
    }

    public Trie(IEnumerable<string> words)
    {
        this.AddRange(words);
    }

    public void Add(string word)
    {
        var currentNode = this.Root;

        foreach (char ch in word)
        {
            TrieNode nextNode;

            if (!currentNode.Edges.TryGetValue(ch, out nextNode))
            {
                nextNode = new TrieNode();
                currentNode.Edges[ch] = nextNode;
            }

            currentNode = nextNode;
        }

        currentNode.Word = word;
    }

    public void AddRange(IEnumerable<string> words)
    {
        foreach (var word in words)
        {
            this.Add(word);
        }
    }
}

public class TrieNode
{
    public readonly Dictionary<char, TrieNode> Edges = new Dictionary<char, TrieNode>();
    public string Word { get; set; }

    public bool IsTerminal
    {
        get
        {
            return this.Word != null;
        }
    }
}

word 是您感兴趣的字符串,parts 将包含可能的音节列表的列表(将其设为 List,但是很容易做到这一点,而不是 parts.Add(new List(currentParts)); 编写。 parts.Add(currentParts.ToArray()); 并将所有 List> 更改为 List 我将添加一个 Enigmativity 响应的变体,它实际上比他的更快,因为它会立即丢弃

错误的音节,而不是稍后过滤它们。如果您喜欢它,您应该给他一个。 +1,因为没有他的想法,这个变体是不可能的,但请注意,“正确”的解决方案是使用 Trie(s) :-)

Func<string, bool> isSyllable = t => Regex.IsMatch(t, "^(un|que|stio|na|ble|qu|es|ti|onable|o|nable)$");

Func<string, IEnumerable<string[]>> splitter = null;
splitter =
    t =>
        (
        from n in Enumerable.Range(1, t.Length - 1)
        let s = t.Substring(0, n)
        where isSyllable(s)
        let e = t.Substring(n)
        let f = splitter(e)
        from g in f
        select (new[] { s }).Concat(g).ToArray()
        )
        .Concat(isSyllable(t) ? new[] { new string[] { t } } : new string[0][]);

var parts = splitter(word).ToList();

解释:

        from n in Enumerable.Range(1, t.Length - 1)
        let s = t.Substring(0, n)
        where isSyllable(s)

我们计算 a 的所有可能的音节。单词,从长度 1 到单词的长度 - 1 并检查它是否是音节。我们直接剔除非音节。稍后将检查作为音节的完整单词。

        let e = t.Substring(n)
        let f = splitter(e)

我们搜索字符串剩余部分的音节

        from g in f
        select (new[] { s }).Concat(g).ToArray()

,并将找到的音节与“当前”音节链接起来。请注意,我们正在创建许多无用的数组。如果我们接受 IEnumerable> 作为结果,我们可以拿走这个 ToArray

(我们可以一起重写许多行,删除许多let

        from g in splitter(t.Substring(n))
        select (new[] { s }).Concat(g).ToArray()

但为了清楚起见,我们不会这样做)

并且我们将“当前”音节与找到的音节连接起来。

        .Concat(isSyllable(t) ? new[] { new string[] { t } } : new string[0][]);

在这里,我们可以稍微重建一下查询,以便不使用此 Concat 并创建空数组,但这会有点复杂(我们可以将整个 lambda 函数重写为 isSyllable(t) ? new[] { new string[] { t } }.Concat(oldLambdaFunction) : oldLambdaFunction)

最后,如果整个单词是一个音节,我们将整个单词添加为一个音节音节。否则我们Concat一个空数组(所以没有Concat

Normally this type of problems is solved using Tries. I will base my implementation of a Trie on How to create a trie in c# (but notice that I have rewritten it).

var trie = new Trie(new[] { "un", "que", "stio", "na", "ble", "qu", "es", "ti", "onable", "o", "nable" });
//var trie = new Trie(new[] { "u", "n", "q", "u", "e", "s", "t", "i", "o", "n", "a", "b", "l", "e", "un", "qu", "es", "ti", "on", "ab", "le", "nq", "ue", "st", "io", "na", "bl", "unq", "ues", "tio", "nab", "nqu", "est", "ion", "abl", "que", "stio", "nab" });

var word = "unquestionable";

var parts = new List<List<string>>();

Split(word, 0, trie, trie.Root, new List<string>(), parts);

//

public static void Split(string word, int index, Trie trie, TrieNode node, List<string> currentParts, List<List<string>> parts)
{   
    // Found a syllable. We have to split: one way we take that syllable and continue from it (and it's done in this if).
    // Another way we ignore this possible syllable and we continue searching for a longer word (done after the if)
    if (node.IsTerminal)
    {
        // Add the syllable to the current list of syllables
        currentParts.Add(node.Word);

        // "covered" the word with syllables
        if (index == word.Length)
        {
            // Here we make a copy of the parts of the word. This because the currentParts list is a "working" list and is modified every time.
            parts.Add(new List<string>(currentParts));
        }
        else
        {
            // There are remaining letters in the word. We restart the scan for more syllables, restarting from the root.
            Split(word, index, trie, trie.Root, currentParts, parts);
        }

        // Remove the syllable from the current list of syllables
        currentParts.RemoveAt(currentParts.Count - 1);
    }

    // We have covered all the word with letters. No more work to do in this subiteration
    if (index == word.Length)
    {
        return;
    }

    // Here we try to find the edge corresponding to the current character

    TrieNode nextNode;

    if (!node.Edges.TryGetValue(word[index], out nextNode))
    {
        return;
    }

    Split(word, index + 1, trie, nextNode, currentParts, parts);
}

public class Trie
{
    public readonly TrieNode Root = new TrieNode();

    public Trie()
    {
    }

    public Trie(IEnumerable<string> words)
    {
        this.AddRange(words);
    }

    public void Add(string word)
    {
        var currentNode = this.Root;

        foreach (char ch in word)
        {
            TrieNode nextNode;

            if (!currentNode.Edges.TryGetValue(ch, out nextNode))
            {
                nextNode = new TrieNode();
                currentNode.Edges[ch] = nextNode;
            }

            currentNode = nextNode;
        }

        currentNode.Word = word;
    }

    public void AddRange(IEnumerable<string> words)
    {
        foreach (var word in words)
        {
            this.Add(word);
        }
    }
}

public class TrieNode
{
    public readonly Dictionary<char, TrieNode> Edges = new Dictionary<char, TrieNode>();
    public string Word { get; set; }

    public bool IsTerminal
    {
        get
        {
            return this.Word != null;
        }
    }
}

word is the string you are interested in, parts will contain the list of lists of possible syllables (it would probably be more correct to make it a List<string[]>, but it's quite easy to do it. Instead of parts.Add(new List<string>(currentParts)); write parts.Add(currentParts.ToArray()); and change all the List<List<string>> to List<string[]>.

I'll add a variant of Enigmativity response thas is theretically faster than his because it discards wrong syllables immediately instead of post-filtering them later. If you like it, you should give him a +1, because without his idea, this variant wouldn't be possible. But note that it's still an hack. The "right" solution is in using Trie(s) :-)

Func<string, bool> isSyllable = t => Regex.IsMatch(t, "^(un|que|stio|na|ble|qu|es|ti|onable|o|nable)$");

Func<string, IEnumerable<string[]>> splitter = null;
splitter =
    t =>
        (
        from n in Enumerable.Range(1, t.Length - 1)
        let s = t.Substring(0, n)
        where isSyllable(s)
        let e = t.Substring(n)
        let f = splitter(e)
        from g in f
        select (new[] { s }).Concat(g).ToArray()
        )
        .Concat(isSyllable(t) ? new[] { new string[] { t } } : new string[0][]);

var parts = splitter(word).ToList();

An explanation:

        from n in Enumerable.Range(1, t.Length - 1)
        let s = t.Substring(0, n)
        where isSyllable(s)

We calculate all the possible syllables of a word, from length 1 to the length of the word - 1 and check if it's a syllable. We weed out directly the non-syllables. The full word as a syllable will be checked later.

        let e = t.Substring(n)
        let f = splitter(e)

We search the syllables of the remaining part of the string

        from g in f
        select (new[] { s }).Concat(g).ToArray()

And we chain the found syllables with the "current" syllable. Note that we are creating many useless arrays. If we accept to have an IEnumerable<IEnumerable<string>> as the result, we can take away this ToArray.

(we could rewrite many rows together, deleting many let, like

        from g in splitter(t.Substring(n))
        select (new[] { s }).Concat(g).ToArray()

but we won't do it for clarity)

And we concatenate the "current" syllable with the found syllables.

        .Concat(isSyllable(t) ? new[] { new string[] { t } } : new string[0][]);

Here we could rebuild the query a little so to not use this Concat and create empty arrays, but it would be a little complex (we could rewrite the entire lambda function as a isSyllable(t) ? new[] { new string[] { t } }.Concat(oldLambdaFunction) : oldLambdaFunction)

In the end, if the whole word is a syllable, we add the whole word as a syllable. Otherwise we Concat an empty array (so no Concat)

陌伤ぢ 2024-12-15 15:08:53

说实话,您可能在扩展此数据时遇到问题,我不确定您的数据集有多大,但基于简单的“这是aa音节吗?”的解决方案您需要为每个单词调用大约 0(n*n) 的“音节检测”例程,其中 n = 单词中的字符数(如果这没有意义,则意味着对于大型数据集来说它可能会很慢!) 。这没有考虑到检测算法的可扩展性,当您添加更多音节时,检测算法也可能会变慢。 .

我知道您说过识别什么是音节或不是音节的过程是不相关的,但是可以说您可以更改它以使其更像自动完成功能,即传递音节的开头并让它告诉你从这一点起所有可能的音节将更具可扩展性。如果性能失控,请考虑将其替换为 trie

You might have problems with scaling this to be honest, I'm not sure how big your dataset is but a solution based on a simple 'is this a a syllable?' you will need to call your 'syllable detection' routine roughly 0(n*n) for each word where n = the number of characters in the word (if this is meaningless, it means that it's likely to be slow for large datasets!). This doesn't take into account the scalability of the detection algorithm which might also slow down as you add more syllables. .

I know you have said your process for identifying what's a syllable or not isn't relevant, but lets say you can change it to have it work more like autocompletion, ie pass in the beginning of a syllable and have it tell you all the syllables that are possible from this point would be much more scalable. Have a look at replacing it with a trie if performance gets out of hand.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文