从字符串生成子字符串的组合
我试图为给定的单词生成所有可能的音节组合。识别音节的过程在这里并不相关,但所有组合的生成给我带来了问题。我认为这可能可以在几行中递归地完成(尽管任何其他方式都可以),但我很难让它工作。有人可以帮忙吗?
// how to test a syllable, just for the purpose of this example
bool IsSyllable(string possibleSyllable)
{
return Regex.IsMatch(possibleSyllable, "^(mis|und|un|der|er|stand)$");
}
List<string> BreakIntoSyllables(string word)
{
// the code here is what I'm trying to write
// if 'word' is "misunderstand" , I'd like this to return
// => {"mis","und","er","stand"},{ "mis","un","der","stand"}
// and for any other combinations to be not included
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
尝试从这里开始:
这将返回:
这至少对开始有帮助吗?
编辑:我进一步思考了这个问题,并提出了这几个查询:
现在
query
返回:让我知道现在是否解决了。
Try starting with this:
This returns:
Does that help to begin with at least?
EDIT: I thought a bit further about this problem an came up with this couple of queries:
Now
query
return this:Let me know if that's nailed it now.
通常此类问题可以使用 尝试 来解决。我将基于如何创建特里树来实现特里树在 c# 中(但请注意我已经重写了它)。
word
是您感兴趣的字符串,parts
将包含可能的音节列表的列表(将其设为List
,但是很容易做到这一点,而不是parts.Add(new List(currentParts));
编写。parts.Add(currentParts.ToArray());
并将所有List
更改为>
List
我将添加一个 Enigmativity 响应的变体,它实际上比他的更快,因为它会立即丢弃错误的音节,而不是稍后过滤它们。如果您喜欢它,您应该给他一个。 +1,因为没有他的想法,这个变体是不可能的,但请注意,“正确”的解决方案是使用 Trie(s) :-)
解释:
我们计算 a 的所有可能的音节。单词,从长度 1 到单词的长度 - 1 并检查它是否是音节。我们直接剔除非音节。稍后将检查作为音节的完整单词。
我们搜索字符串剩余部分的音节
,并将找到的音节与“当前”音节链接起来。请注意,我们正在创建许多无用的数组。如果我们接受
IEnumerable>
作为结果,我们可以拿走这个ToArray
。(我们可以一起重写许多行,删除许多
let
,但为了清楚起见,我们不会这样做)
并且我们将“当前”音节与找到的音节连接起来。
在这里,我们可以稍微重建一下查询,以便不使用此 Concat 并创建空数组,但这会有点复杂(我们可以将整个 lambda 函数重写为 isSyllable(t) ? new[] { new string[] { t } }.Concat(oldLambdaFunction) : oldLambdaFunction)
最后,如果整个单词是一个音节,我们将整个单词添加为一个音节音节。否则我们
Concat
一个空数组(所以没有Concat
)Normally this type of problems is solved using Tries. I will base my implementation of a Trie on How to create a trie in c# (but notice that I have rewritten it).
word
is the string you are interested in,parts
will contain the list of lists of possible syllables (it would probably be more correct to make it aList<string[]>
, but it's quite easy to do it. Instead ofparts.Add(new List<string>(currentParts));
writeparts.Add(currentParts.ToArray());
and change all theList<List<string>>
toList<string[]>
.I'll add a variant of Enigmativity response thas is theretically faster than his because it discards wrong syllables immediately instead of post-filtering them later. If you like it, you should give him a +1, because without his idea, this variant wouldn't be possible. But note that it's still an hack. The "right" solution is in using Trie(s) :-)
An explanation:
We calculate all the possible syllables of a word, from length 1 to the length of the word - 1 and check if it's a syllable. We weed out directly the non-syllables. The full word as a syllable will be checked later.
We search the syllables of the remaining part of the string
And we chain the found syllables with the "current" syllable. Note that we are creating many useless arrays. If we accept to have an
IEnumerable<IEnumerable<string>>
as the result, we can take away thisToArray
.(we could rewrite many rows together, deleting many
let
, likebut we won't do it for clarity)
And we concatenate the "current" syllable with the found syllables.
Here we could rebuild the query a little so to not use this
Concat
and create empty arrays, but it would be a little complex (we could rewrite the entire lambda function as aisSyllable(t) ? new[] { new string[] { t } }.Concat(oldLambdaFunction) : oldLambdaFunction
)In the end, if the whole word is a syllable, we add the whole word as a syllable. Otherwise we
Concat
an empty array (so noConcat
)说实话,您可能在扩展此数据时遇到问题,我不确定您的数据集有多大,但基于简单的“这是aa音节吗?”的解决方案您需要为每个单词调用大约 0(n*n) 的“音节检测”例程,其中 n = 单词中的字符数(如果这没有意义,则意味着对于大型数据集来说它可能会很慢!) 。这没有考虑到检测算法的可扩展性,当您添加更多音节时,检测算法也可能会变慢。 .
我知道您说过识别什么是音节或不是音节的过程是不相关的,但是可以说您可以更改它以使其更像自动完成功能,即传递音节的开头并让它告诉你从这一点起所有可能的音节将更具可扩展性。如果性能失控,请考虑将其替换为 trie 。
You might have problems with scaling this to be honest, I'm not sure how big your dataset is but a solution based on a simple 'is this a a syllable?' you will need to call your 'syllable detection' routine roughly 0(n*n) for each word where n = the number of characters in the word (if this is meaningless, it means that it's likely to be slow for large datasets!). This doesn't take into account the scalability of the detection algorithm which might also slow down as you add more syllables. .
I know you have said your process for identifying what's a syllable or not isn't relevant, but lets say you can change it to have it work more like autocompletion, ie pass in the beginning of a syllable and have it tell you all the syllables that are possible from this point would be much more scalable. Have a look at replacing it with a trie if performance gets out of hand.