如何衡量字符串的复杂度?
我有一些长字符串(~ 1.000.000 个字符)。每个字符串仅包含定义字母表中的符号,例如
A = {1,2,3}
示例字符串
string S1 = "1111111111 ..."; //[meta complexity] = 0
string S2 = "1111222333 ..."; //[meta complexity] = 10
string S3 = "1213323133 ..."; //[meta complexity] = 100
Q 我可以使用什么样的度量来量化这些字符串的复杂性?我可以看到 S1 没有 S3 复杂,但如何从 .NET 以编程方式做到这一点?任何算法或指向工具/文献的点都将不胜感激。
编辑
我尝试了香农熵,但事实证明它对我来说并不是真正有用。我将为这些序列 AAABBBCCC 和 ABCABCABC 以及 ACCCBABAB 和 BBACCABAC 提供相同的 H 值强>
This is what I ended up doing
I have a few long strings (~ 1.000.000 chars). Each string only contains symbols from the defined alphabet, for example
A = {1,2,3}
Sample strings
string S1 = "1111111111 ..."; //[meta complexity] = 0
string S2 = "1111222333 ..."; //[meta complexity] = 10
string S3 = "1213323133 ..."; //[meta complexity] = 100
Q What kind of measures can I use to quantify the complexity of these strings? I can see that S1 is less complex than S3, but how can I do that programmatically from .NET? Any algorithm or point to the tool/literature would be greatly appreciated.
Edit
I tried Shannon entropy, but it turned out that it is not really useful for me. I will have the same H value for these sequences AAABBBCCC and ABCABCABC and ACCCBABAB and BBACCABAC
This is what I ended up doing
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
使用标准技术(例如 zip)压缩字符串可以很好地表明复杂性。
良好的压缩率 ≈ 较低的复杂度
不好的压缩率 ≈ 更高的复杂度
Compressing the strings using standard techniques such as zip gives a good indication of the compexity.
Good compression rate ≈ lower complexity
Bad compression rate ≈ higher complexity