反转 sprintf/format 的方法
我必须通过分析格式化结果来启发式地确定格式模式字符串。
例如我有这些字符串:
您有 3 条未读消息。
您有 10 条未读消息。
对不起,戴夫。恐怕我做不到。
对不起,弗兰克。恐怕我做不到。
这个说法是错误的。
我想导出这些格式字符串:
您有%s条未读消息
对不起,%s。恐怕我做不到。
这个说法是错误的。
哪些方法和/或算法可以帮助我?
我的第一个想法是使用机器学习的东西,但我的直觉告诉我这可能是一个相当经典的问题。
一些额外的要求:
- 参数的类型是不相关的,即如果参数最初是
%s
或%d
或者如果它被填充或,我不需要这些信息对齐。 - 可以有多个参数(或根本没有)。
- 通常,数据由数千个格式化字符串组成,但只有数十种格式模式。
I have to heuristically determine the format pattern strings by analyzing the formatted results.
For example I have these strings:
You have 3 unread messages.
You have 10 unread messages.
I'm sorry, Dave. I'm afraid I can't do that.
I'm sorry, Frank. I'm afraid I can't do that.
This statement is false.
I want to derive these format strings:
You have %s unread messages
I'm sorry, %s. I'm afraid I can't do that.
This statement is false.
Which approaches and/or algorithms could help me here?
My first thought was using machine learning stuff, but my guts tell me this could be a rather classic problem.
Some additional requirements:
- The type of the parameter is irrelevant, i.e. I don't need the information if the parameter originally was
%s
or%d
or if it was padded or aligned. - There can be more than one parameter (or none at all)
- Typically the data consists of thousands of formatted strings, but only tens of format patterns.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
通过某种相似性度量对字符串进行聚类(我会尝试最长公共子序列的长度,LCS)。 确定簇的数量是困难的部分,如果您事先不知道的话。
在每个簇内,确定其中所有字符串的 LCS,记录出现间隙的位置。将空格替换为
%s
。 (您可能想要构建一个返回基于 LCS 的格式字符串的函数,并在集群上fold
/reduce
。)上面是一个贪婪算法,给定 {
foobar
,fooBaR
} 生成foo%sa%s
。您可能希望以递归方式将由单个字符(或单个非空白字符等)分隔的任意一对%s
替换为单个%s
。Cluster the strings by some metric of similarity (I'd try length of longest common subsequence, LCS). Determining the number of clusters is the hard part, if you don't know it beforehand.
Within each cluster, determine the LCS of all strings in it, recording the position of the gaps that occur. Replace the gaps with
%s
. (You may want to build a function that returns an LCS-based format string andfold
/reduce
that over the cluster.)The above is a greedy algorithm that, given {
foobar
,fooBaR
} producesfoo%sa%s
. You may want to replace any pair of occurrences of%s
separated by a single character (or a single non-whitespace char, etc) by a single%s
, recursively.