是否可以计算正则表达式和字符串之间的编辑距离?
如果是这样,请解释一下如何。
回复:什么是距离 - “两个字符串之间的距离定义为将一个字符串转换为另一个字符串所需的最小编辑次数。”
例如,xyz 到 XYZ 需要 3 次编辑,因此字符串 xYZ 更接近 XYZ 和 xyz。
如果模式是 [0-9]{3} 或例如 123,则 a23 会比 ab3 更接近该模式。
如何找到正则表达式和不匹配字符串之间的最短距离?
以上是Damerau–Levenshtein距离算法。
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以使用有限状态机有效地完成此操作(即时间线性)。如果您使用转换器,您甚至可以相当紧凑地编写转换规范,并进行比简单插入或删除更细致的转换 - 请参阅维基百科的 有限状态传感器作为起点,以及 FSA 工具包或 FSA6 等软件(其具有不完全稳定的网络演示)。有很多用于 FSA 操作的库;我不想建议前两个是您唯一或最好的选择,只是我听说过的两个。
但是,如果您只想进行高效的近似搜索,则存在一个不太灵活但已为您实现的选项:TRE ,它有一个近似匹配函数,它返回匹配的成本 - 即,从您的角度来看,到比赛的距离。
You can use Finite State Machines to do this efficiently (that is, linear in time). If you use a transducer, you can even write the specification of the transformation fairly compactly and do far more nuanced transformations than simply inserts or deletes - see wikipedia for Finite State Transducer as a starting point, and software such as the FSA toolkit or FSA6 (which has a not entirely stable web-demo) too. There are lots of libraries for FSA manipulation; I don't want to suggest the previous two are your only or best options, just two I've heard of.
If, however, you merely want the efficient, approximate searching, a less flexibly but already-implemented-for-you option exists: TRE, which has an approximate matching function that returns the cost of the match - i.e., the distance to the match, from your perspective.
如果你的意思是最接近的匹配字符串和样本之间具有最小编辑距离的字符串,那么我很确定它可以完成,但是你必须自己将正则表达式转换为 DFA,然后尝试匹配,并且每当某些事情失败了,非确定性地继续,就好像它已经通过一样,并跟踪数字差异。你可以使用 A* 搜索或类似的东西,但效率很低(O(2^n) 最坏情况)
If you mean the string with the smallest levenshtein distance between the closest matched string and a sample, then I'm pretty sure it can be done, but you'd have to convert the Regex to a DFA yourself, then try to match and whenever something fails, non-deterministically continue as if it had passed and keep track of the number differences. you could use A* search or something similar for this, it would be quite inefficient though (O(2^n) worst case)