如何根据真实数据自动创建模式?
我的数据库中有很多供应商,他们在数据的某些方面都有所不同。我想制定基于以前数据的数据验证规则。
示例:
A: XZ-4, XZ-23, XZ-217
B: 1276, 1899, 22711
C: 12-4, 12-75, 12
目标:如果用户输入供应商 B 的字符串“XZ-217”,算法应比较以前的数据并说:该字符串与供应商 B 以前的数据不相似。
有没有一些好的方法/工具来实现这样的比较?答案可能是一些通用算法或 Perl 模块。
编辑: 我同意,“相似性”很难定义。但我想了解算法,它可以分析之前的大约 100 个样本,然后将分析结果与新数据进行比较。相似性可能基于长度、字符/数字的使用、字符串创建模式、相似的开头/结尾/中间、有一些分隔符。
我觉得这不是一件容易的事,但另一方面,我认为它有非常广泛的用途。所以我希望,已经有一些提示了。
I have many vendors in database, they all differ in some aspect of their data. I'd like to make data validation rule which is based on previous data.
Example:
A: XZ-4, XZ-23, XZ-217
B: 1276, 1899, 22711
C: 12-4, 12-75, 12
Goal: if user inputs string 'XZ-217' for vendor B, algorithm should compare previous data and say: this string is not similar to vendor B previous data.
Is there some good way/tools to achieve such comparison? Answer could be some generic algoritm or Perl module.
Edit:
The "similarity" is hard to define, i agree. But i'd like to catch to algorithm, which could analyze previous ca 100 samples and then compare the outcome of analyze with new data. Similarity may based on length, on use of characters/numbers, string creation patterns, similar beginning/end/middle, having some separators in.
I feel it is not easy task, but on other hand, i think it has very wide use. So i hoped, there is already some hints.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您可能想仔细阅读:
http://en.wikipedia.org/wiki/String_metric 和 http://search.cpan.org/dist/Text-Levenshtein/Levenshtein.pm (例如)
You may want to peruse:
http://en.wikipedia.org/wiki/String_metric and http://search.cpan.org/dist/Text-Levenshtein/Levenshtein.pm (for instance)
乔尔和我也提出了类似的想法。下面的代码区分了 3 种类型的区域。
它创建字符串的配置文件和正则表达式来匹配输入。此外,它还包含扩展现有配置文件的逻辑。最后,在任务子中,它包含一些伪逻辑,指示如何将其集成到更大的应用程序中。
Joel and I came up with similar ideas. The code below differentiates 3 types of zones.
It creates a profile of the string and a regex to match input. In addition, it also contains logic to expand existing profiles. At the end, in the task sub, it contains some pseudo logic which indicates how this might be integrated into a larger application.
这是我的实现和对您的测试用例的循环。基本上,您为该函数提供了一个好的值列表,它会尝试为其构建一个正则表达式。
输出:
代码:
为了简化查找模式的工作,可选部分可以出现在末尾,但可选部分之后不能出现必需的部分。这或许可以克服,但可能很难。
Here is my implementation and a loop over your test cases. Basically you give a list of good values to the function and it tries to build a regex for it.
output:
code:
To simplify the work of finding the pattern, optional parts may come at the end, but no required parts may come after optional ones. This could probably be overcome but it might be hard.
如果有一个
Tie::StringApproxHash
模块,那么它就符合这里的要求。我认为您正在寻找结合
String::Approx 的模糊逻辑功能的东西
和Tie::RegexpHash
。前者更为重要;后者将使编码工作变得轻松。
If there was a
Tie::StringApproxHash
module, it would fit the bill here.I think you're looking for something that combines the fuzzy-logic functionality of
String::Approx
and the hash interface ofTie::RegexpHash
.The former is more important; the latter would make light work of coding.