如何像Excel那样发现并分析相似的模式?
当您键入具有特定模式的 3 行并将该列一直向下拖动时,您就知道了 Excel 中的功能,Excel 会尝试为您延续该模式。
例如
键入...
- test-1
- test-2
- test-3
Excel 将继续执行:
- test-4
- test-5
- test-n...
对于其他一些模式(例如日期和日期)同样适用很快。
我正在尝试完成类似的事情,但我也想处理更多特殊情况,例如:
- test-blue-somethingelse
- test-yellow-somethingelse
- test-red-somethingelse
现在基于此条目我想说模式是:
- < strong>test-[DYNAMIC]-something
继续使用其他颜色的[DYNAMIC]完全是另一回事,我现在并不关心这一点。我最感兴趣的是检测模式中的[动态]部分。
我需要从大量池条目中检测到这一点。假设您有 10,000 个具有此类模式的字符串,并且您希望根据相似性对这些字符串进行分组,并检测文本的哪一部分在不断变化 ([DYNAMIC])。
文档分类在这种情况下可能很有用,但我不知道从哪里开始。
更新:
我忘了提及,也可以有多个[动态]模式。
例如:
- test_[DYNAMIC]12[DYNAMIC2]
我认为这并不重要,但我计划在 .NET 中实现它,但任何有关要使用的算法的提示都将是相当重要的有帮助。
You know the functionality in Excel when you type 3 rows with a certain pattern and drag the column all the way down Excel tries to continue the pattern for you.
For example
Type...
- test-1
- test-2
- test-3
Excel will continue it with:
- test-4
- test-5
- test-n...
Same works for some other patterns such as dates and so on.
I'm trying to accomplish a similar thing but I also want to handle more exceptional cases such as:
- test-blue-somethingelse
- test-yellow-somethingelse
- test-red-somethingelse
Now based on this entries I want say that the pattern is:
- test-[DYNAMIC]-something
Continue the [DYNAMIC] with other colours is whole another deal, I don't really care about that right now. I'm mostly interested in detecting the [DYNAMIC] parts in the pattern.
I need to detect this from a large of pool entries. Assume that you got 10.000 strings with this kind of patterns, and you want to group these strings based on similarity and also detect which part of the text is constantly changing ([DYNAMIC]).
Document classification can be useful in this scenario but I'm not sure where to start.
UPDATE:
I forgot to mention that also it's possible to have multiple [DYNAMIC] patterns.
Such as:
- test_[DYNAMIC]12[DYNAMIC2]
I don't think it's important but I'm planning to implement this in .NET but any hint about the algorithms to use would be quite helpful.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
一旦您开始考虑查找以下形式的模式的动态部分:
....
没有任何其他假设,那么您将需要查找您提供的示例字符串的最长公共子序列。例如,如果我有test-123-abc
和test-48953-defg
那么 LCS 将是test-
和-< /代码>。动态部分将是 LCS 结果之间的差距。然后,您可以在适当的数据结构中查找动态部分。
找到超过 2 个字符串的 LCS 的问题是非常昂贵的,这将是你的问题的瓶颈。以牺牲准确性为代价,您可以使这个问题变得容易处理。例如,您可以在所有字符串对之间执行 LCS,并将具有相似 LCS 结果的字符串组分组在一起。然而,这意味着某些模式将无法被正确识别。
当然,如果您可以对字符串施加进一步的限制,就像 Excel 那样,它似乎只允许
形式的模式,那么所有这一切都可以避免。As soon as you start considering finding dynamic parts of patterns of the form :
<const1><dynamic1><const2><dynamic2>....
without any other assumptions then you would need to find the longest common subsequence of the sample strings you have provided. For example if I havetest-123-abc
andtest-48953-defg
then the LCS would betest-
and-
. The dynamic parts would then be the gaps between the result of the LCS. You could then look up your dynamic part in an appropriate data structure.The problem of finding the LCS of more than 2 strings is very expensive, and this would be the bottleneck of your problem. At the cost of accuracy you can make this problem tractable. For example, you could perform LCS between all pairs of strings, and group together sets of strings having similar LCS results. However, this means that some patterns would not be correctly identified.
Of course, all this can be avoided if you can impose further restrictions on your strings, like Excel does which only seems to allow patterns of the form
<const><dynamic>
.找到 [dynamic] 并不是什么大不了的事,你可以用 2 个字符串来做到这一点 - 只需从开头开始,当它们开始不等于时停止,从末尾做同样的事情,瞧 - 你得到了你的 [dynamic]
类似(伪代码 - 有点):
关于你的 10k 数据集。您需要对每个组合调用此(或者可能是更优化的版本)来找出您的模式(10k x 10k 调用)。然后按模式对结果进行排序(即保存开始和结束并按这些字段排序)
finding [dynamic] isnt that big of deal, you can do that with 2 strings - just start at the beginning and stop when they start not-being-equals, do the same from the end, and voila - you got your [dynamic]
something like (pseudocode - kinda):
About your 10k data-sets. You would need to call this (or maybe a little more optimized version) with each combination to figure out your patten (10k x 10k calls). and then sort the result by pattern (ie. save the begin and the ending and sort by these fields)
我认为你需要计算类似 Levenshtein 距离之类的东西,以找到相似的组字符串,然后在每组相似的字符串中,您可以使用典型的类似 diff 的算法来识别动态部分。
I think what you need is to compute something like the Levenshtein distance, to find the group of similar strings, and then in each group of similar strings, you indentify the dynamic part in a typical diff-like algorithm.
不管你信不信,谷歌文档在这类事情上可能比 excel 更好。
谷歌收集了大量关于集合的数据 - 例如,在您给出的示例中,它会识别蓝色、红色、黄色......作为集合“颜色”的一部分。它具有比 Excel 更完整的模式识别能力,因此更有可能继续该模式。
Google docs might be better than excel for this sort of thing, believe it or not.
Google has collected massive amounts of data on sets - for example the in the example you gave it would recognise the blue, red, yellow ... as part of the set 'colours'. It has far more complete pattern recognition than Excel so would stand a better chance of continuing the pattern.