发现字符串中的相似性和模式 - Python
这是我试图解决这个问题的用例。
我有一个服务的垃圾邮件订阅列表,它们正在扼杀转化率和其他可用性研究。
插入的电子邮件如下所示:
roger[...]_surname[...]@hotmail.com
对于使用自动脚本发现这些条目,您有何建议?感觉比实际看起来要复杂一些。
非常感谢您的帮助!
this is the use case I'm trying to figure this out for.
I have a list of spam subscriptions to a service and they are killing conversion rate and other usability studies.
The emails inserted look like the following:
roger[...]_surname[...]@hotmail.com
What would be your suggestions on spotting these entries by using an automated script? It feels a little more complicated than it actually looks.
Help would be very much appreciated!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我认为你不能轻易检查这一点。这不太可能是一个简单的字符串匹配问题,您可以抛出正则表达式,因为我猜测您使用的名称“Roger”只是一个示例,并且任意数量的名称都可以出现在该位置。您还可以运行其他海报提供的正则表达式之一,用明显的名字和姓氏的每个排列对其进行参数化。这可能需要“太长”和“永远”之间的时间,并且会标记大量误报。
另一种方法适用于您上面发布的模式,即获取用户名的最后 4 个字母,并将它们与某些内容进行比较。可以通过在合法文本上训练马尔可夫链来识别随机字符,而不是合理排列的字符(给定特定语言),然后您可以计算给定的 4 个字母在该语言中按该顺序出现的概率。对于随机字母,此概率通常远低于合法名称(尽管如果其中存在特殊字符或数字,则所有赌注都会失败)。
另一种方法可能是使用贝叶斯过滤器(例如,Python 中的 Reverend 之类的东西,尽管有是其他)对合法电子邮件地址的最后 4 个字母进行训练。如果您使数据可用,这可能会发现 95% 的随机数据。例如。不仅提交 4 个字母,还提交其中的每个 2 字母和 3 字母子字符串,以捕获每个字母的上下文。不过,我认为这不如马尔可夫式方法有效。
无论您做什么检查,您都可以通过仅提交某些电子邮件地址来减少误报(例如,仅那些包含下划线的网络邮件地址,下划线之前至少有 3 个字符,之后至少有 5 个字符。)
但最终,在它被用于某一目的之前,您永远无法确定它是垃圾邮件地址还是真实地址。因此,如果可能的话,我建议放弃尝试分析内容并在其他地方解决问题。他们以什么方式降低转化率?如果您要以某种指标来计算这些虚拟帐户,那么最好首先添加验证阶段,并且只关心通过验证的帐户的指标。毕竟,有些人确实拥有 [电子邮件受保护] 这样的地址。
I don't think you can easily check for this. It's not likely to be a simple string matching problem that you can throw a regular expression at because I would guess that your use of the name 'Roger' was just an example, and that any number of names can appear in that position. You could also run one of the regular expressions supplied by the other posters, parameterising it with every permutation of obvious first name and last name. This will probably take somewhere between "too long" and "forever", and will flag up plenty of false positives.
Another approach, which works with the pattern you posted above, would be to take the last 4 letters of the username, and compare them against something. Spotting characters that are random as opposed to arranged sensibly (given a specific language) can be done by training a Markov Chain on legitimate text which can then allow you to calculate the probability of a given 4 letters appearing in that order in that language. For random letters, this probability will typically come in far lower than for a legitimate name (although if there are special characters or digits in there, all bets are off).
Another way might be to use a Bayesian filter (eg. something like Reverend in Python, though there are others) trained on the last 4 letters of legitimate email addresses. This would probably spot 95% of the ones which were just random, providing you made the data usable. eg. Submitting not just the 4 letters but each of the 2-letter and 3-letter substrings inside it, to capture the context of each letter. I don't think this would work as well as the Markov-style method though.
Whatever check you do, you can cut false positives by only submitting certain email addresses for it (eg. only those at webmail addresses, which contain an underscore, with at least 3 characters before the underscore and 5 characters after it.)
But ultimately, you can never know whether it's a spam address or a real one for sure until it gets used for one purpose or the other. So if possible I'd suggest giving up on trying to analyse the content and fix the problem somewhere else. In what way are they killing conversion rate? If you're counting these dummy accounts in some sort of metric, you'd be best off adding a verification stage first and only caring about metrics for accounts that pass verification. Some people really do have addresses like [email protected], after all.
我认为您只能通过检查以下内容将其标记为潜在问题:
使用 正则表达式,如果这是垃圾邮件发送者重复使用的模式。
看起来他们在
roger
之后使用了 2 个小写字母字符,所以我已经将其内置了。不确定如何匹配他们使用的姓氏词典,所以匹配最后一部分(看起来是姓氏,然后是 4 个小写字母字符)可能很难,但你也许可以这样做:假设所有姓氏中至少有一个字符。
I don't think you can do more than flag it as a potential problem, by checking for:
using regular expressions, if that's the pattern that the spammer is using repeatedly.
Looks like they're using 2 lower-case alphabetic characters after
roger
, so I've built that in. Not sure how you'd go about matching what dictionary of surnames they're using, so matching the last part (which appears to be surname then 4 lower-case alphabetic characters) might be hard, though you could perhaps do:which assumes that all their surnames at least have one character in.
听起来像是正则表达式的工作:(
如果您从未使用过正则表达式,请快速了解其含义:
^
匹配字符串的开头,而$
匹配结尾,因此我们要求这些符号之间的所有内容都是描述整个字符串的模式[az]
匹配任何小写字母,而+
表示“。一次或多次”,因此[az]+
匹配一个或多个小写字母。将它们放在一起,如果字符串可以描述为“字符串的开头,后面跟着由字母roger
组成,后跟一个或多个小写字母,后跟下划线,后跟一个或多个小写字母,后跟@hotmail.com
,后面跟着字符串的结尾。”如果正则表达式匹配,则电子邮件地址符合您在问题中描述的模式。)当然,如果他明白并更改了他的模式(例如,通过交换名字),则此方法将会失败,您将不得不求助于更传统的垃圾邮件预防技术,例如使用验证码。
Sounds like a job for regular expressions:
(If you've never used regular expressions, here's a quick rundown of what that means:
^
matches the beginning of the string and$
matches the end, so we're requiring that everything between those symbols is a pattern describing the entire string.[a-z]
matches any lower-case letter, and+
means "one or more times", so[a-z]+
matches one or more lower-case letters. Putting it all together, our regex matches if the string can be described as "the beginning of the string, followed by the lettersroger
, followed one or more lower-case letters, followed by an underscore, followed by one or more lower-case letters, followed by@hotmail.com
, followed by the end of the string." If the regex matches, the email address fits the pattern you described in your question.)Of course, if he catches on and changes up his pattern (for example, by switching first names), this method will fail and you'll have to fall back on more traditional spam-prevention techniques like employing a CAPTCHA.