检测相似电子邮件地址的最佳方法?

发布于 2024-09-01 15:23:19 字数 3317 浏览 7 评论 0 原文

我有大约 20,000 个电子邮件地址的列表,其中一些我知道是为了绕过“每封电子邮件 1 个”限制的欺诈性尝试,例如 [电子邮件受保护][电子邮件受保护][电子邮件受保护]等等。我想找到类似的电子邮件地址进行评估。目前,我正在使用 Levenshtein 算法来对照列表中的其他电子邮件检查每封电子邮件,并报告编辑距离小于 2 的任何电子邮件。但是,这非常慢。有更有效的方法吗?

我现在使用的测试代码是:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Threading;

namespace LevenshteinAnalyzer
{
    class Program
    {
        const string INPUT_FILE = @"C:\Input.txt";
        const string OUTPUT_FILE = @"C:\Output.txt";

        static void Main(string[] args)
        {
            var inputWords = File.ReadAllLines(INPUT_FILE);
            var outputWords = new SortedSet<string>();

            for (var i = 0; i < inputWords.Length; i++)
            {
                if (i % 100 == 0) 
                    Console.WriteLine("Processing record #" + i);

                var word1 = inputWords[i].ToLower();
                for (var n = i + 1; n < inputWords.Length; n++)
                {
                    if (i == n) continue;
                    var word2 = inputWords[n].ToLower();

                    if (word1 == word2) continue;
                    if (outputWords.Contains(word1)) continue;
                    if (outputWords.Contains(word2)) continue;
                    var distance = LevenshteinAlgorithm.Compute(word1, word2);

                    if (distance <= 2)
                    {
                        outputWords.Add(word1);
                        outputWords.Add(word2);
                    }
                }
            }

            File.WriteAllLines(OUTPUT_FILE, outputWords.ToArray());
            Console.WriteLine("Found {0} words", outputWords.Count);
        }
    }
}

编辑:我试图捕捉的一些内容如下:

[电子邮件受保护]
[电子邮件受保护]
[电子邮件受保护]
[电子邮件受保护]
[电子邮件受保护]
[电子邮件受保护]
[电子邮件受保护]
[电子邮件受保护]
[电子邮件受保护]

I have a list of ~20,000 email addresses, some of which I know to be fraudulent attempts to get around a "1 per e-mail" limit, such as [email protected], [email protected], [email protected], etc. I want to find similar email addresses for evaluation. Currently I'm using a Levenshtein algorithm to check each e-mail against the others in the list and report any with an edit distance of less than 2. However, this is painstakingly slow. Is there a more efficient approach?

The test code I'm using now is:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Threading;

namespace LevenshteinAnalyzer
{
    class Program
    {
        const string INPUT_FILE = @"C:\Input.txt";
        const string OUTPUT_FILE = @"C:\Output.txt";

        static void Main(string[] args)
        {
            var inputWords = File.ReadAllLines(INPUT_FILE);
            var outputWords = new SortedSet<string>();

            for (var i = 0; i < inputWords.Length; i++)
            {
                if (i % 100 == 0) 
                    Console.WriteLine("Processing record #" + i);

                var word1 = inputWords[i].ToLower();
                for (var n = i + 1; n < inputWords.Length; n++)
                {
                    if (i == n) continue;
                    var word2 = inputWords[n].ToLower();

                    if (word1 == word2) continue;
                    if (outputWords.Contains(word1)) continue;
                    if (outputWords.Contains(word2)) continue;
                    var distance = LevenshteinAlgorithm.Compute(word1, word2);

                    if (distance <= 2)
                    {
                        outputWords.Add(word1);
                        outputWords.Add(word2);
                    }
                }
            }

            File.WriteAllLines(OUTPUT_FILE, outputWords.ToArray());
            Console.WriteLine("Found {0} words", outputWords.Count);
        }
    }
}

Edit: Some of the stuff I'm trying to catch looks like:

[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

执妄 2024-09-08 15:23:19

您可以首先对要相互比较的电子邮件应用一些优先级。

性能限制的一个关键原因是将每个地址与每个其他电子邮件地址。 优先级是提高此类搜索算法性能的关键。

例如,您可以对具有相似长度(+/- 一定数量)的所有电子邮件进行存储,并首先比较该子集。您还可以从电子邮件中删除所有特殊字符(数字、符号),并在减少后找到相同的字符。

您可能还想从数据创建一个特里树,而不是逐行处理它,并使用它来查找共享一组通用后缀/前缀的所有电子邮件,并从该减少中驱动您的比较逻辑。从您提供的示例来看,您似乎正在寻找一个地址的一部分可能作为另一个地址中的子字符串出现的地址。 尝试(和后缀树)是执行这些类型搜索的有效数据结构。

优化此算法的另一种可能方法是使用电子邮件帐户的创建日期(假设您知道)。如果创建了重复的电子邮件,则它们可能会在短时间内相互创建 - 这可能会帮助您减少查找重复项时执行的比较次数。

You could start by applying some prioritization to which emails to compare to one another.

A key reason for the performance limitations is the O(n2) performance of comparing each address to every other email address. Prioritization is the key to improving performance of this kind of search algorithm.

For instance, you could bucket all emails that have a similar length (+/- some amount) and compare that subset first. You could also strip all special charaters (numbers, symbols) from emails and find those that are identical after that reduction.

You may also want to create a trie from the data rather than processing it line by line, and use that to find all emails that share a common set of suffixes/prefixes and drive your comparison logic from that reduction. From the examples you provided, it looks like you are looking for addresses where a part of one address could appear as a substring within another. Tries (and suffix trees) are an efficient data structure for performing these types of searches.

Another possible way to optimize this algorithm would be to use the date when the email account is created (assuming you know it). If duplicate emails are created they would likely be created within a short period of time of one another - this may help you reduce the number of comparisons to perform when looking for duplicates.

浅语花开 2024-09-08 15:23:19

好吧,您可以进行一些优化,假设 Levenshtein 差异是您的瓶颈。

1) 当编辑距离为 2 时,电子邮件之间的长度将在 2 个字符以内,因此不必费心进行距离计算,除非 abs(length(email1)-length(email2))< /code> <= 2

2) 同样,距离为 2 时,不同的字符不会超过 2 个,因此您可以对电子邮件中的字符进行哈希集,并取并集的长度减去两者相交的长度。 (我相信这是一个SymmetricExceptWith)如果结果是> 2、跳到下一个比较。

或者

编写您自己的编辑距离算法。如果您只对长度感兴趣< k、可以优化运行时间。请参阅维基百科页面上的“可能的改进”:http://en.wikipedia.org/wiki/Levenshtein_distance

Well you can make some optimizations, assuming that the Levenshtein difference is your bottleneck.

1) With a Levenshtein distance of 2, the emails are going to be within 2 characters length of one another, so don't bother to do the distance calculations unless abs(length(email1)-length(email2)) <= 2

2) Again, with a distance of 2, there are not going to be more than 2 characters different, so you can make HashSets of the characters in the emails, and take the length of the union minus the length of the intersection of the two. (I believe this is a SymmetricExceptWith) If the result is > 2, skip to the next comparison.

OR

Code your own Levenshtein distance algorithm. If you are only interested in lengths < k, you can optimize the run time. See "Possible Improvements" on the Wikipedia page: http://en.wikipedia.org/wiki/Levenshtein_distance.

感悟人生的甜 2024-09-08 15:23:19

您可以添加一些优化:

1) 保留已知欺诈行为的列表,并首先与该列表进行比较。当你开始使用你的算法后,你可能会比点击主列表更快地点击这个列表。

2)首先对列表进行排序。它不会花费太长的时间(相比之下)并且会增加首先匹配字符串前面的机会。首先按域名排序,然后按用户名排序。也许将每个域放在自己的存储桶中,然后排序并与该域进行比较。

3) 一般考虑剥离域。 [电子邮件受保护][email protected] 永远不会触发您的标志。

You could add a few optimizations:

1) Keep a list of known frauds and compare to that first. After you get going in your algorithm, you might be able hit against this list faster than you hit the main list.

2) Sort the list first. It won't take too long (in comparison) and will increase the chance of matching the front of the string first. Have it sort by domain name first, then by username. Perhaps put each domain in its own bucket, then sort and also compare against that domain.

3) Consider stripping the domain in general. [email protected] and [email protected] will never trigger your flag.

月野兔 2024-09-08 15:23:19

如果您可以定义到某个 k 维空间的合适映射,以及该空间的合适范数,则这可以简化为 所有最近邻问题,可以在 O(n log n) 时间内解决。

然而,找到这样的映射可能很困难。也许有人会接受这个部分答案并运行它。

If you can define a suitable mapping to some k-dimensional space, and a suitable norm on that space, this reduces to the All Nearest Neighbours Problem which can be solved in O(n log n) time.

Finding such a mapping, however, might be difficult. Maybe someone will take this partial answer and run with it.

另类 2024-09-08 15:23:19

为了完整起见,您还应该考虑电子邮件地址的语义,具体如下:

  1. Gmail 将 user.nameusername 视为相同,因此两者是属于同一用户的有效电子邮件地址。其他服务也可以这样做。 LBushkin 去除特殊字符的建议会有所帮助此处。

  2. 子地址如果用户明智的话,可能会触发您的过滤器到此为止。您希望在比较之前删除子地址数据。

Just for completeness, you should consider the semantics of email addresses as well, in terms of:

  1. Gmail treats user.name and username as being the same, so both are valid email addresses belonging to the same user. Other services may do this as well. LBushkin's suggestion to strip special characters would help here.

  2. Sub-adrressing can potentially trip your filter if users wise up to it. You'd want to drop the sub-address data before comparison.

彩扇题诗 2024-09-08 15:23:19

您可能需要查看完整的数据集,看看具有欺骗性电子邮件的帐户之间是否存在其他共性。

我不知道您的应用程序是做什么的,但如果还有其他关键点,请使用它们来过滤您要比较的地址。

You might want to look at the full data set to see if there is other commonality between accounts that have spoofed emails.

i don't know what your application does, but if there are other key points, then use those to filter down what addresses you are going to compare.

裂开嘴轻声笑有多痛 2024-09-08 15:23:19

首先将所有内容排序到哈希表中。关键应该是电子邮件的域名; “gmail.com”。如上所述,从值中删除特殊字符。

然后相互检查所有 gmail.com。这应该快得多。不要比较长度相差超过 3 个字符的事物。

第二步,相互检查所有键,并在那里进行分组。 (例如,gmail.com == googlemail.com。)

Sort everything into a hashtable first. The key should be the domain name of the email; "gmail.com". Strip out special characters from the values, as was mentioned above.

Then check all the gmail.com's against one another. That should be much faster. Do not compare things that are more than 3 characters different in length.

As a second step, check all the keys against one another, and develop groupings there. (gmail.com == googlemail.com, for example.)

只涨不跌 2024-09-08 15:23:19

我同意其他人关于比较电子邮件地址没有帮助的评论,因为用户也可以创建具有欺骗性的外观不同的地址。

我认为最好采用其他解决方案,例如限制您每小时/每天可以写下的电子邮件数量,或者限制您收到这些地址和发送给用户之间的时间。基本上,以一种每天发送一些邀请比较舒服的方式来解决,但发送很多邀请却需要 PITA。我想如果大多数用户必须花费相对较长的时间才能获得免费赠品,他们会忘记/放弃这样做。

I agree with others comments about comparing email addresses not being to helpful, since users could just aswell create fraudulent disimilar looking addresses.

I think a better to come with other solutions, such as limiting the amount of emails you can write down per hour/day, or the time between those addresses being received by you and being sent to the users. Basically, work it out in a way where it is comfortable to send to send a few invites per day, but a PITA to send out many. I guess most users would forget/give up to do it if they had to do it through out a relatively long period of time in order to get their freebies.

爱要勇敢去追 2024-09-08 15:23:19

有什么方法可以检查创建电子邮件的人的 IP 地址吗?这将是一种简单的方法来确定或至少为您提供有关不同电子邮件地址是否来自同一个人的附加信息。

Is there any way you can do a check on the IP address of the person creating the email. That would be a simple way to determine, or at least give you added information about whether the different email addresses have come from the same person.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文