在帐户管理系统中标记可能的相同用户
我正在研究帐户管理系统上滥用检测机制的可能架构。我想要的是根据表中的某些关联字段检测可能的重复用户。为了使问题简单化,假设我有一个包含以下字段的 USER 表:
Name
Nationality
Current Address
Login
Interests
一个用户很可能在该表中创建了多条记录。该用户创建他/她的帐户可能存在某种模式。需要做什么来挖掘这个表来标记可能重复的记录。另一个问题是规模。如果我们有一百万个用户,那么选取一个用户并将其与其余用户进行匹配在计算上是不现实的。如果这些记录分布在不同地理位置的不同机器上怎么办?
我可以使用哪些技术来解决这个问题?我试图以一种技术不可知的方式提出这个问题,希望人们能为我提供多种视角。
谢谢
I am working on a possible architecture for an abuse detection mechanism on an account management system. What I want is to detect possible duplicate users based on certain correlating fields within a table. To make the problem simplistic, lets say I have a USER table with the following fields:
Name
Nationality
Current Address
Login
Interests
It is quite possible that one user has created multiple records within this table. There might be a certain pattern in which this user has created his/her accounts. What would it take to mine this table to flag records that may be possible duplicates. Another concern is scale. If we have lets say a million users, taking one user and matching it against the remaining users is unrealistic computationally. What if these records are distributed across various machines in various geographic locations?
What are some of the techniques, that I can use, to solve this problem? I have tried to pose this question in a technologically agnostic manner with the hopes that people can provide me with multiple perspectives.
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
答案实际上取决于您如何对用户进行建模以及什么构成重复。
可能有一个用户使用所有哈利波特角色的名字。 祝您找到该模式好运:)
如果您正在寻找大致相似的记录,请尝试以下简单的方法:
对文档中的每个单词进行哈希处理并选择最小的木瓦。对 k 个不同的哈希函数执行此操作。连接这些最小哈希值。你所拥有的几乎是重复的。
为了清楚起见,假设一条记录包含单词 w1....wn。假设您的哈希函数是 h1...hk。
let m_i = min_j (h_i(w_j)
且签名为 S = m1.m2.m3....mk
此签名的酷之处在于,如果两个文档包含 90% 相同的单词,则有 90% 的机会两个文档的签名很有可能是相同的,因此,如果您想增加匹配数,则可以减少 k 的值,而不是寻找近似的重复项。你会得到太多的误报,然后你增加 k 的数量。
当然,还有一种方法是使用用户的隐式特征,例如他们的 IP 地址和 cookie 等。
The answer really depends upon how you model your users and what constitutes a duplicate.
There could be a user that uses names from all harry potter characters. Good luck finding that pattern :)
If you are looking for records that are approximately similar try this simple approach:
Hash each word in the doc and pick the min shingle. Do this for k different hash functions. Concatenate these min hashes. What you have is a near duplicate.
To be clear, lets say a record has words w1....wn. Lets say your hash functions are h1...hk.
let m_i = min_j (h_i(w_j)
and the signature is S = m1.m2.m3....mk
The cool thing with this signature is that if two documents contain 90% same words then there is a good 90% chance that good chance that the signatures would be the same for the two documents. Hence, instead of looking for near duplicates, you look for exact duplicates in the signatures. If you want to increase the number of matches then you decrease the value of k, if you are getting too many false positives then you increase the number of k.
Of course there is the approach of implicit features of users such as thier IP addresses and cookie etc.