计算字符串被随机化的概率? - Python

发布于 2024-09-01 12:25:35 字数 700 浏览 2 评论 0原文

这与我之前提出的一个问题相关(问题)

我有一个手动创建的字符串列表,例如:

露西87

戈登王

fancy_unicorn77

joplucky_kanga90

属于独角鲸的基地

随机字符串列表:

约翰克德夫

煎饼90kgjd

fancy_jagookfk

曼哈顿


表明最后一组字符串是随机的的是诸如 'kjg', 'jgf', 'lkd', ... 之类的序列。

有什么聪明的方法可以将包含这些明显随机的字符串的字符串从人群中分离出来吗?

我想这在很大程度上是因为某些字符更有可能被放置在其他字符旁边(例如“co”,“ka”,“ja”,...)。


对这个有什么想法吗? Kylotan提到了Reverend,但我不确定它是否可以用于这样的目的。

非常感谢您的帮助!

this is correlated to a question I asked earlier (question)

I have a list of manually created strings such as:

lucy87

gordan_king

fancy_unicorn77

joplucky_kanga90

base_belong_to_narwhals

and a list of randomized strings:

johnkdf

pancake90kgjd

fancy_jagookfk

manhattanljg


What gives away that the last set of strings are randomized is that sequences such as 'kjg', 'jgf', 'lkd', ... .

Any clever way I could separate strings that contain these apparently randomized strings from the crowd?

I guess that this plays a lot on the fact that certain characters are more likely to be placed next to others (e.g. 'co', 'ka', 'ja', ...).


Any ideas on this one? Kylotan mentioned Reverend, but I am not sure if it can be used fr such purpose.

Help would be much appreciated!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

二智少女猫性小仙女 2024-09-08 12:25:35

这只是一个想法。我自己从未尝试过...

通过对字典中找到的每个(重叠)4 字母序列进行哈希处理来构建布隆过滤器。通过计算字符串中有多少个 4 字母序列未命中过滤器来测试字符串。错过的次数越多,该单词就越有可能包含随机垃圾。

尝试调整布隆过滤器的大小和每个序列的字母数量。

另请注意(感谢@MihaiD),您应该在布隆过滤器中包含一个名称字典,最好来自多种语言,以最大限度地减少误报。

This is just a thought. I've never tried it myself...

Build a bloom filter from hashing every (overlapping) 4-letter sequence found in a dictionary. Test a string by counting how many 4-letter sequences in the string don't hit the filter. The more misses, the more likely it is that the word contains random junk.

Try tuning the size of the bloom filter and the number of letters per sequence.

Also note (thanks @MihaiD) that you should include a dictionary of names, preferably from multiple languages, in the bloom filter to minimise false positives.

笑看君怀她人 2024-09-08 12:25:35

如果您通过 textcat 之类的东西运行字符串,您会得到什么分数? (我见过 TextCat 的几种不同实现;也许已经有一个 Python 实现了;如果没有,它就不是一个硬算法——重要的是数据。)

我在想,如果你去掉数字,第一组字符串将比其中包含随机内容的字符串更接近 TextCat 中的“英语”结果。

距离有多近以及是否能够使用 TextCat 数据(从根本上讲,该数据基于特定语言中哪些字母往往彼此相邻)来“通过”或“失败”字符串将需要一些知识实验,不过……

What scores do you get if you run the strings through something like textcat? (I've seen a few different implementations of TextCat; maybe there's a Python one already out there; if not it's not a hard algorithm -- it's the data that's important.)

I'm thinking that if you strip the numbers out, the first set of strings will be closer to the "English" result in TextCat than the ones with random stuff in them.

How much closer and whether you might be able to use the TextCat data -- which is fundamentally based on which letters tend to be next to each other in particular languages -- to "pass" or "fail" a string is going to need some experimentation, though...

指尖上的星空 2024-09-08 12:25:35

尝试使用普通贝叶斯分类器。对于一般情况应该足够了。

Try using a vanilla bayes classifier. Should be enough for the general case.

噩梦成真你也成魔 2024-09-08 12:25:35

在我看来,您似乎正在尝试编写代码来识别某些垃圾邮件发送者对字符串所做的一组特定的小东西,以通过您的过滤器。我不明白是什么阻止了他们,在你付出了所有的努力之后,对他们的算法进行 10 秒的调整并击败你的新过滤器。

It seems to me like you are trying to write code to recognize a certian particular set of tiny stuff some spammer does to a string to get past your filters. What I don't see is what is stopping them from, after all your hard work, making a 10-second tweak to their algorithm and defeating your new filter.

无所谓啦 2024-09-08 12:25:35

前一段时间,我读了一篇关于随机名称生成的短文,其中他们做了以下工作:他们建立了一个表,其中包含您已经指出的信息:“我想这在很大程度上是因为某些字符更有可能出现被放置在其他人旁边”。

所以他们所做的就是阅读整本字典并确定哪些字母更可能彼此放置。我不知道他们连续考虑了多少个字母。也许你应该尝试不仅仅是两个连续的字母,让我们说 3 到 6 之间的东西。

现在我建议你建立这样一个表(也许以更好的数据结构表示),其中包含所有“有效”的连续字母组合(也许他们的可能性)并查看要检查的您的姓名是否(几乎)仅包含此类“有效”连续字母。

Some time ago I read a short article about random name generation, where they did the following: They built up a table that contains the information you already pointed at: "I guess that this plays a lot on the fact that certain characters are more likely to be placed next to others".

So what they did was they read a whole dictionary and determined which letters were placed more likely to each others. I do not know, how much letters in a row they considered. Maybe you should try more than just two consecutive letters, let's say something between 3 and 6.

Now I suggest you bild up such a table (maybe in a better data structural representation), that contains all "valid" consecutive letter combinations (and maybe their likelihood) and look if your name to be checked contains (almost) only such "valid" consecutive letters.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文