使用计算成本低廉的 Python 哈希算法检测转发
为了能够检测特定推文的 RT,我计划将每个格式化推文的哈希值存储在数据库中。
我应该使用什么哈希算法。 神秘当然不是必需的。 只是将数据存储为某种东西的最小方式,然后可以以有效的方式比较相同的数据。
我的第一次尝试是使用 md5 哈希值。 但我认为可能存在更有效的哈希算法,因为不需要安全性。
In order to be able to detect RT of a particular tweet, I plan to store hashes of each formatted tweet in the database.
What hashing algorithm should I use. Cryptic is of course not essential. Just a minimal way of storing a data as something which can then be compared if it is the same, in an efficient way.
My first attempt at this was by using md5 hashes. But I figured there can be hashing algorithms that are much more efficient, as security is not required.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
你真的需要哈希吗? Twitter 消息足够短(并且磁盘空间足够便宜),因此最好只存储整个消息,而不是占用时钟周期来对其进行哈希处理。
Do you really need to hash at all? Twitter messages are short enough (and disk space cheap enough) that it may be better to just store the whole message, rather than eating up clock cycles to hash it.
我不熟悉 Python(抱歉,Ruby 人在这里打字),但是你可以尝试一些东西。
假设:
随着时间的推移,您可能会存储数十万条推文,因此将一个散列与表中的“每条记录”进行比较将是低效的。 此外,转发并不总是原始推文的副本。 毕竟,通常会包含原作者姓名,并且会占用 140 个字符的限制。 那么也许您可以使用比“哑”哈希匹配更准确的解决方案?
标记和 索引
标记并索引其组成部分
以标准方式发送消息。 这
可能包括处理散列#....,
at 标记的 @.... 和 URL 字符串为
“标签”。 去除干扰词后
和标点符号,你也可以
将剩余的单词视为标签
快速搜索
数据库在查找方面很糟糕
多个组成员身份非常
很快(我假设你使用
Mysql 或 Postgresql,它们是
对此很糟糕)。 而是尝试一个
像这样的自由文本引擎
Sphinx 搜索。 他们很
非常快地解决多个组成员身份(即
检查关键字是否存在)。
使用 Sphinx 或类似工具,我们搜索
我们提取的所有“标签”。 这
可能会返回一个较小的
“潜在原始推文”的结果集。 然后一一比较
使用相似度匹配算法
(这是一个 Python http://code.google.com/p/pylevenshtein/)
现在让我热烈欢迎您来到文本挖掘的世界。
祝你好运!
I am not familiar with Python (sorry, Ruby guy typing here) however you could try a few things.
Assumptions:
You will likely be storing hundreds of thousands of Tweets over time, so comparing one hash against "every record" in the table will be inefficient. Also, RTs are not always carbon copies of the original tweet. After all, the original author's name is usually included and takes up some of the 140 character limit. So perhaps you could use a solution that matches more accurately than a "dumb" hash?
Tagging & Indexing
Tag and index the component parts of
the message in a standard way. This
could include treating hashed #....,
at-marked @.... and URL strings as
"tags". After removing noise words
and punctuation, you could also
treat the remaining words as tags
too.
Fast Searching
Databases are terrible at finding
multiple group membership very
quickly (I'll assume your using either
Mysql or Postgresql, which are
terrible at this). Instead try one
of the free text engines like
Sphinx Search. They are very
very fast at resolving multiple group membership (i.e.
checking if keywords are present).
Using Sphinx or similar, we search on
all of the "tags" we extracted. This
will probably return a smallish
result set of "potential original Tweets". Then compare them one by one
using similarity matching algorithm
(here is one in Python http://code.google.com/p/pylevenshtein/)
Now let me warmly welcome you to the world of text mining.
Good luck!
我赞同 Chris 关于根本不使用哈希的评论(您的数据库引擎有望有效地索引 140 个字符的字段)。
如果您确实想使用哈希,MD5 也是我的首选(16 字节),其次是 SHA-1(20 字节)。
无论你做什么,都不要使用字符总和。 我无法立即想出一个会产生更多冲突的函数(所有字谜散列相同),而且速度更慢!
I echo Chris' comment about not using a hash at all (your database engine can hopefully index 140-character fields efficiently).
If you did want to use a hash, MD5 would be my first choice as well (16 bytes), followed by SHA-1 (20 bytes).
Whatever you do, don't use sum-of-characters. I can't immediately come up with a function that would have more collisions (all anagrams hash the same), plus it's slower!
这里有几个问题。 首先,RT 并不总是相同。 有人补充评论。 其他人则更改用于跟踪的 URL。 其他人则添加了他们正在转发的人(可能是也可能不是发起者)。
因此,如果您要对推文进行哈希处理,则需要将其归结为推文的核心内容,并且仅对其进行哈希处理。 祝你好运。
上面有人提到,使用 32 位,大约 65K 条推文就会开始发生冲突。 当然,您可能会在推文 #2 上发生冲突。 但我认为该评论的作者很困惑,因为 2^16 = ~65K,但 2^32 = ~4 万亿。 所以你那里还有更多的空间。
更好的算法可能是尝试导出推文的“独特”部分,并对其进行指纹识别。 它不是哈希值,而是定义唯一性的几个关键词的指纹。
There are a few issues here. First, RT's are not always identical. Some people add a comment. Others change the URL for tracking. Others add in the person that they are RT'ing (which may or may not be the originator).
So if you are going to hash the tweet, you need to boil it down to the meat of the tweet, and only hash that. Good luck.
Above, someone mentioned that with 32-bits, you will start having collisions at about 65K tweets. Of course, you could have collisions on tweet #2. But I think the author of that comment was confused, since 2^16 = ~65K, but 2^32 = ~4 Trillion. So you have a little more room there.
A better algorithm might be to try to derive the "unique" parts of the tweet, and fingerprint it. It's not a hash, it's a fingerprint of a few key words that define uniqueness.
好吧,推文只有 140 个字符长,因此您甚至可以将整个推文存储在数据库中...
但如果您确实想以某种方式“散列”它们,一个简单的方法是只取 ASCII 值的总和推文中的所有字符:
当然,每当你有哈希值匹配时,你应该检查推文本身是否相同,因为找到两条给出相同“总和哈希值”的推文的概率可能是不可忽略的。
Well, tweets are only 140 characters long, so you could even store the entire tweet in the database...
but if you really want to "hash" them somehow, a simple way would be to just take the sum of the ASCII values of all the characters in the tweet:
Of course, whenever you have a match of hashes, you should check the tweets themselves for sameness, because the probability of finding two tweets that give the same "sum-hash" is probably non-negligible.
您正在尝试对字符串进行哈希处理吗? 内置类型可以立即进行哈希处理,只需执行
hash("some string")
即可获得一些 int。 它与 python 用于字典的函数相同,所以它可能是最好的选择。You are trying to hash a string right? Builtin types can be hashed right away, just do
hash("some string")
and you get some int. Its the same function python uses for dictonarys, so it is probably the best choice.Python的shelf模块? http://docs.python.org/library/shelve.html
Python's shelve module? http://docs.python.org/library/shelve.html