通过哈希搜索?
我的想法是创建一个搜索引擎,它可以像其他搜索引擎一样对网络项目进行索引,但只会存储文件的标题、URL 和内容的哈希值。
这样,如果您已经拥有某些项目但不知道它们来自哪里或想知道某些内容出现的所有位置,那么您可以轻松地在网络上找到这些项目。
对于图像、可执行文件和档案等非文本项目更有用。
我想知道是否已经有类似的东西了?
I had the idea of a search engine that would index web items like other search engines do now but would only store the file's title, url and a hash of the contents.
This way it would be easy to find items on the web if you already had them and didn't know where they came from or wanted to know all the places that something appeared.
More useful for non textual items like images, executables and archives.
I was wondering if there is already something similar?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
好吧,对于图像,有 http://tineye.com,它会将其加起来,并找到相似的图像也。
Well, for images, there's http://tineye.com, which will one-up that, and find you similar images too.
这不是一个坏主意。 有时我发现自己偶然发现了一些文件,试图找出它的来源:)但是你将如何跟踪项目的来源呢? 内容可以通过多种方式获取 - Web 浏览器、下载管理器,只需从网络共享复制即可。
It's not a bad idea. Sometimes I find myself stumbled upon some file trying to figure out where it comes from :) But how are you going to track item's sources? Content can be obtained by various means - web browser, download manager, simply by copying from network share.
如果我正确理解您的建议,http://bitzi.com/ 已经这样做了一段时间了。
If I understand your proposal right, http://bitzi.com/ has done this for a while.
查看有关位置敏感哈希的维基百科页面。 还有由麻省理工学院的一项研究主办的一个很好的页面。
一般来说,有几种可用的风格: 字符串的哈希值(例如 simhash)、集合或 0/1 特征(例如 min-wise 哈希),以及实数向量。
到目前为止,数字哈希的主要技巧基本上是降维。 对于字符串,我们的想法是提出一种在进行细微编辑时仍能保持稳健的表示形式。
我也在这个领域做了一些研究,尽管我猜想 stackoverflow 可能不是适合新生工作的地方。
Check out the wikipedia page on locality sensitive hashing. There's also a good page hosted by a research on MIT.
In general, there are several flavors available: hashes for strings (such as simhash), sets or 0/1 features (such as min-wise hashes), and for real vectors.
The main trick for numerical hashes is basically dimension reduction, so far. For strings, the idea is to come up with a representation that's robust in the face of minor edits.
I'm also doing a little research in this field, although I guess stackoverflow might not be the right place for nascent work.
这个问题似乎集中在精确匹配哈希上,我们比最近邻方法更好地理解它,并且确实是值得的,特别是如果人们可以通过这种方式共享标签和其他元数据。
正如@rjmunro 所指出的,基于哈希的搜索是 P2P 世界中的一个流行想法,Bitzi 几乎就是这样做的,尽管他们已经关闭,并且他们的 Bitpedia(数字媒体百科全书)也不再托管在那里,尽管其中一些至少在 Archive.org 上仍然可以找到。
Bitzi 还制作了 Bitcollider (SourceForge.net) 等软件,
以及 Magnet URI 方案,它允许通过哈希指定文件,因此是基于内容的标识符。 各种应用程序支持通过 Magnet URI 搜索各种数据库,如该维基百科页面所述。
同样的想法在密码破解场景中很流行 - 请参阅 findmyhash - 使用在线服务破解哈希值的 Python 脚本 了
更进一步,我认为如果有数据库和在线存储库通过哈希值识别内容并提供关于内容的标签和其他元数据,那就太好 不同角度的内容。 然后我可以让我的音乐收藏保持原始状态(不会浪费备份空间和时间),但仍然自己标记它们并通过外部标签数据库添加其他元数据。 如果我的应用程序知道如何获取标签,那么它似乎比当前的系统要好得多,在当前的系统中,我们修改和复制大文件只是为了将标签从我的桌面移动到我的手机。
请参阅媒体识别的元数据独立散列中的相关想法& P2P 传输优化 (pdf)。
The question seems to focus on exact match hashes, which we understand better than nearest-neighbor approaches, and are indeed worthwhile, especially if people can share tags and other metadata that way.
As @rjmunro notes, hash-based searching is a popular idea in the P2P world, and Bitzi did pretty much this, though they have shut down and their Bitpedia (Digital Media Encyclopedia) isn't hosted there any more, though some of it at least is still available at Archive.org.
Bitzi also produced software like Bitcollider (SourceForge.net),
and the Magnet URI scheme, which allows for specifying a file by hash and is thus a content-based identifier. Various applications support searching at various databases via Magnet URIs as described at that Wikipedia page.
The same idea is popular in the password-cracking scene - see e.g. findmyhash - Python script to crack hashes using online services etc.
Going a step further, I think it would be great if there were databases and online repositories identifying content by hash and providing tags and other metadata about the content from various perspectives. Then I could leave my music collection in its pristine state (no wasted backup space and time), but still tag them myself and add other metadata, via external tag databases. If my applications knew how to grab the tags, it would seem much better than the current system where we modify and copy around big files just to move tags from e.g. my desktop to my phone.
See a related idea at Metadata Independent Hashing for Media Identification & P2P Transfer Optimisation (pdf).