如何构建 torrent 文件索引器?
我对 torrentz.com 这样的搜索引擎背后的技术很好奇。 据我观察,它不托管任何种子文件,而是将您连接到其他托管种子文件的服务器。
- 当您搜索关键字时,它会显示与您的搜索相匹配的潜在标题列表。
- 然后您选择其中一个,它会为您提供托管相应 torrent 文件的潜在服务器的另一列表。
我特别感兴趣的是收集和索引所有内容背后的策略:
他们如何收集然后聚合数据?
它是一个提交基础服务,其中每个服务器都提交其内容以进行索引?
是爬行算法吗? 如果是这样,您如何开始抓取像 Piratebay.org 这样的网站?
他们可以访问这些其他服务器的数据库吗?
我对 BitTorrent 协议的了解和理解不是很详细,但是我在网上找到的文档更多地向我指出了构建跟踪器服务所涉及的过程,这并不是我感兴趣的。任何见解和推荐阅读材料受到赞赏。
I am curious about the technology behind a search engine like torrentz.com. From what I could observe, it doesn't host any torrent files, but rather connects you to other servers that do.
- you search for keywords, it brings up a list of potential titles matching your search.
- then you pick one of these and it provides you with another list of potential servers hosting the corresponding torrent file.
What I'm interested in particularly is the strategy behind gathering and indexing all that content:
How do they collect then aggregate the data?
Is it a submission base service, where each of these servers submits its content for indexing?
Is it a crawling algorithm? If so how do you even start crawling a site like piratebay.org?
Do they have access to these other servers' databases?
My knowledge and understanding of the bittorrent protocol is not very elaborate, but the documentation that I found online pointed me more toward the processes involved in building a tracker service, which isn't exactly what I'm interested in. Any insight and recommended reading material is appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
首先,开始索引他们的 rss 提要并从中收集数据。 下一步将是为门户网站(如 Mininova、tpb 等)页面建立索引,但要注意这样做可能会被禁止(基于 IP),因为这会引发从其服务器请求大量数据(我不要认为他们对此太高兴)..
这就是说我怀疑他们是否有权访问其他服务器的数据库,而是它正在爬行+rss。
您可以使用的另一件事是,当有人查询您的数据库中没有的项目时,您可以在主 bt 门户上进行查询,将结果缓存在您的数据库中,然后显示结果。 然后,如果另一个用户进行相同的查询(这是很常见的情况),您可以向他显示缓存的数据+来自 rss 的新数据。
For beginning start indexing their rss feeds and gather data from it. The next step would be indexing of portal's (like Mininova, tpb, etc) pages but watch out for the fact that you can be banned (ip based) for doing so, since that would provoke huge amount of data requested from their servers (i don't think that they be too happy about that)..
That said i doubt that they have access to other server's databases, but rather it's crawling +rss.
Another thing that you can use is that when somebody make a query of an item which you don't have in qyour database, you make the query on the main bt portal's, cache the result in your db, and then display results. Then if another user make the same query (which is pretty common scenario) you can show him cached data + new data from rss.