使用Django构建搜索引擎的建议
我是网络爬行的新手。我将构建一个搜索引擎,爬虫会保存 Rapidshare 链接,包括找到 Rapidshare 链接的 URL...
,我将构建一个类似于 filestube.com
的网站
换句话说 经过一番搜索,我发现 Scrapy 可以与 Django 配合使用。我试图找到有关 nutch 与 Django 集成的信息,但一无所获。
我希望你能给我构建此类网站的建议......尤其是爬虫
Im new in web crawling. I'm going to build a search engine which the crawler saves Rapidshare links including URL where that Rapidshare links found...
In other words, I'm going to build a website similar to filestube.com
After some searching, I've found Scrapy works with Django. I've tried to find about nutch integration with Django, but found nothing
I hope you can give me suggestion for building this kind of website... especially the crawler
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
最著名的可插入应用程序是 Django-Haystack,它允许您连接到多个搜索后端:
haystack 允许您使用API 看起来像 Django 自己的查询集语法,可以直接使用这些搜索引擎(它们都恰好有自己的 API 和方言)。
如果您只使用抓取工具,无论您使用什么工具:BeautifulSoup 或 Scrappy,您将自己编写 Python 代码来解析您想要解析的内容,然后填充您的 django 模型。
这甚至可以是单独的 python 脚本,可在commands.py 模块中使用。
如果您有很多文件要搜索,您可能需要一个索引,该索引会经常重建并允许快速搜索而无需使用 django ORM。
使用 Solr 索引(例如)使您能够动态创建其他字段,例如基于真实模型字段的虚拟字段(例如:拆分作者名字和姓氏、添加大写文件标题字段等)
当然,如果您不需要快速索引、关键字提升或语义分析,您仍然可以对几个 django 模型字段进行经典的全文搜索:
The best known pluggable app for that is Django-Haystack which allows you to connect to several search backends :
haystack allows you to use an API which looks like Django's own Queryset syntax to use directly these search engines (which all happens to have their own API and dialects).
If you're juste after scraping tools, whatever tool you'll use : BeautifulSoup or Scrappy, you'll be on your own, writing python code that will parse what you want to parse, and then populate your django models.
This can even be separate python scripts , available in the commands.py module.
If you have a lot of files to search, you will probably need an index, which is rebuilt frequently and allows fast searches without hitting the django ORM.
Using a Solr index (for example) enables you to create other fields on-the-fly, like virtual fields based on your real model's fields (ex : splitting author firstname and lastname, adding an uppercased file title field, whatever)
Of course, f you don't need speedy indexation, keyword boost or semantic analysis, you still can do a classic full-text search over a couple of django model fields i :
您检查过 DjangoItem 吗?这是一个实验性的 Scrapy 功能,但众所周知它是有效的
Have you checked DjangoItem? It's an experimental Scrapy feature, but it's known to work