实时网络搜索(.Net 中)
你们将如何在.Net 平台上创建一个“实时”搜索引擎。近乎实时的网络搜索现在非常流行,我希望你们能帮助我集思广益一些想法。我最终可能会尝试制作一些原型,但大多数情况下这只是一种“心理训练”。
要求是:
- .NET 平台、IIS、MS SQL 服务器或 Lucene.Net(文件系统)
- 要索引的输入数据只是关键字加上一些元信息 - 无需进一步处理所需的
- 数据按关键字分组并按出现次数排序 不保留关键字的
- 历史数据(超过某个固定时间的数据将被丢弃或移动到其他数据存储)
对主题不太了解,这就是我到目前为止所想到的:
数据被馈送通过网络服务连接到系统。由于数据已经是关键字的形式,因此不进行进一步的处理。 WS将数据保存到db。 Select查询以固定的时间间隔执行以返回数据(例如:我们查询过去一小时的传入数据并每秒执行一次查询)。分组和排序在内存中执行,以减轻sql server的负担。数据库中的旧数据每隔几分钟就会被丢弃。 我不确定如果不断添加许多新行,sql server 将如何处理。 然后显示分组和排序的数据。
我相信你们对于这类事情有更多的经验和更好的想法。
问候,
翁德雷
How would you guys go about creating a "real-time" search engine on .Net platform. Near real-time search of the web is so popular nowadays and I was hoping you guys would help me brainstorm some ideas. I might try to make some prototype eventually, but mostly it is just a "mental training".
The requirements are:
- .NET platform, IIS, MS SQL server or Lucene.Net (file-system)
- input data to be indexed are only keywords plus some meta information - no further processing required
- data are grouped by keywords and ordered by number of occurrences of the keywords
- no historic data are kept (data older than some fixed amount of time are discarded or moved to some other data store)
Not knowing much about the subject matter, this is what I've come up with so far:
Data are fed to the system through web service. Since data are already in form of keywords, no further processing is performed. WS saves data to db. Select query is performed in fixed time intervals to return data (for example: we query incoming data for past hour and perform the query every second). Grouping and sorting is performed in memory to offload the sql server. Old data in db are discarded every couple minutes.
I'm not sure how would sql server handle that if there were many new rows added constantly.
Grouped and sorted data are then displayed.
I'm sure you guys have more experience and better ideas for this kind of thing.
Regards,
Ondrej
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
根据您对系统的描述,一个简单的数据库架构可能如下所示
:
- id(主键)
- 关键字(唯一)
输入
- id(主键)
- 数据(文本)
输入关键字
- id(主键)
- input_id(外键)
-keyword_id(外键)
- count(整数;id为keyword_id的关键字出现在id为input_id的输入中的次数)
-expiration_date(时间戳;定期删除所有已过期的条目)
数据操作如下:
在流量大的系统上,您的数据库会经常受到攻击。由于您实际上只是为了方便在这些表中执行 SELECT 操作而使用数据库,并且数据的生命周期非常短暂,因此您最好使用内存中的数据结构来替换“关键字”和“ input_keyword”表以消除对磁盘的命中。这可能需要更复杂的应用程序代码,但在繁忙的系统上可能是值得的。
From your description of the system, a bare-bones database schema might look like the following:
keyword
- id (primary key)
- keyword (unique)
input
- id (primary key)
- data (text)
input_keyword
- id (primary key)
- input_id (foreign key)
- keyword_id (foreign key)
- count (integer; the number of times keyword with id keyword_id appears in input with id input_id)
- expiration_date (timestamp; at regular intervals, all entries that have expired need to be deleted)
Data operations would be as follows:
On a highly trafficked system, your database will be hit quite often. Since you are really only using the database for the convenience of performing SELECT operations across these tables, and since the data is very short-lived, you might be better off just using an in-memory data structure to replace the "keyword" and "input_keyword" tables to eliminate hits to disk. This may require more complex application code, but it may be worth it on a busy system.
该网站并不是真正的集思广益或帮助您设计应用程序。
您可能想将其发布到 http://answers.onstartups.com/ 上,看看有哪些要求和建议这个想法是,看看实时网络搜索是否有任何商业意义。
但是,您需要确定如何才能比 Google 更快。
This site is not really for brainstorming, or to help you design applications.
You may want to post this on http://answers.onstartups.com/ and see what requirements and suggestions on this idea would be, to see if there is any business sense to a real-time websearch.
But, you would need to determine how you can go faster than Google.