如何评价一个搜索引擎?
我是一名学生,正在进行一项增强搜索引擎现有算法的研究。
我想知道如何评估我改进的搜索引擎,以量化算法的改进程度。
我应该如何比较新旧算法?
谢谢
I am a student carrying out a study to enhance a search engine's existing algorithm.
I want to know how I can evaluate the search engine - which I have improved - to quantify how much the algorithm was improved.
How should I go about comparing the old and new algorithm?
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
这通常是通过创建问题测试套件,然后评估搜索响应回答这些问题的程度来完成的。在某些情况下,响应应该是明确的(如果您在搜索引擎中输入
slashdot
,您希望将slashdot.org
作为您的热门搜索),因此您可以考虑这些作为一类具有“正确”答案的硬查询。大多数其他查询本质上都是主观的。为了最大限度地减少偏见,您应该要求多个用户尝试您的搜索引擎并对结果进行评分,以便与原始结果进行比较。以下是执行类似操作的计算机科学论文的示例:
http:// www.cs.uic.edu/~liub/searchEval/SearchEngineEvaluation.htm
关于算法的具体比较,虽然很明显,但您衡量的内容取决于您有兴趣了解什么。例如,您可以比较计算效率、内存使用情况、爬行开销或返回结果的时间。如果您试图产生非常具体的行为,例如对某些参数运行专业搜索(例如文献搜索),那么您需要明确地测试这一点。
相关性启发式也是一种有用的检查。例如,当有人使用可能与“编程相关”的搜索词时,您是否倾向于从
stackoverflow.com
获得更多结果?如果这样做,您的搜索结果会更好吗?如果您要为特定站点或域提供一组信任权重(例如,将.edu
或.ac.uk
域评为技术结果更值得信赖),那么您需要来测试这些权重的有效性。This is normally done by creating a test suite of questions and then evaluating how well the search response answers those questions. In some cases the responses should be unambiguous (if you type
slashdot
into a search engine you expect to getslashdot.org
as your top hit), so you can think of these as a class of hard queries with 'correct' answers.Most other queries are inherently subjective. To minimise bias you should ask multiple users to try your search engine and rate the results for comparison with the original. Here is an example of a computer science paper that does something similar:
http://www.cs.uic.edu/~liub/searchEval/SearchEngineEvaluation.htm
Regarding specific comparison of the algorithms, although obvious, what you measure depends on what you're interested in knowing. For example, you can compare efficiency in computation, memory usage, crawling overhead or time to return results. If you are trying to produce very specific behaviour, such as running specialist searches (e.g. a literature search) for certain parameters, then you need to explicitly test this.
Heuristics for relevance are also a useful check. For example, when someone uses search terms that are probably 'programming-related', do you tend to get more results from
stackoverflow.com
? Would your search results be better if you did? If you are providing a set of trust weightings for specific sites or domains (e.g. rating.edu
or.ac.uk
domains as more trustworthy for technical results), then you need to test the effectiveness of these weightings.首先,我首先要说的是,感谢您尝试将传统研究方法应用于搜索引擎结果。许多 SEO 在您之前就已经这样做了,并且通常将其保留给自己,因为分享“惊人的发现”通常意味着您不能再利用或占据上风,这就是说我将尽我所能分享一些指示和要寻找的东西。
不同的搜索执行不同的算法。
广泛搜索
例如,在广泛术语搜索中,引擎往往会返回各种结果。这些结果的常见部分包括
这些结果类型中的哪一种会根据单词的不同而有所不同。
示例: Cats 返回猫的图像和新闻,Shoes 返回本地购买鞋子的情况。 (这是基于我 10 月 6 日在芝加哥的 IP)
返回广义结果的目标是为每个人提供一点点一切,以便每个人都高兴。
区域修饰符
一般来说,任何时候将区域术语附加到搜索中,都会极大地修改结果。如果您因为附加了“芝加哥”一词而搜索“芝加哥网页设计”,则结果将从前 10 个区域结果开始。 (这些是地图右侧的一行),超过 10 个列表后将显示一般“结果时尚”。
“本地前十名”的结果往往与下面有机列表中的结果截然不同。这是因为本地结果(来自谷歌地图)依赖于完全不同的数据进行排名。
示例:在您的网站上提供带有芝加哥区号的电话号码将有助于本地结果......但对一般结果没有帮助。与地址、黄皮书列表等相同。
结果速度
目前(截至 2009 年 10 月 6 日)Google 正在对“caffeine”进行 beta 测试。该引擎构建的主要亮点是它返回结果的时间几乎缩短了一半。尽管您现在可能不认为 Google 很慢……当每小时发生数百万次搜索时,加快算法速度非常重要。
减少垃圾邮件列表
我们都发现过一次充满垃圾邮件的搜索。新版本的 Google Caffeine http://www2.sandbox.google.com/ 就是一个很好的例子。在过去的 10 多年里,最大的在线战斗之一是搜索引擎优化者和搜索引擎之间的战斗。游戏谷歌(和其他引擎)利润丰厚,谷歌大部分时间都在与之斗争。
新版本的 Google Caffeine 也是一个很好的例子。到目前为止,我的研究以及 SEO 领域的其他一些人都发现这是 5 年来第一个比以前的版本更重视 Onsite 元素(例如关键字、内部网站链接等)的版本。在此之前,每次“发布”似乎都越来越偏向入站链接……这是第一个向“内容”退一步的人。
测试算法的方法。
比较同一引擎的两个版本。目前可以通过比较 Caffeine(请参阅上面的链接或 google、google caffeine)和当前的 Google 来实现这一点。
比较不同地区的本地结果。尝试查找诸如网页设计之类的搜索词,这些搜索词会返回本地结果而无需本地关键字修饰符。然后,使用代理(通过谷歌找到)从各个位置进行搜索。您需要确保您知道代理位置(在 google 上查找一个网站,该网站会告诉您的 IP 地址、地理 IP 邮政编码或城市)。然后您可以看到不同区域如何返回不同的结果。
警告...不要选择“锁匠”一词...并警惕返回结果时包含大量垃圾邮件列表的任何术语。Google local 相当容易发送垃圾邮件,尤其是在竞争激烈的市场中。
按照之前的答案中提到的那样,比较用户需要多少次“点击返回”才能找到结果。您应该知道,目前没有主要引擎使用“跳出率”作为网站准确性的指标。这可能是因为很容易让你的结果看起来跳出率在 4-8% 范围内,而实际上却没有这么低……换句话说,它很容易被欺骗。
跟踪用户对给定术语平均使用多少种搜索变体,以便找到所需的结果。这是一个很好的指标,表明引擎智能猜测查询类型的程度(如本答案中提到的)。
**免责声明。这些观点基于我截至 2009 年 10 月 6 日的行业经验。关于 SEO 和引擎的一件事是它们每天都在变化。谷歌明天可能会发布 Caffeine,这将会改变很多……也就是说,这就是 SEO 研究的乐趣!
干杯
First, let me start out by saying, kudos to you for attempting to apply traditional research methods to search engine results. Many SEO's have done this before you, and generally keep this to themselves as sharing "amazing findings" usually means you can't exploit or have the upper hand anymore, this said I will share as best I can some pointers and things to look for.
Different searches execute different algorithms.
Broad Searches
For instance in a broad term search, engines tend to return a variety of results. Common part of these results include
Which of these result types are thrown into the mix can vary based on the word.
Example: Cats returns images of cats, and news, Shoes returns local shopping for shoes. (this is based on my IP in Chicago on October 6th)
The goal in returning results for a broad term is to provide a little bit of everything for everyone so that everyone is happy.
Regional Modifiers
Generally any time a regional term is attached to a search, it will modify the results greatly. If you search for "Chicago web design" because the word Chicago is attached, the results will start with a top 10 regional results. (these are the one liners to the right of the map), after than 10 listings will display in general "result fashion".
The results in the "top ten local" tend to be drastically different than those in organic listing below. This is because the local results (from google maps) rely on entirely different data for ranking.
Example: Having a phone number on your website with the area code of Chicago will help in local results... but NOT in the general results. Same with address, yellow book listing and so forth.
Results Speed
Currently (as of 10/06/09) Google is beta testing "caffeine" The main highlight of this engine build is that it returns results in almost half the time. Although you may not consider Google to be slow now... speeding up an algorithm is important when millions of searches happen every hour.
Reducing Spam Listings
We have all found experienced a search that was riddled with spam. The new release of Google Caffeine http://www2.sandbox.google.com/ is a good example. Over the last 10+ one of the largest battles online has been between Search Engine Optimizers and Search Engines. Gaming google (and other engines) is highly profitable and what Google spends most of its time combating.
A good example is again the new release of Google Caffeine. So far my research and also a few others in the SEO field are finding this to be the first build in over 5 years to put more weight on Onsite elements (such as keywords, internal site linking, etc) than prior builds. Before this, each "release" seemed to favor inbound links more and more... this is the first to take a step back towards "content".
Ways to test an algorythm.
Compare two builds of the same engine. This is currently possible by comparing Caffeine (see link above or google, google caffeine) and the current Google.
Compare local results in different regions. Try finding search terms like web design, that return local results without a local keyword modifier. Then, use a proxy (found via google) to search from various locations. You will want to make sure you know the proxies location (find a site on google that will tell your your IP address geo IP zipcode or city). Then you can see how different regions return different results.
Warning... DONT pick the term locksmith... and be wary of any terms that when returning result, have LOTS of spammy listings.. Google local is fairly easy to spam, especially in competitive markets.
Do as mentioned in a prior answer, compare how many "click backs" users require to find a result. You should know, currently, no major engines use "bounce rates" as indicators of sites accuracy. This is PROBABLY because it would be EASY to make it look like your result has a bounce rate in the 4-8% range without actually having one that low... in other words it would be easy to game.
Track how many search variations users use on average for a given term in order to find the result that is desired. This is a good indicator of how well an engine is smart guessing the query type (as mentioned WAY up in this answer).
**Disclaimer. These views are based on my industry experience as of October 6th, 2009. One thing about SEO and engines is they change EVERY DAY. Google could release Caffeine tomorrow, and this would change a lot... that said, this is the fun of SEO research!
Cheers
为了评估某件事,你必须定义你对它的期望。这将有助于定义如何测量它。
然后,您将能够衡量改进情况。
关于搜索引擎,我想您也许能够衡量其查找内容的能力以及返回相关内容的准确性。
这是一个有趣的挑战。
In order to evaluate something, you have to define what you expect from it. This will help to define how to measure it.
Then, you'll be able to measure the improvement.
Concerning a search engine, I guess that you might be able to measure itsability to find things, its accuracy in returning what is relevant.
It's an interesting challenge.
如果这是您的目标,我认为您不会找到最终的数学解决方案。为了对给定的算法进行评级,您需要必须实现的标准和目标。
例如,如果您的目标是改进页面排名过程,那么请确定您是在判断算法的效率还是准确性。判断效率意味着您对代码进行计时以获得一致的大数据集并记录结果。然后,您将使用您的算法来缩短时间。
如果您的目标是提高准确性,那么您需要定义什么是“不准确”。如果您搜索“Cup”,如果您自己能够准确定义“Cup”的最佳答案是什么,则只能说提供的第一个网站是“最佳”。
我对您的建议是缩小实验范围。定义您认为需要改进的搜索引擎的一两个品质,并努力改进它们。
I don't think you will find a final mathematical solution if that is your goal. In order to rate a given algorithm, you require standards and goals that must be accomplished.
For example, if your goal is to improve the process of page ranking then decide if you are judging the efficiency of the algorithm or the accuracy. Judging efficiency means that you time your code for a consistent large data set and record results. You would then work with your algorithm to improve the time.
If your goal is to improve accuracy then you need to define what is "inaccurate". If you search for "Cup" you can only say that the first site provided is the "best" if you yourself can accurately define what is the best answer for "Cup".
My suggestion for you would be to narrow the scope of your experiment. Define one or two qualities of a search engine that you feel need refinement and work towards improving them.
在评论中,您说过“我听说过一种方法来衡量搜索引擎的质量,方法是计算用户在找到他想要的链接之前需要单击后退按钮的次数,但我可以使用这种技术,因为您需要用户测试你的搜索引擎,这本身就是一件令人头疼的事情”。好吧,如果您将引擎免费放在网络上几天并做一点广告,您可能会获得至少几十次尝试。随机向这些用户提供旧版本或新版本,并测量这些点击次数。
其他可能性:假设谷歌从定义上来说是完美的,并将你的答案与它针对某些查询的答案进行比较。 (也许是您的前十个链接与其在 Google 上的对应链接的距离之和,例如:如果您的第二个链接是 Google 的第十二个链接,则距离为 10)。这是一个巨大的假设,但更容易实现。
In the comments you've said "I have heard about a way to measure the quality of the search engines by counting how many time a user need to click a back button before finding the link he wants , but I can use this technique because you need users to test your search engine and that is a headache itself". Well, if you put your engine on the web for free for a few days and advertise a little you will probably get at least a couple dozen tries. Provide these users with the old or new version at random, and measure those clicks.
Other possibility: assume Google is by definition perfect, and compare your answer to its for certain queries. (Maybe sum of distance of your top ten links to their counterparts at Google, for example: if your second link is google's twelveth link, that's 10 distance). That's a huge assumption, but far easier to implement.
信息科学家通常使用精确度和召回率作为信息检索系统(如搜索引擎)的两个相互竞争的质量衡量标准。
因此,您可以通过计算前 10 名中相关结果的数量(称为精度)以及您认为应该位于前 10 名的该查询的重要页面数量来衡量您的搜索引擎相对于 Google 的性能。但不是(称之为回忆)。
您仍然需要在一组查询上手动比较每个搜索引擎的结果,但至少您将有一个指标来评估它们。两者的平衡也很重要:否则,您可以通过不返回任何结果来获得完美的精度,或者通过返回网络上的每个页面作为结果来获得完美的召回率。
关于精确度和召回率的维基百科文章非常好(并定义了F-measure 考虑了两者)。
Information scientists commonly use precision and recall as two competing measures of quality for an information retrieval system (like a search engine).
So you could measure your search engine's performance relative to Google's by, for example, counting the number of relevant results in the top 10 (call that precision) and the number of important pages for that query that you think should have been in the top 10 but weren't (call that recall).
You'll still need to compare the results from each search engine by hand on some set of queries, but at least you'll have one metric to evaluate them on. And the balance of these two is important too: otherwise you can trivially get perfect precision by not returning any results or perfect recall by returning every page on the web as a result.
The Wikipedia article on precision and recall is quite good (and defines the F-measure which takes into account both).
我必须专业地测试搜索引擎。这就是我所做的。
搜索包括模糊逻辑。用户将在网页中输入“Kari Trigger”,搜索引擎将检索诸如“Gary Trager”、“Trager, C”、“Corey Trager”等的条目,每个条目的分数从 0->100,因此我可以将它们从最有可能到最不可能排列。
首先,我重新构建了代码,以便可以使用大搜索查询文件作为输入以批处理模式从网页中删除代码来执行。对于输入文件中的每一行,批处理模式将写出顶部搜索结果及其分数。我从我们的生产系统中收集了数千个实际搜索查询,并通过批量设置运行它们以建立基线。
从那时起,每次修改搜索逻辑时,我都会再次运行批处理,然后将新结果与基线进行比较。我还编写了一些工具,以便更轻松地查看差异中有趣的部分。例如,我并不真正关心旧逻辑是否将“Corey Trager”返回为 82,而新逻辑是否将其返回为 83,因此我的工具会将其过滤掉。
我无法通过手工制作测试用例来完成那么多工作。我只是没有想象力和洞察力来创建良好的测试数据。现实世界的数据要丰富得多。
因此,回顾一下:
1)创建一种机制,让您可以区分运行新逻辑的结果与先前逻辑的结果。
2) 使用大量实际数据进行测试。
3) 创建工具来帮助您处理差异、滤除噪音、增强信号。
I have had to test a search engine professionally. This is what I did.
The search included fuzzy logic. The user would type into a web page "Kari Trigger", and the search engine would retrieve entries like "Gary Trager", "Trager, C", "Corey Trager", etc, each with a score from 0->100 so that I could rank them from most likely to least likely.
First, I re-architected the code so that it could be executed removed from the web page, in a batch mode using a big file of search queries as input. For each line in the input file, the batch mode would write out the top search result and its score. I harvested thousands of actual search queries from our production system and ran them thru the batch setup in order to establish a baseline.
From then on, each time I modified the search logic, I would run the batch again and then diff the new results against the baseline. I also wrote tools to make it easier to see the interesting parts of the diff. For example, I didn't really care if the old logic returned "Corey Trager" as an 82 and the new logic returned it as an 83, so my tools would filter those out.
I could not have accomplished as much by hand-crafting test cases. I just wouldn't have had the imagination and insight to have created good test data. The real world data was so much richer.
So, to recap:
1) Create a mechanism that lets you diff the results of running new logic versus the results of prior logic.
2) Test with lots of realistic data.
3) Create tools that help you work with the diff, filtering out the noise, enhancing the signal.
你必须清楚地识别积极和消极的品质,例如一个人获得他们正在寻求的答案的速度有多快,或者他们在寻找答案的过程中得到了多少“错误”的答案。如果正确答案是 #5 但返回结果的速度快了 20 倍,这是否是一种改进?对于每个应用程序来说,类似的事情都会有所不同。正确的答案在企业知识库搜索中可能更重要,但对于电话支持应用程序可能需要快速答案。
如果没有参数,任何测试都不能说是成功的。
You have to clearly identify positive and negative qualities such as how fast one gets the answer they are seeking or how many "wrong" answers they get on the way there. Is it an improvement if the right answer is #5 but the results are returned 20 times faster? Things like that will be different for each application. The correct answer may be more important in a corporate knowledge base search but a fast answer may be needed for a phone support application.
Without parameters no test can be claimed to be a victory.
接受这样一个事实:搜索结果的质量最终是主观的。您应该有多种评分算法用于比较:旧的、新的和一些对照组(例如,根据 URI 长度或页面大小或一些类似的故意破坏的概念进行评分)。现在选择一堆可以运行你的算法的查询,比如说一百个左右。假设您最终总共有 4 种算法。创建一个 4x5 的表格,显示每个算法的查询的前 5 个结果。 (您可以排名前十,但前五个更为重要。)请务必随机化每列中出现的算法。然后将一个人放在这个东西前面,让他们从 4 个结果集中选择他们最喜欢的一个。在整个查询集中重复此操作。对尽可能多的人重复上述步骤。这应该可以根据每种算法的总获胜次数对您进行公平的比较。
Embrace the fact that the quality of search results are ultimately subjective. You should have multiple scoring algorithms for your comparison: The old one, the new one, and a few control groups (e.g. scoring by URI length or page size or some similarly intentionally broken concept). Now pick a bunch of queries that exercise your algorithms, say a hundred or so. Let's say you end up with 4 algorithms total. Make a 4x5 table, displaying the first 5 results of a query across each algorithm. (You could do top ten, but the first five are way more important.) Be sure to randomize which algorithm appears in each column. Then plop a human in front of this thing and have them pick which of the 4 result sets they like best. Repeat across your entire query set. Repeat for as many more humans as you can stand. This should give you a fair comparison based on total wins for each algorithm.
http://www.bingandgoogle.com/
创建一个类似的应用程序来比较和提取数据。然后对您需要查找的 50 种不同事物进行测试,然后与您想要的结果进行比较。
http://www.bingandgoogle.com/
Create an app like this that compares and extracts the data. Then run a test with 50 different things you need to look for and then compare with the results you want.