企业搜索:有人用FAST ESP开发过吗? 你觉得呢?

发布于 2024-07-12 10:02:12 字数 147 浏览 8 评论 0原文

我在斯堪的纳维亚黄页工作。 该公司正在考虑将其定制搜索技术转移到 FAST ESP。

与所有安装相对较少的大型、昂贵的系统一样,很难获得有关系统优缺点的反馈。

有没有有过FAST ESP经验的stackoverflowers想分享一下?

I work for a scandinavian yellow pages. The company is looking at moving its bespoke search technology over to FAST ESP.

Like all big, expensive systems with relatively few installations, it is difficult to get feedback on the strengths and weaknesses of the system.

Are there any stackoverflowers who have experience of FAST ESP and want to share?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(14

友欢 2024-07-19 10:02:12

:) 我是一名搜索架构师,自 1997 年担任 Lycos 软件工程师以来一直致力于开发和集成搜索引擎技术。

我们使用 FAST ESP 作为为 http://thomasnet.com 提供支持的搜索引擎。 我从 2003 年起就开始使用 ESP(当时称为 FDS 3.2)。

FAST ESP 非常灵活,可以处理多种文档类型(html、pdf、word 等)的索引。 它有一个非常强大的网络文档爬虫,您可以使用它们的中间 FastXML 格式将自定义文档格式加载到系统中或使用它们的 Content API。

该引擎我最喜欢的部分之一是它的文档处理管道,它允许您使用数十个开箱即用的处理插件,并使用 Python API 编写您自己的自定义文档处理阶段。 我们编写的自定义阶段的一个示例是查看网站 URL 并尝试识别它属于哪家公司,以便可以将其他元数据附加到 Web 文档。

它具有使用多种流行语言(C++/C#/Java)的非常强大的编程/集成 SDK,用于添加内容和执行查询以及获取系统状态和管理集群服务。

ESP 有一种名为 FAST 查询语言 (FQL) 的查询语言,它非常强大,允许您执行基本的布尔搜索(AND、OR、NOT)以及短语和术语邻近搜索。 除此之外,它还有一种称为“范围搜索”的功能,可用于搜索格式因文档而异的文档元数据 (XML)。

就性能而言,它的扩展相当线性。 如果您对其进行基准测试以确定它在一台机器上的性能,那么如果添加另一台机器,通常可以使性能加倍。 您可以在一台机器上运行系统(仅建议用于开发),也可以在多台机器上运行系统(用于生产)。 它是容错的(如果您的负载平衡索引之一离线,它仍然可以提供一些结果)并且具有完整的故障转移支持(一台或多台关键机器可能会死机或离线进行维护,系统将继续运行正常运行)

所以,它非常强大。 现在的文档非常好。 那么,你问,有什么缺点呢?

好吧,如果您需要使其可搜索的数据的格式经常变化,那可能会很痛苦。 ESP 有一个叫做“索引配置文件”的东西,它基本上是一个配置文件,它用来确定哪些文档字段是重要的并且应该用于索引。 进入 ESP 的所有内容都是一个“文档”,即使您将数据库表行加载到其中也是如此。 每个文档都有多个字段,典型字段有:标题、正文、关键字、标题、文档向量、处理时间等。您可以根据需要指定任意数量的自定义字段。

如果您的内容基本上保持相同的格式(如网络文档),那么这不是一个大问题。 但是,如果您必须对哪些字段应该建立索引以及如何处理它们进行重大更改,您可能需要编辑索引配置文件。 对索引配置文件的一些更改是“热更新”,这意味着您可以进行更改而不中断服务。 但是,一些较大的更改是“冷更新”,它需要在更改生效之前重新提供完整的数据和索引。 根据数据集的大小以及集群中的计算机数量,此操作可能需要数小时或数天。 冷更新的安排很痛苦,除非您有足够的现金购买额外的硬件,可以在生产系统执行冷更新并重新加载数据时将其上线。 每年必须在生产集群上执行此操作一次或两次以上,需要进行大量规划才能以最少或 0% 的停机时间进行正确操作。

对于你的情况,我怀疑你的数据格式会经常改变。 如果您需要对其进行细微调整,您可以向范围字段添加额外的元数据,以避免执行任何完整数据重新加载的需要。

您可能遇到的大部分麻烦是使用该产品的初始学习曲线。 一旦您让开发集群(或节点)执行您想要的操作,并且不需要经常对索引字段配置进行重大更改,那么它就是一个非常稳定且可靠的搜索引擎。 对于您的应用程序来说,这听起来像是一个很好的匹配,对于较小的公司或初创公司来说,有一些开源选项,如果您不需要那么多的性能或耐用性,那么它们的前期成本就足够了。

我希望您认为此评估有帮助。 :)

真挚地,
迈克尔·麦金托什
高级搜索架构师 @ TnR Global

:) I am a search architect that has been developing and integrating search engine technology since 1997 from my days as a Lycos software engineer.

We use FAST ESP as the search engine that powers http://thomasnet.com. I've been working with ESP since 2003 (then known as FDS 3.2).

FAST ESP is extremely flexible and can deal with indexing many document types (html, pdf, word, etc). It has a very robust crawler for web documents and you can use their intermediary FastXML format to load custom document formats into the system or use their Content APIs.

One of my favorite parts of the engine is its Document Processing Pipeline which lets you make use of dozens of out-of-the-box processing plugins as well as using a Python API to write your own custom document processing stages. An example of a custom stage we wrote was one that looks at a website URL and tries to identify which company it belongs to so additional metadata can be attached to a web document.

It has a very robust programming/integration SDK in several popular languages (C++/C#/Java) for adding content and performing queries as well as fetching system status and managing cluster services.

ESP has a query language called FAST Query Language (FQL) that is very robust and allows you to do basic Boolean searches (AND, OR, NOT) as well as phrase and term proximity searches. In addition to that, it has something called "scope search" which can be used to search document metadata (XML) that has a format that can vary from document to document.

In terms of performance, it scales fairly linearly. If you benchmark it to determine how it performs on one machine, if you add another machine it generally can double performance. You can run the system on one machine (only recommended for development), or many (for production). It is fault-tolerant (it can still serve some results if one of your load-balanced indices goes offline) and it has full fail-over support (one or more critical machines could die or be taken offline for maintenance and the system will continue to function properly)

So, its very powerful. The documentation nowadays is pretty good. So, you ask, what are the downsides?

Well, if the data you need to make searchable has a format that changes frequently, that might be a pain. ESP has something called an "Index Profile" which is basically a config file it uses to determine what document fields are important and should be used for indexing. Everything fed into ESP is a "document", even if your loading database table rows into it. Each document has several fields, typical fields being: title, body, keywords, headers, documentvectors, processingtime, etc. You can specify as many of your own custom fields as you wish.

If your content maintains mostly the same format (like web documents) its not a big issue. But if you have to make big changes to which fields should be indexed and how they should be treated, you probably need to edit the Index Profile. Some changes to the index profile are "Hot Updates", meaning you can make the change and not interrupt service. But, some of the bigger changes are "Cold Updates" which requires a full data refeed and indexing before the change takes effect. Depending on the size of your dataset and how many machines are in your cluster, this operation could take hours or days. Cold Updates are a pain to schedule unless you have plenty of cash for extra hardware that you can bring online while your production systems are performing a cold update and reloading the data. Having to do that on production clusters more than once or twice a year requires a fair amount of planning to get right with minimum or 0% downtime.

For your case, I doubt your data formats will change very frequently. If you need to make minor tweaks to it, you can add additional metadata to scope fields to side-step the need to do any full data reloads.

Most of the trouble you'll probably encounter is the initial learning curve of using the product. Once you get a development cluster (or node) doing what you want and if don't have to make significant changes to indexed field configs frequently, it is a very very stable and dependable search engine to use. For your application it sounds like a good match, for smaller companies or startups there are open-source options out there that are not as expensive up front that should suffice if you don't need as much performance or durability.

I hope you find this assessment helpful. :)

Sincerely,
Michael McIntosh
Senior Search Architect @ TnR Global

放肆 2024-07-19 10:02:12

2008-2009 年期间,我在俄罗斯黄页 (yell.ru) 担任“搜索引擎工程师”。 我的主要职责是使用 FAST ESP 系统。 我为我们的特定数据编写和维护自定义文档处理器(自定义阶段),处理一些用于数据推送管道的“粘合”代码。 关于FAST ESP。 我对此有一种“复杂”的感觉。 这是一些缺点。

  1. 这是一种昂贵的产品。 除了一次性初始付款外,您还必须支付年度(且值得注意的)许可费,否则您的服务器将停止工作。 我们的错误是获取(相对)低成本许可证,该许可证的“每秒最大请求”速率非常有限(每秒最多 10 个查询)。 虽然我们多次被告知这只是“业务限制”,但实际上这是服务器峰值吞吐量的硬技术限制。 由于这个峰值限制,我们的性能被破坏了,我们切换回临时“评估”许可证(令人惊讶!)没有任何性能限制(只有时间段限制)。

  2. 文档很好,但技术细节不够深入。 仅通过阅读文档不可能完成一些真正棘手的事情。 细节很简单就不在这里了。 一旦我们被告知我们需要联系他们的“解决方案部门”(并购买“解决方案”),因为这不应该由客户完成。

  3. 有些部分出人意料地棘手且有问题。 一些例子:在将自定义词典放入非英语符号时存在一些问题。 有时,如果我们加载一堆具有自定义增强值的短语,系统会变得缓慢且不负责任。

  4. 到处都有一些奇怪的技术限制。 例如 - 我们只能为可搜索字段分配 8 个不同的提升值。

总的来说 - 我们在尝试使用 FAST ESP 作为我们网站的底层搜索引擎来满足用户的需求方面遇到了困难。 最后,系统被另一个(开源)解决方案取代,我被解雇了;-) 故事结束。

During 2008-2009 I had a job in russian yellow pages (yell.ru) as "Search Engine Engineer". My primary responsibility was to work with FAST ESP system. I write and maintain custom document processor (custom stage) for our specific data processing some "glue" code for data pushing pipeline. In regards of FAST ESP. I got a "mixed" feeling about it. Here is some downsides.

  1. It is an expensive product. Aside one-time initial payment you must pay annual (and notiable) license fee or your server will stop work. Our fault was to arquire (relative) low-cost license that has a very limited "max request per second" rate (10 query-per-second maximum). While we was told several times it is just "bussiness limitation", actually it was a hard technical limit of server's peak throughtput. Our performance was ruined having this peak limit and we switched back to temporal "evaluation" license that (surprisely!) had no any performance limitation (just a time period limit).

  2. Documentation is good but not very deep in technical details. It is impossible to do something really tricky just by reading documentation. Details are simple not here. Once we were told we need to contact to their "solution department" (and buy "solution") because it not meant to be done by customers.

  3. Some parts are surprisily tricky and buggy. Some examples: while putting custom dictionaries where are several problems with non-english symbols. Sometimes system became slow and unresponsible if we load it with a bunch of phrases with custom boost values.

  4. There are some strange technical limits here and there. For example - we can have only 8 different boost values assigned to searchable fields.

In a general - we had a tought time trying to follow our user's needs having FAST ESP as underlying search engline for our site. Finally the system was replaced with another (open source) solution and I was fired ;-) The end of story.

抽个烟儿 2024-07-19 10:02:12

FAST ESP 技术很可靠,但您需要记住,它实际上是一个搜索平台(因此称为“ESP”),而不是开箱即用的搜索体验。 结果的质量与索引的质量直接相关,这意味着您确实需要针对内容调整文档处理管道和索引配置文件。

对此没有硬性规定; 您确实需要了解该平台和您的内容。 这确实需要时间和大量的尝试和错误。 此外,它非常消耗资源,因此您不能在硬件上吝啬。 如果你有时间和资源来正确地完成它,它会工作得很好,但是半途而废不会比现成的东西甚至 Lucene 更好(甚至可能更糟)。

The FAST ESP technology is solid, but you will want to bear in mind that it is really a search platform (hence "ESP") not an out-of-the-box search experience. The quality of your results are directly related to the quality of your index, which means you really need to tune your document processing pipeline and index profile for your content.

There are no hard and fast rules for this; you really need to understand the platform and your content. It does take time and a lot of trial and error. Also, it is resource hungry so you cannot skimp on hardware. If you have the time and resources to do it right it will work great, but a halfway job will be no better (and possibly worse) than something out of the box or even Lucene.

情魔剑神 2024-07-19 10:02:12

@Michael McIntosh:为了避免冷更新,您可以向索引添加通用字段。 例如,您添加 5 个通用整数、5 个字符串和 5 个日期。 当您需要突然引入一个新整数时,您可以使用已有的“填充”,例如 igeneric1。

一段时间后,您可能想要进行冷更新,然后合并这些字段并给它们指定正确的名称等。

@Michael McIntosh: To avoid a cold update you could add generic fields to the index. For example you add 5 generic integers, 5 strings and 5 dates. When you need to suddenly introduce a new integer you can use the "padding" you already have, for example igeneric1.

After a while you may want to do a cold update and then you consolidate these fields and give them proper names etc.

心在旅行 2024-07-19 10:02:12

我多年来一直支持 FAST ESP。 总体 4/5。

迄今为止我的经验是,FAST ESP 平台坚如磐石,但各种连接器有一些怪癖。

Lotus Notes 连接器尤其糟糕,当索引超过 100,000 个文档时,它会定期中断。

其他怪癖可能相当严重,例如当文件的 NTFS 权限更新时,文件遍历器不反映文档更新。 这意味着每个人都可以看到他们不应该看到的文档——严重的安全问题。

我同意这里其他人的观点,FAST ESP 非常好,但它肯定不是一个“开箱即用”的解决方案。 预计要投入 3 到 12 个月的时间来实施,但您将获得一个非常强大的引擎作为回报。

I've been supporting FAST ESP for some years now. Overall 4/5.

My experience to date is that the FAST ESP platform is rock solid, but the various connectors have some quirks.

The Lotus Notes connector is especially poor, and periodically breaks when more than 100,000 documents are indexed.

Other quirks can be fairly major, such as the File Traverser not reflecting a document update when a files’ NTFS permissions are updated. It means everybody can see a document they shouldn’t be able to – bad security problem.

I echo the sentiments of others here, FAST ESP is very good, but it’s certainly not an ‘out-of-box’ solution. Expect to invest a good 3 to 12 months implementing – but you’ll be rewarded with a very powerful engine.

靖瑶 2024-07-19 10:02:12

我们已经实施了大量的FAST ESP应用程序,并且在所有情况下ESP都被证明是一个非常稳定的高性能平台,只要您预先投资相对较高的实施成本。 关于黄页问题 - 我们使用 ESP 实施和管理美国最大的在线目录网站,它可以处理巨大的 QPS(每秒查询数)。 正如其他人所提到的,关键的替代技术 - 例如 Google、Solr/Lucene 也非常强大,您的选择实际上取决于技术/用户的要求和预算。

We've implemented a large number of FAST ESP applications and on all occasions ESP has proved to be a very stable, high performance platform, as long as you invest in the relatively higher implementation costs upfront. With regard to the yellowpages question - we implemented and manage the largest online directories site in the US using ESP and it handles huge QPS (queries per second). As mentioned by others, the key alternative technologies - Google, Solr/Lucene for e.g are also very capable and your choice really depends on tech/user requirements and budget.

葬花如无物 2024-07-19 10:02:12

只是评论之前关于 filetraverser 未拾取 ACL 更改的答案(我没有足够的声誉来直接回复答案):

您可以通过启用 aclplugin 让文件遍历器拾取文件权限更改

filetraverser -c <collection> -r <dir> -M -E -J $FASTSEARCH/etc/filetraverser/aclplugin.xml

我们发现的唯一烦人的事情最大的问题是你一次只能使用一个插件,所以你不能同时使用 acl 和lazy 插件。 我们通过创建自定义插件来解决这个问题,该插件只是调用它们两个。

Just commenting on a previous answer (i don't have enough reputation to reply directly on answers) about filetraverser not picking up ACL changes:

You can get the file traverser to pick up file permission changes by enabling the aclplugin

filetraverser -c <collection> -r <dir> -M -E -J $FASTSEARCH/etc/filetraverser/aclplugin.xml

The only annoying thing we found is that you can only use one plugin at the time, so you cant use both the acl and lazy plugins natively. We solved this by creating custom plugin which just called both of them though.

笑着哭最痛 2024-07-19 10:02:12

我在 GE Search 团队的项目中工作了大约 4 年,他们使用 FAST ESP,我发现它非常灵活、强大且多语言文档处理。 它允许我搜索包含至少 20 种不同语言的文档,包括日语、中文、韩语、拉丁语和希腊语。 此外,我还能够使用搜索引擎混合的不同数据集合,对来自数据库、爬虫(基于 XML)、PDF、Office 等的文档进行索引,所有这些都混合在一次搜索中。 还允许我创建不同的管道来处理不同的内容摄取。

I worked in projects for GE Search group for about 4 years, they use FAST ESP and I found that it is very flexible, powerful and multilingual document handling. It allowed me to search into documents with at least 20 different language inside included Japanese, Chinese and Korean with Latino languages and Greek. Also I was able to index documents from Databases, Crawlers (XML based), PDF, Office, etc. all mixed in a single search using different collection of data mixed by search engine. Also allowed me to crate different pipelines to approach different content ingestion.

追风人 2024-07-19 10:02:12

我正在为一些企业内部网站(大公司)实施 FAST ESP。 我接触过一些搜索技术(Verity 早在 90 年代末)。

幸运的是,我在我们真正开始之前就参加了 FAST ESP 开发人员课程。 这些课程非常简单,如果您学习速度很快,您可能只需参加在线课程即可。 对我来说,这些最大的好处是在项目开始之前就可以了解 API。 在快速查看并使用 API 进行了一些编程实验室之后,我意识到我需要编写相当多的内容。

我主要对 API 感到失望。 MS 不到一年前刚刚购买了 FAST ESP,因此希望他们能在清理 .NET API 方面获得一些帮助。 .NET API 的感觉就像有人只需单击一个按钮并制作一个 COM 包装器来与本机 Java servlet 进行交互。 API 命名约定和方法很容易让您熟悉(只要您记住所有 FAST ESP 集合/数组都是基于 1 而不是基于 0)。 不过,我相信他们可以在这里做很多工作。 Java API 看起来与我见过和使用过的所有其他 Java API 非常相似。 命名约定和结构看起来像标准的 Java API,可能是因为 FAST ESP 是一个基于 java 的搜索引擎,其开发人员是 Java 软件工程师而不是 .NET 软件工程师。

起初,由于我使用 ASP.NET,因此我开发了一组模仿 MS SharePoint Web 控件功能的 Web 控件。 在课堂和所有 ASP.NET 示例中,一切都是内联 ASP.NET 编码,没有或很少有“代码隐藏”编码。 雅虎! Developer Network 有一些很好的设计模式,用于设计搜索界面、结果、寻呼机等。

总的来说,到目前为止,它运行得很好。 我们仍处于开发阶段,并将在接下来的几周内开始对我们的网站进行 Beta 测试。 FQL(快速查询语言)有点过于复杂 - 我们的用户可能会抱怨该语言对他们来说不够“像 Google”。 如果您搜索某些 FQL pdf 文件,您将能够预览该语言。 您也可以只使用简单的搜索(所有术语、任何术语等)。

如果您想了解任何具体信息,请直接询问,我会尽力获取信息。 我们在虚拟机环境中使用 FAST ESP - 他们说不支持,但它工作正常,基准测试结果对我们来说还可以。

I'm in the process of implementing FAST ESP for a few corporate intranet sites (large company). I've worked a little with search technology (Verity back in the late 90s).

Luckily, I took the FAST ESP developer courses before we really got started. The courses were really easy and if you're a quick study, you can probably just do the online classes. The biggest benefit in these for me was getting a heads-up on the API before the project started. After a quick look and a few programming labs using the API, I realized there was quite a bit that I would have to code.

I'm mostly disappointed in the API. FAST ESP was just purchased by MS less than a year ago, so hopefully, they'll get some help in cleaning up the .NET API. The .NET API fells like someone just clicked a button and made a COM wrapper to interface with the native Java servlets. The API naming conventions and methods are easy enough to orient yourself to (as long as you remember that all FAST ESP collections/arrays are 1-based instead of 0-based). However, I believe they could do a lot of work here. The Java API looked pretty much like all of the other Java APIs that I've seen and worked with. The naming conventions and structure looks like a standard Java API, probably because FAST ESP is a java-based search engine and their developers are Java software engineers and not .NET software engineers.

At first, since I was using ASP.NET, I developed a set of web controls that mimic the MS SharePoint web controls functionality. In the classroom and all ASP.NET examples, everything was inline ASP.NET coding with no or very little "code-behind" coding. Yahoo! Developer Network has some nice design patterns for designing search interfaces, results, pagers, etc.

Overall and so far, it works pretty well. We're still in development phase and are going to start beta testing our site within the next few weeks. The FQL (Fast Query Language) is a bit over complicated - our users will probably complain that the language isn't "Google-like" enough for them. If you search for some FQL pdf files, you'll be able to preview the language. You can also just use simple searches (all terms, any terms, etc.).

If there's anything specific you'd like to know, just ask and I'll try to get the information. We're using FAST ESP in a VM environment - which they say isn't supported, but its working fine and the benchmark results are okay for us.

洒一地阳光 2024-07-19 10:02:12

快速ESP 很好。 至少与 Google Search Appliance 相比是这样。 那么,选择哪种企业搜索引擎就完全取决于需求了。

FAST ESP is good. At least when compared to Google Search Appliance . But then, which enterprise search engine to choose is entirely upto the requirment.

不可一世的女人 2024-07-19 10:02:12

@anand,您可以使用 FAST ESP .NET API。 安装时附带 PDF 文档、示例代码和 API 参考资料。

@anand, You can use the FAST ESP .NET API. There's PDF documents, sample code, and API reference material with the install.

柠檬色的秋千 2024-07-19 10:02:12

@anand:您可以在 .NET Content API 之间进行选择,也可以通过 HTTP/XML 执行所有操作,并根据需要设置 XML 样式。

@anand: You can choose between the .NET Content API or do everything via HTTP/XML and style the XML as you wish.

一向肩并 2024-07-19 10:02:12

有关于编写高级文档处理插件的材料吗? 例如,从内容中提取自定义信息? 我听说它是​​用 Python 完成的,但似乎没有材料可以学习如何实际做到这一点。

Any material on writing advanced document processing plugin? E.g. doing custom information extraction from the content? I've heard its done in Python but seems there is no material out there to learn how to actually do it.

且行且努力 2024-07-19 10:02:12

除了 FAST ESP 之外,您只有两个可能的选择:Autonomy 的 IDOL 平台 (AFAIK) 和 Apache Solr。

Besides of FAST ESP you have only two other possible options, Autonomy's IDOL platform (AFAIK) and Apache Solr.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文