应用程序引擎 - 轻松的文本搜索

发布于 2024-09-27 03:51:35 字数 1956 浏览 6 评论 0原文

我希望为 App Engine 实现一个简单但有效的文本搜索,在 App Engine 的官方文本搜索功能发布之前我可以使用它。我看到那里有库,但安装新东西总是很麻烦。我想知道这是否是一个有效的策略:

1)将需要进行文本搜索的每个属性分解为一组(列表)文本片段 2) 添加这些列表后保存记录 3)搜索时,只需在列表属性上使用相等过滤器

例如,如果我有一条记录:

{
  firstName="Jon";
  lastName="Doe";
}

我可以保存这样的属性:

{
  firstName="Jon";
  lastName="Doe";

  // not case sensative:
  firstNameSearchable=["j","o", "n","jo","on","jon"];   
  lastNameSerachable=["D","o","e","do","oe","doe"];
}

然后进行搜索,我可以这样做并期望它返回上面的记录:

//pseudo-code:
SELECT person 
WHERE firstNameSearchable=="jo" AND
      lastNameSearchable=="oe"

是这样的吗文本搜索已实现?如何防止索引失控,尤其是当您有一个段落或其他内容时?还有其他常用的压缩策略吗?我想如果我只想要一些简单的东西,这可能会起作用,但很高兴知道我可能遇到的问题。

更新:::

好吧,事实证明这个概念可能是合法的。此博文也引用了它:http:// /googleappengine.blogspot.com/2010/04/making-your-app-searchable-using-self.html

注意:上面博客文章中的源代码不适用于当前版本的 Lucene。我安装了旧版本(2.9.3)作为快速修复,因为谷歌应该很快就会推出自己的应用程序引擎文本搜索。

下面的响应中建议的解决方案是一个很好的快速修复,但由于大表的限制,仅当您在一个字段上查询时才有效,因为您只能在查询中的一个属性上使用非等式运算符:

db.GqlQuery("SELECT * FROM MyModel WHERE prop >= :1 AND prop < :2", "abc", u"abc" + u"\ufffd")

如果您想查询对于多个属性,您可以保存每个属性的索引。就我而言,我将其用于小文本字段上的一些自动建议功能,而不是实际在文档中搜索单词和短语匹配(您可以使用上面的博客文章的实现)。事实证明这非常简单,我真的不需要一个库。另外,我预计如果有人搜索“Larry”,他们会先输入“La...”,而不是从单词中间开始:“arry”。因此,如果该属性用于人名或类似内容,则索引仅包含以第一个字母开头的子字符串,因此“Larry”的索引将只是 {“l”, “la”, “lar”, “larr ", "larry"}

我对电话号码等数据做了不同的处理,您可能想从开头或中间数字开始搜索。在本例中,我只是存储了以长度为 3 的字符串开头的整个子字符串集,因此电话号码“123-456-7890”将是:{“123”,“234”,“345”,... . "123456789", "234567890", "1234567890"},总共 (10*((10+1)/2))-(10+9) = 41 个索引...实际上我做的多了一点为了删除一些不太可能使用的子字符串而变得复杂,但你明白了。

那么你的查询将是: (伪代码) 从人物 WHERE 中选择 * 名字搜索索引 == "lar" phonenumberSearchIndex == "1234"

应用程序引擎的工作方式是,如果查询子字符串与属性中的任何子字符串匹配,则算作匹配。

I was hoping to implement an easy, but effective text search for App Engine that I could use until official text search capabilities for app engine are released. I see there are libraries out there, but its always a hassle to install something new. I'm wondering if this is a valid strategy:

1) Break each property that needs to be text-searchable into a set(list) of text fragments
2) Save record with these lists added
3) When searching, just use equality filters on the list properties

For example, if I had a record:

{
  firstName="Jon";
  lastName="Doe";
}

I could save a property like this:

{
  firstName="Jon";
  lastName="Doe";

  // not case sensative:
  firstNameSearchable=["j","o", "n","jo","on","jon"];   
  lastNameSerachable=["D","o","e","do","oe","doe"];
}

Then to search, I could do this and expect it to return the above record:

//pseudo-code:
SELECT person 
WHERE firstNameSearchable=="jo" AND
      lastNameSearchable=="oe"

Is this how text searches are implemented? How do you keep the index from getting out of control, especially if you have a paragraph or something? Is there some other compression strategy that is usually used? I suppose if I just want something simple, this might work, but its nice to know the problems that I might run into.

Update:::

Ok, so it turns out this concept is probably legitimate. This blog post also refers to it: http://googleappengine.blogspot.com/2010/04/making-your-app-searchable-using-self.html

Note: the source code in the blog post above does not work with the current version of Lucene. I installed the older version (2.9.3) as a quick fix since google is supposed to come out with their own text search for app engine soon enough anyway.

The solution suggested in the response below is a nice quick fix, but due to big table's limitations, only works if you are querying on one field because you can only use non-equality operators on one property in a query:

db.GqlQuery("SELECT * FROM MyModel WHERE prop >= :1 AND prop < :2", "abc", u"abc" + u"\ufffd")

If you want to query on more than one property, you can save indexes for each property. In my case, I'm using this for some auto-suggest functionality on small text fields, not actually searching for word and phrase matches in a document (you can use the blog post's implementation above for this). It turns out this is pretty simple and I don't really need a library for it. Also, I anticipate that if someone is searching for "Larry" they'll start by typing "La..." as opposed to starting in the middle of the word: "arry". So if the property is for a person's name or something similar, the index only has the substrings starting with the first letter, so the index for "Larry" would just be {"l", "la", "lar", "larr", "larry"}

I did something different for data like phone numbers, where you may want to search for one starting from the beginning or middle digits. In this case, I just stored the entire set of substrings starting with strings of length 3, so the phone number "123-456-7890" would be: {"123","234", "345", ..... "123456789", "234567890", "1234567890"}, a total of (10*((10+1)/2))-(10+9) = 41 indexes... actually what I did was a little more complex in order to remove some unlikely to-be-used substrings, but you get the idea.

Then your query would be:
(Pseaudo Code)
SELECT * from Person WHERE
firstNameSearchIndex == "lar"
phonenumberSearchIndex == "1234"

The way that app engine works is that if the query substrings match any of the substrings in the property, then that is counted as a match.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

憧憬巴黎街头的黎明 2024-10-04 03:51:35

实际上,这不会扩展。对于 n 个字符的字符串,需要 n 个阶乘索引条目。 500 个字符的字符串需要 1.2 * 10^1134 个索引来捕获所有可能的子字符串。在您的实体完成对数据存储的写入之前,您将因年老而死亡。

像 search.SearchableModel 这样的实现为每个单词创建一个索引条目,这更现实一些。您无法搜索任意子字符串,但有一个技巧可以让您匹配前缀:

来自 文档

db.GqlQuery("从 MyModel 中选择 *
WHERE prop >= :1 AND prop <; :2”,
"abc", u"abc" + u"\ufffd")

这与每个 MyModel 实体相匹配
开始的字符串属性 prop
与字符 abc。统一码
字符串 u"\ufffd" 代表
最大可能的 Unicode 字符。
当属性值排序时
一个索引,落在这个值中的值
range 是所有开始的值
带有给定的前缀。

In practice, this won't scale. For a string of n characters, you need n factorial index entries. A 500 character string would need 1.2 * 10^1134 indexes to capture all possible substrings. You will die of old age before your entity finishes writing to the datastore.

Implementations like search.SearchableModel create one index entry per word, which is a bit more realistic. You can't search for arbitrary substrings, but there is a trick that lets you match prefixes:

From the docs:

db.GqlQuery("SELECT * FROM MyModel
WHERE prop >= :1 AND prop < :2",
"abc", u"abc" + u"\ufffd")

This matches every MyModel entity with
a string property prop that begins
with the characters abc. The unicode
string u"\ufffd" represents the
largest possible Unicode character.
When the property values are sorted in
an index, the values that fall in this
range are all of the values that begin
with the given prefix.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文