在 SQL 中按子字符串查找字符串的最快方法?
我有一个巨大的表,有两列:Id 和 Title。 Id 是 bigint,我可以自由选择 Title 列的类型:varchar、char、text 等。列标题包含随机文本字符串,例如“abcdefg”、“q”、“allyourbasebelongtous”,最多 255 个字符。
我的任务是通过给定的子字符串获取字符串。子串也有随机长度,可以是字符串的开头、中间或结尾。最明显的执行方法是:
SELECT * FROM t LIKE '%abc%'
我不关心 INSERT,我只需要进行快速选择。我该怎么做才能尽快执行搜索?
我使用 MS SQL Server 2008 R2,据我所知,全文搜索是没用的。
I have huge table with 2 columns: Id and Title. Id is bigint and I'm free to choose type of Title column: varchar, char, text, whatever. Column Title contains random text strings like "abcdefg", "q", "allyourbasebelongtous" with maximum of 255 chars.
My task is to get strings by given substring. Substrings also have random length and can be start, middle or end of strings. The most obvious way to perform it:
SELECT * FROM t LIKE '%abc%'
I don't care about INSERT, I need only to do fast selects. What can I do to perform search as fast as possible?
I use MS SQL Server 2008 R2, full text search will be useless, as far as I see.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
如果您不关心存储,那么您可以创建另一个包含部分标题条目的表,从每个子字符串开始(每个普通标题最多 255 个条目)。
这样,您可以对这些子字符串进行索引,并且仅匹配字符串的开头,应该会大大提高性能。
if you dont care about storage, then you can create another table with partial Title entries, beginning with each substring (up to 255 entries per normal title ).
in this way, you can index these substrings, and match only to the beginning of the string, should greatly improve performance.
如果您想使用比 Randy 的答案更少的空间,并且数据中有相当多的重复,您可以创建一个 N 叉树数据结构,其中每条边都是下一个字符,并将数据中的每个字符串和尾随子字符串挂在其上。
您按深度优先顺序对节点进行编号。然后,您可以为每条记录创建一个最多包含 255 行的表,其中包含记录的 ID 以及树中与字符串或尾随子字符串匹配的节点 ID。然后,当您进行搜索时,您会找到代表您正在搜索的字符串(以及所有尾随子字符串)的节点 ID,并进行范围搜索。
If you want to use less space than Randy's answer and there is considerable repetition in your data, you can create an N-Ary tree data structure where each edge is the next character and hang each string and trailing substring in your data on it.
You number the nodes in depth first order. Then you can create a table with up to 255 rows for each of your records, with the Id of your record, and the node id in your tree that matches the string or trailing substring. Then when you do a search, you find the node id that represents the string you are searching for (and all trailing substrings) and do a range search.
听起来你已经排除了所有好的选择。
您已经知道您的查询
不会使用索引,它每次都会进行全表扫描。
如果您确定该字符串位于字段的开头,则可以
使用 Title 上的索引。
您确定全文搜索对您没有帮助吗?
根据您的业务需求,我有时会使用以下逻辑:
LIKE 'abc%'
),这将使用索引。LIKE '%abc%'
)取决于您的需要,当然,但我在可以首先显示最简单和最常见的结果的情况下使用了它,只有在必要时才转向更困难的查询。
Sounds like you've ruled out all good alternatives.
You already know that your query
won't use an index, it will do a full table scan every time.
If you were sure that the string was at the beginning of the field, you could do
which would use an index on Title.
Are you sure full text search wouldn't help you here?
Depending on your business requirements, I've sometimes used the following logic:
LIKE 'abc%'
) first, which will use an index.LIKE '%abc%'
)Depends on what you need, of course, but I've used this in situations where I can show the easiest and most common results first, and only move on to the more difficult query when necessary.
您可以在表中添加另一个计算列:titleLength as len(title) PERSISTED。这将存储“标题”列的长度。对此创建索引。
另外,添加另一个名为:ReverseTitle 的计算列作为 Reverse(title) PERSISTED。
现在,当有人搜索关键字时,检查关键字的长度是否与标题长度相同。如果是这样,请执行“=”搜索。如果关键字的长度小于 titleLength 的长度,则执行 LIKE。但首先做一个标题,如“abc%”,然后做一个反向标题,如“cba%”。与 Brad 的方法类似 - 即仅在需要时才执行下一个困难查询。
另外,如果 80-20 规则适用于您的关键字/子字符串(即,如果大多数搜索都针对少数关键字),那么您还可以考虑进行某种缓存。例如:假设您发现许多用户搜索关键字“abc”,并且此关键字搜索返回 id 为 20、22、24、25 的记录 - 您可以将其存储在单独的表中并为其建立索引。
现在,当有人搜索新关键字时,首先查看此“缓存”表,看看之前的用户是否已经执行过该搜索。如果是这样,则无需再次查看主表。只需从“缓存”表返回结果即可。
您还可以将上述内容与 SQL Server TextSearch 结合起来。 (假设您有充分的理由不使用它)。但您仍然可以首先使用文本搜索来筛选结果集。然后使用 TExt Search 返回的 ID 作为参数以及关键字对表运行 SQL 查询以获得准确的结果。
所有这一切显然都是假设您必须使用 SQL。如果没有,您可以探索 Apache Solr 之类的东西。
You can add another calculated column on the table: titleLength as len(title) PERSISTED. This would store the length of the "title" column. Create an index on this.
Also, add another calculated column called: ReverseTitle as Reverse(title) PERSISTED.
Now when someone searches for a keyword, check if the length of keyword is same as titlelength. If so, do a "=" search. If length of keyword is less than the length of the titleLength, then do a LIKE. But first do a title LIKE 'abc%', then do a reverseTitle LIKE 'cba%'. Similar to Brad's approach - ie you do the next difficult query only if required.
Also, if the 80-20 rules applies to your keywords/ substrings (ie if most of the searches are on a minority of the keywords), then you can also consider doing some sort of caching. For eg: say you find that many users search for the keyword "abc" and this keyword search returns records with ids 20, 22, 24, 25 - you can store this in a separate table and have this indexed.
And now when someone searches for a new keyword, first look in this "cache" table to see if the search was already performed by an earlier user. If so, no need to look again in main table. Simply return results from "cache" table.
You can also combine the above with SQL Server TextSearch. (assuming you have a valid reason not to use it). But you could nevertheless use Text search first to shortlist the result set. and then run a SQL query against your table to get exact results using the Ids returned by the TExt Search as a parameter along with your keyword.
All this is obviously assuming you have to use SQL. If not, you can explore something like Apache Solr.
创建索引视图 sql 中有一个新功能,可以在您需要搜索的列上创建索引,并在搜索后使用该视图,这将为您提供更快的结果。
Create index view there is new feature in sql create index on the column that you need to search and use that view after in your search that will give your more faster result.
由于数据的原因,字符集会影响搜索性能
内存和磁盘上的大小。瓶颈通常是 I/O。
您的字符字段而不是全文,这更快。不要
在 select 语句中选择不必要的列。
The charset influences the search performance because of the data
size on both ram and disk. The bottleneck is often I/O.
your char field rather than full text, which is faster. Do not
select unnecessary columns in your select statement.
做一件事,在特定列上使用主键以簇形式对其进行索引。
然后使用任何方法(通配符或=或任何)进行搜索,它将进行最佳搜索,因为表已经处于聚集形式,所以它知道他可以找到哪里(因为列已经处于排序形式)
Do one thing, use primary key on specific column & index it in cluster form.
Then search using any method (wild card or = or any), it will search optimally because the table is already in clustered form, so it knows where he can find (because column is already in sorted form)