SQL 字谜效率和逻辑?
我有一个大约 200,000 个单词的 SQL 数据库。我需要一个查询,我将能够解决类似的字谜问题。不同之处在于我需要输入字符可以组成的所有可能的单词。例如,如果您输入ofdg,它应该输出单词:do、go和dog。您能估计一下这样的查询需要花费多少时间吗?我怎样才能让它更快、更高效?另外,一般来说,SQL 解析 200000 行的数据库需要多长时间。
I have an SQL db with about 200,000 words. I need a query which I will be able to solve an anagram kind of. The difference is that I need all the possible words that could be made with the input characters. For example, if you input ofdg, it should output the words: do, go, and dog. Can you estimate the amount of time a query like this would take. How can I make it faster and more efficient? Also, in general how long does it take SQL to parse a 200000 row database.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
为了解决这个问题,你需要做的第一件事就是将每个单词简化为拼字游戏玩家所说的字母表。也就是说,单词中的所有字母都按字母顺序排列。所以
do
、go
和dog
生成do
、go
和dgo
。当然,任何给定的字母表可能对应于多个单词,因此,例如,字母表dgo
对应于单词dog
和god
。您需要做的下一件事是构建一个包含键字母序列号和单个属性字段单词的表。
单词列表往往是静态的。例如,英语世界中的两个 Scrabble 单词列表大约每 5 年改变一次。所以你事先构建了这个查找表。性能是O(n)并且是沉没成本。也就是说,您执行一次并存储它,因此它不计入查询成本。您必须事先这样做。每次出现查询时都动态构建这样的索引是绝对没有意义的。
您可能想知道“Scrabble 到底是怎么回事?”答案是,您的 200,000 个单词正好位于英语世界两个认可的锦标赛单词列表之间。美国国家拼字游戏协会的官方锦标赛和俱乐部单词列表(2006 年)包含 178,691 个单词,而由世界英语拼字游戏玩家协会维护的国际列表包含 246,691 个单词。
当您收到查询时,您可以将提供的单词减少为一堆字母表。输入
odfg
生成字母od
fo
go
df
dg
fg
dfo
dgo
fgo
dfg
dfgo
(这是纯 SQL 中的一个非常好的编程问题,所以我必须假设有一个 PHP 或 Python 或 JavaScript 前端可以为您做到这一点)。然后您在数据库中进行查找。每个查询的成本应该大约为 O(log2 n),换句话说,非常即时。这种查询正是关系数据库所擅长的。顺便说一句,你的示例输出很差。 Alphagram
dfgo
与拼字游戏玩家所称的“build”(所有可能的子集)使得do
od
of
去
狗
神
雾
。(我讨厌必须做这些繁琐的事情,但孩之宝的律师很敏感,所以:Scrabble 是 Hasbro, Inc. 在美国拥有的注册商标;在加拿大由 Hasbro Canada Corporation 拥有;在世界其他地区拥有由美泰公司 JW Spear & Sons 开发。)
To solve this problem, the first thing you need to do is reduce every word to what Scrabble players call an alphagram. That is, all the letters in the word but in alphabetical order. So
do
,go
anddog
makedo
,go
anddgo
. Of course, any given alphagram may correspond to more than one word, so, for example, alphagramdgo
corresponds to both the wordsdog
andgod
.The next thing you need to do is construct a table with a key alphagram-sequence number and a single attribute field word.
Word lists tend to be static. For example, the two Scrabble word lists in the English-speaking world change about every 5 years of so. So you construct this lookup table beforehand. Performance is O( n ) and it is a sunk cost. That is, you do it once and store it, so it is not counted against the cost of the query. You have to do this beforehand. It makes absolutely no sense to build such an index on the fly every time a query comes in.
You may be wondering "What is all this about Scrabble?" The answer is that your figure of 200,000 words falls neatly between the two approved tournament word lists in the English-speaking world. The US National Scrabble Association's Official Tournament and Club Word List (2006) contains 178,691 words, and the international list, maintained by the World English Scrabble Players' Association, contains 246,691.
When you get a query you reduce the supplied word to a bunch of alphagrams. Input
odfg
makes alphagramsod
fo
go
df
dg
fg
dfo
dgo
fgo
dfg
dfgo
(which is a pretty programming problem in pure SQL, so I have to assume there is a PHP or Python or JavaScript front-end that will do that for you). Then you do the lookup in the database. The cost of each query should be approximately O(log2 n), in other words pretty damn immediate. That sort of query is what relational databases are good at.BTW, your example output is poor. Alphagram
dfgo
with what Scrabble players call 'build' (all possible subsets) makesdo
od
of
go
dog
god
fog
.(I hate to have to do this rigmarole, but Hasbro's lawyers are touchy, so: Scrabble is a registered trademark owned in the USA by Hasbro, Inc.; in Canada by Hasbro Canada Corporation; and throughout the rest of the world by J. W. Spear & Sons, a Mattel Company.)
那么,长度为
n
的单词中可能的字母组合数为n!
。显然,您还有更多选择,因为您也想要较短的单词,但这并不会改变一般的O(n!)
关系。因此,尝试所有组合并在数据库中查找的简单算法将具有复杂性。使算法更高效显然是为了减少搜索空间——对此有一些选择。
查找 200.000 行的表需要多长时间取决于其中存储的数据类型、格式以及该表上的索引。
Well, the number of possible letter combination in a word of length
n
isn!
. Apparently you have a few more options as you want the shorter words as well, but that does not change that much the generalO(n!)
relationship. So a simple algorithm trying all combinations and looking the up in the database will have that as complexity.Making the algorithm more efficient is apparently to reduce the search space - for which there are a few options.
How long it takes to look up a 200.000 row table depends on what kind of data is stored in there, in what format and what indexes you have on that table.