我在文本
MySQL表的字段中存储了约700万个生物蛋白序列(使用 Innodb
存储引擎和 latin1_swedish_ci
colation)。
在mySQL中存储的序列是大写速度中英语字母的简单组合。像这样:
MSTWQVYRLLMEYCSCLDNKTPNAFAKWCSSRKIKFLQADYFRKRPKHCDEGTGRYRSIYVMKKEYLGDIVRKITN
MySQL中的文本
字段的选择看起来必不可少将来即将到来的记录)。
MySQL中VARCHAR或文本的最大密钥大小为 767
。意味着只有大多数 767
字节才能被索引。
喜欢
在此索引上无效的操作员从整个 text> text
字段中有效地检索 substring 。
因此,有什么方法可以索引整个 text
字段以有效地搜索其中的子字符串?
I have stored about 7 million biological protein sequences in text
field of MySQL table (using InnoDB
storage engine and latin1_swedish_ci
collation).
Sequences stored in MySQL are simple combinations of English alphabets in uppercase. Like this:
MSTWQVYRLLMEYCSCLDNKTPNAFAKWCSSRKIKFLQADYFRKRPKHCDEGTGRYRSIYVMKKEYLGDIVRKITN
Selection of text
field in MySQL looks essential because the sequences are trending from minimum 1 byte to maximum unlimited/unknown bytes (max size was 23089 in stored 7 million records but it will ideally go beyond for upcoming records in future).
Maximum key size for varchar or text in MySQL is 767
. Means that only first left most 767
bytes can be indexed.
LIKE
operator ineffective on this index to efficiently retrieve substring
from entire text
field.
So, is there any way to index entire text
field to efficiently search substrings inside it?
发布评论
评论(1)
您希望存储,然后搜索substrings,in 。
MySQL/Mariadb的搜索功能,
都喜欢'%cscldnktpnafakw%'
和fullText,不适合此应用程序,很遗憾地说。为什么不呢?
使用
搜索,例如'%cscldn%'
会很慢。 (%
在类似字符串中是通配符操作员。)而且,您的列上的前缀索引不会帮助使事情更快。fullText在自然语言序列上起作用,而不是用来表示DNA序列的长字符串。
postgresql rdbms 具有一个称为 Trigram indexes 。当您使用它时,您可以使用
(例如'%actg%'
- 样式过滤器)搜索长文本对象。您可以使用类似的内容来声明您的Trigram索引。但是在这样做之前,您必须切换到使用PostgreSQL。
You're hoping to store, and then search for substrings, in alphabetic protein sequences.
MySQL / MariaDB's search capabilities, both
LIKE '%CSCLDNKTPNAFAKW%'
and FULLTEXT, are not suitable for this application, sorry to say.Why not?
Searches with
LIKE '%CSCLDN%'
will be absurdly slow. (%
in LIKE strings is the wildcard operator.) And, a prefix index on your column won't help make things faster.FULLTEXT works on natural-language sequences of words, not the long strings of characters used to represent DNA sequences.
The PostgreSQL RDBMS has a feature called trigram indexes. When you use it you can search long TEXT objects with
LIKE '%ACTG%'
- style filters with decent performance. You can declare your trigram index using something like this.But before you do that you'll have to switch over to using postgreSQL.