什么是在MySQL中存储分类字符串变量的最有效的数据类型
我有一个大约50k行和多个列的表。 某些列具有数据类型varchar
,但存储一组唯一的值,分类字符串。
我在此表中遇到了一些性能问题,因此我正在重构数据类型,并进行了研究,并发现set
和enum
不比好。 varchar
因为将有一个查找表开销。
我应该怎么办
I have a table with about 50k rows and multiple columns.
Some columns have the data type VARCHAR
but the store a unique set of values, Categorical strings.
I'm having some performance issues with this table, so I'm refactoring the data types and did my research and found out SET
and ENUM
are no better than VARCHAR
since there will be a lookup table overhead.
what should I do
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我猜“分类”是指这些列具有“受控词汇”,这是一组有限的可能值。
您可以做一些使这张桌子更有效地为您服务的事情。您不必全部做。我以难度顺序列出它们(无论如何对我来说很难)。
将索引放在列或列中,您将在查询时在子句中使用。这样做很可能解决您的性能问题:50k行不是很小,但很小。
良好的索引选择是一门艺术。请查看 https://use-the--index-luke.com 进行介绍。或者,如果您在某些查询中遇到性能问题,请在这里询问另一个问题。
如果可能的话,如果有必要,请用
callate latin1_bin
声明这些列。这使它们更短,并使它们更快地查找。如果您的分类价值用阿拉伯语或其他需要Unicode的语言。制作一个新桌子。也许将其称为
类别
,并为其提供category_id
的int int unsigned列,而category_name
的varchar列。然后,在主表中使用int int unsigned列而不是varchar列:将新表作为查找表,而主表中的列则作为对该表的数字引用。这种方法通常在大的(megarow)表中使用,以节省RAM和磁盘空间,并正式化您类别的“受控词汇”。但我怀疑您的应用程序可能会过大。
您关于集合和枚举的结论与我的经验相吻合。另外,在生产数据库中的枚举中增加值可能是一个令人震惊的昂贵操作。
I guess by "categorical" you mean those columns have a "controlled vocabulary" – a limited set of possible values.
Some things you can do to make this table serve you more efficiently. You don't have to do them all. I list them in order of difficulty (difficulty for me at any rate).
Put indexes on the column or columns you will use in WHERE clauses when querying. Doing this is very likely to solve your performance issues: 50k rows is not tiny, but it is small.
Good index choices are an art. Check out https://use-the-index-luke.com for an introduction. Or, ask another question here if you have performance problems with certain queries.
If possible, and if necessary, declare those columns with
COLLATE latin1_bin
. That makes them shorter and makes looking them up faster. This won't work if your categorical values are in Arabic or some other language that needs Unicode.Make a new table. Maybe call it
category
, and give it an INT UNSIGNED column forcategory_id
and a VARCHAR column forcategory_name
. Then, in your main table use INT UNSIGNED columns rather than VARCHAR columns: treat the new table as a lookup table, and the columns in your main table as numeric references to that table.This approach is often used in large (megarow) tables to save RAM and disk space, and to formalize the "controlled vocabulary" of your categories. But I suspect it may be overkill for your app.
Your conclusions about SETs and ENUMs match my experience. Plus, adding values to ENUMs in a production database can be a shockingly expensive operation.