是否规范数据库?只读MyISAM表,性能是主要优先考虑(MySQL)
我正在将数据导入到未来的数据库,该数据库将有一个静态 MyISAM 表(只能从中读取)。我选择 MyISAM 是因为据我了解它更快地满足我的要求(我对 MySQL / SQL 完全没有经验)。
该表将包含各种列,例如 ID、姓名、性别、电话、状态...以及国家/地区、城市、街道列。现在的问题是,我应该为最后 3 列创建表(例如 Country: Country_ID、Country_Name)并通过 ID 在主表中引用它们(规范化...[?]),还是将它们存储为 VARCHAR主表(显然有重复项)?
我主要关心的是速度 - 由于表不会被写入,数据完整性不是优先考虑的问题。唯一的操作是选择特定行或搜索符合特定条件的行。
如果我只使用 VARCHAR,按“国家/地区”、“城市”和/或“街道”列(以及同一搜索中的其他列)搜索是否会更快?
编辑:该表大约有 30 列和大约 10m 行。
I'm importing data to a future database that will have one, static MyISAM table (will only be read from). I chose MyISAM because as far as I understand it's faster for my requirements (I'm not very experienced with MySQL / SQL at all).
That table will have various columns such as ID, Name, Gender, Phone, Status... and Country, City, Street columns. Now the question is, should I create tables (e.g Country: Country_ID, Country_Name) for the last 3 columns and refer to them in the main table by ID (normalize...[?]), or just store them as VARCHAR in the main table (having duplicates, obviously)?
My primary concern is speed - since the table won't be written into, data integrity is not a priority. The only actions will be selecting a specific row or searching for rows that much a certain criteria.
Would searching by the Country, City and/or Street columns (and possibly other columns in the same search) be faster if I simply use VARCHAR?
EDIT: The table has about 30 columns and about 10m rows.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果进行规范化,搜索速度会更快,因为数据库只需比较整数而不是字符串。表数据也会更小,这使得搜索速度更快,因为可以一次将更多数据加载到内存中。
如果您的表索引正确,那么无论哪种方式都会非常快 - 您可能不会注意到显着的差异。
您可能还想查看全文搜索 如果您发现自己写的是
LIKE '%foo%'
,因为后者将无法使用索引并会导致全表扫描。It can be faster to search if you normalize as the database will only have to compare an integer instead of a string. The table data will also be smaller which makes it faster to search as more can be loaded into memory at once.
If your tables are indexed correctly then it will be very fast either way - you probably won't notice a significant difference.
You might also want to look at a full text search if you find yourself writing
LIKE '%foo%'
as the latter won't be able to use an index and will result in a full table scan.我会尽力给您提供比通常的“这取决于情况”答案更多的内容。
#1 - 对于小 N 来说一切都很快 - 如果您的行数少于 100,000 行,只需将其平坦加载,根据需要对其进行索引,然后继续执行更高优先级的操作。
将所有内容保持在一张表中可以更快地读取所有内容(所有列),但是要查找或搜索它,您通常需要索引,如果您的数据非常大,并且具有冗余的城市和国家信息,那么使用代理外键可能会更好分成单独的表,但你不能真正说硬和快。
这就是为什么几乎总是使用某种数据建模原则 - 通常使用传统的标准化(例如实体关系)或维度(例如 Kimball) - 这两种情况下的规则或方法旨在帮助您对数据进行建模,而无需预测每个用例。显然,了解所有使用模式将使您的数据模型偏向于支持它们 - 因此大量聚合和分析是使用非规范化维度模型的有力指标。
因此,这实际上在很大程度上取决于您的数据配置文件(行宽和行数)和使用模式。
I'll try to give you something more than the usual "It Depends" answer.
#1 - Everything is fast for small N - if you have less than 100,000 rows, just load it flat, index it as you need to and move on to something higher priority.
Keeping everything flat in one table is faster for reading everything (all columns), but to seek or search into it you usually need indexes, if your data is very large with redundant City and Country information, it might be better to have surrogate foreign keys into separate tables, but you can't really say hard and fast.
This is why some kind of data modeling principles are almost always used - either traditional normalized (e.g. Entity-Relationship) or dimensional (e.g. Kimball) is usually used - the rules or methodologies in both cases are designed to help you model the data without having to anticipate every use case. Obviously, knowing all the usage patterns will bias your data model towards supporting them - so a lot of aggregations and analysis is a strong indicator to use a denormalized dimensional model.
So it really depends a lot on your data profile (row width and row count) and usage patterns.
不幸的是,除了通常的“这取决于情况”之外,我没有更多的答案。
根据实际搜索的需要进行尽可能多的标准化。如果您从未真正搜索过居住在萨克拉门托榆树街或丹佛枫树大道上的人,那么任何使这些列正常化的努力都几乎是浪费的。通常,您会标准化类似的内容以避免更新错误,但您已经声明数据完整性不是一个风险。
像鹰一样观察你的慢查询日志!这会告诉您需要标准化什么。对这些查询执行
EXPLAIN
,并确定是否可以添加索引来改进它,或者是否需要标准化。我使用过一些我们称之为“超标准化”的数据模型。它们采用所有正确的范式,但通常用于我们使用数据的方式不需要它的事情。这些数据模型很难一眼理解,而且非常烦人。
I don't have much more than the usual "It Depends" answer, unfortunately.
Go with as much normalization as you need for the searches you actually do. If you never actually search for people who live on Elm Street in Sacramento or on Maple Avenue in Denver, any effort to normalize those columns is pretty much wasted. Ordinarily you would normalize something like that to avoid update errors, but you've stated that data integrity is not a risk.
Watch your slow query log like a hawk! That will tell you what you need to normalize. Do
EXPLAIN
on those queries and determine whether you can add an index to improve it or whether you need to normalize.I've worked with some data models that we would called "hyper-normalized." They were in all the proper normal forms, but often for things that just didn't need it for how we used the data. Those kinds of data models are difficult to understand with a casual glance, and they can be very annoying.