SQL 转向垂直记录
我在 Informix v11.10 中有一个表(200 万行),复制(50+ 节点)环境
基本布局如下:
ID(PK)(整数)
除法(整数)
公司(int)
特征1 字符(20)
特征2 int
...
feature 200 char(2)
当前布局存在几个问题: 有 200 个与此记录关联的“特征”,但在任何给定时间,其中可能有 5-10 个不是默认/空(每个记录不同)。
更新公司的所有记录有时意味着更新 10 万行,这会阻碍复制并且不易管理。
所以我对表格进行了如下更改:
ID(整数)
ID_TYPE(ID、部门或公司)
功能名称
feature_value
并且有另一个表,其中仅:
ID(整数)
除法(整数)
company (int)
因此,对于 ID #1,表中将有 10 行,关联的部门可能有一些记录,而公司可能有一些记录。 ID 记录将“覆盖”具有与部门匹配的相同 feature_name 的任何记录,并且部门将覆盖任何公司。
我创建了一个函数,当您传入 ID 和 feature_name 时,它会根据公司进行查询,然后根据部门进行查询,然后根据 ID 进行查询,并根据上述覆盖逻辑返回特征值。 (基本上是一个有序的 foreach 循环)
然后我创建了一个如下所示的视图:
选择
my_func(feature1,ID) 作为feature1
my_func(feature2,ID) 作为特征2
...
my_func(feature200,ID) 作为 feature200
现在的
问题是,我为每个功能点击表 200 * 3(对于 ID、公司、部门)次,但这是行不通的,它会占用 CPU。新的记录数量约为 2000 万条,占用的空间要少得多。
有什么想法吗?我觉得我在某个地方缺少临时表的使用,这可以防止它需要访问 2000 万行表 600 次。
I have a table (2 million rows) in Informix v11.10, replicated (50+ node) environment
Basic layout is like so:
ID (PK) (int)
division (int)
company (int)
feature1 char(20)
feature2 int
...
feature 200 char(2)
There are several issues I have with the current layout:
There are 200 "features" associated with this record but maybe 5-10 of them at any given time are not default/null (different for each record).
An update to all records for a company would sometimes mean updating 100k rows which chokes replication and isn't easy to manage.
So I made a change to a table like so:
ID (int)
ID_TYPE (ID,division, or company)
Feature_name
Feature_value
And had another table with only:
ID (int)
division (int)
company (int)
So for say ID #1 there would be 10 rows in the table, and the associated division might have a few records, and company might have a few. An ID record would "override" any record with the same feature_name that matches the division, and division would override any company.
I created a function that when you pass in an ID and a feature_name it queries based on company, then queries on division, and then based on ID, and returns the feature value based on the above override logic. (Basically an ordered foreach loop)
Then I created a view looking like:
select
my_func(feature1,ID) as feature1
my_func(feature2,ID) as feature2
...
my_func(feature200,ID) as feature200
from table
Now the issue is that I'm hitting the table 200 * 3(for ID, company, division) times for each feature which is just not going to work, it pegs the CPU. The new number of records is around 20 million and takes up much less space.
Any thoughts? I feel like I'm missing use of a temp table somewhere that would keep it from needing to hit the 20 million row table 600 times.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您不应该为每个功能点击表 200*3,而是为视图的每一行点击 - 这是因为您的视图每行包含 200 个对 my_func 的调用(每个功能一个)。
这就引出了一个问题:您是否需要同时访问所有 200 个功能?从问题中所写的内容来看,听起来好像任何给定的 ID 可能只使用一小部分功能 - 任何特定于特定功能的查询可能应该直接访问 my_func (而不是通过视图)特征。
另一方面,如果必须检索所有 200 个特征,则基于对 my_func 的 200 次调用的视图将保证检索的每行有 600 次逻辑访问。相反,我建议重写视图以直接访问功能表,按 ID 分组,每个功能由 MAX(CASE WHEN... 类型结构派生。这仍然会导致 600 个物理行被读取,但对于每个返回的视图行最多只能进行 3 次逻辑读取 - 我希望这会表现得更好。
You shouldn't be hitting your table 200*3 for each feature, but for each row of your view - this is because your view includes 200 calls to my_func for each row (one per feature).
This begs the question: are you ever going to need to access all 200 features simultaneously? From what has been written in the question it sounds as though any given ID is likely to be using only a small subset of features - any queries that are specific to particular features should probably be accessing my_func directly (instead of via the view) for those features.
On the other hand, where it is essential to retrieve all 200 features, basing the view on 200 calls to my_func will guarantee 600 logical accesses per row retrieved. Instead, I suggest rewriting the view to access the feature table directly, grouping by ID and with each feature derived by a
MAX(CASE WHEN
... type structure. This would still result in 600 physical rows being read, but only a maximum of 3 logical reads, for each view row returned - I would expect this to perform significantly better.我的常识告诉我你应该标准化为两个单独的表。
my common sense tells me you should normalize into two separate tables.