规范化表：在一系列行中查找唯一列 (Oracle 10.x)

发布于 2024-09-25 12:21:35 字数 1738 浏览 10 评论 0原文

我有一个具有以下结构的表：

WorkerPersons
-------------------------------
ID          (PK)
PersonID    (Indicates which version of Person the record describes)
SomeColumn1 (data specific to Worker)
SomeColumn2 (data specific to Person)
....
SomeColumnN
-------------------------------

正如您所看到的，它是一个非规范化的表，它在一个表中同时保存 Worker 和 Person（以及一个 Person 的许多版本）数据。我的愿望是规范化该表，但是，由于该表包含大量数据（许多列），我需要确定哪些列应进入 Workers 表，哪些列应进入 Persons 表。结果应该是这样的：

Workers                 Persons
----------------------- ---------------------
ID                      ID
PersonID (now a FK)     PersonColumn1
WorkerColumn1           PersonColumn2
WorkerColumn2           ...
...                     PersonColumnN
WorkerColumnN
----------------------- ---------------------

为此，我需要分析哪些数据在所有唯一 Person 中的 Person 范围有所不同（在 WorkerPersons 中由 PersonID 分隔）。例如：

WorkerPersons
-------------------------------------------------------
ID      PersonID      Column1      Column2      Column3
-------------------------------------------------------
1       PersonA       10.1         John Doe     Single
2       PersonA       10.1         John Doe     Single
3       PersonA       10.1         John Doe     Married
4       PersonB       09.2         Sully        Single
5       PersonB       09.2         Sullivan     Single

在本例中，PersonA 有 3 个版本，PersonB 有 2 个版本。 Column1 值在所有版本的 Person 中始终相同，我们可以将该列移至表 Worker。但是 Column 2 和 Column3 值在不同版本的 Person 中发生变化，因此这些值应该移动到 Person 表中。

想象不到，我有大约 10 个这样的表需要标准化，每个表大约有 40 列。每个表保存大约 500k 到 5m 行。

我需要一个脚本来帮助我分析哪些列要移动到哪里。我需要一个脚本来输出整个表中唯一 Person 范围内发生变化的所有列。我不知道如何做到这一点。我尝试使用 LAG 分析函数来与下一行进行比较，但到底如何输出更改的列超出了我的范围。

请指教。

最好的祝愿，安德鲁

原文

I have a table with the following structure:

WorkerPersons
-------------------------------
ID          (PK)
PersonID    (Indicates which version of Person the record describes)
SomeColumn1 (data specific to Worker)
SomeColumn2 (data specific to Person)
....
SomeColumnN
-------------------------------

As you can see, it's a denormalized table, which holds both Worker and Person (and many versions of one Person) data in one table. My wish is to normalize that table, however, as the table holds a lot of data (many many columns), I need to be sure which columns should go to Workers table and which columns to Persons table. The outcome should be like this:

Workers                 Persons
----------------------- ---------------------
ID                      ID
PersonID (now a FK)     PersonColumn1
WorkerColumn1           PersonColumn2
WorkerColumn2           ...
...                     PersonColumnN
WorkerColumnN
----------------------- ---------------------

To do that, I need to analyze which data differs in scope of Person over all unique Persons (wich are separated by PersonID in WorkerPersons). For example:

WorkerPersons
-------------------------------------------------------
ID      PersonID      Column1      Column2      Column3
-------------------------------------------------------
1       PersonA       10.1         John Doe     Single
2       PersonA       10.1         John Doe     Single
3       PersonA       10.1         John Doe     Married
4       PersonB       09.2         Sully        Single
5       PersonB       09.2         Sullivan     Single

In this case, there are 3 versions on PersonA and 2 versions of PersonB. Column1 values are always the same over all versions of Person, and we can move that column to table Worker. But Column 2 and Column3 values change over different versions of Person, so those values should be moved to Person table.

No imagine, I have about 10 tables like this that need to be normalized, with about 40 columns in each. Eeach table holds about 500k to 5m rows.

I need a script that helps me analyse which columns to move where. I need a script that outputs all columns that change in scope of unique Person over the whole table. I've no ideas however how to do that. I experimented with LAG analytical function to compare against the next row but how in the world to output changed columns is beyond me.

Please advise.

Best wishes,
Andrew

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

魔 2024-10-02 12:21:35

由于 10 个表并不是很多，这里是（某种）

for each table_name in tables
  for each column_name in columns
    case (exists (select 1
          from table_name
          group by PersonID
          having min(column_name) = max(column_name))
       when true then 'Worker'
       when false then 'Person'
    end case
  end for
end for

带有信息模式和动态查询的伪代码，您可以制作上述正确的 PL/SQL 或采用核心查询并用您最喜欢的语言编写脚本。

编辑：
上面假设 column_name 中没有 NULL。

编辑2：
核心查询的其他变体可以是，

SELECT 1
FROM 
(SELECT COUNT(DISTINCT column_name) AS distinct_values_by_pid
FROM table_name
GROUP BY PersonID) T
HAVING MIN(distinct_values_by_pid) = MAX(distinct_values_by_pid)

如果每个 PersonID 的所有值都相同，则将返回一行。
（此查询也存在 NULL 问题，但我认为 NULL 是一个单独的问题；出于上述查询的目的，您始终可以将 NULL 转换为某个域外值）

上述查询也可以写为

SELECT MIN(c1)=MAX(c1), MIN(c2)=MAX(c2), ...
FROM 
(SELECT COUNT(DISTINCT column_name_1) AS c1, COUNT(DISTINCT column_name_2) AS c2, ...
FROM table_name
GROUP BY PersonID) T

Which will test multiple同时对于属于“Workers”的列返回 true，对于应该进入“Persons”的列返回 false。

Since 10 tables is not a lot, here is (some sort of) pseudo code

for each table_name in tables
  for each column_name in columns
    case (exists (select 1
          from table_name
          group by PersonID
          having min(column_name) = max(column_name))
       when true then 'Worker'
       when false then 'Person'
    end case
  end for
end for

with information schema and dynamic queries you could make the above proper PL/SQL or take the core query and script it in your favourite language.

EDIT:
The above assumes no NULLs in column_name.

EDIT2:
Other variants of the core query can be

SELECT 1
FROM 
(SELECT COUNT(DISTINCT column_name) AS distinct_values_by_pid
FROM table_name
GROUP BY PersonID) T
HAVING MIN(distinct_values_by_pid) = MAX(distinct_values_by_pid)

Which will return a row if all values per PersonID are the same.
(this query also has problems with NULLS, but I consider NULLs a separate issue; you can always cast a NULL to some out-of-domain value for purposes of the above query)

The above query can be also written as

SELECT MIN(c1)=MAX(c1), MIN(c2)=MAX(c2), ...
FROM 
(SELECT COUNT(DISTINCT column_name_1) AS c1, COUNT(DISTINCT column_name_2) AS c2, ...
FROM table_name
GROUP BY PersonID) T

Which will test multiple columns at the same time returning true for columns that belong to 'Workers' and false for columns that should go into 'Persons'.

回复收藏 0 原文

天邊彩虹 2024-10-02 12:21:35

谢谢，但我通过让 Excel 在表架构信息上创建一系列选择来解决这个问题。它生成的最终查询是一长串选择，但它有效（尽管运行了一个多小时）。 “核心查询”（实际上是 Excel 中创建核心查询的公式）：

=IF(AND(C17<>"CLOB";C17<>"NCLOB");"SELECT '"&A17&".'||initcap('"&B17&"') description,
decode(count(*),0,'SAME OVE VERSIONS','DIFFERENT OVER VERSIONS') values FROM (SELECT 
objektid, count(DISTINCT nvl("&B17&","&IF(C17="DATE";"'01.02.0004'";IF(C17="VARCHAR2"
;"'!#¤¤%¤(%#¤%AS'";"-1234561"))&")) OVER (PARTITION BY objectid) arv FROM "&A17&") 
WHERE number > 1 union all";"SELECT '"&A17&".'||initcap('"&B17&"') description, 'CLOB
field' values from dual union all")

Thanks, but I solved it by letting Excel create series of selects over table schema information. The final query that it generated was a long list of selects but it works (although it run over a hour). The "core query" (actually a formula in Excel to create to core query):

=IF(AND(C17<>"CLOB";C17<>"NCLOB");"SELECT '"&A17&".'||initcap('"&B17&"') description,
decode(count(*),0,'SAME OVE VERSIONS','DIFFERENT OVER VERSIONS') values FROM (SELECT 
objektid, count(DISTINCT nvl("&B17&","&IF(C17="DATE";"'01.02.0004'";IF(C17="VARCHAR2"
;"'!#¤¤%¤(%#¤%AS'";"-1234561"))&")) OVER (PARTITION BY objectid) arv FROM "&A17&") 
WHERE number > 1 union all";"SELECT '"&A17&".'||initcap('"&B17&"') description, 'CLOB
field' values from dual union all")

回复收藏 0 原文

~没有更多了~