返回介绍

Basic concepts of database normalization

发布于 2025-02-25 23:43:38 字数 4595 浏览 0 评论 0 收藏 0

In which we convert a dataframe into a normalized database.

names = ['ann', 'bob', 'ann', 'bob', 'carl', 'delia', 'ann']
tests = ['wbc', 'wbc', 'rbc', 'rbc', 'wbc', 'rbc', 'platelets']
values1 = [10, 11.2, 300, 204, 9.8, 340, 125]
values2 = [10.6, 13.2, 322, 214, 10.3, 343, 145]
df = pd.DataFrame([names, tests, values1, values2]).T
df.columns = ['names', 'tests', 'values1', 'values2']
df
 namestestsvalues1values2
0annwbc1010.6
1bobwbc11.213.2
2annrbc300322
3bobrbc204214
4carlwbc9.810.3
5deliarbc340343
6annplatelets125145
# names are put into their own table so there is no dubplication

name_table = pd.DataFrame(df['names'].unique(), columns=['name'])
name_table['name_id'] = name_table.index
columns = ['name_id', 'name']
name_table[columns]
 name_idname
00ann
11bob
22carl
33delia
# tests are put inot their own table so there is no duplication

test_table = pd.DataFrame(df['tests'].unique(), columns=['test'])
test_table['test_id'] = test_table.index
columns = ['test_id', 'test']
test_table[columns]
 test_idtest
00wbc
11rbc
22platelets
# the values1 and values2 correspond to visit 1 and 2, so
# we create a visits table

visit_table = pd.DataFrame([1,2], columns=['visit'])
visit_table['visit_id'] = visit_table.index
columns = ['visit_id', 'visit']
visit_table[columns]
 visit_idvisit
001
112
# finally, we link each value to a triple(name_id, test_id, visit_id)

value_table = pd.DataFrame([
    [0,0,0,10], [1,0,0,11.2], [0,1,0,300], [1,1,0,204], [2,0,0,9.8], [3,1,0,340], [0,2,0,125],
   [0,0,1,10.6], [1,0,1,13.2], [0,1,1,322], [1,1,1,214], [2,0,1,10.3], [3,1,1,343], [0,2,1,145]
], columns=['name_id', 'test_id', 'visit_id', 'value'])
value_table
 name_idtest_idvisit_idvalue
000010.0
110011.2
2010300.0
3110204.0
42009.8
5310340.0
6020125.0
700110.6
810113.2
9011322.0
10111214.0
1120110.3
12311343.0
13021145.0

At the end of the normalizaiton, we have gone from 1 dataframe with multiple redundancies to 4 tables with unique entries in each row. This organization helps maintain data integrity and is necesssary for effficeincy as the number of test values grows, possibly into millions of rows. As we have seen, we can use SQL queries to recreate the origianl dataformat if that is more convenient for analysis.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
    我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
    原文