如何从 .csv 文件重建关系数据库?
我有超过 1.5TB 的数据,我被告知这些数据来自某种关系数据库。不幸的是,在这一过程中,使关系数据库成为关系的所有内容都被删除了,我们剩下的几乎是 50 个 .csv 文件,每个文件都对应于数据库中的一个表。没有架构,没有注释,只有一份有用的文档。使问题更加复杂的是,我们无法联系维护数据库的人员,也无法询问他们 CX_CUST_O 中的 X_ATTRIB_14 意味着什么。
我们确实有一个文件,列出了每个属性的名称和数据类型,这涵盖了所有表,因此我们至少知道属性是什么。它看起来像
TABLE_NAME
ROW_ID......................VARCHAR2(32)
CREATED...... ..VARCHAR2(16)
LAST_MODIFIED.........DATE
等等
数据非常零散,许多属性在任何字段中都没有列出数据,还有很多无用的名称,如 ATTRIB_3。每个表都有一个 ROW_ID、CREATED 和 CREATED_BY,以及一个 CONFLICT_ID。然后是数据字段,这些字段通常是零散的,并且仅部分填充,并且有很多空值。
到目前为止,我已经完成了一些基本的预处理,通过消除没有值的属性并检查是否存在可以给我们提示的简单命名约定 - 例如,如果 TABLE_A 中的 X_ATTRIB_3 与 TABLE_B 中的 X_ATTRIB_3 相同,看起来唯一的共同属性是那些微不足道的属性 - 例如,LONGITUDE 在几个表中很常见,但这不太可能有帮助。我不确定如何继续;数据的庞大规模使得不可能用手查看所有内容。
是否已经开发出任何工具或技术可以帮助重建表之间的关系,或者这种情况是否非常罕见,以至于我们必须从头开始?
感谢您抽出时间。
So I've got a little over 1.5TB of data, which I have been told came from sort sort of relational database. Unfortunately, somewhere along the line, everything that makes a relational database relational was stripped out and what we're left with is almost fifty .csv files, each one corresponding to one of the tables in the database. There's no schema, no notes, and only one piece of helpful documentation. Compounding the problem is that we don't have access to the people who maintain the database and can't go and ask them what X_ATTRIB_14 in CX_CUST_O means.
We DO have a file that lists, for each attribute, name and datatype, and this covers all tables, so we at least know what the attributes are. It looks something like
TABLE_NAME
ROW_ID.....................VARCHAR2(32)
CREATED....................VARCHAR2(16)
LAST_MODIFIED.........DATE
etc etc etc
The data is very patchy, with many attributes without data listed in any of the fields, and lots of unhelpful names like ATTRIB_3. Each table does have a ROW_ID, CREATED, and CREATED_BY, as well as a CONFLICT_ID. Then, there are the data fields, which are usually patchy and only partially filled as well, with lots of empty values.
So far I've done some basic pre-processing by eliminating attributes which have no values and checking to see if there are simple naming conventions that would give us a hint - for example, if X_ATTRIB_3 in TABLE_A is the same as X_ATTRIB_3 in TABLE_B, and it appears that the only common attributes are those that are trivial - LONGITUDE, for example, is common across a few tables, but that is unlikely to help. I'm unsure as how to proceed; the sheer size of the data makes it impossible to look at everything by hand.
Are there any tools or techniques that have been developed that would aid in reconstructing the relationships between the tables, or is this sufficiently rare that we're going to have to start from scratch?
Thank you for your time.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果不深入了解数据本身的性质,就很难重建复杂数据结构的关系。必须手动重建关系,并且为了正确地重建关系,不应修改数据。在大多数数据库模式设计中,设计者将在记录 ID 和易于索引的数据(通常是任意数字)之间创建关系。要做的第一件事是以原始形式将内容添加到数据库中,按原样!然后,根据您对数据的启发式理解,根据您打算对数据执行的查询创建有意义的关系。您可能需要专业帮助才能做到这一点:-) — 实际上,您最好不要使用任何自动构建工具来处理原始数据;当您将数据转储为 CSV 等格式然后尝试重建时,信息丢失可能会非常微妙。
It is very difficult to rebuild relationships for complex data structures without intimate knowledge about the nature of the data itself. The relationships would have to be rebuilt by hand and in order to do it correctly the data should NOT be modified. In most database schema designs, the designer would create relationships over record IDs and easily indexable data which is typically an arbitrary number. The first thing to do would be to add the content to the database in raw form, AS IS! Then based on your heuristic understanding of data, create relationships that make sense based on queries you intend to conduct on the data. You probably need professional assistance to do this :-) — You are actually better off NOT using any automated build tools to process your raw data; Loss of information can be very subtle when you do datadumps to a format like CSV and then try to rebuild.