生成检查 Excel(CSV) 的表模式并导入数据
我将如何创建一个 MYSQL 表模式来检查 Excel(或 CSV)文件。 是否有任何现成的 Python 库可以完成该任务?
列标题将被清理为列名称。将根据电子表格列的内容来估计数据类型。完成后,数据将加载到表中。
我有一个包含约 200 列的 Excel 文件,我想开始对其进行标准化。
How would I go around creating a MYSQL table schema inspecting an Excel(or CSV) file.
Are there any ready Python libraries for the task?
Column headers would be sanitized to column names. Datatype would be estimated based on the contents of the spreadsheet column. When done, data would be loaded to the table.
I have an Excel file of ~200 columns that I want to start normalizing.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
使用
xlrd
模块; 从这里开始。 [免责声明:我是作者]。xlrd
将单元格分类为文本、数字、日期、布尔值、错误、空白和空。它通过检查与单元格关联的格式(例如“dd/mm/yyyy”与“0.00”)来区分日期和数字。编写一些代码来遍历用户输入的数据来决定每列使用哪种数据库数据类型的工作并不容易自动化。您应该能够观察数据并分配整数、货币、文本、日期、日期时间、时间等类型,并编写代码来检查您的猜测。请注意,您需要能够处理文本字段中输入的数字或日期数据等内容(在 GUI 中看起来不错)。您需要一种策略来处理不适合“估计”数据类型的单元格。您需要验证和清理您的数据。确保规范化文本字符串(去掉前导/尾随空格,用单个空格替换多个空格。Excel 文本是(仅限 BMP)Unicode;不要将其转换为 ASCII 或“ANSI”——以 Unicode 工作并以UTF-8 将其放入数据库中。
Use the
xlrd
module; start here. [Disclaimer: I'm the author].xlrd
classifies cells into text, number, date, boolean, error, blank, and empty. It distinguishes dates from numbers by inspecting the format associated with the cell (e.g. "dd/mm/yyyy" versus "0.00").The job of programming some code to wade through user-entered data to decide on what DB datatype to use for each column is not something that can be easily automated. You should be able to eyeball the data and assign types like integer, money, text, date, datetime, time, etc and write code to check your guesses. Note that you need to able to cope with things like numeric or date data entered in text fields (can look OK in the GUI). You need a strategy to handle cells that don't fit the "estimated" datatype. You need to validate and clean your data. Make sure you normalize text strings (strip leading/trailing whitespace, replace multiple whitespaces by a single space. Excel text is (BMP-only) Unicode; don't bash it into ASCII or "ANSI" -- work in Unicode and encode in UTF-8 to put it in your database.
使用 phpmyadmin 快速而肮脏的解决方法:
Quick and dirty workaround with phpmyadmin:
据我所知,没有任何工具可以自动化这个过程(我希望有人能证明我是错的,因为我以前也遇到过这个问题)。
当我这样做时,我想出了两个选择:
(1) 在db中手动创建适当类型的列然后导入,或者
(2) 编写某种过滤器,可以“找出”列应该是什么数据类型。
我选择第一个选项主要是因为我认为我实际上无法编写一个程序来进行类型推断。
如果您决定编写类型推断工具/转换,以下是您可能需要处理的几个问题:
(1) Excel日期实际上存储为自1899年12月31日以来的天数;那么如何推断一列是日期而不是某些数字数据(例如人口)?
(2) 对于文本字段,您是否只创建 varchar(n) 类型的列(其中 n 是该列中最长的条目),或者如果其中一个条目的长度超过某个上限,是否将其设为无界字符字段?如果是这样,什么是好的上限?
(3) 如何以正确的精度自动将浮点数转换为小数而不丢失任何位置?
显然,这并不意味着您不能(我是一个非常糟糕的程序员)。我希望你这样做,因为这将是一个非常有用的工具。
As far as I know, there is no tool that can automate this process (I would love for someone to prove me wrong as I've had this exact problem before).
When I did this, I came up with two options:
(1) Manually create the columns in the db with the appropriate types and then import, or
(2) Write some kind of filter that could "figure out" what data types the columns should be.
I went with the first option mainly because I didn't think I could actually write a program to do the type inference.
If you do decide to write a type inference tool/conversion, here are a couple of issues you may have to deal with:
(1) Excel dates are actually stored as the number of days since December 31st, 1899; how does one infer then that a column is dates as opposed to some piece of numerical data (population for example)?
(2) For text fields, do you just make the columns of type varchar(n) where n is the longest entry in that column, or do you make it an unbounded char field if one of the entries is longer than some upper limit? If so, what's a good upper limit?
(3) How do you automatically convert a float to a decimal with the correct precision and without loosing any places?
Obviously, this doesn't mean that you won't be able to (I'm a pretty bad programmer). I hope you do, because it'd be a really useful tool to have.
Pandas 可以返回模式:
参考文献:
pandas.read_csv
pandas.DataFrame
Pandas can return a schema:
References:
pandas.read_csv
pandas.DataFrame
仅供(我的)参考,我在下面记录了我所做的事情:
LOAD DATA INFILE
CREATE TABLE
,所有列均为文本,主键LOAD DATA LOCAL INFILE
将所有 CSV 数据加载到文本字段。PROCEDURE ANALYSE
的输出,我能够ALTER TABLE
为列提供正确的类型和长度。对于任何具有很少不同值的列,PROCEDURE ANALYSE
返回ENUM
,这不是我所需要的,但我发现这对于规范化很有用。使用PROCEDURE ANALYSE
轻松查看 200 列。 PhpMyAdmin 的输出表明表结构是垃圾。INSERT
之后,我得到了它的ID并UPDATE
编辑了FK列。循环完成后,我删除了旧列,只留下 FK 列。与多个依赖列类似。这比我预期的要快得多。python manage.py inspctdb
,将输出复制到 models.py 并添加了所有这些ForeignkeyField
,因为 MyISAM 上不存在 FK。写了一些Python的views.py,urls.py,一些模板...TADA创建了https://blocl.uk/schools
基于 https://www.get-information-schools.service 的数据。 gov.uk/下载
Just for (my) reference, I documented below what I did:
LOAD DATA INFILE
CREATE TABLE
with all columns as TEXT, except for Primary keyLOAD DATA LOCAL INFILE
loading all CSV data into TEXT fields.PROCEDURE ANALYSE
, I was able toALTER TABLE
to give columns the right types and lengths.PROCEDURE ANALYSE
returnsENUM
for any column with few distinct values, which is not what I needed, but I found that useful later for normalization. Eye-balling 200 columns was a breeze withPROCEDURE ANALYSE
. Output from PhpMyAdmin propose table structure was junk.SELECT DISTINCT
on columns andINSERT
ing results to separate tables. I have added to the old table a column for FK first. Just after theINSERT
, I've got its ID andUPDATE
ed the FK column. When loop finished I've dropped old column leaving only FK column. Similarly with multiple dependent columns. It was much faster than I expected.python manage.py inspctdb
, copied output to models.py and added all thoseForeignkeyField
s as FKs do not exist on MyISAM. Wrote a little python views.py, urls.py, few templates...TADACreated https://blocl.uk/schools
based on data from https://www.get-information-schools.service.gov.uk/Downloads