生成检查 Excel(CSV) 的表模式并导入数据

发布于 2024-09-06 07:45:51 字数 168 浏览 4 评论 0原文

我将如何创建一个 MYSQL 表模式来检查 Excel(或 CSV)文件。 是否有任何现成的 Python 库可以完成该任务?

列标题将被清理为列名称。将根据电子表格列的内容来估计数据类型。完成后,数据将加载到表中。

我有一个包含约 200 列的 Excel 文件,我想开始对其进行标准化。

How would I go around creating a MYSQL table schema inspecting an Excel(or CSV) file.
Are there any ready Python libraries for the task?

Column headers would be sanitized to column names. Datatype would be estimated based on the contents of the spreadsheet column. When done, data would be loaded to the table.

I have an Excel file of ~200 columns that I want to start normalizing.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

清秋悲枫 2024-09-13 07:45:51

使用xlrd模块; 从这里开始。 [免责声明:我是作者]。 xlrd 将单元格分类为文本、数字、日期、布尔值、错误、空白和空。它通过检查与单元格关联的格式(例如“dd/mm/yyyy”与“0.00”)来区分日期和数字。

编写一些代码来遍历用户输入的数据来决定每列使用哪种数据库数据类型的工作并不容易自动化。您应该能够观察数据并分配整数、货币、文本、日期、日期时间、时间等类型,并编写代码来检查您的猜测。请注意,您需要能够处理文本字段中输入的数字或日期数据等内容(在 GUI 中看起来不错)。您需要一种策略来处理不适合“估计”数据类型的单元格。您需要验证和清理您的数据。确保规范化文本字符串(去掉前导/尾随空格,用单个空格替换多个空格。Excel 文本是(仅限 BMP)Unicode;不要将其转换为 ASCII 或“ANSI”——以 Unicode 工作并以UTF-8 将其放入数据库中。

Use the xlrd module; start here. [Disclaimer: I'm the author]. xlrd classifies cells into text, number, date, boolean, error, blank, and empty. It distinguishes dates from numbers by inspecting the format associated with the cell (e.g. "dd/mm/yyyy" versus "0.00").

The job of programming some code to wade through user-entered data to decide on what DB datatype to use for each column is not something that can be easily automated. You should be able to eyeball the data and assign types like integer, money, text, date, datetime, time, etc and write code to check your guesses. Note that you need to able to cope with things like numeric or date data entered in text fields (can look OK in the GUI). You need a strategy to handle cells that don't fit the "estimated" datatype. You need to validate and clean your data. Make sure you normalize text strings (strip leading/trailing whitespace, replace multiple whitespaces by a single space. Excel text is (BMP-only) Unicode; don't bash it into ASCII or "ANSI" -- work in Unicode and encode in UTF-8 to put it in your database.

如果没有 2024-09-13 07:45:51

使用 phpmyadmin 快速而肮脏的解决方法:

  • 创建一个具有适量列的表。确保数据适合列。
  • 将 CSV 导入表中。
  • 使用建议表结构

Quick and dirty workaround with phpmyadmin:

  • Create a table with the right amount of columns. Make sure the data fits the columns.
  • Import the CSV into the table.
  • Use the propose table structure.
止于盛夏 2024-09-13 07:45:51

据我所知,没有任何工具可以自动化这个过程(我希望有人能证明我是错的,因为我以前也遇到过这个问题)。
当我这样做时,我想出了两个选择:
(1) 在db中手动创建适当类型的列然后导入,或者
(2) 编写某种过滤器,可以“找出”列应该是什么数据类型。
我选择第一个选项主要是因为我认为我实际上无法编写一个程序来进行类型推断。

如果您决定编写类型推断工具/转换,以下是您可能需要处理的几个问题:
(1) Excel日期实际上存储为自1899年12月31日以来的天数;那么如何推断一列是日期而不是某些数字数据(例如人口)?
(2) 对于文本字段,您是否只创建 varchar(n) 类型的列(其中 n 是该列中最长的条目),或者如果其中一个条目的长度超过某个上限,是否将​​其设为无界字符字段?如果是这样,什么是好的上限?
(3) 如何以正确的精度自动将浮点数转换为小数而不丢失任何位置?
显然,这并不意味着您不能(我是一个非常糟糕的程序员)。我希望你这样做,因为这将是一个非常有用的工具。

As far as I know, there is no tool that can automate this process (I would love for someone to prove me wrong as I've had this exact problem before).
When I did this, I came up with two options:
(1) Manually create the columns in the db with the appropriate types and then import, or
(2) Write some kind of filter that could "figure out" what data types the columns should be.
I went with the first option mainly because I didn't think I could actually write a program to do the type inference.

If you do decide to write a type inference tool/conversion, here are a couple of issues you may have to deal with:
(1) Excel dates are actually stored as the number of days since December 31st, 1899; how does one infer then that a column is dates as opposed to some piece of numerical data (population for example)?
(2) For text fields, do you just make the columns of type varchar(n) where n is the longest entry in that column, or do you make it an unbounded char field if one of the entries is longer than some upper limit? If so, what's a good upper limit?
(3) How do you automatically convert a float to a decimal with the correct precision and without loosing any places?
Obviously, this doesn't mean that you won't be able to (I'm a pretty bad programmer). I hope you do, because it'd be a really useful tool to have.

故人爱我别走 2024-09-13 07:45:51

Pandas 可以返回模式:

pandas.read_csv('data.csv').dtypes

参考文献:

Pandas can return a schema:

pandas.read_csv('data.csv').dtypes

References:

溺ぐ爱和你が 2024-09-13 07:45:51

仅供(我的)参考,我在下面记录了我所做的事情:

  1. XLRD 很实用,但是我刚刚将 Excel 数据保存为 CSV,因此我可以使用 LOAD DATA INFILE
  2. 我已经复制了标题行并开始编写导入和规范化脚本
  3. 脚本执行以下操作:CREATE TABLE,所有列均为文本,主键
  4. 查询除外 mysql:LOAD DATA LOCAL INFILE 将所有 CSV 数据加载到文本字段。
  5. 根据PROCEDURE ANALYSE的输出,我能够ALTER TABLE为列提供正确的类型和长度。对于任何具有很少不同值的列,PROCEDURE ANALYSE 返回ENUM,这不是我所需要的,但我发现这对于规范化很有用。使用 PROCEDURE ANALYSE 轻松查看 200 列。 PhpMyAdmin 的输出表明表结构是垃圾。
  6. 我编写了一些规范化,主要是在列上使用 SELECT DISTINCT 并将结果插入到单独的表中。我首先在旧表中添加了 FK 列。就在INSERT之后,我得到了它的ID并UPDATE编辑了FK列。循环完成后,我删除了旧列,只留下 FK 列。与多个依赖列类似。这比我预期的要快得多。
  7. 我运行了(django)python manage.py inspctdb,将输出复制到 models.py 并添加了所有这些 ForeignkeyField,因为 MyISAM 上不存在 FK。写了一些Python的views.py,urls.py,一些模板...TADA

创建了https://blocl.uk/schools
基于 https://www.get-information-schools.service 的数据。 gov.uk/下载

Just for (my) reference, I documented below what I did:

  1. XLRD is practical, however I've just saved the Excel data as CSV, so I can use LOAD DATA INFILE
  2. I've copied the header row and started writing the import and normalization script
  3. Script does: CREATE TABLE with all columns as TEXT, except for Primary key
  4. query mysql: LOAD DATA LOCAL INFILE loading all CSV data into TEXT fields.
  5. based on the output of PROCEDURE ANALYSE, I was able to ALTER TABLE to give columns the right types and lengths. PROCEDURE ANALYSE returns ENUM for any column with few distinct values, which is not what I needed, but I found that useful later for normalization. Eye-balling 200 columns was a breeze with PROCEDURE ANALYSE. Output from PhpMyAdmin propose table structure was junk.
  6. I wrote some normalization mostly using SELECT DISTINCT on columns and INSERTing results to separate tables. I have added to the old table a column for FK first. Just after the INSERT, I've got its ID and UPDATEed the FK column. When loop finished I've dropped old column leaving only FK column. Similarly with multiple dependent columns. It was much faster than I expected.
  7. I ran (django) python manage.py inspctdb, copied output to models.py and added all those ForeignkeyFields as FKs do not exist on MyISAM. Wrote a little python views.py, urls.py, few templates...TADA

Created https://blocl.uk/schools
based on data from https://www.get-information-schools.service.gov.uk/Downloads

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文