python中CSV数据的数据类型识别/猜测

发布于 2024-11-26 11:05:29 字数 810 浏览 5 评论 0原文

我的问题是处理大型 CSV 文件中的数据。

我正在寻找最有效的方法来根据该列中找到的值确定(即猜测)该列的数据类型。我可能正在处理非常混乱的数据。因此,算法应该具有一定的容错性。

下面是一个示例:

arr1 = ['0.83', '-0.26', '-', '0.23', '11.23']               # ==> recognize as float
arr2 = ['1', '11', '-1345.67', '0', '22']                    # ==> regognize as int
arr3 = ['2/7/1985', 'Jul 03 1985, 00:00:00', '', '4/3/2011'] # ==> recognize as date
arr4 = ['Dog', 'Cat', '0.13', 'Mouse']                       # ==> recognize as str

底线:我正在寻找一个 python 包或一种算法,可以检测

  • CSV 文件的架构,甚至可以检测
  • 单个列的数据类型 作为数组

猜测类型的方法当前表示为字符串 的数据也朝着类似的方向发展。 不过,我担心性能,因为我可能正在处理许多大型电子表格(数据源自何处)

My problem is in the context of processing data from large CSV files.

I'm looking for the most efficient way to determine (that is, guess) the data type of a column based on the values found in that column. I'm potentially dealing with very messy data. Therefore, the algorithm should be error-tolerant to some extent.

Here's an example:

arr1 = ['0.83', '-0.26', '-', '0.23', '11.23']               # ==> recognize as float
arr2 = ['1', '11', '-1345.67', '0', '22']                    # ==> regognize as int
arr3 = ['2/7/1985', 'Jul 03 1985, 00:00:00', '', '4/3/2011'] # ==> recognize as date
arr4 = ['Dog', 'Cat', '0.13', 'Mouse']                       # ==> recognize as str

Bottom line: I'm looking for a python package or an algorithm that can detect either

  • the schema of a CSV file, or even better
  • the data type of an individual column
    as an array

Method for guessing type of data represented currently represented as strings goes in a similar direction.
I'm worried about performance, though, since I'm possibly dealing with many large spreadsheets (where the data stems from)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

网名女生简单气质 2024-12-03 11:05:29

您可能对这个 python 库感兴趣,它可以为您对 CSV 和 XLS 文件进行这种类型猜测:

它很乐意扩展到非常大的文件,流式传输互联网上的数据还有

一个更简单的包装库,其中包括一个名为 dataconverters 的命令行工具: http://okfnlabs.org/dataconverters/ (以及在线服务:https://github.com/okfn/dataproxy!)

进行类型猜测的核心算法是这里:https://github.com/okfn/messytables/blob/7e4f12abef257a4d70a8020e0d024df6fbb02976/messytables/types.py#L164

You may be interested in this python library which does exactly this kind of type guessing on CSVs and XLS files for you:

It happily scales to very large files, to streaming data off the internet etc.

There is also an even simpler wrapper library that includes a command line tool named dataconverters: http://okfnlabs.org/dataconverters/ (and an online service: https://github.com/okfn/dataproxy!)

The core algorithm that does the type guessing is here: https://github.com/okfn/messytables/blob/7e4f12abef257a4d70a8020e0d024df6fbb02976/messytables/types.py#L164

风月客 2024-12-03 11:05:29

经过一些思考后,这就是我自己设计算法的方式:

  • 出于性能原因:为每列(例如 1%)取样,为
  • 样本中的每个单元格运行正则表达式匹配,检查数据类型
  • 选择基于频率分布的列的适当数据类型

出现的两个问题:

  • 足够的样本量是多少?对于小数据集?对于大数据集?
  • 根据频率分布选择数据类型的足够高的阈值是多少?

After putting some thought into it, this is how I would design the algorithm myself:

  • For performance reasons: take a sample for each column (say, 1%)
  • run a regex match for each cell in the sample, checking for the data type
  • Choose the appropriate data type for the column based on the frequency distribution

The two questions that arise:

  • What's a sufficient sample size? For small data sets? For large data sets?
  • What's a high enough threshold for selecting a data type based on the frequency distribution?
陌上青苔 2024-12-03 11:05:29

也许 csvsql 在这里可能有用?不知道它有多高效,但肯定可以完成从 csv 生成 sql 创建表语句的工作。

$ csvsql so_many_columns.csv  >> sql_create_table_with_char_types.txt

Maybe csvsql could be useful here? No idea how efficient it is but definitely gets the job done for generating sql create table statements out of csvs.

$ csvsql so_many_columns.csv  >> sql_create_table_with_char_types.txt
◇流星雨 2024-12-03 11:05:29

您可以尝试使用正则表达式进行预解析。例如:

import re
pattern = re.compile(r'^-?\d+.{1}\d+

这样,您可以创建一个正则表达式字典,并尝试每个字典,直到找到匹配项。

myregex = {int: r'^-?\d+

不要忘记开头的“^”和结尾的“$”,如果没有,正则表达式可以匹配部分字符串并返回一个对象。

希望这有帮助:)

) data = '123.42' print pattern.match(data) # ----> object data2 = 'NOT123.42GONNA31.4HAPPEN' print pattern.match(data2) # ----> None

这样,您可以创建一个正则表达式字典,并尝试每个字典,直到找到匹配项。


不要忘记开头的“^”和结尾的“$”,如果没有,正则表达式可以匹配部分字符串并返回一个对象。

希望这有帮助:)

, float: r'^\d+.{1}\d+

不要忘记开头的“^”和结尾的“$”,如果没有,正则表达式可以匹配部分字符串并返回一个对象。

希望这有帮助:)

) data = '123.42' print pattern.match(data) # ----> object data2 = 'NOT123.42GONNA31.4HAPPEN' print pattern.match(data2) # ----> None

这样,您可以创建一个正则表达式字典,并尝试每个字典,直到找到匹配项。

不要忘记开头的“^”和结尾的“$”,如果没有,正则表达式可以匹配部分字符串并返回一个对象。

希望这有帮助:)

, ....} for key, reg in myregex.items(): to_del = [] for index, data in enumerate(arr1): if re.match(reg,data): d = key(data) # You will need to insert data differently depending on function ....#---> do something to_del.append(data) # ---> delete this when you can from arr1

不要忘记开头的“^”和结尾的“$”,如果没有,正则表达式可以匹配部分字符串并返回一个对象。

希望这有帮助:)

) data = '123.42' print pattern.match(data) # ----> object data2 = 'NOT123.42GONNA31.4HAPPEN' print pattern.match(data2) # ----> None

这样,您可以创建一个正则表达式字典,并尝试每个字典,直到找到匹配项。

不要忘记开头的“^”和结尾的“$”,如果没有,正则表达式可以匹配部分字符串并返回一个对象。

希望这有帮助:)

You could try a pre parse using regex. For example:

import re
pattern = re.compile(r'^-?\d+.{1}\d+

This way you could do a dictionary of regex and try each of them until you find a match

myregex = {int: r'^-?\d+

Don't forget the '^' at the beggining and the '$' at the end, if not the regex could match part of the string and return an object.

Hope this helps :)

) data = '123.42' print pattern.match(data) # ----> object data2 = 'NOT123.42GONNA31.4HAPPEN' print pattern.match(data2) # ----> None

This way you could do a dictionary of regex and try each of them until you find a match


Don't forget the '^' at the beggining and the '$' at the end, if not the regex could match part of the string and return an object.

Hope this helps :)

, float: r'^\d+.{1}\d+

Don't forget the '^' at the beggining and the '$' at the end, if not the regex could match part of the string and return an object.

Hope this helps :)

) data = '123.42' print pattern.match(data) # ----> object data2 = 'NOT123.42GONNA31.4HAPPEN' print pattern.match(data2) # ----> None

This way you could do a dictionary of regex and try each of them until you find a match

Don't forget the '^' at the beggining and the '$' at the end, if not the regex could match part of the string and return an object.

Hope this helps :)

, ....} for key, reg in myregex.items(): to_del = [] for index, data in enumerate(arr1): if re.match(reg,data): d = key(data) # You will need to insert data differently depending on function ....#---> do something to_del.append(data) # ---> delete this when you can from arr1

Don't forget the '^' at the beggining and the '$' at the end, if not the regex could match part of the string and return an object.

Hope this helps :)

) data = '123.42' print pattern.match(data) # ----> object data2 = 'NOT123.42GONNA31.4HAPPEN' print pattern.match(data2) # ----> None

This way you could do a dictionary of regex and try each of them until you find a match

Don't forget the '^' at the beggining and the '$' at the end, if not the regex could match part of the string and return an object.

Hope this helps :)

寂寞花火° 2024-12-03 11:05:29

我在c#中解决了同样的问题。
这就是我构建示例集的方式:
对于 CSV 中的每一列,我选择了具有最长值的行以及具有最短值的行。
然后,我构建了一个包含前 50 个非空行的数组。
所以我的样本至少有 0 行,最多 50 行,覆盖了一列中的整个范围。
之后,我会尝试从最宽的定义解析到最窄的定义:

if (value is String) then thisType = String;

如果(值为 DateTime)则 thisType 为 DateTime;

如果(值为 Decimal)则 thisType 为 Decimal;

如果(值为 Integer)则 thisType 为 Integer;

if (value is Boolean) then thisType is Boolean; 如果 (value is Boolean) then thisType is Boolean;

我在 C# 中使用 TryParse ,但我确信其他语言也会有类似的方法使用。

I solved the same problem in c#.
This is how I built the sample set:
For every column in the CSV, I selected the row with the longest value, as well as the row with the shortest value.
I then built an array with the 1st 50 non-empty rows.
So my samples had at least 0 and at most 50 rows which covered the whole range in a column.
After that, I would try to parse from widest definition to narrowest:

if (value is String) then thisType = String;

if (value is DateTime) then thisType is DateTime;

if (value is Decimal) then thisType is Decimal;

if (value is Integer) then thisType is Integer;

if (value is Boolean) then thisType is Boolean;

I use TryParse in C#, but I am sure other languages would have similar methods to use.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文