python中CSV数据的数据类型识别/猜测

发布于 2024-11-26 11:05:29 字数 810 浏览 9 评论 0原文

我的问题是处理大型 CSV 文件中的数据。

我正在寻找最有效的方法来根据该列中找到的值确定（即猜测）该列的数据类型。我可能正在处理非常混乱的数据。因此，算法应该具有一定的容错性。

下面是一个示例：

arr1 = ['0.83', '-0.26', '-', '0.23', '11.23']               # ==> recognize as float
arr2 = ['1', '11', '-1345.67', '0', '22']                    # ==> regognize as int
arr3 = ['2/7/1985', 'Jul 03 1985, 00:00:00', '', '4/3/2011'] # ==> recognize as date
arr4 = ['Dog', 'Cat', '0.13', 'Mouse']                       # ==> recognize as str

底线：我正在寻找一个 python 包或一种算法，可以检测

CSV 文件的架构，甚至可以检测
单个列的数据类型作为数组

猜测类型的方法当前表示为字符串的数据也朝着类似的方向发展。不过，我担心性能，因为我可能正在处理许多大型电子表格（数据源自何处）

原文

My problem is in the context of processing data from large CSV files.

I'm looking for the most efficient way to determine (that is, guess) the data type of a column based on the values found in that column. I'm potentially dealing with very messy data. Therefore, the algorithm should be error-tolerant to some extent.

Here's an example:

arr1 = ['0.83', '-0.26', '-', '0.23', '11.23']               # ==> recognize as float
arr2 = ['1', '11', '-1345.67', '0', '22']                    # ==> regognize as int
arr3 = ['2/7/1985', 'Jul 03 1985, 00:00:00', '', '4/3/2011'] # ==> recognize as date
arr4 = ['Dog', 'Cat', '0.13', 'Mouse']                       # ==> recognize as str

Bottom line: I'm looking for a python package or an algorithm that can detect either

the schema of a CSV file, or even better
the data type of an individual column
as an array

Method for guessing type of data represented currently represented as strings goes in a similar direction.
I'm worried about performance, though, since I'm possibly dealing with many large spreadsheets (where the data stems from)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

网名女生简单气质 2024-12-03 11:05:29

您可能对这个 python 库感兴趣，它可以为您对 CSV 和 XLS 文件进行这种类型猜测：

它很乐意扩展到非常大的文件，流式传输互联网上的数据还有

一个更简单的包装库，其中包括一个名为 dataconverters 的命令行工具： http://okfnlabs.org/dataconverters/ （以及在线服务：https://github.com/okfn/dataproxy！）

进行类型猜测的核心算法是这里：https://github.com/okfn/messytables/blob/7e4f12abef257a4d70a8020e0d024df6fbb02976/messytables/types.py#L164

回复收藏 0 原文

风月客 2024-12-03 11:05:29

经过一些思考后，这就是我自己设计算法的方式：

出于性能原因：为每列（例如 1%）取样，为
样本中的每个单元格运行正则表达式匹配，检查数据类型
选择基于频率分布的列的适当数据类型

出现的两个问题：

足够的样本量是多少？对于小数据集？对于大数据集？
根据频率分布选择数据类型的足够高的阈值是多少？

回复收藏 0 原文

陌上青苔 2024-12-03 11:05:29

也许 csvsql 在这里可能有用？不知道它有多高效，但肯定可以完成从 csv 生成 sql 创建表语句的工作。

$ csvsql so_many_columns.csv  >> sql_create_table_with_char_types.txt

Maybe csvsql could be useful here? No idea how efficient it is but definitely gets the job done for generating sql create table statements out of csvs.

$ csvsql so_many_columns.csv  >> sql_create_table_with_char_types.txt

回复收藏 0 原文

◇流星雨 2024-12-03 11:05:29

您可以尝试使用正则表达式进行预解析。例如：

import re
pattern = re.compile(r'^-?\d+.{1}\d+
这样，您可以创建一个正则表达式字典，并尝试每个字典，直到找到匹配项。
myregex = {int: r'^-?\d+
不要忘记开头的“^”和结尾的“$”，如果没有，正则表达式可以匹配部分字符串并返回一个对象。
希望这有帮助:)
)
data = '123.42'
print pattern.match(data) # ----> object
data2 = 'NOT123.42GONNA31.4HAPPEN'
print pattern.match(data2) # ----> None

这样，您可以创建一个正则表达式字典，并尝试每个字典，直到找到匹配项。

不要忘记开头的“^”和结尾的“$”，如果没有，正则表达式可以匹配部分字符串并返回一个对象。
希望这有帮助:)
, float: r'^\d+.{1}\d+
不要忘记开头的“^”和结尾的“$”，如果没有，正则表达式可以匹配部分字符串并返回一个对象。
希望这有帮助:)
)
data = '123.42'
print pattern.match(data) # ----> object
data2 = 'NOT123.42GONNA31.4HAPPEN'
print pattern.match(data2) # ----> None

这样，您可以创建一个正则表达式字典，并尝试每个字典，直到找到匹配项。

不要忘记开头的“^”和结尾的“$”，如果没有，正则表达式可以匹配部分字符串并返回一个对象。

希望这有帮助:)

, ....} for key, reg in myregex.items(): to_del = [] for index, data in enumerate(arr1): if re.match(reg,data): d = key(data) # You will need to insert data differently depending on function ....#---> do something to_del.append(data) # ---> delete this when you can from arr1

不要忘记开头的“^”和结尾的“$”，如果没有，正则表达式可以匹配部分字符串并返回一个对象。

希望这有帮助:)

) data = '123.42' print pattern.match(data) # ----> object data2 = 'NOT123.42GONNA31.4HAPPEN' print pattern.match(data2) # ----> None

这样，您可以创建一个正则表达式字典，并尝试每个字典，直到找到匹配项。

不要忘记开头的“^”和结尾的“$”，如果没有，正则表达式可以匹配部分字符串并返回一个对象。

希望这有帮助:)

You could try a pre parse using regex. For example:

import re
pattern = re.compile(r'^-?\d+.{1}\d+
This way you could do a dictionary of regex and try each of them until you find a match
myregex = {int: r'^-?\d+
Don't forget the '^' at the beggining and the '$' at the end, if not the regex could match part of the string and return an object. 
Hope this helps :)
)
data = '123.42'
print pattern.match(data) # ----> object
data2 = 'NOT123.42GONNA31.4HAPPEN'
print pattern.match(data2) # ----> None

This way you could do a dictionary of regex and try each of them until you find a match

Don't forget the '^' at the beggining and the '$' at the end, if not the regex could match part of the string and return an object. 
Hope this helps :)
, float: r'^\d+.{1}\d+
Don't forget the '^' at the beggining and the '$' at the end, if not the regex could match part of the string and return an object. 
Hope this helps :)
)
data = '123.42'
print pattern.match(data) # ----> object
data2 = 'NOT123.42GONNA31.4HAPPEN'
print pattern.match(data2) # ----> None

This way you could do a dictionary of regex and try each of them until you find a match

Don't forget the '^' at the beggining and the '$' at the end, if not the regex could match part of the string and return an object.

Hope this helps :)

Don't forget the '^' at the beggining and the '$' at the end, if not the regex could match part of the string and return an object.

Hope this helps :)

) data = '123.42' print pattern.match(data) # ----> object data2 = 'NOT123.42GONNA31.4HAPPEN' print pattern.match(data2) # ----> None

This way you could do a dictionary of regex and try each of them until you find a match

Don't forget the '^' at the beggining and the '$' at the end, if not the regex could match part of the string and return an object.

Hope this helps :)

回复收藏 0 原文