如何使用Python读取不同量列的CSV文件

发布于 2025-01-30 19:07:51 字数 537 浏览 2 评论 0原文

import glob
files = glob.glob("Data/*.csv")
df = pd.concat((pd.read_csv(f) for f in files))
print(df)

我遇到了一个错误,上面写着:“ parsererror:错误令牌数据。C错误:预期在第273行中的39个字段,SAW 40”。然后根据这个问题:使用pandas ,我尝试使用列的名称,使用Stringio, 然后,Bytesio,然后我得到了以下错误:“ typeError:intire_value必须为str或none,none nonge,not list”或“ typeError:需要类似字节的对象,而不是'list'”。我正在查看20多个CSV文件。

import glob
files = glob.glob("Data/*.csv")
df = pd.concat((pd.read_csv(f) for f in files))
print(df)

I get an error that says: "ParserError: Error tokenizing data. C error: Expected 39 fields in line 273, saw 40". Then as per this question: import csv with different number of columns per row using Pandas , I tried passing in the names of the columns, using StringIO
and BytesIO, then I got errors like: "TypeError: initial_value must be str or None, not list" or "TypeError: a bytes-like object is required, not 'list'". I am looking at over 20 csv files.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

小梨窩很甜 2025-02-06 19:07:51

看来您没有尝试过所有解决方案,因为您实际上在共享的链接中有一个答案:
https://stackoverflow.com/a/a/57824142/8805842
如果您检查.CSV文件中的最后一行/最后一列单元格,您将看到为什么会出现错误。

解决方案(来自问题链接的简单复制/粘贴),还有2行,以删除不需要的/空列

    ### Loop the data lines
    with open("storm_data_search_results.csv", 'r') as temp_f:
        # get No of columns in each line
        col_count = [ len(l.split(",")) for l in temp_f.readlines() ]
    
    ### Generate column names  (names will be 0, 1, 2, ..., maximum columns - 1)
    column_names = [i for i in range(0, max(col_count))]
    
    ### Read csv
    df = pd.read_csv("storm_data_search_results.csv", header=None, delimiter=",", names=column_names)
    
    # my addition
    df.columns = df.iloc[0] # create headers from the first row
    df = df.iloc[:, 0:39] # keeping data frame with named headers only

update
我的天啊,
请小心...他们在.csv中提供的数据实际上不是正确的。
如果您可以使用任何其他来源,请 - 使用它,除非您不需要“注释”,否则可以将其丢弃。

it looks like you have not tried all solutions as you actually had an answer in the link you shared:
https://stackoverflow.com/a/57824142/8805842
if you inspect the last row/last column cell in your .csv file you will see why you get error.

Solution (simple copy/paste from your question link) with 2 more rows to remove unwanted/empty columns

    ### Loop the data lines
    with open("storm_data_search_results.csv", 'r') as temp_f:
        # get No of columns in each line
        col_count = [ len(l.split(",")) for l in temp_f.readlines() ]
    
    ### Generate column names  (names will be 0, 1, 2, ..., maximum columns - 1)
    column_names = [i for i in range(0, max(col_count))]
    
    ### Read csv
    df = pd.read_csv("storm_data_search_results.csv", header=None, delimiter=",", names=column_names)
    
    # my addition
    df.columns = df.iloc[0] # create headers from the first row
    df = df.iloc[:, 0:39] # keeping data frame with named headers only

Update
OMG,
be careful... the data they give in .csv actually is not structured properly.... just scroll all it down...
if you can use any other source, - use it, Unless you do not need "comments" and you can drop them.

白鸥掠海 2025-02-06 19:07:51

假设问题来自多行的文本字段,并且很容易被混乱...
...您可以使用REGEX:re.subn(r'(“。 >

此外,假设所有文件中的标题恒定,您可以在文本中处理全部,然后分析一次。


# Read headers
headers = open(files[0]).read().split('\n',1)[0].split(',')

# Read all files and remove headers
xx = [open(ff).read().split('\n',1)[1] for ff in files]

# Remove the comments fields
dd = [re.sub(r'(".*?")',"__",x,x.count('"'), re.DOTALL) for x in xx]

# Load as CSV
df = pd.read_csv(StringIO(''.join(dd), names = headers)

Assuming that the problem comes from the text fields that are multiline and can easily get messed up...
...you can remove them using RegEx: re.subn(r'(".*?")',"_______________",xx,xx.count('"'), re.DOTALL)

Also, assuming constant headers in all files, you can process all in text and then parse once.


# Read headers
headers = open(files[0]).read().split('\n',1)[0].split(',')

# Read all files and remove headers
xx = [open(ff).read().split('\n',1)[1] for ff in files]

# Remove the comments fields
dd = [re.sub(r'(".*?")',"__",x,x.count('"'), re.DOTALL) for x in xx]

# Load as CSV
df = pd.read_csv(StringIO(''.join(dd), names = headers)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文