找到空白线后,将数据加载到DF中

发布于 2025-01-24 12:57:45 字数 3434 浏览 2 评论 0原文

当输入文件有时为.csv而有时是.xls时,在文件中找到第一个newline的最佳方法是什么。保证了新线,但是在读取文件时,newline始终在随机行中。输入文件将具有一定量的行,始终在顶部。此数据可变一两行。因此,由于这种不可预测性,我将跳过前4、5、6。我的目标是将数据读取到数据框中,跳过那些第一行。第一个空白行之后的线就是我将开始将数据读入df。因此,我缺少只是跳过此变量的行的东西,我有一个较小的函数来标识文件类型,如果该代码返回true,则文件为XLS文件,如果false,则文件为CSV文件。在第一个空白行下方的示例文件中,第7。1

:CSV

永远读取,我必须中断执行程序才能退出。一个关键点,在运行f.readline()并按行查看输出线时,我注意到文件通过空白行,因为它不是'\ n',如所预期。取而代之的是,它总是像',,,,,,, \ n'一样,我的许多CSV文件没有一致性。我如何编写一些内容以将其识别为空白行,而不必总是调整代码以说明CSV文件中第一个空白行中的新数量逗号?

import pandas as pd

file = 'input_file.csv'

f = open(file)

while f.readline() not in ('\n'):
        pass

final_df = pd.read_csv(f, header=None)

示例文件

报告
随机信息
更多信息
项目编号1111111111111111
1
板板1板2板3
dna \分析ID1ID2 ID2 ID3名称
C:CG:C G:C
NAME2:C GC:C G:C G:GC:G C:C
NAME3C:CG:C G T:C G:C GT:C C G:C c G:C

正在寻找NewLine的ReadLine函数的当前输出:

',,,,,,,, \ n'

final_df预期的输出

dna \测定ID1ID2ID3
NAME1C:CG:GT:C
NAME2C:CG:C G:GC:C
NAME3C:C G:CG:GT:G T:C

2:C 2:XLS

文件以XLS文件格式,它们显示与上面使用的示例文件完全相同。该示例文件根据此问题的需要准确地提供数据,无需更改。

我的想法是读取文件是否为XLS文件的输入

import pandas as pd

df = pd.read_excel(file)

f = tempfile.NamedTemporaryFile()

df.to_csv(f)

f.seek(0)

line = str(f.readline()).strip()

,并且在print(line)返回之后的当前输出,

b',report,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27,Unnamed: 28,Unnamed: 29,Unnamed: 30,Unnamed: 31,Unnamed: 32,Unnamed: 33,Unnamed: 34,Unnamed: 35,Unnamed: 36,Unnamed: 37,Unnamed: 38,Unnamed: 39,Unnamed: 40,Unnamed: 41,Unnamed: 42,Unnamed: 43,Unnamed: 44,Unnamed: 45,Unnamed: 46\n'

如果有另一个,我不想继续以这种方式读取该文件使用pd.read_excel(line)找到第一个空白行的方法。

预期输出与final_df中所列的上述输出相同,

我理想地使用final_df = pd.pd.read_csv(line)来产生final_df ,那行不通。

DNA \分析ID1ID2 ID3名称
1C:CG:GT:C
NAME2C:CG:C G:GC:C:C
NAME3C:CG:C G:GT:C T:C

What is the best way to find the first newline in a file when the input file is sometimes a .csv and sometimes a .xls. The newline is guaranteed, but the newline is always at a random row when reading the file. The input file will have a certain amount of rows, always at the top. This data is variable by a line or two. So I will skip the first 4, 5, 6, because of this unpredictability. My goal here is to read the data beyond that point into a DataFrame, skipping those first rows. The line right after the first blank line is where I will start reading the data in to the df. So something that just skips this variable amount of rows is what I am missing, I have a small function that identifies file type, if that code returns true the file is a xls file and if false the file is a CSV file. In my example file below the first blank row is at row 7.

1: CSV

This reads forever and I have to interrupt execution for the program to quit. A key point, when running f.readline() and looking at the output line by line I notice the file passes the blank line because it is not '\n' as expected. Instead it's always something like ',,,,,,,,,,\n' with no consistency across my many csv files. How can I write something to identify this as a blank line without always tweaking code to account for new amount of commas in the first blank row in the CSV file?

import pandas as pd

file = 'input_file.csv'

f = open(file)

while f.readline() not in ('\n'):
        pass

final_df = pd.read_csv(f, header=None)

Example file.

report
random info
more info
Project number111111
Order number
PlatesPlate1Plate2Plate3
DNA \ Assayid1id2id3
Name1C:CG:GT:C
Name2C:CG:GC:C
Name3C:CG:GT:C

Current output for the readline function that is looking for the newline, at the newline:

',,,,,,,,,,\n'

final_df expected output

DNA \ Assayid1id2id3
Name1C:CG:GT:C
Name2C:CG:GC:C
Name3C:CG:GT:C

2: XLS

When the files are in the xls file format, they appear the exact same as my example file used above. The example file provides the data exactly as needed for this question, no changes needed.

My idea to read the files if they are input as a xls file is to

import pandas as pd

df = pd.read_excel(file)

f = tempfile.NamedTemporaryFile()

df.to_csv(f)

f.seek(0)

line = str(f.readline()).strip()

and the current output after a print(line) returns

b',report,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27,Unnamed: 28,Unnamed: 29,Unnamed: 30,Unnamed: 31,Unnamed: 32,Unnamed: 33,Unnamed: 34,Unnamed: 35,Unnamed: 36,Unnamed: 37,Unnamed: 38,Unnamed: 39,Unnamed: 40,Unnamed: 41,Unnamed: 42,Unnamed: 43,Unnamed: 44,Unnamed: 45,Unnamed: 46\n'

I'm not wanting to continue reading the file this way if there is another way to find the first blank line with pd.read_excel(line).

The expected output is the same as listed above in final_df

I would ideally use something like final_df = pd.read_csv(line) to produce the final_df, that does not work.

DNA \ Assayid1id2id3
Name1C:CG:GT:C
Name2C:CG:GC:C
Name3C:CG:GT:C

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

柏林苍穹下 2025-01-31 12:57:46

我认为最简单的处理方法,特别是考虑到您可能拥有CSV或XLS文件是读取数据并之后清理数据。这样的事情可能会有所帮助,并且可以采用两种格式:

df = pd.read_excel(file)
new_line = min(df[df.iloc[:,0].isnull()].index)
df.columns = df.iloc[new_line+1]
df = df.iloc[new_line+2:, :]

从本质上讲,您读取整个文件,找到第一个空行,然后从“ new_line”开始重建数据框。

I would think easiest way to handle this, specially considering you might have csv or xls files is to read the data and clean it afterwards. Something like this might help and would work on both formats:

df = pd.read_excel(file)
new_line = min(df[df.iloc[:,0].isnull()].index)
df.columns = df.iloc[new_line+1]
df = df.iloc[new_line+2:, :]

Essentially you read the whole file, find the first empty line, and reconstruct the dataframe starting from the "new_line".

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文