使用pandas.read_csv（）进行畸形的CSV数据

发布于 2025-02-10 21:00:12 字数 818 浏览 1 评论 0原文

这是一个概念上的问题，因此没有代码或可重现的示例。

我正在处理从数据库中获取的数据，该数据库包含自动化过程中的记录。常规记录包含14个字段，具有唯一的ID和13个包含指标的字段，例如创建日期，执行时间，客户ID，作业类型等。数据库以每天数十个速度和每月几千次的速度积累记录。

有时，过程会导致错误，从而导致行畸形。这是一个示例：

id1,m01,m02,m03,m04,m05,m06,m07,m08,m09,m10,m11,m12,m13 /*regular record, no error, 14 fields*/
id2,m01,m02,m03,m04,m05,m06,m07,m08,m09,m10,m11,m12,"DELETE error, failed" /*error in column 14*/
id3,m01,m02,"NO SUCH JOB error, failed" /*error in column 4*/
id4,m01,m02,m03,m04,m05,m06,"JOB failed, no time recorded" /*error in column 7*/

要求（1）从指标中填充仪表板，以及（2）分类错误的类型。理想的解决方案将read_csv与on_bad_lines设置为返回数据框的某些函数。我的技巧解决方案是用手手工勾勒数据，然后划行，并从输出中创建两个数据帧。通过使用关键字“失败”可以可靠地检测到不良线的存在。我写了收集“失败”消息的逻辑，并按日期制作出堆叠的条形图。它有效，但我宁愿使用全面的熊猫解决方案。

是否可以使用pd.read_csv（）返回2个dataframes？如果是这样，这将如何完成？您能指出我任何示例代码吗？还是我完全不在基地？谢谢。

原文

This is a conceptual question, so no code or reproduceable example.

I am processing data pulled from a database which contains records from automated processes. The regular record contains 14 fields, with a unique ID, and 13 fields containing metrics, such as the date of creation, the time of execution, the customer ID, the type of job, and so on. The database accumulates records at the rate of dozens a day, and a couple of thousand per month.

Sometimes, the processes result in errors, which result in malformed rows. Here is an example:

id1,m01,m02,m03,m04,m05,m06,m07,m08,m09,m10,m11,m12,m13 /*regular record, no error, 14 fields*/
id2,m01,m02,m03,m04,m05,m06,m07,m08,m09,m10,m11,m12,"DELETE error, failed" /*error in column 14*/
id3,m01,m02,"NO SUCH JOB error, failed" /*error in column 4*/
id4,m01,m02,m03,m04,m05,m06,"JOB failed, no time recorded" /*error in column 7*/

The requirements are to (1) populate a dashboard from the metrics, and (2) catalog the types of errors. The ideal solution uses read_csv with on_bad_lines set to some function that returns a dataframe. My hacky solution is to munge the data by hand, row by row, and create two data frames from the output. The presence of the bad lines can be reliably detected by the use of the keyword "failed." I have written the logic that collects the "failed" messages and produces a stacked bar chart by date. It works, but I'd rather use a total Pandas solution.

Is it possible to use pd.read_csv() to return 2 dataframes? If so, how would this be done? Can you point me to any example code? Or am I totally off base? Thanks.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

苦笑流年记忆 2025-02-17 21:00:12

您可以将CSV文件加载到数据框架上并应用过滤器：

df = pd.read_csv("your_file.csv", header = None)
df_filter = df.apply(lambda row: row.astype(str).str.contains('failed').any(), axis=1) 

df[df_filter.values] #this gives a dataframe of "failed" rows
df[~df_filter.values] #this gives a dataframe of "non failed" rows

您需要确保关键字不会出现在数据上。

PS：可能有更多优化的方法可以做到这一点

You can load your csv file on a Dataframe and apply a filter :

df = pd.read_csv("your_file.csv", header = None)
df_filter = df.apply(lambda row: row.astype(str).str.contains('failed').any(), axis=1) 

df[df_filter.values] #this gives a dataframe of "failed" rows
df[~df_filter.values] #this gives a dataframe of "non failed" rows

You need to make sure that your keyword does not appear on your data.

PS : There might be more optimized ways to do it

回复收藏 0 原文

雨的味道风的声音 2025-02-17 21:00:12

这种方法将整个CSV读为单列。然后使用识别失败行的掩码来爆发并创建良好的和失败的数据范围。

将整个CSV读为单列

import io

dfs = pd.read_fwf(sim_csv, widths=[999999], header=None)

构建一个掩码，识别失败的行

fail_msk = dfs[0].str.contains('failed')

使用该掩码分开并构建单独的DataFrames

df_good = pd.read_csv(io.StringIO('\n'.join(dfs[~fail_msk].squeeze())), header=None)
df_fail = pd.read_csv(io.StringIO('\n'.join(dfs[fail_msk].squeeze())), header=None)

This approach reads the entire CSV into a single column. Then uses a mask that identifies failed rows to break out and create good and failed dataframes.

Read the entire CSV into a single column

import io

dfs = pd.read_fwf(sim_csv, widths=[999999], header=None)

Build a mask identifying the failed rows

fail_msk = dfs[0].str.contains('failed')

Use that mask to split out and build separate dataframes

df_good = pd.read_csv(io.StringIO('\n'.join(dfs[~fail_msk].squeeze())), header=None)
df_fail = pd.read_csv(io.StringIO('\n'.join(dfs[fail_msk].squeeze())), header=None)

回复收藏 0 原文

~没有更多了~