如何计算python中的每行有多少个分隔符

发布于 2025-02-11 18:06:41 字数 2115 浏览 1 评论 0原文

我产生一个具有13列值的查询。这些值的每个单个都是手动输入的。这意味着输入行的可能性大约少于10%。但是,这不是问题。问题有时是输入某些特殊字符,可能会对数据库造成破坏。我需要从CSV文件过滤/删除此内容,

这是CSV文件输出的一个简单示例

TypeOfEntry;Schoolid;Schoolyear;Grade;Classname;ClassId;firstname;lastname;Gender;nationality;Street;housenumber;Email;

;;;;;;;;;;;;; (1st line empty, 13 semicolons per row)

U;98645;2022;4;4AG;59845;John;Bizley;Male;United Kingdom;Canterburrystreet; 15a; [email protected];
U;98645;2022;4;4AG;59847;Alice;Schmidt;Female;United Kingdom;Milton street; 2/3; [email protected];

,有时在极少数情况下,有时有人可能想添加第二封电子邮件,但他们仍然这样做,而更糟的是,更糟糕的是在其中添加半隆。这意味着当加载CSV时,会有超过13列的行。

u; 98645; 2022; 5; 6cd; 59845; Billy; Billy; Snow; Male; Male; United Kingdom; Freedom Street; 2a; /电子邮件保护“ class =” __ cf_email__'data-cfemail =“ 2f6d46464343565c4140581d1f1f1b6f1b6f47405b47405b424e4643014c433014c4042”

> ,如果我确实找到了通过这一数字的行,我也需要清除该超过的数据,即使这意味着为该特定人丢失该数据。因此,这意味着需要删除13列之后的所有内容。

这是我在Python中的代码示例。您还会注意到我正在过滤CSV文件中的其他特殊字符。

import pandas as pd
from datetime import datetime

data = pd.read_csv("schooldata.csv", sep = ';')
data.columns = ['TypeOfEntry','Schoolid','Schoolyear','Grade','Classname','ClassId','Firstname','Lastname','Gender','Nationality','Street','Housenumber','Email']
date = datetime. now(). strftime("%Y_%m_%d")

data = data.convert_dtypes()
#df = data.dataframe()

rep_chars = '°|^|!|"|\(|\)|\?'
rep_chars2 = r'\'|\`|\´|\*|#'

data = data.replace(rep_chars, '', regex=True)
data = data.replace(rep_chars2, '', regex=True)
data = data.replace('\+', '-', regex=True)

print(data.head())
print(data .dtypes)
data.to_csv(f'scoolexport_{date}.csv', sep= ';', date_format='%Y%m%d', index=False)


I produce a query with 13 columns of values. Every single ones of these values are manually entered. That means there is roughly less than 10% chance that the rows entered are wrong. However that is not the issue. the issue is sometimes certain special characters are entered that can cause havoc to the database. I need to filter/remove this content from the CSV file

Here is a simple sample of the output of the CSV file

TypeOfEntry;Schoolid;Schoolyear;Grade;Classname;ClassId;firstname;lastname;Gender;nationality;Street;housenumber;Email;

;;;;;;;;;;;;; (1st line empty, 13 semicolons per row)

U;98645;2022;4;4AG;59845;John;Bizley;Male;United Kingdom;Canterburrystreet; 15a; [email protected];
U;98645;2022;4;4AG;59847;Alice;Schmidt;Female;United Kingdom;Milton street; 2/3; [email protected];

Now in rare occasions sometimes someone might want to add a second email which is not allowed but they still do it and whats worse they add a semicolon to it. Meaning that when the csv is loaded there are rows that surpass 13 columns.

U;98645;2022;5;6CD;59845;Billy;Snow;Male;United Kingdom;Freedom street; 2a; [email protected];[email protected];

Therefore to solve this problem I need to count the number of deliemters there are in each row, and if I do find a row that passed that count, I need to clear that excessve data even if it means losing that data for that particular person. So that means everything after the 13 column needs to be removed.

Here is my code sample in python. You will also notice that I am filtering other special characters from the csv file.

import pandas as pd
from datetime import datetime

data = pd.read_csv("schooldata.csv", sep = ';')
data.columns = ['TypeOfEntry','Schoolid','Schoolyear','Grade','Classname','ClassId','Firstname','Lastname','Gender','Nationality','Street','Housenumber','Email']
date = datetime. now(). strftime("%Y_%m_%d")

data = data.convert_dtypes()
#df = data.dataframe()

rep_chars = '°|^|!|"|\(|\)|\?'
rep_chars2 = r'\'|\`|\´|\*|#'

data = data.replace(rep_chars, '', regex=True)
data = data.replace(rep_chars2, '', regex=True)
data = data.replace('\+', '-', regex=True)

print(data.head())
print(data .dtypes)
data.to_csv(f'scoolexport_{date}.csv', sep= ';', date_format='%Y%m%d', index=False)


如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

木格 2025-02-18 18:06:41

非常非常基本的侵犯,但也许足够:

import pandas as pd

df = pd.read_csv(r"C:\Test\test.csv", sep = ';')

data = df.iloc[:, : 13].copy() # data to use in later code

excessive_data = df.iloc[:, 13: ].copy().reset_index(drop=True) # excessive data will land after columns 13
if not excessive_data.empty:
    # checking if any excessive data is present 
    pos = excessive_data[excessive_data.notnull().all(axis=1)].index.tolist()
    print(f"excessive data is present in rows index:{pos}")

very very basic aproach, but maybe will be enough:

import pandas as pd

df = pd.read_csv(r"C:\Test\test.csv", sep = ';')

data = df.iloc[:, : 13].copy() # data to use in later code

excessive_data = df.iloc[:, 13: ].copy().reset_index(drop=True) # excessive data will land after columns 13
if not excessive_data.empty:
    # checking if any excessive data is present 
    pos = excessive_data[excessive_data.notnull().all(axis=1)].index.tolist()
    print(f"excessive data is present in rows index:{pos}")
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文