如何计算python中的每行有多少个分隔符
我产生一个具有13列值的查询。这些值的每个单个都是手动输入的。这意味着输入行的可能性大约少于10%。但是,这不是问题。问题有时是输入某些特殊字符,可能会对数据库造成破坏。我需要从CSV文件过滤/删除此内容,
这是CSV文件输出的一个简单示例
TypeOfEntry;Schoolid;Schoolyear;Grade;Classname;ClassId;firstname;lastname;Gender;nationality;Street;housenumber;Email;
;;;;;;;;;;;;; (1st line empty, 13 semicolons per row)
U;98645;2022;4;4AG;59845;John;Bizley;Male;United Kingdom;Canterburrystreet; 15a; [email protected];
U;98645;2022;4;4AG;59847;Alice;Schmidt;Female;United Kingdom;Milton street; 2/3; [email protected];
,有时在极少数情况下,有时有人可能想添加第二封电子邮件,但他们仍然这样做,而更糟的是,更糟糕的是在其中添加半隆。这意味着当加载CSV时,会有超过13列的行。
u; 98645; 2022; 5; 6cd; 59845; Billy; Billy; Snow; Male; Male; United Kingdom; Freedom Street; 2a; /电子邮件保护“ class =” __ cf_email__'data-cfemail =“ 2f6d46464343565c4140581d1f1f1b6f1b6f47405b47405b424e4643014c433014c4042”
> ,如果我确实找到了通过这一数字的行,我也需要清除该超过的数据,即使这意味着为该特定人丢失该数据。因此,这意味着需要删除13列之后的所有内容。
这是我在Python中的代码示例。您还会注意到我正在过滤CSV文件中的其他特殊字符。
import pandas as pd
from datetime import datetime
data = pd.read_csv("schooldata.csv", sep = ';')
data.columns = ['TypeOfEntry','Schoolid','Schoolyear','Grade','Classname','ClassId','Firstname','Lastname','Gender','Nationality','Street','Housenumber','Email']
date = datetime. now(). strftime("%Y_%m_%d")
data = data.convert_dtypes()
#df = data.dataframe()
rep_chars = '°|^|!|"|\(|\)|\?'
rep_chars2 = r'\'|\`|\´|\*|#'
data = data.replace(rep_chars, '', regex=True)
data = data.replace(rep_chars2, '', regex=True)
data = data.replace('\+', '-', regex=True)
print(data.head())
print(data .dtypes)
data.to_csv(f'scoolexport_{date}.csv', sep= ';', date_format='%Y%m%d', index=False)
I produce a query with 13 columns of values. Every single ones of these values are manually entered. That means there is roughly less than 10% chance that the rows entered are wrong. However that is not the issue. the issue is sometimes certain special characters are entered that can cause havoc to the database. I need to filter/remove this content from the CSV file
Here is a simple sample of the output of the CSV file
TypeOfEntry;Schoolid;Schoolyear;Grade;Classname;ClassId;firstname;lastname;Gender;nationality;Street;housenumber;Email;
;;;;;;;;;;;;; (1st line empty, 13 semicolons per row)
U;98645;2022;4;4AG;59845;John;Bizley;Male;United Kingdom;Canterburrystreet; 15a; [email protected];
U;98645;2022;4;4AG;59847;Alice;Schmidt;Female;United Kingdom;Milton street; 2/3; [email protected];
Now in rare occasions sometimes someone might want to add a second email which is not allowed but they still do it and whats worse they add a semicolon to it. Meaning that when the csv is loaded there are rows that surpass 13 columns.
U;98645;2022;5;6CD;59845;Billy;Snow;Male;United Kingdom;Freedom street; 2a; [email protected];[email protected];
Therefore to solve this problem I need to count the number of deliemters there are in each row, and if I do find a row that passed that count, I need to clear that excessve data even if it means losing that data for that particular person. So that means everything after the 13 column needs to be removed.
Here is my code sample in python. You will also notice that I am filtering other special characters from the csv file.
import pandas as pd
from datetime import datetime
data = pd.read_csv("schooldata.csv", sep = ';')
data.columns = ['TypeOfEntry','Schoolid','Schoolyear','Grade','Classname','ClassId','Firstname','Lastname','Gender','Nationality','Street','Housenumber','Email']
date = datetime. now(). strftime("%Y_%m_%d")
data = data.convert_dtypes()
#df = data.dataframe()
rep_chars = '°|^|!|"|\(|\)|\?'
rep_chars2 = r'\'|\`|\´|\*|#'
data = data.replace(rep_chars, '', regex=True)
data = data.replace(rep_chars2, '', regex=True)
data = data.replace('\+', '-', regex=True)
print(data.head())
print(data .dtypes)
data.to_csv(f'scoolexport_{date}.csv', sep= ';', date_format='%Y%m%d', index=False)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
非常非常基本的侵犯,但也许足够:
very very basic aproach, but maybe will be enough: