清理使用逗号分隔符的 csv 的最佳方法使用 Excel 的双引号 (") 文本限定符。带有额外的数据错误使数据进入错误的字段

发布于 2025-01-10 17:01:44 字数 670 浏览 3 评论 0 原文

我一直遇到数据位于错误字段的问题。我有一些大型 csv 文件,在加载到 QLIK 之前必须手动更新它们。 csv 有一个逗号 (,) 分隔符 &双引号 (") 文本限定符。某些数据具有额外的字符,会导致文本字段中出现数字,反之亦然。有人可以建议解决此问题的最佳/最快方法吗?删除不需要的 " 并救救我从手动删除引号并粘贴到数百条记录的正确字段。我在下面创建了虚拟数据。

请注意,我可用于清理 csv 的工具有点有限。或者您能否建议为此所需的最佳工具/应用程序?只是不确定从哪里开始

记事本:

ID,T_No,T_Type,T_Date,T_Name,T_TNo,

2,256,House,30/05/2021,Airport,75.1,

3,268,Hotel,31/05/2021,Hotel Antel""",76.1

4,269,House,31/05/2021,Bank of USA,"LA Branch""""",77.1

EXCEL: [在此处输入图像描述][1]

非常感谢任何帮助。

谢谢 [1]: https://i.sstatic.net/vyYAT.png

I have been having issues where my data is in the wrong fields. I have a few large csv files that I have to manually update before loading into QLIK. The csv's have a comma (,) delimiter & double quote (") text qualifier. Some data has extra characters that throw it off and results in numeric numbers in text fields and vice versa. Can someone please advise the best/fastest way to combat this? To remove the unwanted " and save me from manually deleting quotes and pasting to correct fields for hundreds of records. I have created dummy data below.

Please note I am bit limited with the tools I have available to clean the csv. Or could you please advise the best tools/applications needed for this? Just unsure where to start

IN NOTEPAD:

ID,T_No,T_Type,T_Date,T_Name,T_TNo,

2,256,House,30/05/2021,Airport,75.1,

3,268,Hotel,31/05/2021,Hotel Antel""",76.1

4,269,House,31/05/2021,Bank of USA,"LA Branch""""",77.1

IN EXCEL:
[enter image description here][1]

Any assistance is greatly appreciated.

Thank you
[1]: https://i.sstatic.net/vyYAT.png

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

亢潮 2025-01-17 17:01:44

如果问题仅出在 T_Name 列,您可以将模式设置为 CsvMode.NoEscape,使用 ClassMap 获取您知道的字段您可以毫无问题地获取,然后使用一些逻辑来确定 T_Name 列的结束位置和 T_TNo 列的开始位置。这段代码中有很多地方可能会被破坏,具体取决于其余数据的样子,但它至少应该给你一些想法。

void Main()
{
    var text = new StringBuilder();
    text.AppendLine("ID,T_No,T_Type,T_Date,T_Name,T_TNo,");
    text.AppendLine("2,256,House,30/05/2021,Airport,75.1,");
    text.AppendLine("3,268,Hotel,31/05/2021,Hotel Antel\"\"\",76.1");
    text.AppendLine("4,269,House,31/05/2021,Bank of USA,\"LA Branch\"\"\"\"\",77.1");
    
    var config = new CsvConfiguration(CultureInfo.InvariantCulture)
    {
        Mode = CsvMode.NoEscape
    };
    
    using (var reader = new StringReader(text.ToString()))
    using (var csv = new CsvReader(reader, config))
    {
        var options = new TypeConverterOptions { Formats = new[] { "dd/MM/yyyy" } };
        csv.Context.TypeConverterOptionsCache.AddOptions<DateTime>(options);
        csv.Context.RegisterClassMap<MyClassMap>();
        
        var records = new List<MyClass>();
        
        csv.Read();
        csv.ReadHeader();
        
        while (csv.Read())
        {
            var record = csv.GetRecord<MyClass>();
            var name = string.Empty;
            
            int i = 4;
            var finished = false;

            while (!finished)
            {
                var field = csv.GetField(i);

                if (i == 4)
                {
                    record.Name = field.Replace("\"", "");
                    i++;
                    continue;
                }
                
                var isNumber = float.TryParse(field, out var number);

                if (!isNumber)
                {
                    record.Name += ", " + field.Replace("\"", "");
                    i++;
                    continue;
                }
                
                record.TNumber = number;
                finished = true;
            }
            
            records.Add(record);
        }
        records.Dump();
    }
}

public class MyClassMap : ClassMap<MyClass>
{
    public MyClassMap()
    {
        Map(x => x.Id).Name("ID");
        Map(x => x.Number).Name("T_No");
        Map(x => x.Type).Name("T_Type");
        Map(x => x.Date).Name("T_Date");
    }
}

public class MyClass
{
    public int Id { get; set; }
    public int Number { get; set; }
    public string Type { get; set; }
    public DateTime Date { get; set; }
    public string Name { get; set; }
    public float TNumber { get; set; }
}

If the issue is just with the T_Name column, you could set the mode to CsvMode.NoEscape, use the ClassMap to get the fields you know you can get without issue and then use some logic to figure out where the T_Name column ends and the T_TNo column starts. There is a lot that could break in this code, depending on what the rest of the data looks like, but it should at least give you some ideas.

void Main()
{
    var text = new StringBuilder();
    text.AppendLine("ID,T_No,T_Type,T_Date,T_Name,T_TNo,");
    text.AppendLine("2,256,House,30/05/2021,Airport,75.1,");
    text.AppendLine("3,268,Hotel,31/05/2021,Hotel Antel\"\"\",76.1");
    text.AppendLine("4,269,House,31/05/2021,Bank of USA,\"LA Branch\"\"\"\"\",77.1");
    
    var config = new CsvConfiguration(CultureInfo.InvariantCulture)
    {
        Mode = CsvMode.NoEscape
    };
    
    using (var reader = new StringReader(text.ToString()))
    using (var csv = new CsvReader(reader, config))
    {
        var options = new TypeConverterOptions { Formats = new[] { "dd/MM/yyyy" } };
        csv.Context.TypeConverterOptionsCache.AddOptions<DateTime>(options);
        csv.Context.RegisterClassMap<MyClassMap>();
        
        var records = new List<MyClass>();
        
        csv.Read();
        csv.ReadHeader();
        
        while (csv.Read())
        {
            var record = csv.GetRecord<MyClass>();
            var name = string.Empty;
            
            int i = 4;
            var finished = false;

            while (!finished)
            {
                var field = csv.GetField(i);

                if (i == 4)
                {
                    record.Name = field.Replace("\"", "");
                    i++;
                    continue;
                }
                
                var isNumber = float.TryParse(field, out var number);

                if (!isNumber)
                {
                    record.Name += ", " + field.Replace("\"", "");
                    i++;
                    continue;
                }
                
                record.TNumber = number;
                finished = true;
            }
            
            records.Add(record);
        }
        records.Dump();
    }
}

public class MyClassMap : ClassMap<MyClass>
{
    public MyClassMap()
    {
        Map(x => x.Id).Name("ID");
        Map(x => x.Number).Name("T_No");
        Map(x => x.Type).Name("T_Type");
        Map(x => x.Date).Name("T_Date");
    }
}

public class MyClass
{
    public int Id { get; set; }
    public int Number { get; set; }
    public string Type { get; set; }
    public DateTime Date { get; set; }
    public string Name { get; set; }
    public float TNumber { get; set; }
}
霊感 2025-01-17 17:01:44

如果您可以使用 C#(有免费版本),您可以处理该文件并修复不良记录。我会首先弄清楚是否存在问题,如果存在则再解决。找出名称字段的开始和结束位置,并解决引号问题。

这将是一个很好的起点:

private void UpdateCsv()
{
    var lines = System.IO.File.ReadAllLines("your file");
    var updatedLines = new List<string>();

    foreach (var line in lines)
    {
        //fixes issue with your first example
        var newLine = line.TrimEnd(',');
        var fixedString = "";

        if (newLine.Split(",").Length == 6)  //indicates there are no issues
        {
            fixedString = newLine;
        }
        else
        {

            //get the start of the name field
            var startName = IndexOfOccurence(newLine, ",", 4) + 1;
            
            //get the end of the name field
            var endName = newLine.LastIndexOf(',') + 1;

            //populate a new string to hold the fixed data
            fixedString = newLine.Substring(0, startName);

            //parse the name field based on start and end
            var name = newLine.Substring(startName, endName - startName - 1);

            //get rid of starting and ending quotes
            name = name.TrimStart('"').TrimEnd('"');

            //what to do with quote in middle of string? escape or remove your choice uncomment your choice
            //name = name.Replace('"', ' '); //to remove
            //name = name.Replace("\"", "\"\""); //to escape

            //if name contains comma or quote then add quote, else not needed
            if (name.Contains(',') || name.Contains('"'))
            {
                fixedString += "\"" + name + "\"" + newLine.Substring(endName - 1);
            }
            else
            {
                fixedString += name + newLine.Substring(endName - 1);
            }
        }

        updatedLines.Add(fixedString);
    }

    //write out the updated data
    System.IO.File.WriteAllLines("your file", updatedLines);

}

private int IndexOfOccurence(string s, string match, int occurence)
{
    int i = 1;
    int index = 0;

    while (i <= occurence && (index = s.IndexOf(match, index + 1)) != -1)
    {
        if (i == occurence)
            return index;

        i++;
    }

    return -1;
} 

If you have access to C# (there is a free version) you could process the file and fix the bad records. I would do that by figuring out first if there is an issue, and if there is then. Figure out where the name field starts and ends and fix the issues with the quotes.

This would be a good starting point:

private void UpdateCsv()
{
    var lines = System.IO.File.ReadAllLines("your file");
    var updatedLines = new List<string>();

    foreach (var line in lines)
    {
        //fixes issue with your first example
        var newLine = line.TrimEnd(',');
        var fixedString = "";

        if (newLine.Split(",").Length == 6)  //indicates there are no issues
        {
            fixedString = newLine;
        }
        else
        {

            //get the start of the name field
            var startName = IndexOfOccurence(newLine, ",", 4) + 1;
            
            //get the end of the name field
            var endName = newLine.LastIndexOf(',') + 1;

            //populate a new string to hold the fixed data
            fixedString = newLine.Substring(0, startName);

            //parse the name field based on start and end
            var name = newLine.Substring(startName, endName - startName - 1);

            //get rid of starting and ending quotes
            name = name.TrimStart('"').TrimEnd('"');

            //what to do with quote in middle of string? escape or remove your choice uncomment your choice
            //name = name.Replace('"', ' '); //to remove
            //name = name.Replace("\"", "\"\""); //to escape

            //if name contains comma or quote then add quote, else not needed
            if (name.Contains(',') || name.Contains('"'))
            {
                fixedString += "\"" + name + "\"" + newLine.Substring(endName - 1);
            }
            else
            {
                fixedString += name + newLine.Substring(endName - 1);
            }
        }

        updatedLines.Add(fixedString);
    }

    //write out the updated data
    System.IO.File.WriteAllLines("your file", updatedLines);

}

private int IndexOfOccurence(string s, string match, int occurence)
{
    int i = 1;
    int index = 0;

    while (i <= occurence && (index = s.IndexOf(match, index + 1)) != -1)
    {
        if (i == occurence)
            return index;

        i++;
    }

    return -1;
} 
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文