解析大型 CSV 文件,处理逗号和引号

发布于 2024-09-26 22:14:43 字数 249 浏览 5 评论 0原文

我需要加载一个大的 CSV 文件(>1MB)并解析它。 一般来说,这很容易做到,只需先按换行符再按逗号进行分割即可。 问题是有些条目包含的字符串包含自己的逗号。当此电子表格转换为 CSV 时,包含逗号的行会用引号引起来。

我编写了一个解析器,它首先转义这些字符串中的所有逗号,然后按换行符和逗号将其拆分,然后再次对值进行转义。

对于这么长的字符串来说,这是一个相当慢的过程,因为我需要迭代整个字符串。 有谁知道处理这个问题的更快或更优化的方法?

I need to load in a large CSV file (>1MB) and parse it.
Generally this is quite easy to do by splitting first on linebreaks and then commas.
The problem is though that some entries contain Strings that include their own commas. When this spreadsheet is converted to CSV, the lines containing commas are wrapped in quotes.

I've written a parser that first escapes all the commas in these strings, then splits it on linebreaks and then commas, and then unescapes the values again.

This is quite a slow process for such a long string, as I need to iterate through the whole string.
Does anyone know a faster or more optimised method of dealing with this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

情魔剑神 2024-10-03 22:14:43

您看过 csvlib 了吗?它是 ActionScript 3 的解析器库。它声称旨在正确处理带引号的字符串。

希望您已经将字符串括在引号中,尤其是包含逗号的字符串。 CSV 解析器无法区分作为字符串一部分的逗号和分隔两个字符串的逗号,除非字符串周围有引号。

    
Good
    "This string, has a comma", "This string doesn't"

Bad
    This string, has a comma, this string doesn't

Have you had a look at csvlib yet? It is a parser library for ActionScript 3. It claims to be designed to properly handle quoted strings.

Hopefully, you are already enclosing your strings in quotes, especially the ones containing the commas. CSV parsers cannot distinguish a comma that is part of a string from a comma that separates two strings, unless the strings have quotes around them.

    
Good
    "This string, has a comma", "This string doesn't"

Bad
    This string, has a comma, this string doesn't
听闻余生 2024-10-03 22:14:43

一次处理文件将减少时间。这可以通过使用简单的状态机来处理嵌入在值中的逗号的复杂性来实现。
问候

Processing the file in a single pass will reduce the time. This can be achieved by using a simple state machine to handle the complexity of commas embedded in the values.
Regards

迷你仙 2024-10-03 22:14:43
  • 添加对 Microsoft.VisualBasic 的引用(是的,它说
    VisualBasic 但它在 C# 中也同样可以工作 - 请记住,在
    结束这一切都只是 IL)
  • 使用 Microsoft.VisualBasic.FileIO.TextFieldParser 类来解析
    CSV 文件

以下是示例代码:

    Dim parser As TextFieldParser = New TextFieldParser("C:\mar0112.csv")
    parser.TextFieldType = FieldType.Delimited
    parser.SetDelimiters(",")

    While Not parser.EndOfData
        'Processing row    
        Dim fields() As String = parser.ReadFields
        For Each field As String In fields
            'TODO: Process field     

        Next

    End While
    parser.Close()
  • Add a reference to the Microsoft.VisualBasic (yes, it says
    VisualBasic but it works in C# just as well - remember that at the
    end it is all just IL)
  • Use the Microsoft.VisualBasic.FileIO.TextFieldParser class to parse the
    CSV file

Here is the sample code:

    Dim parser As TextFieldParser = New TextFieldParser("C:\mar0112.csv")
    parser.TextFieldType = FieldType.Delimited
    parser.SetDelimiters(",")

    While Not parser.EndOfData
        'Processing row    
        Dim fields() As String = parser.ReadFields
        For Each field As String In fields
            'TODO: Process field     

        Next

    End While
    parser.Close()
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文