CSV 行继续字符忽略换行符

发布于 2025-01-11 21:16:04 字数 665 浏览 0 评论 0原文

我正在使用 Python 解析一个 .csv 文件,其中大多数值都包含换行符。这不是问题,因为值是由 " 分隔的。

但是,我注意到在构建 .csv 文件的过程中,在某个时间点,长值被分成多行(但保持在相同的值内),并在一行末尾放置一个 = 字符来表示“下面的换行符实际上是一个串联”。 :该值

Hello, world!
How are you today?

可以表示为

"Hello, world!\n
How are you t=\n
oday?"

其中\n 表示一字节换行符。Python

csv 库的文档中没有提及任何有关它的内容。 格式化部分,因此我想知道如果这是常见做法如果 Python 有支持,我知道如何编写一个连接这些行的解析器(一个简单的 str.replace(v,"=\n","") 可能就足够了),但我'我只是好奇这是否是我的文件的特质。

I'm using Python to parse a .csv file that contains line breaks in most values. This isn't an issue, since values are delimited by ".

However, I've noticed that during the construction of the .csv file at one point in time, long values were split into multiple lines (but kept within the same value), with an = character put at the end of one line to signify "the following line break is actually a concatenation". A minimal working example: the value

Hello, world!
How are you today?

could be represented as

"Hello, world!\n
How are you t=\n
oday?"

where \n denotes the one-byte line break character.

Does CSV have the concept of "line continuation characters"? The documentation of Python's csv library does not mention anything about it under the formatting section, and hence I wonder if this is common practice and if Python nevertheless has support. I know how to write a parser that concatenates these lines (a simple str.replace(v,"=\n","") probably suffices), but I'm just curious whether this is an idiosyncrasy of my file.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

苦行僧 2025-01-18 21:16:04

这似乎不是 CSV 的功能,而是 MIME 的功能(由于我的数据集由电子邮件组成,这解决了我的问题)。

这种等号字符的用法是quoted-printable 编码的一部分,并且可以由 quopri 处理Python 模块。有关更多详细信息,请参阅此答案

使用此模块比简单的 str.replace(v, "=\n", "") 更好,因为电子邮件可能包含其他需要解码且不会出现在电子邮件中的带引号的可打印标记。行结束(例如 =09 表示水平制表符)。使用 quopri,您可以编写:

import quopri
v = ...
original = quopri.decodestring(v.encode("utf-8")).decode("utf-8")

This seems to be not a feature of CSV, but rather of MIME (and since my dataset consists of e-mails, this solves my question).

This usage of equals characters is part of quoted-printable encoding, and can be handled by the quopri Python module. See this answer for more details.

Using this module is better than a simple str.replace(v, "=\n", ""), because e-mails can contain other quoted-printable tokens that need decoding and do not appear on line ends (e.g. =09 to represent a horizontal tab). With quopri, you would write:

import quopri
v = ...
original = quopri.decodestring(v.encode("utf-8")).decode("utf-8")
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文