CSV 行继续字符忽略换行符
我正在使用 Python 解析一个 .csv
文件,其中大多数值都包含换行符。这不是问题,因为值是由 "
分隔的。
但是,我注意到在构建 .csv
文件的过程中,在某个时间点,长值被分成多行(但保持在相同的值内),并在一行末尾放置一个 =
字符来表示“下面的换行符实际上是一个串联”。 :该值
Hello, world!
How are you today?
可以表示为
"Hello, world!\n
How are you t=\n
oday?"
其中\n
表示一字节换行符。Python
的 csv
库的文档中没有提及任何有关它的内容。 格式化部分,因此我想知道如果这是常见做法如果 Python 有支持,我知道如何编写一个连接这些行的解析器(一个简单的 str.replace(v,"=\n","") 可能就足够了),但我'我只是好奇这是否是我的文件的特质。
I'm using Python to parse a .csv
file that contains line breaks in most values. This isn't an issue, since values are delimited by "
.
However, I've noticed that during the construction of the .csv
file at one point in time, long values were split into multiple lines (but kept within the same value), with an =
character put at the end of one line to signify "the following line break is actually a concatenation". A minimal working example: the value
Hello, world!
How are you today?
could be represented as
"Hello, world!\n
How are you t=\n
oday?"
where \n
denotes the one-byte line break character.
Does CSV have the concept of "line continuation characters"? The documentation of Python's csv
library does not mention anything about it under the formatting section, and hence I wonder if this is common practice and if Python nevertheless has support. I know how to write a parser that concatenates these lines (a simple str.replace(v,"=\n","")
probably suffices), but I'm just curious whether this is an idiosyncrasy of my file.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这似乎不是 CSV 的功能,而是 MIME 的功能(由于我的数据集由电子邮件组成,这解决了我的问题)。
这种等号字符的用法是quoted-printable 编码的一部分,并且可以由
quopri
处理Python 模块。有关更多详细信息,请参阅此答案。使用此模块比简单的
str.replace(v, "=\n", "")
更好,因为电子邮件可能包含其他需要解码且不会出现在电子邮件中的带引号的可打印标记。行结束(例如=09
表示水平制表符)。使用quopri
,您可以编写:This seems to be not a feature of CSV, but rather of MIME (and since my dataset consists of e-mails, this solves my question).
This usage of equals characters is part of quoted-printable encoding, and can be handled by the
quopri
Python module. See this answer for more details.Using this module is better than a simple
str.replace(v, "=\n", "")
, because e-mails can contain other quoted-printable tokens that need decoding and do not appear on line ends (e.g.=09
to represent a horizontal tab). Withquopri
, you would write: