标准化字符串中的 lineends 的最 Pythonic 方法是什么?

发布于 2024-08-11 06:18:45 字数 648 浏览 12 评论 0原文

给定一个来源未知的文本字符串,如何最好地重写它以具有已知的行尾约定?

我通常这样做:

lines = text.splitlines()
text = '\n'.join(lines)

...但这不能处理完全混乱的约定的“混合”文本文件(是的,它们仍然存在!)。

编辑

我正在做的事情的重点当然是:

'\n'.join(text.splitlines())

......这不是我要问的。

之后的总行数应该相同,因此不会删除空行。

测试用例

分割

'a\nb\n\nc\nd'
'a\r\nb\r\n\r\nc\r\nd'
'a\rb\r\rc\rd'
'a\rb\n\rc\rd'
'a\rb\r\nc\nd'
'a\nb\r\nc\rd'

..应该全部产生5行。在混合上下文中,分割线假定 '\r\n' 是单个逻辑换行符,导致最后两个测试用例有 4 行。

嗯,可以通过比较 splitlines() 和 split('\n') 和/或 split('\r') 的结果来检测包含 '\r\n' 的混合上下文...

Given a text-string of unknown source, how does one best rewrite it to have a known lineend-convention?

I usually do:

lines = text.splitlines()
text = '\n'.join(lines)

... but this doesn't handle "mixed" text-files of utterly confused conventions (Yes, they still exist!).

Edit

The oneliner of what I'm doing is of course:

'\n'.join(text.splitlines())

... that's not what I'm asking about.

The total number of lines should be the same afterwards, so no stripping of empty lines.

Testcases

Splitting

'a\nb\n\nc\nd'
'a\r\nb\r\n\r\nc\r\nd'
'a\rb\r\rc\rd'
'a\rb\n\rc\rd'
'a\rb\r\nc\nd'
'a\nb\r\nc\rd'

.. should all result in 5 lines. In a mixed context, splitlines assumes that '\r\n' is a single logical newline, leading to 4 lines for the last two testcases.

Hm, a mixed context that contains '\r\n' can be detected by comparing the result of splitlines() and split('\n'), and/or split('\r')...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

挖鼻大婶 2024-08-18 06:18:45
mixed.replace('\r\n', '\n').replace('\r', '\n')

应该处理所有可能的变体。

mixed.replace('\r\n', '\n').replace('\r', '\n')

should handle all possible variants.

可遇━不可求 2024-08-18 06:18:45

...但这不能处理完全混乱的约定的“混合”文本文件(是的,它们仍然存在!)

实际上它应该可以正常工作:

>>> s = 'hello world\nline 1\r\nline 2'

>>> s.splitlines()
['hello world', 'line 1', 'line 2']

>>> '\n'.join(s.splitlines())
'hello world\nline 1\nline 2'

您使用的是哪个版本的Python?

编辑:我仍然不明白splitlines()为什么不适合你:

>>> s = '''\
... First line, with LF\n\
... Second line, with CR\r\
... Third line, with CRLF\r\n\
... Two blank lines with LFs\n\
... \n\
... \n\
... Two blank lines with CRs\r\
... \r\
... \r\
... Two blank lines with CRLFs\r\n\
... \r\n\
... \r\n\
... Three blank lines with a jumble of things:\r\n\
... \r\
... \r\n\
... \n\
... End without a newline.'''

>>> s.splitlines()
['First line, with LF', 'Second line, with CR', 'Third line, with CRLF', 'Two blank lines with LFs', '', '', 'Two blank lines with CRs', '', '', 'Two blank lines with CRLFs', '', '', 'Three blank lines with a jumble of things:', '', '', '', 'End without a newline.']

>>> print '\n'.join(s.splitlines())
First line, with LF
Second line, with CR
Third line, with CRLF
Two blank lines with LFs


Two blank lines with CRs


Two blank lines with CRLFs


Three blank lines with a jumble of things:



End without a newline.

据我所知splitlines()不适合你将列表拆分两次或其他任何内容。

您能否粘贴给您带来麻烦的输入类型的示例?

... but this doesn't handle "mixed" text-files of utterly confused conventions (Yes, they still exist!)

Actually it should work fine:

>>> s = 'hello world\nline 1\r\nline 2'

>>> s.splitlines()
['hello world', 'line 1', 'line 2']

>>> '\n'.join(s.splitlines())
'hello world\nline 1\nline 2'

What version of Python are you using?

EDIT: I still don't see how splitlines() is not working for you:

>>> s = '''\
... First line, with LF\n\
... Second line, with CR\r\
... Third line, with CRLF\r\n\
... Two blank lines with LFs\n\
... \n\
... \n\
... Two blank lines with CRs\r\
... \r\
... \r\
... Two blank lines with CRLFs\r\n\
... \r\n\
... \r\n\
... Three blank lines with a jumble of things:\r\n\
... \r\
... \r\n\
... \n\
... End without a newline.'''

>>> s.splitlines()
['First line, with LF', 'Second line, with CR', 'Third line, with CRLF', 'Two blank lines with LFs', '', '', 'Two blank lines with CRs', '', '', 'Two blank lines with CRLFs', '', '', 'Three blank lines with a jumble of things:', '', '', '', 'End without a newline.']

>>> print '\n'.join(s.splitlines())
First line, with LF
Second line, with CR
Third line, with CRLF
Two blank lines with LFs


Two blank lines with CRs


Two blank lines with CRLFs


Three blank lines with a jumble of things:



End without a newline.

As far as I know splitlines() doesn't split the list twice or anything.

Can you paste a sample of the kind of input that's giving you trouble?

多孤肩上扛 2024-08-18 06:18:45

还有比 \r\n\\n 更多的约定吗?如果您不需要线条,只需替换 \r\n 就足够了。

only_newlines = mixed.replace('\r\n','\n')

Are there even more convetions than \r\n\ and \n? Simply replacing \r\n is enough if you dont want lines.

only_newlines = mixed.replace('\r\n','\n')
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文