标准化字符串中的 lineends 的最 Pythonic 方法是什么？

发布于 2024-08-11 06:18:45 字数 648 浏览 12 评论 0原文

给定一个来源未知的文本字符串，如何最好地重写它以具有已知的行尾约定？

我通常这样做：

lines = text.splitlines()
text = '\n'.join(lines)

...但这不能处理完全混乱的约定的“混合”文本文件（是的，它们仍然存在！）。

编辑

我正在做的事情的重点当然是：

'\n'.join(text.splitlines())

......这不是我要问的。

之后的总行数应该相同，因此不会删除空行。

测试用例

分割

'a\nb\n\nc\nd'
'a\r\nb\r\n\r\nc\r\nd'
'a\rb\r\rc\rd'
'a\rb\n\rc\rd'
'a\rb\r\nc\nd'
'a\nb\r\nc\rd'

..应该全部产生5行。在混合上下文中，分割线假定 '\r\n' 是单个逻辑换行符，导致最后两个测试用例有 4 行。

嗯，可以通过比较 splitlines() 和 split('\n') 和/或 split('\r') 的结果来检测包含 '\r\n' 的混合上下文...

原文

Given a text-string of unknown source, how does one best rewrite it to have a known lineend-convention?

I usually do:

lines = text.splitlines()
text = '\n'.join(lines)

... but this doesn't handle "mixed" text-files of utterly confused conventions (Yes, they still exist!).

Edit

The oneliner of what I'm doing is of course:

'\n'.join(text.splitlines())

... that's not what I'm asking about.

The total number of lines should be the same afterwards, so no stripping of empty lines.

Testcases

Splitting

'a\nb\n\nc\nd'
'a\r\nb\r\n\r\nc\r\nd'
'a\rb\r\rc\rd'
'a\rb\n\rc\rd'
'a\rb\r\nc\nd'
'a\nb\r\nc\rd'

.. should all result in 5 lines. In a mixed context, splitlines assumes that '\r\n' is a single logical newline, leading to 4 lines for the last two testcases.

Hm, a mixed context that contains '\r\n' can be detected by comparing the result of splitlines() and split('\n'), and/or split('\r')...

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

挖鼻大婶 2024-08-18 06:18:45

mixed.replace('\r\n', '\n').replace('\r', '\n')

应该处理所有可能的变体。

mixed.replace('\r\n', '\n').replace('\r', '\n')

should handle all possible variants.

回复收藏 0 原文

可遇━不可求 2024-08-18 06:18:45

...但这不能处理完全混乱的约定的“混合”文本文件（是的，它们仍然存在！）

实际上它应该可以正常工作：

>>> s = 'hello world\nline 1\r\nline 2'

>>> s.splitlines()
['hello world', 'line 1', 'line 2']

>>> '\n'.join(s.splitlines())
'hello world\nline 1\nline 2'

您使用的是哪个版本的Python？

编辑：我仍然不明白splitlines()为什么不适合你：

>>> s = '''\
... First line, with LF\n\
... Second line, with CR\r\
... Third line, with CRLF\r\n\
... Two blank lines with LFs\n\
... \n\
... \n\
... Two blank lines with CRs\r\
... \r\
... \r\
... Two blank lines with CRLFs\r\n\
... \r\n\
... \r\n\
... Three blank lines with a jumble of things:\r\n\
... \r\
... \r\n\
... \n\
... End without a newline.'''

>>> s.splitlines()
['First line, with LF', 'Second line, with CR', 'Third line, with CRLF', 'Two blank lines with LFs', '', '', 'Two blank lines with CRs', '', '', 'Two blank lines with CRLFs', '', '', 'Three blank lines with a jumble of things:', '', '', '', 'End without a newline.']

>>> print '\n'.join(s.splitlines())
First line, with LF
Second line, with CR
Third line, with CRLF
Two blank lines with LFs


Two blank lines with CRs


Two blank lines with CRLFs


Three blank lines with a jumble of things:



End without a newline.

据我所知splitlines()不适合你将列表拆分两次或其他任何内容。

您能否粘贴给您带来麻烦的输入类型的示例？

... but this doesn't handle "mixed" text-files of utterly confused conventions (Yes, they still exist!)

Actually it should work fine:

>>> s = 'hello world\nline 1\r\nline 2'

>>> s.splitlines()
['hello world', 'line 1', 'line 2']

>>> '\n'.join(s.splitlines())
'hello world\nline 1\nline 2'

What version of Python are you using?

EDIT: I still don't see how splitlines() is not working for you:

>>> s = '''\
... First line, with LF\n\
... Second line, with CR\r\
... Third line, with CRLF\r\n\
... Two blank lines with LFs\n\
... \n\
... \n\
... Two blank lines with CRs\r\
... \r\
... \r\
... Two blank lines with CRLFs\r\n\
... \r\n\
... \r\n\
... Three blank lines with a jumble of things:\r\n\
... \r\
... \r\n\
... \n\
... End without a newline.'''

>>> s.splitlines()
['First line, with LF', 'Second line, with CR', 'Third line, with CRLF', 'Two blank lines with LFs', '', '', 'Two blank lines with CRs', '', '', 'Two blank lines with CRLFs', '', '', 'Three blank lines with a jumble of things:', '', '', '', 'End without a newline.']

>>> print '\n'.join(s.splitlines())
First line, with LF
Second line, with CR
Third line, with CRLF
Two blank lines with LFs


Two blank lines with CRs


Two blank lines with CRLFs


Three blank lines with a jumble of things:



End without a newline.

As far as I know splitlines() doesn't split the list twice or anything.

Can you paste a sample of the kind of input that's giving you trouble?

回复收藏 0 原文

多孤肩上扛 2024-08-18 06:18:45

还有比 \r\n\ 和 \n 更多的约定吗？如果您不需要线条，只需替换 \r\n 就足够了。

only_newlines = mixed.replace('\r\n','\n')

Are there even more convetions than \r\n\ and \n? Simply replacing \r\n is enough if you dont want lines.

only_newlines = mixed.replace('\r\n','\n')

回复收藏 0 原文

~没有更多了~

关于作者

〆一缕阳光ご

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

标准化字符串中的 lineends 的最 Pythonic 方法是什么？

编辑

测试用例

Edit

Testcases

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

helenabai_sz

993438968

若能看破又如何

情未る

纪平伟

bobowiki

友情链接

标准化字符串中的 lineends 的最 Pythonic 方法是什么？

编辑

测试用例

Edit

Testcases

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

helenabai_sz

993438968

若能看破又如何

情未る

纪平伟

bobowiki

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。