Python 制表位感知 len() 和填充函数

发布于 2024-08-11 00:48:33 字数 1180 浏览 4 评论 0原文

Python 的 len() 和诸如 string.ljust() 之类的填充函数不支持制表位,即它们像对待任何其他单宽字符一样对待 '\t',并且不不会将 len() 向上舍入到最接近的 tabstop 倍数。示例:

len('Bear\tnecessities\t')

是 17 而不是 24 (即 4+(8-4)+11+(8-3) )

并说我还想要一个函数 pad_with_tabs(s) 这样

pad_with_tabs('Bear', 15) = 'Bear\t\t'

寻找简单的实现这些 - 紧凑性和可读性第一,效率第二。 这是一个基本但令人恼火的问题。 @gnibbler - 你能展示一个纯粹的 Pythonic 解决方案,即使它的效率低 20 倍?

当然,您可以使用 str.expandtabs(TABWIDTH) 来回转换,但这很笨拙。 导入数学来获取 TABWIDTH * int( math.ceil(len(s)*1.0/TABWIDTH) ) 似乎也有点矫枉过正。

我无法管理比以下更优雅的东西:

TABWIDTH = 8

def pad_with_tabs(s,maxlen):
  s_len = len(s)
  while s_len < maxlen:
    s += '\t'
    s_len += TABWIDTH - (s_len % TABWIDTH)
  return s

函数的结果:

s = pad_with_tabs(s, ...)

并且由于Python字符串是不可变的,除非我们想将函数猴子修补到字符串模块中以将其添加为方法,否则我们还必须分配给 特别是我无法使用列表理解或 string.join(...) 获得干净的方法:

''.join([s, '\t' * ntabs])

如果没有特殊情况,len(s)len(s) 的情况。 TABWIDTH 的整数倍),或已经 len(s)>=maxlen

谁能展示更好的 len() 和 pad_with_tabs() 函数?

Python's len() and padding functions like string.ljust() are not tabstop-aware, i.e. they treat '\t' like any other single-width character, and don't round len() up to the nearest multiple of tabstop. Example:

len('Bear\tnecessities\t')

is 17 instead of 24 ( i.e. 4+(8-4)+11+(8-3) )

and say I also want a function pad_with_tabs(s) such that

pad_with_tabs('Bear', 15) = 'Bear\t\t'

Looking for simple implementations of these - compactness and readability first, efficiency second.
This is a basic but irritating question.
@gnibbler - can you show a purely Pythonic solution, even if it's say 20x less efficient?

Sure you could convert back and forth using str.expandtabs(TABWIDTH), but that's clunky.
Importing math to get TABWIDTH * int( math.ceil(len(s)*1.0/TABWIDTH) ) also seems like massive overkill.

I couldn't manage anything more elegant than the following:

TABWIDTH = 8

def pad_with_tabs(s,maxlen):
  s_len = len(s)
  while s_len < maxlen:
    s += '\t'
    s_len += TABWIDTH - (s_len % TABWIDTH)
  return s

and since Python strings are immutable and unless we want to monkey-patch our function into string module to add it as a method, we must also assign to the result of the function:

s = pad_with_tabs(s, ...)

In particular I couldn't get clean approaches using list-comprehension or string.join(...):

''.join([s, '\t' * ntabs])

without special-casing the cases where len(s) is < an integer multiple of TABWIDTH), or len(s)>=maxlen already.

Can anyone show better len() and pad_with_tabs() functions?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

起风了 2024-08-18 00:48:34

我相信 gnibbler 是最适合大多数实际情况的。但无论如何,这里有一个简单的(不考虑 CR、LF 等)解决方案来计算字符串的长度而不创建扩展副本:

def tab_aware_len(s, tabstop=8):
    pos = -1
    extra_length = 0
    while True:
        pos = s.find('\t', pos+1)
        if pos<0:
            return len(s) + extra_length
        extra_length += tabstop - (pos+extra_length) % tabstop - 1

可能它对于一些巨大的字符串甚至内存映射文件可能很有用。这是稍微优化的填充函数:

def pad_with_tabs(s, max_len, tabstop=8):
    length = tab_aware_len(s, tabstop)
    if length<max_len:
        s += '\t' * ((max_len-1)//tabstop + 1 - length//tabstop)
    return s

I believe gnibbler's is the best for most prectical cases. But anyway, here is a naive (without accounting CR, LF etc) solution to compute the length of string without creating expanded copy:

def tab_aware_len(s, tabstop=8):
    pos = -1
    extra_length = 0
    while True:
        pos = s.find('\t', pos+1)
        if pos<0:
            return len(s) + extra_length
        extra_length += tabstop - (pos+extra_length) % tabstop - 1

Probably it could be useful for some huge strings or even memory mapped files. And here is padding function a bit optimized:

def pad_with_tabs(s, max_len, tabstop=8):
    length = tab_aware_len(s, tabstop)
    if length<max_len:
        s += '\t' * ((max_len-1)//tabstop + 1 - length//tabstop)
    return s
行至春深 2024-08-18 00:48:34

TABWIDTH * int( math.ceil(len(s)*1.0/TABWIDTH) ) 确实是一个巨大的过度杀戮;您可以更简单地获得相同的结果。对于正的 in,请使用:

def round_up_positive_int(i, n):
    return ((i + n - 1) // n) * n

经过适当的翻译后,此过程适用于我曾经使用过的几乎任何语言。

然后你可以做 next_pos = round_up_positive_int(len(s), TABWIDTH)

为了稍微提高代码的优雅性,而不是

while(s_len < maxlen):

使用这个:

while s_len < maxlen:

TABWIDTH * int( math.ceil(len(s)*1.0/TABWIDTH) ) is indeed a massive over-kill; you can get the same result much more simply. For positive i and n, use:

def round_up_positive_int(i, n):
    return ((i + n - 1) // n) * n

This procedure works in just about any language I've ever used, after appropriate translation.

Then you can do next_pos = round_up_positive_int(len(s), TABWIDTH)

For a slight increase in the elegance of your code, instead of

while(s_len < maxlen):

use this:

while s_len < maxlen:
魂ガ小子 2024-08-18 00:48:34

不幸的是,我无法“按原样”使用接受的答案,因此这里有稍微修改的版本,以防万一有人遇到同样的问题并通过搜索发现这篇文章:

from decimal import Decimal, ROUND_HALF_UP
TABWIDTH = 4

def pad_with_tabs(src, max_len):
    return src + "\t" * int(
        Decimal((max_len - len(src.expandtabs(TABWIDTH))) / TABWIDTH + 1).quantize(0, ROUND_HALF_UP))


def pad_fields(input):
    result = []
    longest = max(len(x) for x in input)
    for row in input:
        result.append(pad_with_tabs(row, longest))
    return result

输出列表包含正确填充的行,其中选项卡计数四舍五入,因此结果数据当原始答案中没有添加制表符时,无论角 0.5 情况如何,都将具有相同的缩进级别。

Unfortunately I was unable to make use of accepted answer "as is" so here goes slightly modified version just in case someone would run into same problem and discovers this post via search:

from decimal import Decimal, ROUND_HALF_UP
TABWIDTH = 4

def pad_with_tabs(src, max_len):
    return src + "\t" * int(
        Decimal((max_len - len(src.expandtabs(TABWIDTH))) / TABWIDTH + 1).quantize(0, ROUND_HALF_UP))


def pad_fields(input):
    result = []
    longest = max(len(x) for x in input)
    for row in input:
        result.append(pad_with_tabs(row, longest))
    return result

Output list contains properly padded rows having tab count rounded so the resulting data will have same indentation level regardless of corner .5 cases when no tab gets added in the original answer.

萌无敌 2024-08-18 00:48:33
TABWIDTH=8
def my_len(s):
    return len(s.expandtabs(TABWIDTH))

def pad_with_tabs(s,maxlen):
    return s+"\t"*((maxlen-len(s)-1)/TABWIDTH+1)

为什么我使用 expandtabs()
好吧,它很快

$ python -m timeit '"Bear\tnecessities\t".expandtabs()'
1000000 loops, best of 3: 0.602 usec per loop
$ python -m timeit 'for c in "Bear\tnecessities\t":pass'
100000 loops, best of 3: 2.32 usec per loop
$ python -m timeit '[c for c in "Bear\tnecessities\t"]'
100000 loops, best of 3: 4.17 usec per loop
$ python -m timeit 'map(None,"Bear\tnecessities\t")'
100000 loops, best of 3: 2.25 usec per loop

任何对字符串进行迭代的操作都会变慢,因为即使您在循环中不执行任何操作,仅迭代也比 expandtabs 慢约 4 倍。

$ python -m timeit '"Bear\tnecessities\t".split("\t")'
1000000 loops, best of 3: 0.868 usec per loop

即使只是在选项卡上拆分也需要更长的时间。您仍然需要迭代拆分并将每个项目填充到制表符

TABWIDTH=8
def my_len(s):
    return len(s.expandtabs(TABWIDTH))

def pad_with_tabs(s,maxlen):
    return s+"\t"*((maxlen-len(s)-1)/TABWIDTH+1)

Why did I use expandtabs()?
Well it's fast

$ python -m timeit '"Bear\tnecessities\t".expandtabs()'
1000000 loops, best of 3: 0.602 usec per loop
$ python -m timeit 'for c in "Bear\tnecessities\t":pass'
100000 loops, best of 3: 2.32 usec per loop
$ python -m timeit '[c for c in "Bear\tnecessities\t"]'
100000 loops, best of 3: 4.17 usec per loop
$ python -m timeit 'map(None,"Bear\tnecessities\t")'
100000 loops, best of 3: 2.25 usec per loop

Anything that iterates over your string is going to be slower, because just the iteration is ~4 times slower than expandtabs even when you do nothing in the loop.

$ python -m timeit '"Bear\tnecessities\t".split("\t")'
1000000 loops, best of 3: 0.868 usec per loop

Even just splitting on tabs takes longer. You'd still need to iterate over the split and pad each item to the tabstop

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文