确定源文件中使用的制表符宽度的良好启发式是什么?

发布于 2024-11-27 21:26:09 字数 2231 浏览 2 评论 0原文

我想确定源文件中使用空格缩进的制表符宽度。 对于具有特别规则缩进的文件来说,这并不难,其中前导空格仅用于缩进,始终以制表符宽度的倍数表示,并且缩进一次增加一级。 但许多文件都会与这种常规缩进有一些偏差,通常是为了某种形式的垂直对齐。因此,我正在寻找一种很好的启发式方法来估计使用的制表符宽度,从而允许出现不规则缩进的可能性。

这样做的动机是为 SubEthaEdit 编辑器编写扩展。遗憾的是,SubEthaEdit 不提供可用于脚本编写的制表符宽度,因此我将根据文本进行猜测。

合适的启发式应该:

  • 表现得足够好以供交互使用。我不认为这会成为问题,如果需要,可以仅使用文本的一部分。
  • 独立于语言。
  • 返回最长的合适标签宽度。例如,任何制表符宽度为四个空格的文件也可能是具有两个空格制表符的文件,如果每个缩进实际上是两倍的级别。显然,四个空间是正确的选择。
  • 如果压痕完全规则,则始终正确。

一些简化因素:

  • 可以假定至少一行是缩进的。
  • 可以假定制表符宽度至少为两个空格。
  • 可以安全地假设缩进仅用空格完成。并不是说我反对制表符——恰恰相反,我会首先检查是否有用于缩进的制表符并单独处理。这确实意味着混合制表符和空格的缩进可能无法正确处理,但我认为这并不重要。
  • 可以假设不存在仅包含空白的行。
  • 并非所有语言都需要正确处理。例如,像 lisp 和 go 这样的语言的成功或失败完全无关紧要,因为它们通常不会手工缩进。
  • 不需要完美。如果偶尔需要手动调整几行,世界也不会终结。

您会采取什么方法?您认为它的优点和缺点是什么?

如果您想在答案中提供工作代码,最好的方法可能是使用 shell 脚本从 stdin 读取源文件并将制表符宽度写入 stdout。伪代码或清晰的文字描述也可以。

一些结果

为了测试不同的策略,我们可以对语言发行版标准库中的文件应用不同的策略,因为它们可能遵循该语言的标准缩进。我将考虑 Python 2.7 和 Ruby 1.8 库(系统框架安装在 Mac OS X 10.7 上),它们的预期制表符宽度分别为 4 和 2。排除的是那些具有以制表符开头的行或没有以至少两个空格开头的行的文件。

Python:

                     Right  None  Wrong
Mode:                 2523     1    102
First:                2169     1    456
No-long (12):         2529     9     88
No-long (8):          2535    16     75
LR (changes):         2509     1    116
LR (indent):          1533     1   1092
Doublecheck (10):     2480    15    130
Doublecheck (20):     2509    15    101

Ruby:

                     Right  None  Wrong
Mode:                  594    29     51
First:                 578     0     54
No-long (12):          595    29     50
No-long (8):           597    29     48
LR (changes):          585     0     47
LR (indent):           496     0    136
Doublecheck (10):      610     0     22
Doublecheck (20):      609     0     23

在这些表中,“正确”应视为语言标准制表符宽度的确定,“错误”应视为不等于语言标准宽度的非零制表符宽度,“无”应视为零制表符- 宽度或无应答。 “模式”是选择缩进最常发生变化的策略; “First”是取第一条缩进线的缩进; “No-long”是FastAl的策略,排除缩进较大的行并采取模式,数字表示允许的最大缩进变化; “LR”是Patrick87基于线性回归的策略,其变体基于行间缩进的变化和行的绝对缩进; “Doublecheck”(无法抗拒双关语!)是 Mark 对 FastAl 策略的修改,限制可能的制表符宽度并检查模态值的一半是否也频繁出现,并有两个不同的阈值用于选择较小的宽度。

I would like to determine the tab width used in source files indented with spaces.
This is not hard for files with particularly regular indentation, where the leading spaces are only used for indentation, always in multiples of the tab width, and with indentation increasing one level at at time.
But many files will have some departure from this sort of regular indentation, generally for some form of vertical alignment. I'm thus looking for a good heuristic to estimate what tab width was used, allowing some possibility for irregular indentation.

The motivation for this is writing an extension for the SubEthaEdit editor. SubEthaEdit unfortunately doesn't make the tab width available for scripting, so I'm going to guess at it based on the text.

A suitable heuristic should:

  • Perform well enough for interactive use. I don't imagine this will be a problem, and just a portion of the text can be used if need be.
  • Be language independent.
  • Return the longest suitable tab width. For example, any file with a tab width of four spaces could also be a file with two-space tabs, if every indentation was actually by twice as many levels. Clearly, four spaces would be the right choice.
  • Always get it right if the indentation is completely regular.

Some simplifying factors:

  • At least one line can be assumed to be indented.
  • The tab width can be assumed to be at least two spaces.
  • It's safe to assume that indentation is done with spaces only. It's not that I have anything against tabs---quite the contrary, I'll check first if there are any tabs used for indentation and handle it separately. This does mean that indentation mixing tabs and spaces might not be handled properly, but I don't consider it important.
  • It may be assumed that there are no lines containing only whitespace.
  • Not all languages need to be handled correctly. For example, success or failure with languages like lisp and go would be completely irrelevant, since they're not normally indented by hand.
  • Perfection is not required. The world isn't going to end if a few lines occasionally need to be manually adjusted.

What approach would you take, and what do you see as its advantages and disadvantages?

If you want to provide working code in your answer, the best approach is probably to use a shell script that reads the source file from stdin and writes the tab width to stdout. Pseudocode or a clear description in words would be just fine, too.

Some Results

To test different strategies, we can apply different strategies to files in the standard libraries for language distributions, as they presumably follow the standard indentation for the language. I'll consider the Python 2.7 and Ruby 1.8 libraries (system framework installs on Mac OS X 10.7), which have expected tab widths of 4 and 2, respectively. Excluded are those files which have lines beginning with tab characters or which have no lines beginning with at least two spaces.

Python:

                     Right  None  Wrong
Mode:                 2523     1    102
First:                2169     1    456
No-long (12):         2529     9     88
No-long (8):          2535    16     75
LR (changes):         2509     1    116
LR (indent):          1533     1   1092
Doublecheck (10):     2480    15    130
Doublecheck (20):     2509    15    101

Ruby:

                     Right  None  Wrong
Mode:                  594    29     51
First:                 578     0     54
No-long (12):          595    29     50
No-long (8):           597    29     48
LR (changes):          585     0     47
LR (indent):           496     0    136
Doublecheck (10):      610     0     22
Doublecheck (20):      609     0     23

In these tables, "Right" should be taken as determination of the language-standard tab width, "Wrong" as a non-zero tab width not equal to the language-standard width, and "None" as zero tab-width or no answer. "Mode" is the strategy of selecting the most frequently occurring change in indentation; "First" is taking the indentation of the first indented line; "No-long" is FastAl's strategy of excluding lines with large indentation and taking the mode, with the number indicating the maximum allowed indent change; "LR" is Patrick87's strategy based on linear regression, with variants based on the change in indentation between lines and on the absolute indentation of lines; "Doublecheck" (couldn't resist the pun!) is Mark's modification of FastAl's strategy, restricting the possible tab width and checking whether half the modal value also occurs frequently, with two different thresholds for selecting the smaller width.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

厌倦 2024-12-04 21:26:09

好吧,由于您想要一个与语言无关的解决方案,我们将无法使用任何语法提示。尽管您说您不想要一个完美的解决方案,但这里有一个对于大多数语言都运行良好的解决方案。

实际上,我必须解决密码学中的类似问题,才能在 多字母密码 中获得正确的代码字长度。这种加密是基本的凯撒加密(字母表中的每个字母都移动n个字母),其中密码用于以不同的方式移动字母(第n个字母)明文的字母被 mod(nth, length(cryptword)) 密码的字母移动。选择的武器是自相关

该算法如下:

  1. 在行开头的空格结束后去除所有字符 - 保持行结束标记完好无损。
  2. 删除空白为零的行(因为它们只是空白行)
  3. 计算每行的空白宽度并将其保存在数组lengths
  4. 自相关中:循环直到最大估计数量 -可能相当高,如 32 或其他 - 当前迭代应为i。对于每次迭代,计算每个条目与第 i 个条目之间的距离。计算距离 = 0 的数量(第 n 个(n+i)th 条目的值相同),保存在键 i 的数组中嗯>。
  5. 您现在拥有一系列相同的出现次数。计算该数组的平均值,并删除该平均值附近的所有值(留下自相关的尖峰)。峰值将是最低值的倍数,最低值将是搜索到的用于缩进的空格数。

自相关是一个非常好的函数,可用于您想要检测数据流中的重复值的每种情况。它在信号处理中大量使用并且速度非常快(取决于信号重复的估计最大距离)。

是的,当时我用自相关破解了多表密文。 ;)

Okay, as you want a language-agnostic solution, we won't be able to use any syntactical hints. Although you said, that you don't want a perfect solution, here is one, that is working very well with most languages.

I actually had to solve a similar issue in cryptography to get the correct code word length in a polyalphabetic cipher. This kind of encryption is a basic Caesar-chiffre (each letter of the alphabet is moved by n letters), where the cryptword is used to move the letters differently (the nth letter of the clear text is moved by the mod(nth, length(cryptword)) letter of the cryptword). The weapon of choice is autocorrelation.

The algorithm would be like this:

  1. strip all characters after the whitespace at the beginning of a line has ended - leave the line-end markers intact.
  2. remove lines with zero whitespace (as they are only blank lines)
  3. Count the whitespace width for each line and save this in an array lengths
  4. Autocorrelation: loop until the maximum estimated number - may be fairly high like 32 or something - current iteration shall be i. For each iteration, calculate the distance between each entry and the ith entry. Count the number of distances = 0 (same values for the nth and (n+i)th entries), save in an array for the key i.
  5. You now have an array of same-pair-occurances. Calculate the mean of this array, and delete all values near this mean (leaving the spikes of the autocorrelation). The spikes will be multiples of the lowest value, which will be the searched number of spaces used for indentation.

The autocorrelation is a very nice function, usable for every situation, in which you want to detect repeating values in a stream of data. It is heavily used in signal processing and very fast (depending on the estimated maximum distance of signal repeats).

And yes, back then I cracked the polyalphabetic ciphertext with autocorrelation. ;)

沧笙踏歌 2024-12-04 21:26:09
  • 对于文件中的每一行
    • 如果缩进量比前一个多,则将差异添加到列表中
      • 如果>则丢弃12,这可能是续行
  • 生成列表
  • #1 中的 # 的频率表可能是您的答案。

编辑

我打开了 VB.Net(不是吗?:-) 这就是我的意思:

    Sub Main()
        Dim lines = IO.File.ReadAllLines("ProveGodExists.c")
        Dim previndent As Integer = 0
        Dim indent As Integer
        Dim diff As Integer
        Dim Diffs As New Dictionary(Of Integer, Integer)
        For Each line In lines
            previndent = indent
            indent = Len(line) - Len(LTrim(line))
            diff = indent - previndent
            If diff > 0 And diff < 13 Then
                If Diffs.ContainsKey(diff) Then
                    Diffs(diff) += 1
                Else
                    Diffs.Add(diff, 1)
                End If
            End If
        Next
        Dim freqtbl = From p In Diffs Order By p.Value Descending
        Console.WriteLine("Dump of frequency table:")
        For Each item In freqtbl
            Console.WriteLine(item.Key.ToString & " " & item.Value.ToString)
        Next
        Console.WriteLine("My wild guess at tab setting: " & freqtbl(0).Key.ToString)
        Console.ReadLine()
    End Sub

结果:

频率表转储:
4 748
8月22日
12 12
2 2
9 2
3 1
6 1
我对选项卡设置的疯狂猜测:4

希望有帮助。

  • For each line in the file
    • If indented more than the previous, add the difference to a list
      • discard if > 12, it's probably a line continuation
  • Generate a frequency table of the #s in the list
  • #1 is likely your answer.

edit

I have VB.Net open (didn't you? :-) Here's what I mean:

    Sub Main()
        Dim lines = IO.File.ReadAllLines("ProveGodExists.c")
        Dim previndent As Integer = 0
        Dim indent As Integer
        Dim diff As Integer
        Dim Diffs As New Dictionary(Of Integer, Integer)
        For Each line In lines
            previndent = indent
            indent = Len(line) - Len(LTrim(line))
            diff = indent - previndent
            If diff > 0 And diff < 13 Then
                If Diffs.ContainsKey(diff) Then
                    Diffs(diff) += 1
                Else
                    Diffs.Add(diff, 1)
                End If
            End If
        Next
        Dim freqtbl = From p In Diffs Order By p.Value Descending
        Console.WriteLine("Dump of frequency table:")
        For Each item In freqtbl
            Console.WriteLine(item.Key.ToString & " " & item.Value.ToString)
        Next
        Console.WriteLine("My wild guess at tab setting: " & freqtbl(0).Key.ToString)
        Console.ReadLine()
    End Sub

Results:

Dump of frequency table:
4 748
8 22
12 12
2 2
9 2
3 1
6 1
My wild guess at tab setting: 4

Hope that helps.

御弟哥哥 2024-12-04 21:26:09

您的选择(实际上)是 2、3、4、5、6、7、8。

我会使用 @FastAl 建议的内容扫描前 50-100 行左右。我可能倾向于盲目地从任何带有文本的行前面拉出空格计数,并计算空白字符串的长度。如果您有可用的正则表达式,则左修剪线和运行长度两次似乎是一种浪费。另外,我会执行 System.Math.abs(indent - previndent) 以便获得取消缩进的数据。正则表达式将是这样的:

row.matches('^( +)[^ ]') # grab all the spaces from line start to non-space.

一旦您获得了 7 个选项中计数最高的统计数据,就将其作为第一个猜测运行。对于 8、6 和 4,您应该检查 4 和 2、3 或 2 是否也有显着计数(第二名或超过 10% 或其他廉价启发式)。如果有很多 12(或 9)可能暗示 4(或 3)也是比 8(或 6)更好的选择。一次删除或添加超过 2 个级别(通常是折叠的结束括号)的情况非常罕见。

不相干的咕哝

我看到的一个问题是旧的 .c 代码尤其存在这种令人讨厌的模式:

code level 0
/* Fancy comments get weird spacing because there 
 * is an extra space beyond the *
 * looks like one space!
 */
  code indent (2 spaces)
  /* Fancy comments get weird spacing because there 
   * is an extra space beyond the *
   * looks like three spaces!
   */

code level 0
  code indent (2 spaces)
  /* comment at indent level 1
     With no stars you wind up with 2 spaces + 3 spaces.
  */

恶心。我不知道你是如何对待这样的评论标准的。对于像“c”这样的代码,您可能必须处理 2.0 版本中的特殊注释...但我现在会忽略它。

您的最后一个问题是处理与您的假设不符的行。我的建议是将它们“标记”到深度,然后保留多余的空格。如果您必须更正,我会这样做:rowtabdepth = heaven((rowspacecount - (tabwidth/2)) / tabwidth)

Your choices are (realistically) 2,3,4,5,6,7,8.

I'd scan the the first 50-100 lines or so using something like what @FastAl suggested. I'd probably lean toward just blindly pulling the spaces count from the front of any row with text and counting the length of the white space string. Left trimming lines and running length twice seems like a waste if you have regex available. Also, I'd do System.Math.abs(indent - previndent) so you get de-indent data. The regex would be this:

row.matches('^( +)[^ ]') # grab all the spaces from line start to non-space.

Once you've got a statistic for which of the 7 options has the highest count, run with it as the first guess. For 8, 6, and 4 you should check to see if there is also a significant count (2nd place or over 10% or some other cheapo heuristic) for 4 and 2, 3, or 2. If there are a lot of 12s (or 9s) that might hint that 4 (or 3) is a better choice than 8 (or 6) as well. Dropping or adding more than 2 levels at a time (usually collapsed ending brackets) is super rare.

Irrelevant mumbling

The one problem I see is that old .c code in particular has this nasty pattern going on in it:

code level 0
/* Fancy comments get weird spacing because there 
 * is an extra space beyond the *
 * looks like one space!
 */
  code indent (2 spaces)
  /* Fancy comments get weird spacing because there 
   * is an extra space beyond the *
   * looks like three spaces!
   */

code level 0
  code indent (2 spaces)
  /* comment at indent level 1
     With no stars you wind up with 2 spaces + 3 spaces.
  */

Yuck. I don't know how you deal with comment standards like that. For code that is "c" like you might have to deal with comments special in version 2.0... but I would just ignore it for now.

Your final issue is dealing with lines that don't match your assumptions. My suggestion would be to "tab" them to depth and then leave the extra spaces in place. If you have to correct I'd do this: rowtabdepth = ceiling((rowspacecount - (tabwidth/2)) / tabwidth)

风吹雪碎 2024-12-04 21:26:09

也许做类似的事情...

  1. 获取文件中所有制表符宽度的列表,
  2. 删除最不频繁的条目的 50%,
  3. 按升序对剩余条目进行排序,
  4. 计算(a,b)对的列表,其中 b 位于制表符宽度列表和 a 给出该制表符宽度的排名。
  5. 绘制最佳拟合线
  6. 最佳拟合线的斜率是选项卡宽度的猜测值。四舍五入到最接近的整数。

示例:

  1. 列表 = [4, 4, 6, 8, 8, 4, 4, 4, 8, 8, 12, 5, 11, 13, 12, 12]
  2. 列表 = [4, 4, 4, 4, 4, 8, 8, 8]
  3. 已排序
  4. [(1, 4), (1, 4), (1, 4), (1, 4), (1, 4), (2, 8), (2, 8), (2, 8)]
  5. 最佳拟合线是 b = 4a + 0 (R^2 = 0)
  6. 斜率为 4,所以这可能是制表符宽度。

Maybe do something like...

  1. get a list of all the tab widths in the file
  2. remove 50% of the entries which are least frequent
  3. sort the remaining entries in ascending order
  4. compute a list of (a, b) pairs where b's are in the list of tab widths and the a's give the rank of that tab width.
  5. plot a best-fit line
  6. the slope of the best-fit line is the guess for the tab width. round to the nearest integer.

Example:

  1. list = [4, 4, 6, 8, 8, 4, 4, 4, 8, 8, 12, 5, 11, 13, 12, 12]
  2. list = [4, 4, 4, 4, 4, 8, 8, 8]
  3. already sorted
  4. [(1, 4), (1, 4), (1, 4), (1, 4), (1, 4), (2, 8), (2, 8), (2, 8)]
  5. the best fit line is b = 4a + 0 (R^2 = 0)
  6. slope is 4, so this is probably the tab width.
贪恋 2024-12-04 21:26:09

对于您想要支持的每种语言,您需要进行一些解析:
1)排除注释(逐行或逐块,也许还嵌套?)
2)找到子块的开口(类C语言中的{,pascal中的begin,shell中的do等)

然后看看子块打开后,空间数量增加了多少。进行一些简单的统计 - 找到最频繁的值、最大值和最小值、平均值。这样您还可以看到压痕是否规则以及有多少。

For each langauge you want to support, you'll need to do a bit of parsing:
1) exclude comments (either line-wise or block-wise, maybe also nested?)
2) find openings of sub-block ({ in C-like languages, begin in pascal, do in shell etc.)

Then just see how much the number of spaces increase after sub-block has been opened. Make some simple statistics - to find most frequent value, maximum and minimum value, average value. This way you can also see if the indentation is regular or not and how much.

剩一世无双 2024-12-04 21:26:09

作为基线,可以简单地计算所有缩进增加,并将最频繁的增加作为制表符宽度。作为一个 shell 脚本,编写每个管道阶段都有小的操作,它可能看起来像这样:

#!/bin/sh

grep -v -E '^[[:space:]]*

此实现是 O(n log(n)) 其中 n 是数字文件中的行数,但可以在 O(n) 内轻松完成。

| sed 's/^\([[:space:]]*\).*/\1/' | awk '{ print length($0) }' | awk '$1 > prev { print $1 - prev } { prev = $1 }' | sort | uniq -c | sort -k1nr | awk '{ print $2 }' | head -n 1

此实现是 O(n log(n)) 其中 n 是数字文件中的行数,但可以在 O(n) 内轻松完成。

As a baseline, one could simply calculate all indentation increases, and take the most frequent increase as the tab width. As a shell script, written to have small actions per pipeline stage, it could look like this:

#!/bin/sh

grep -v -E '^[[:space:]]*

This implementation is O(n log(n)) where n is the number of lines in the file, but it could readily be done in O(n).

| sed 's/^\([[:space:]]*\).*/\1/' | awk '{ print length($0) }' | awk '$1 > prev { print $1 - prev } { prev = $1 }' | sort | uniq -c | sort -k1nr | awk '{ print $2 }' | head -n 1

This implementation is O(n log(n)) where n is the number of lines in the file, but it could readily be done in O(n).

淡看悲欢离合 2024-12-04 21:26:09

启发式:

  1. 获取从一行到下一行的所有缩进更改的列表,其中 > 0.
  2. 制作此列表中所有值的频率表。
  3. 取频率最高的值。

Python 脚本,采用文件名或标准输入并打印最佳缩进数:

#!/usr/bin/env python

import fileinput, collections

def leadingSpaceLen(line):
    return len(line) - len(line.lstrip())

def indentChange(line1, line2):
    return leadingSpaceLen(line2) - leadingSpaceLen(line1)

def indentChanges(lines):
    return [indentChange(line1, line2)
        for line1, line2 in zip(lines[:-1], lines[1:])]

def bestIndent(lines):
    f = collections.defaultdict(lambda: 0)
    for change in indentChanges(lines):
        if change > 0:
            f[change] += 1
    return max(f.items(), key=lambda x: x[1])[0]

if __name__ == '__main__':
    print bestIndent(tuple(fileinput.input()))

Heuristic:

  1. Get a list of all indention changes from a line to its next line which are > 0.
  2. Make a frequency table of all values in this list.
  3. Take the value with highest frequency.

Python script, takes filenames or stdin and prints best indent number:

#!/usr/bin/env python

import fileinput, collections

def leadingSpaceLen(line):
    return len(line) - len(line.lstrip())

def indentChange(line1, line2):
    return leadingSpaceLen(line2) - leadingSpaceLen(line1)

def indentChanges(lines):
    return [indentChange(line1, line2)
        for line1, line2 in zip(lines[:-1], lines[1:])]

def bestIndent(lines):
    f = collections.defaultdict(lambda: 0)
    for change in indentChanges(lines):
        if change > 0:
            f[change] += 1
    return max(f.items(), key=lambda x: x[1])[0]

if __name__ == '__main__':
    print bestIndent(tuple(fileinput.input()))
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文