确定源文件中使用的制表符宽度的良好启发式是什么?
我想确定源文件中使用空格缩进的制表符宽度。 对于具有特别规则缩进的文件来说,这并不难,其中前导空格仅用于缩进,始终以制表符宽度的倍数表示,并且缩进一次增加一级。 但许多文件都会与这种常规缩进有一些偏差,通常是为了某种形式的垂直对齐。因此,我正在寻找一种很好的启发式方法来估计使用的制表符宽度,从而允许出现不规则缩进的可能性。
这样做的动机是为 SubEthaEdit 编辑器编写扩展。遗憾的是,SubEthaEdit 不提供可用于脚本编写的制表符宽度,因此我将根据文本进行猜测。
合适的启发式应该:
- 表现得足够好以供交互使用。我不认为这会成为问题,如果需要,可以仅使用文本的一部分。
- 独立于语言。
- 返回最长的合适标签宽度。例如,任何制表符宽度为四个空格的文件也可能是具有两个空格制表符的文件,如果每个缩进实际上是两倍的级别。显然,四个空间是正确的选择。
- 如果压痕完全规则,则始终正确。
一些简化因素:
- 可以假定至少一行是缩进的。
- 可以假定制表符宽度至少为两个空格。
- 可以安全地假设缩进仅用空格完成。并不是说我反对制表符——恰恰相反,我会首先检查是否有用于缩进的制表符并单独处理。这确实意味着混合制表符和空格的缩进可能无法正确处理,但我认为这并不重要。
- 可以假设不存在仅包含空白的行。
- 并非所有语言都需要正确处理。例如,像 lisp 和 go 这样的语言的成功或失败完全无关紧要,因为它们通常不会手工缩进。
- 不需要完美。如果偶尔需要手动调整几行,世界也不会终结。
您会采取什么方法?您认为它的优点和缺点是什么?
如果您想在答案中提供工作代码,最好的方法可能是使用 shell 脚本从 stdin
读取源文件并将制表符宽度写入 stdout
。伪代码或清晰的文字描述也可以。
一些结果
为了测试不同的策略,我们可以对语言发行版标准库中的文件应用不同的策略,因为它们可能遵循该语言的标准缩进。我将考虑 Python 2.7 和 Ruby 1.8 库(系统框架安装在 Mac OS X 10.7 上),它们的预期制表符宽度分别为 4 和 2。排除的是那些具有以制表符开头的行或没有以至少两个空格开头的行的文件。
Python:
Right None Wrong
Mode: 2523 1 102
First: 2169 1 456
No-long (12): 2529 9 88
No-long (8): 2535 16 75
LR (changes): 2509 1 116
LR (indent): 1533 1 1092
Doublecheck (10): 2480 15 130
Doublecheck (20): 2509 15 101
Ruby:
Right None Wrong
Mode: 594 29 51
First: 578 0 54
No-long (12): 595 29 50
No-long (8): 597 29 48
LR (changes): 585 0 47
LR (indent): 496 0 136
Doublecheck (10): 610 0 22
Doublecheck (20): 609 0 23
在这些表中,“正确”应视为语言标准制表符宽度的确定,“错误”应视为不等于语言标准宽度的非零制表符宽度,“无”应视为零制表符- 宽度或无应答。 “模式”是选择缩进最常发生变化的策略; “First”是取第一条缩进线的缩进; “No-long”是FastAl的策略,排除缩进较大的行并采取模式,数字表示允许的最大缩进变化; “LR”是Patrick87基于线性回归的策略,其变体基于行间缩进的变化和行的绝对缩进; “Doublecheck”(无法抗拒双关语!)是 Mark 对 FastAl 策略的修改,限制可能的制表符宽度并检查模态值的一半是否也频繁出现,并有两个不同的阈值用于选择较小的宽度。
I would like to determine the tab width used in source files indented with spaces.
This is not hard for files with particularly regular indentation, where the leading spaces are only used for indentation, always in multiples of the tab width, and with indentation increasing one level at at time.
But many files will have some departure from this sort of regular indentation, generally for some form of vertical alignment. I'm thus looking for a good heuristic to estimate what tab width was used, allowing some possibility for irregular indentation.
The motivation for this is writing an extension for the SubEthaEdit editor. SubEthaEdit unfortunately doesn't make the tab width available for scripting, so I'm going to guess at it based on the text.
A suitable heuristic should:
- Perform well enough for interactive use. I don't imagine this will be a problem, and just a portion of the text can be used if need be.
- Be language independent.
- Return the longest suitable tab width. For example, any file with a tab width of four spaces could also be a file with two-space tabs, if every indentation was actually by twice as many levels. Clearly, four spaces would be the right choice.
- Always get it right if the indentation is completely regular.
Some simplifying factors:
- At least one line can be assumed to be indented.
- The tab width can be assumed to be at least two spaces.
- It's safe to assume that indentation is done with spaces only. It's not that I have anything against tabs---quite the contrary, I'll check first if there are any tabs used for indentation and handle it separately. This does mean that indentation mixing tabs and spaces might not be handled properly, but I don't consider it important.
- It may be assumed that there are no lines containing only whitespace.
- Not all languages need to be handled correctly. For example, success or failure with languages like lisp and go would be completely irrelevant, since they're not normally indented by hand.
- Perfection is not required. The world isn't going to end if a few lines occasionally need to be manually adjusted.
What approach would you take, and what do you see as its advantages and disadvantages?
If you want to provide working code in your answer, the best approach is probably to use a shell script that reads the source file from stdin
and writes the tab width to stdout
. Pseudocode or a clear description in words would be just fine, too.
Some Results
To test different strategies, we can apply different strategies to files in the standard libraries for language distributions, as they presumably follow the standard indentation for the language. I'll consider the Python 2.7 and Ruby 1.8 libraries (system framework installs on Mac OS X 10.7), which have expected tab widths of 4 and 2, respectively. Excluded are those files which have lines beginning with tab characters or which have no lines beginning with at least two spaces.
Python:
Right None Wrong
Mode: 2523 1 102
First: 2169 1 456
No-long (12): 2529 9 88
No-long (8): 2535 16 75
LR (changes): 2509 1 116
LR (indent): 1533 1 1092
Doublecheck (10): 2480 15 130
Doublecheck (20): 2509 15 101
Ruby:
Right None Wrong
Mode: 594 29 51
First: 578 0 54
No-long (12): 595 29 50
No-long (8): 597 29 48
LR (changes): 585 0 47
LR (indent): 496 0 136
Doublecheck (10): 610 0 22
Doublecheck (20): 609 0 23
In these tables, "Right" should be taken as determination of the language-standard tab width, "Wrong" as a non-zero tab width not equal to the language-standard width, and "None" as zero tab-width or no answer. "Mode" is the strategy of selecting the most frequently occurring change in indentation; "First" is taking the indentation of the first indented line; "No-long" is FastAl's strategy of excluding lines with large indentation and taking the mode, with the number indicating the maximum allowed indent change; "LR" is Patrick87's strategy based on linear regression, with variants based on the change in indentation between lines and on the absolute indentation of lines; "Doublecheck" (couldn't resist the pun!) is Mark's modification of FastAl's strategy, restricting the possible tab width and checking whether half the modal value also occurs frequently, with two different thresholds for selecting the smaller width.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
好吧,由于您想要一个与语言无关的解决方案,我们将无法使用任何语法提示。尽管您说您不想要一个完美的解决方案,但这里有一个对于大多数语言都运行良好的解决方案。
实际上,我必须解决密码学中的类似问题,才能在 多字母密码 中获得正确的代码字长度。这种加密是基本的凯撒加密(字母表中的每个字母都移动n个字母),其中密码用于以不同的方式移动字母(第n个字母)明文的字母被 mod(nth, length(cryptword)) 密码的字母移动。选择的武器是自相关。
该算法如下:
自相关是一个非常好的函数,可用于您想要检测数据流中的重复值的每种情况。它在信号处理中大量使用并且速度非常快(取决于信号重复的估计最大距离)。
是的,当时我用自相关破解了多表密文。 ;)
Okay, as you want a language-agnostic solution, we won't be able to use any syntactical hints. Although you said, that you don't want a perfect solution, here is one, that is working very well with most languages.
I actually had to solve a similar issue in cryptography to get the correct code word length in a polyalphabetic cipher. This kind of encryption is a basic Caesar-chiffre (each letter of the alphabet is moved by n letters), where the cryptword is used to move the letters differently (the nth letter of the clear text is moved by the mod(nth, length(cryptword)) letter of the cryptword). The weapon of choice is autocorrelation.
The algorithm would be like this:
The autocorrelation is a very nice function, usable for every situation, in which you want to detect repeating values in a stream of data. It is heavily used in signal processing and very fast (depending on the estimated maximum distance of signal repeats).
And yes, back then I cracked the polyalphabetic ciphertext with autocorrelation. ;)
编辑
我打开了 VB.Net(不是吗?:-) 这就是我的意思:
结果:
希望有帮助。
edit
I have VB.Net open (didn't you? :-) Here's what I mean:
Results:
Hope that helps.
您的选择(实际上)是 2、3、4、5、6、7、8。
我会使用 @FastAl 建议的内容扫描前 50-100 行左右。我可能倾向于盲目地从任何带有文本的行前面拉出空格计数,并计算空白字符串的长度。如果您有可用的正则表达式,则左修剪线和运行长度两次似乎是一种浪费。另外,我会执行 System.Math.abs(indent - previndent) 以便获得取消缩进的数据。正则表达式将是这样的:
一旦您获得了 7 个选项中计数最高的统计数据,就将其作为第一个猜测运行。对于 8、6 和 4,您应该检查 4 和 2、3 或 2 是否也有显着计数(第二名或超过 10% 或其他廉价启发式)。如果有很多 12(或 9)可能暗示 4(或 3)也是比 8(或 6)更好的选择。一次删除或添加超过 2 个级别(通常是折叠的结束括号)的情况非常罕见。
不相干的咕哝
我看到的一个问题是旧的 .c 代码尤其存在这种令人讨厌的模式:
恶心。我不知道你是如何对待这样的评论标准的。对于像“c”这样的代码,您可能必须处理 2.0 版本中的特殊注释...但我现在会忽略它。
您的最后一个问题是处理与您的假设不符的行。我的建议是将它们“标记”到深度,然后保留多余的空格。如果您必须更正,我会这样做:
rowtabdepth = heaven((rowspacecount - (tabwidth/2)) / tabwidth)
Your choices are (realistically) 2,3,4,5,6,7,8.
I'd scan the the first 50-100 lines or so using something like what @FastAl suggested. I'd probably lean toward just blindly pulling the spaces count from the front of any row with text and counting the length of the white space string. Left trimming lines and running length twice seems like a waste if you have regex available. Also, I'd do
System.Math.abs(indent - previndent)
so you get de-indent data. The regex would be this:Once you've got a statistic for which of the 7 options has the highest count, run with it as the first guess. For 8, 6, and 4 you should check to see if there is also a significant count (2nd place or over 10% or some other cheapo heuristic) for 4 and 2, 3, or 2. If there are a lot of 12s (or 9s) that might hint that 4 (or 3) is a better choice than 8 (or 6) as well. Dropping or adding more than 2 levels at a time (usually collapsed ending brackets) is super rare.
Irrelevant mumbling
The one problem I see is that old .c code in particular has this nasty pattern going on in it:
Yuck. I don't know how you deal with comment standards like that. For code that is "c" like you might have to deal with comments special in version 2.0... but I would just ignore it for now.
Your final issue is dealing with lines that don't match your assumptions. My suggestion would be to "tab" them to depth and then leave the extra spaces in place. If you have to correct I'd do this:
rowtabdepth = ceiling((rowspacecount - (tabwidth/2)) / tabwidth)
也许做类似的事情...
示例:
Maybe do something like...
Example:
对于您想要支持的每种语言,您需要进行一些解析:
1)排除注释(逐行或逐块,也许还嵌套?)
2)找到子块的开口(类C语言中的
{
,pascal中的begin
,shell中的do
等)然后看看子块打开后,空间数量增加了多少。进行一些简单的统计 - 找到最频繁的值、最大值和最小值、平均值。这样您还可以看到压痕是否规则以及有多少。
For each langauge you want to support, you'll need to do a bit of parsing:
1) exclude comments (either line-wise or block-wise, maybe also nested?)
2) find openings of sub-block (
{
in C-like languages,begin
in pascal,do
in shell etc.)Then just see how much the number of spaces increase after sub-block has been opened. Make some simple statistics - to find most frequent value, maximum and minimum value, average value. This way you can also see if the indentation is regular or not and how much.
作为基线,可以简单地计算所有缩进增加,并将最频繁的增加作为制表符宽度。作为一个 shell 脚本,编写每个管道阶段都有小的操作,它可能看起来像这样:
此实现是
O(n log(n))
其中n
是数字文件中的行数,但可以在O(n)
内轻松完成。As a baseline, one could simply calculate all indentation increases, and take the most frequent increase as the tab width. As a shell script, written to have small actions per pipeline stage, it could look like this:
This implementation is
O(n log(n))
wheren
is the number of lines in the file, but it could readily be done inO(n)
.启发式:
Python 脚本,采用文件名或标准输入并打印最佳缩进数:
Heuristic:
Python script, takes filenames or stdin and prints best indent number: