按行长度(包括空格)对文本文件进行排序
我有一个如下所示的 CSV 文件,
AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Atlantis,RI,12345,(999)123-5555,1.56 AS2345,ASDF1232, Mrs. Plain Example, 1121110 Ternary st. 110 Binary ave..,Atlantis,RI,12345,(999)123-5555,1.56 AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Liberty City,RI,12345,(999)123-5555,1.56 AS2345,ASDF1232, Mr. Plain Example, 110 Ternary ave.,Some City,RI,12345,(999)123-5555,1.56
我需要按行长度(包括空格)对其进行排序。以下命令不会 包含空格,有没有办法修改它,使其对我有用?
cat $@ | awk '{ print length, $0 }' | sort -n | awk '{$1=""; print $0}'
I have a CSV file that looks like this
AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Atlantis,RI,12345,(999)123-5555,1.56 AS2345,ASDF1232, Mrs. Plain Example, 1121110 Ternary st. 110 Binary ave..,Atlantis,RI,12345,(999)123-5555,1.56 AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Liberty City,RI,12345,(999)123-5555,1.56 AS2345,ASDF1232, Mr. Plain Example, 110 Ternary ave.,Some City,RI,12345,(999)123-5555,1.56
I need to sort it by line length including spaces. The following command doesn't
include spaces, is there a way to modify it so it will work for me?
cat $@ | awk '{ print length, $0 }' | sort -n | awk '{$1=""; print $0}'
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(14)
回答
或者,对任何等长行进行原始(可能是无意的)子排序:
在这两种情况下,我们都通过在最终剪切中放弃 awk 来解决您所提出的问题。
匹配长度的行 - 在平局的情况下该怎么办:
问题没有指定是否需要对匹配长度的行进行进一步排序。我认为这是不需要的,并建议使用
-s
(--stable
) 来防止这些行相互排序,并将它们保留在相对位置它们在输入中出现的顺序。(那些想要更多地控制这些联系排序的人可能会看看 sort 的
--key
选项。)为什么问题的尝试解决方案失败(awk line-rebuilding):
值得注意的是:
它们 之间的区别仅
(gawk)手册的相关部分顺便提到,当您更改一个字段时,awk 将重建整个 $0 (基于分隔符等)。我想这不是疯狂的行为。它有这样的内容:
“最后,有时使用字段和 OFS 的当前值强制 awk 重建整个记录是很方便的。为此,请使用看似无害的赋值:”“
这迫使 awk 重建记录。”
测试输入,包括一些等长的行:
Answer
Or, to do your original (perhaps unintentional) sub-sorting of any equal-length lines:
In both cases, we have solved your stated problem by moving away from awk for your final cut.
Lines of matching length - what to do in the case of a tie:
The question did not specify whether or not further sorting was wanted for lines of matching length. I've assumed that this is unwanted and suggested the use of
-s
(--stable
) to prevent such lines being sorted against each other, and keep them in the relative order in which they occur in the input.(Those who want more control of sorting these ties might look at sort's
--key
option.)Why the question's attempted solution fails (awk line-rebuilding):
It is interesting to note the difference between:
They yield respectively
The relevant section of (gawk's) manual only mentions as an aside that awk is going to rebuild the whole of $0 (based on the separator, etc) when you change one field. I guess it's not crazy behaviour. It has this:
"Finally, there are times when it is convenient to force awk to rebuild the entire record, using the current value of the fields and OFS. To do this, use the seemingly innocuous assignment:"
"This forces awk to rebuild the record."
Test input including some lines of equal length:
如果您确实想使用 awk,那么来自 neillb 的 AWK 解决方案 非常有用,它解释了为什么它是这样的这是一个麻烦,但如果您想要快速完成工作并且不关心您在什么中执行此操作,一种解决方案是使用 Perl 的
sort()
函数和自定义的 caparison 例程来迭代输入行。这是一个单行代码:您可以将其放在管道中任何需要的地方,要么接收 STDIN(来自
cat
或 shell 重定向),要么只是将文件名作为另一个参数传递给 perl 并让它打开文件。就我而言,我首先需要最长的行,因此我在比较中交换了
$a
和$b
。The AWK solution from neillb is great if you really want to use
awk
and it explains why it's a hassle there, but if what you want is to get the job done quickly and don't care what you do it in, one solution is to use Perl'ssort()
function with a custom caparison routine to iterate over the input lines. Here is a one liner:You can put this in your pipeline wherever you need it, either receiving STDIN (from
cat
or a shell redirect) or just give the filename to perl as another argument and let it open the file.In my case I needed the longest lines first, so I swapped out
$a
and$b
in the comparison.基准测试结果
以下是针对该问题的其他答案的解决方案的基准测试结果。
测试方法
结果
perl
解决方案花了 11.2 秒perl
解决方案花了 11.6 秒awk
解决方案 #1 花了20 秒awk
解决方案 #2 花了 23 秒awk
解决方案花了 24 秒awk
解决方案花了 25 秒bash
解决方案 比awk
解决方案花费的时间长 400 倍(使用截断的测试用例) 100000 行)。它工作得很好,只是需要永远。另一个 perl 解决方案
Benchmark results
Below are the results of a benchmark across solutions from other answers to this question.
Test method
Results
perl
solution took 11.2 secondsperl
solution took 11.6 secondsawk
solution #1 took 20 secondsawk
solution #2 took 23 secondsawk
solution took 24 secondsawk
solution took 25 secondsbash
solution takes 400x longer than theawk
solutions (using a truncated test case of 100000 lines). It works fine, just takes forever.Another
perl
solution尝试使用以下命令:
Try this command instead:
Python 解决方案
这是一个具有相同功能的 Python 单行代码,已使用 Python 3.9.10 和 2.7.18 进行了测试。它比 Caleb 的 perl 解决方案快约 60%,并且输出相同(使用包含 1480 万行的 300MiB 单词列表文件进行测试)。
基准:
Python Solution
Here's a Python one-liner that does the same, tested with Python 3.9.10 and 2.7.18. It's about 60% faster than Caleb's perl solution, and the output is identical (tested with a 300MiB wordlist file with 14.8 million lines).
Benchmark:
纯狂欢:
Pure Bash:
length()
函数确实包含空格。我只会对您的管道进行细微调整(包括避免 UUOC)。sed
命令直接删除awk
命令添加的数字和冒号。或者,保留awk
的格式:The
length()
function does include spaces. I would make just minor adjustments to your pipeline (including avoiding UUOC).The
sed
command directly removes the digits and colon added by theawk
command. Alternatively, keeping your formatting fromawk
:我发现如果您的文件包含以数字开头的行,这些解决方案将不起作用,因为它们将与所有计数的行一起按数字排序。解决方案是为
sort
提供-g
(通用数字排序)标志,而不是-n
(数字排序):I found these solutions will not work if your file contains lines that start with a number, since they will be sorted numerically along with all the counted lines. The solution is to give
sort
the-g
(general-numeric-sort) flag instead of-n
(numeric-sort):使用 POSIX Awk:
示例
With POSIX Awk:
Example
1)纯awk解决方案。假设行长度不能大于 > 1024
然后
cat 文件名 | awk '开始{分钟= 1024; s = "";} {l = 长度($0); if (l < min) {min = l; s = $0;}} END {print s}'
2) 一种 liner bash 解决方案,假设所有行只有 1 个单词,但可以针对所有行具有相同单词数的任何情况进行修改:
LINES=$(cat filename);对于 $LINES 中的 k;执行 printf "$k ";回声 $k | wc-L;完成 |排序 -k2 |头 -n 1 |剪切-d“”-f1
1) pure awk solution. Let's suppose that line length cannot be more > 1024
then
cat filename | awk 'BEGIN {min = 1024; s = "";} {l = length($0); if (l < min) {min = l; s = $0;}} END {print s}'
2) one liner bash solution assuming all lines have just 1 word, but can reworked for any case where all lines have same number of words:
LINES=$(cat filename); for k in $LINES; do printf "$k "; echo $k | wc -L; done | sort -k2 | head -n 1 | cut -d " " -f1
使用 Raku(以前称为 Perl6)
要反转排序,请在方法调用链的中间添加
.reverse
- 紧接在.sort() 之后
。下面的代码显示.chars
包含空格:下面是使用 Genbank 中的 9.1MB txt 文件对 awk 和 Raku 进行的时间比较:
HTH。
https://raku.org
using Raku (formerly known as Perl6)
To reverse the sort, add
.reverse
in the middle of the chain of method calls--immediately after.sort()
. Here's code showing that.chars
includes spaces:Here's a time comparison between awk and Raku using a 9.1MB txt file from Genbank:
HTH.
https://raku.org
这是一种按长度对行进行排序的多字节兼容方法。它需要:
wc -m
可供您使用(macOS 有)。LC_ALL=UTF-8
。您可以在 .bash_profile 中进行设置,或者只需将其添加到以下命令之前即可。testfile
具有与您的语言环境匹配的字符编码(例如,UTF-8)。这是完整的命令:
逐部分解释:
l=$0; gsub(/\047/, "\047\"\047\"\047", l);
← 制作 awk 变量l
中每一行的副本并对每个'
进行双重转义,以便该行可以安全地作为 shell 命令进行回显(\047
是八进制表示法中的单引号)。cmd=sprintf("echo \047%s\047 | wc -m", l);
← 这是我们要执行的命令,它会回显转义行到wc -m
。c
中。close(cmd);
← 关闭 shell 命令的管道,以避免达到一个进程中打开文件数量的系统限制。sub(/ */, "", c);
← 删除wc
返回的字符计数值中的空格。{ print c, $0 }
← 打印该行的字符计数值、空格和原始行。-n
),并保持稳定的排序顺序 (-s
) )。它很慢(在快速的 Macbook Pro 上每秒只有 160 行),因为它必须为每行执行一个子命令。
或者,仅使用
gawk
执行此操作(从版本 3.1.5 开始,gawk 可以识别多字节),这会明显更快。执行所有转义和双引号以通过 awk 的 shell 命令安全地传递行是很麻烦的,但这是我能找到的唯一不需要安装额外软件的方法(gawk 默认情况下不可用) macOS)。Here is a multibyte-compatible method of sorting lines by length. It requires:
wc -m
is available to you (macOS has it).LC_ALL=UTF-8
. You can set this either in your .bash_profile, or simply by prepending it before the following command.testfile
has a character encoding matching your locale (e.g., UTF-8).Here's the full command:
Explaining part-by-part:
l=$0; gsub(/\047/, "\047\"\047\"\047", l);
← makes of a copy of each line in awk variablel
and double-escapes every'
so the line can safely be echoed as a shell command (\047
is a single-quote in octal notation).cmd=sprintf("echo \047%s\047 | wc -m", l);
← this is the command we'll execute, which echoes the escaped line towc -m
.cmd | getline c;
← executes the command and copies the character count value that is returned into awk variablec
.close(cmd);
← close the pipe to the shell command to avoid hitting a system limit on the number of open files in one process.sub(/ */, "", c);
← trims white space from the character count value returned bywc
.{ print c, $0 }
← prints the line's character count value, a space, and the original line.| sort -ns
← sorts the lines (by prepended character count values) numerically (-n
), and maintaining stable sort order (-s
).| cut -d" " -f2-
← removes the prepended character count values.It's slow (only 160 lines per second on a fast Macbook Pro) because it must execute a sub-command for each line.
Alternatively, just do this solely with
gawk
(as of version 3.1.5, gawk is multibyte aware), which would be significantly faster. It's a lot of trouble doing all the escaping and double-quoting to safely pass the lines through a shell command from awk, but this is the only method I could find that doesn't require installing additional software (gawk is not available by default on macOS).重温一下这个。这就是我的处理方法(计算 LINE 的长度并将其存储为 LEN,按 LEN 排序,仅保留 LINE):
Revisiting this one. This is how I approached it (count length of LINE and store it as LEN, sort by LEN, keep only the LINE):
徒劳地“给行添加长度前缀并将其输入
sort -n
”,这是一个“bash 本机”解决方案(没有 awk、perl 或 shudder python) :In the vain of "prefix line with its length and feed it to
sort -n
", here's a "bash native" solution (no awk, perl, or shudder python):