按行长度(包括空格)对文本文件进行排序

发布于 2024-11-05 17:51:56 字数 615 浏览 1 评论 0原文

我有一个如下所示的 CSV 文件,

AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Atlantis,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mrs. Plain Example, 1121110 Ternary st.                                        110 Binary ave..,Atlantis,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Liberty City,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mr. Plain Example, 110 Ternary ave.,Some City,RI,12345,(999)123-5555,1.56

我需要按行长度(包括空格)对其进行排序。以下命令不会 包含空格,有没有办法修改它,使其对我有用?

cat $@ | awk '{ print length, $0 }' | sort -n | awk '{$1=""; print $0}'

I have a CSV file that looks like this

AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Atlantis,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mrs. Plain Example, 1121110 Ternary st.                                        110 Binary ave..,Atlantis,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Liberty City,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mr. Plain Example, 110 Ternary ave.,Some City,RI,12345,(999)123-5555,1.56

I need to sort it by line length including spaces. The following command doesn't
include spaces, is there a way to modify it so it will work for me?

cat $@ | awk '{ print length, $0 }' | sort -n | awk '{$1=""; print $0}'

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(14

耀眼的星火 2024-11-12 17:51:56

回答

< testfile awk '{ print length, $0 }' | sort -n -s | cut -d" " -f2-

或者,对任何等长行进行原始(可能是无意的)子排序:

< testfile awk '{ print length, $0 }' | sort -n | cut -d" " -f2-

在这两种情况下,我们都通过在最终剪切中放弃 awk 来解决您所提出的问题。

匹配长度的行 - 在平局的情况下该怎么办:

问题没有指定是否需要对匹配长度的行进行进一步排序。我认为这是不需要的,并建议使用 -s (--stable) 来防止这些行相互排序,并将它们保留在相对位置它们在输入中出现的顺序。

(那些想要更多地控制这些联系排序的人可能会看看 sort 的 --key 选项。)

为什么问题的尝试解决方案失败(awk line-rebuilding):

值得注意的是:

echo "hello   awk   world" | awk '{print}'
echo "hello   awk   world" | awk '{$1="hello"; print}'

它们 之间的区别仅

hello   awk   world
hello awk world

(gawk)手册的相关部分顺便提到,当您更改一个字段时,awk 将重建整个 $0 (基于分隔符等)。我想这不是疯狂的行为。它有这样的内容:

“最后,有时使用字段和 OFS 的当前值强制 awk 重建整个记录是很方便的。为此,请使用看似无害的赋值:”“

 $1 = $1   # force record to be reconstituted
 print $0  # or whatever else with $0

这迫使 awk 重建记录。”

测试输入,包括一些等长的行:

aa A line   with     MORE    spaces
bb The very longest line in the file
ccb
9   dd equal len.  Orig pos = 1
500 dd equal len.  Orig pos = 2
ccz
cca
ee A line with  some       spaces
1   dd equal len.  Orig pos = 3
ff
5   dd equal len.  Orig pos = 4
g

Answer

< testfile awk '{ print length, $0 }' | sort -n -s | cut -d" " -f2-

Or, to do your original (perhaps unintentional) sub-sorting of any equal-length lines:

< testfile awk '{ print length, $0 }' | sort -n | cut -d" " -f2-

In both cases, we have solved your stated problem by moving away from awk for your final cut.

Lines of matching length - what to do in the case of a tie:

The question did not specify whether or not further sorting was wanted for lines of matching length. I've assumed that this is unwanted and suggested the use of -s (--stable) to prevent such lines being sorted against each other, and keep them in the relative order in which they occur in the input.

(Those who want more control of sorting these ties might look at sort's --key option.)

Why the question's attempted solution fails (awk line-rebuilding):

It is interesting to note the difference between:

echo "hello   awk   world" | awk '{print}'
echo "hello   awk   world" | awk '{$1="hello"; print}'

They yield respectively

hello   awk   world
hello awk world

The relevant section of (gawk's) manual only mentions as an aside that awk is going to rebuild the whole of $0 (based on the separator, etc) when you change one field. I guess it's not crazy behaviour. It has this:

"Finally, there are times when it is convenient to force awk to rebuild the entire record, using the current value of the fields and OFS. To do this, use the seemingly innocuous assignment:"

 $1 = $1   # force record to be reconstituted
 print $0  # or whatever else with $0

"This forces awk to rebuild the record."

Test input including some lines of equal length:

aa A line   with     MORE    spaces
bb The very longest line in the file
ccb
9   dd equal len.  Orig pos = 1
500 dd equal len.  Orig pos = 2
ccz
cca
ee A line with  some       spaces
1   dd equal len.  Orig pos = 3
ff
5   dd equal len.  Orig pos = 4
g
花期渐远 2024-11-12 17:51:56

如果您确实想使用 awk,那么来自 neillb 的 AWK 解决方案 非常有用,它解释了为什么它是这样的这是一个麻烦,但如果您想要快速完成工作并且不关心您在什么中执行此操作,一种解决方案是使用 Perl 的 sort() 函数和自定义的 caparison 例程来迭代输入行。这是一个单行代码:

perl -e 'print sort { length($a) <=> length($b) } <>'

您可以将其放在管道中任何需要的地方,要么接收 STDIN(来自 cat 或 shell 重定向),要么只是将文件名作为另一个参数传递给 perl 并让它打开文件。

就我而言,我首先需要最长的行,因此我在比较中交换了 $a$b

The AWK solution from neillb is great if you really want to use awk and it explains why it's a hassle there, but if what you want is to get the job done quickly and don't care what you do it in, one solution is to use Perl's sort() function with a custom caparison routine to iterate over the input lines. Here is a one liner:

perl -e 'print sort { length($a) <=> length($b) } <>'

You can put this in your pipeline wherever you need it, either receiving STDIN (from cat or a shell redirect) or just give the filename to perl as another argument and let it open the file.

In my case I needed the longest lines first, so I swapped out $a and $b in the comparison.

过潦 2024-11-12 17:51:56

基准测试结果

以下是针对该问题的其他答案的解决方案的基准测试结果。

测试方法

  • 在快速机器上连续运行 10 次,平均
  • Perl 5.24
  • awk 3.1.5(gawk 4.1.0 倍快约 2%)
  • 输入文件是一个 550MB、600 万行的庞然大物(英国国家语料库 txt)

结果

  1. 迦勒的perl 解决方案花了 11.2 秒
  2. 我的 perl 解决方案花了 11.6 秒
  3. neillb 的 awk 解决方案 #1 花了20 秒
  4. neillb 的 awk 解决方案 #2 花了 23 秒
  5. anubhava 的 awk 解决方案花了 24 秒
  6. Jonathan 的 awk 解决方案花了 25 秒
  7. Fritz 的 bash 解决方案awk 解决方案花费的时间长 400 倍(使用截断的测试用例) 100000 行)。它工作得很好,只是需要永远。

另一个 perl 解决方案

perl -ne 'push @a, $_; END{ print sort { length $a <=> length $b } @a }' file

Benchmark results

Below are the results of a benchmark across solutions from other answers to this question.

Test method

  • 10 sequential runs on a fast machine, averaged
  • Perl 5.24
  • awk 3.1.5 (gawk 4.1.0 times were ~2% faster)
  • The input file is a 550MB, 6 million line monstrosity (British National Corpus txt)

Results

  1. Caleb's perl solution took 11.2 seconds
  2. my perl solution took 11.6 seconds
  3. neillb's awk solution #1 took 20 seconds
  4. neillb's awk solution #2 took 23 seconds
  5. anubhava's awk solution took 24 seconds
  6. Jonathan's awk solution took 25 seconds
  7. Fritz's bash solution takes 400x longer than the awk solutions (using a truncated test case of 100000 lines). It works fine, just takes forever.

Another perl solution

perl -ne 'push @a, $_; END{ print sort { length $a <=> length $b } @a }' file
二货你真萌 2024-11-12 17:51:56

尝试使用以下命令:

awk '{print length, $0}' your-file | sort -n | cut -d " " -f2-

Try this command instead:

awk '{print length, $0}' your-file | sort -n | cut -d " " -f2-
娇妻 2024-11-12 17:51:56

Python 解决方案

这是一个具有相同功能的 Python 单行代码,已使用 Python 3.9.10 和 2.7.18 进行了测试。它比 Caleb 的 perl 解决方案快约 60%,并且输出相同(使用包含 1480 万行的 300MiB 单词列表文件进行测试)。

python -c 'import sys; sys.stdout.writelines(sorted(sys.stdin.readlines(), key=len))'

基准:

python -c 'import sys; sys.stdout.writelines(sorted(sys.stdin.readlines(), key=len))'
real    0m5.308s
user    0m3.733s
sys     0m1.490s

perl -e 'print sort { length($a) <=> length($b) } <>'
real    0m8.840s
user    0m7.117s
sys     0m2.279s

Python Solution

Here's a Python one-liner that does the same, tested with Python 3.9.10 and 2.7.18. It's about 60% faster than Caleb's perl solution, and the output is identical (tested with a 300MiB wordlist file with 14.8 million lines).

python -c 'import sys; sys.stdout.writelines(sorted(sys.stdin.readlines(), key=len))'

Benchmark:

python -c 'import sys; sys.stdout.writelines(sorted(sys.stdin.readlines(), key=len))'
real    0m5.308s
user    0m3.733s
sys     0m1.490s

perl -e 'print sort { length($a) <=> length($b) } <>'
real    0m8.840s
user    0m7.117s
sys     0m2.279s
把梦留给海 2024-11-12 17:51:56

纯狂欢:

declare -a sorted

while read line; do
  if [ -z "${sorted[${#line}]}" ] ; then          # does line length already exist?
    sorted[${#line}]="$line"                      # element for new length
  else
    sorted[${#line}]="${sorted[${#line}]}\n$line" # append to lines with equal length
  fi
done < data.csv

for key in ${!sorted[*]}; do                      # iterate over existing indices
  echo -e "${sorted[$key]}"                       # echo lines with equal length
done

Pure Bash:

declare -a sorted

while read line; do
  if [ -z "${sorted[${#line}]}" ] ; then          # does line length already exist?
    sorted[${#line}]="$line"                      # element for new length
  else
    sorted[${#line}]="${sorted[${#line}]}\n$line" # append to lines with equal length
  fi
done < data.csv

for key in ${!sorted[*]}; do                      # iterate over existing indices
  echo -e "${sorted[$key]}"                       # echo lines with equal length
done
月下伊人醉 2024-11-12 17:51:56

length() 函数确实包含空格。我只会对您的管道进行细微调整(包括避免 UUOC)。

awk '{ printf "%d:%s\n", length($0), $0;}' "$@" | sort -n | sed 's/^[0-9]*://'

sed 命令直接删除 awk 命令添加的数字和冒号。或者,保留 awk 的格式:

awk '{ print length($0), $0;}' "$@" | sort -n | sed 's/^[0-9]* //'

The length() function does include spaces. I would make just minor adjustments to your pipeline (including avoiding UUOC).

awk '{ printf "%d:%s\n", length($0), $0;}' "$@" | sort -n | sed 's/^[0-9]*://'

The sed command directly removes the digits and colon added by the awk command. Alternatively, keeping your formatting from awk:

awk '{ print length($0), $0;}' "$@" | sort -n | sed 's/^[0-9]* //'
请止步禁区 2024-11-12 17:51:56

我发现如果您的文件包含以数字开头的行,这些解决方案将不起作用,因为它们将与所有计数的行一起按数字排序。解决方案是为 sort 提供 -g (通用数字排序)标志,而不是 -n (数字排序):

awk '{ print length, $0 }' lines.txt | sort -g | cut -d" " -f2-

I found these solutions will not work if your file contains lines that start with a number, since they will be sorted numerically along with all the counted lines. The solution is to give sort the -g (general-numeric-sort) flag instead of -n (numeric-sort):

awk '{ print length, $0 }' lines.txt | sort -g | cut -d" " -f2-
爱她像谁 2024-11-12 17:51:56

使用 POSIX Awk:

{
  c = length
  m[c] = m[c] ? m[c] RS $0 : $0
} END {
  for (c in m) print m[c]
}

示例

With POSIX Awk:

{
  c = length
  m[c] = m[c] ? m[c] RS $0 : $0
} END {
  for (c in m) print m[c]
}

Example

世界和平 2024-11-12 17:51:56

1)纯awk解决方案。假设行长度不能大于 > 1024
然后

cat 文件名 | awk '开始{分钟= 1024; s = "";} {l = 长度($0); if (l < min) {min = l; s = $0;}} END {print s}'

2) 一种 liner bash 解决方案,假设所有行只有 1 个单词,但可以针对所有行具有相同单词数的任何情况进行修改:

LINES=$(cat filename);对于 $LINES 中的 k;执行 printf "$k ";回声 $k | wc-L;完成 |排序 -k2 |头 -n 1 |剪切-d“”-f1

1) pure awk solution. Let's suppose that line length cannot be more > 1024
then

cat filename | awk 'BEGIN {min = 1024; s = "";} {l = length($0); if (l < min) {min = l; s = $0;}} END {print s}'

2) one liner bash solution assuming all lines have just 1 word, but can reworked for any case where all lines have same number of words:

LINES=$(cat filename); for k in $LINES; do printf "$k "; echo $k | wc -L; done | sort -k2 | head -n 1 | cut -d " " -f1

饮湿 2024-11-12 17:51:56

使用 Raku(以前称为 Perl6)

~$ cat "BinaryAve.txt" | raku -e 'given lines() {.sort(*.chars).join("\n").say};'

AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Atlantis,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mr. Plain Example, 110 Ternary ave.,Some City,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Liberty City,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mrs. Plain Example, 1121110 Ternary st.                                        110 Binary ave..,Atlantis,RI,12345,(999)123-5555,1.56

要反转排序,请在方法调用链的中间添加 .reverse - 紧接在 .sort() 之后。下面的代码显示 .chars 包含空格:

~$ cat "number_triangle.txt" | raku -e 'given lines() {.map(*.chars).say};'
(1 3 5 7 9 11 13 15 17 19 0)
~$ cat "number_triangle.txt"
1
1 2
1 2 3
1 2 3 4
1 2 3 4 5
1 2 3 4 5 6
1 2 3 4 5 6 7
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9 0

下面是使用 Genbank 中的 9.1MB txt 文件对 awk 和 Raku 进行的时间比较:

~$ time cat "rat_whole_genome.txt" | raku -e 'given lines() {.sort(*.chars).join("\n").say};' > /dev/null
    
    real    0m1.308s
    user    0m1.213s
    sys 0m0.173s
    
~$ #awk code from neillb
~$ time cat "rat_whole_genome.txt" | awk '{ print length, $0 }' | sort -n -s | cut -d" " -f2-  > /dev/null
    
    real    0m1.189s
    user    0m1.170s
    sys 0m0.050s

HTH。

https://raku.org

using Raku (formerly known as Perl6)

~$ cat "BinaryAve.txt" | raku -e 'given lines() {.sort(*.chars).join("\n").say};'

AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Atlantis,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mr. Plain Example, 110 Ternary ave.,Some City,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Liberty City,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mrs. Plain Example, 1121110 Ternary st.                                        110 Binary ave..,Atlantis,RI,12345,(999)123-5555,1.56

To reverse the sort, add .reverse in the middle of the chain of method calls--immediately after .sort(). Here's code showing that .chars includes spaces:

~$ cat "number_triangle.txt" | raku -e 'given lines() {.map(*.chars).say};'
(1 3 5 7 9 11 13 15 17 19 0)
~$ cat "number_triangle.txt"
1
1 2
1 2 3
1 2 3 4
1 2 3 4 5
1 2 3 4 5 6
1 2 3 4 5 6 7
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9 0

Here's a time comparison between awk and Raku using a 9.1MB txt file from Genbank:

~$ time cat "rat_whole_genome.txt" | raku -e 'given lines() {.sort(*.chars).join("\n").say};' > /dev/null
    
    real    0m1.308s
    user    0m1.213s
    sys 0m0.173s
    
~$ #awk code from neillb
~$ time cat "rat_whole_genome.txt" | awk '{ print length, $0 }' | sort -n -s | cut -d" " -f2-  > /dev/null
    
    real    0m1.189s
    user    0m1.170s
    sys 0m0.050s

HTH.

https://raku.org

孤独难免 2024-11-12 17:51:56

这是一种按长度对行进行排序的多字节兼容方法。它需要:

  1. wc -m 可供您使用(macOS 有)。
  2. 您当前的语言环境支持多字节字符,例如通过设置LC_ALL=UTF-8。您可以在 .bash_profile 中进行设置,或者只需将其添加到以下命令之前即可。
  3. testfile 具有与您的语言环境匹配的字符编码(例如,UTF-8)。

这是完整的命令:

cat testfile | awk '{l=$0; gsub(/\047/, "\047\"\047\"\047", l); cmd=sprintf("echo \047%s\047 | wc -m", l); cmd | getline c; close(cmd); sub(/ */, "", c); { print c, $0 }}' | sort -ns | cut -d" " -f2-

逐部分解释:

  • l=$0; gsub(/\047/, "\047\"\047\"\047", l); ← 制作 awk 变量 l 中每一行的副本并对每个 ' 进行双重转义,以便该行可以安全地作为 shell 命令进行回显(\047 是八进制表示法中的单引号)。
  • cmd=sprintf("echo \047%s\047 | wc -m", l); ← 这是我们要执行的命令,它会回显转义行到wc -m
  • <代码>cmd | getline c; ← 执行命令并将返回的字符计数值复制到 awk 变量 c 中。
  • close(cmd); ← 关闭 shell 命令的管道,以避免达到一个进程中打开文件数量的系统限制。
  • sub(/ */, "", c); ← 删除 wc 返回的字符计数值中的空格。
  • { print c, $0 } ← 打印该行的字符计数值、空格和原始行。
  • <代码>| sort -ns ← 按数字方式对行进行排序(按前置字符计数值)(-n),并保持稳定的排序顺序 (-s) )。
  • <代码>| cut -d" " -f2- ← 删除前置字符计数值。

它很慢(在快速的 Macbook Pro 上每秒只有 160 行),因为它必须为每行执行一个子命令。

或者,仅使用 gawk 执行此操作(从版本 3.1.5 开始,gawk 可以识别多字节),这会明显更快。执行所有转义和双引号以通过 awk 的 shell 命令安全地传递行是很麻烦的,但这是我能找到的唯一不需要安装额外软件的方法(gawk 默认情况下不可用) macOS)。

Here is a multibyte-compatible method of sorting lines by length. It requires:

  1. wc -m is available to you (macOS has it).
  2. Your current locale supports multi-byte characters, e.g., by setting LC_ALL=UTF-8. You can set this either in your .bash_profile, or simply by prepending it before the following command.
  3. testfile has a character encoding matching your locale (e.g., UTF-8).

Here's the full command:

cat testfile | awk '{l=$0; gsub(/\047/, "\047\"\047\"\047", l); cmd=sprintf("echo \047%s\047 | wc -m", l); cmd | getline c; close(cmd); sub(/ */, "", c); { print c, $0 }}' | sort -ns | cut -d" " -f2-

Explaining part-by-part:

  • l=$0; gsub(/\047/, "\047\"\047\"\047", l); ← makes of a copy of each line in awk variable l and double-escapes every ' so the line can safely be echoed as a shell command (\047 is a single-quote in octal notation).
  • cmd=sprintf("echo \047%s\047 | wc -m", l); ← this is the command we'll execute, which echoes the escaped line to wc -m.
  • cmd | getline c; ← executes the command and copies the character count value that is returned into awk variable c.
  • close(cmd); ← close the pipe to the shell command to avoid hitting a system limit on the number of open files in one process.
  • sub(/ */, "", c); ← trims white space from the character count value returned by wc.
  • { print c, $0 } ← prints the line's character count value, a space, and the original line.
  • | sort -ns ← sorts the lines (by prepended character count values) numerically (-n), and maintaining stable sort order (-s).
  • | cut -d" " -f2- ← removes the prepended character count values.

It's slow (only 160 lines per second on a fast Macbook Pro) because it must execute a sub-command for each line.

Alternatively, just do this solely with gawk (as of version 3.1.5, gawk is multibyte aware), which would be significantly faster. It's a lot of trouble doing all the escaping and double-quoting to safely pass the lines through a shell command from awk, but this is the only method I could find that doesn't require installing additional software (gawk is not available by default on macOS).

离旧人 2024-11-12 17:51:56

重温一下这个。这就是我的处理方法(计算 LINE 的长度并将其存储为 LEN,按 LEN 排序,仅保留 LINE):

cat test.csv | while read LINE; do LEN=$(echo ${LINE} | wc -c); echo ${LINE} ${LEN}; done | sort -k 2n | cut -d ' ' -f 1     

Revisiting this one. This is how I approached it (count length of LINE and store it as LEN, sort by LEN, keep only the LINE):

cat test.csv | while read LINE; do LEN=$(echo ${LINE} | wc -c); echo ${LINE} ${LEN}; done | sort -k 2n | cut -d ' ' -f 1     
阿楠 2024-11-12 17:51:56

徒劳地“给行添加长度前缀并将其输入 sort -n”,这是一个“bash 本机”解决方案(没有 awk、perl 或 shudder python) :

cat FILE | while read line;
  do echo "${#line}:$line";
  done | sort -n | cut -d: -f2-

In the vain of "prefix line with its length and feed it to sort -n", here's a "bash native" solution (no awk, perl, or shudder python):

cat FILE | while read line;
  do echo "${#line}:$line";
  done | sort -n | cut -d: -f2-
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文