如何将文件分成相等的部分而不破坏单独的行?

发布于 2024-12-09 16:38:51 字数 188 浏览 0 评论 0原文

我想知道是否可以将文件分割成相等的部分(编辑: = 除了最后一个之外全部相等),而不破坏行?在 Unix 中使用 split 命令,行可能会分成两半。有没有一种方法可以将一个文件分成 5 个相等的部分,但仍然只包含整行(如果其中一个文件稍大或稍小也没有问题)?我知道我可以只计算行数,但我必须对 bash 脚本中的很多文件执行此操作。非常感谢!

I was wondering if it was possible to split a file into equal parts (edit: = all equal except for the last), without breaking the line? Using the split command in Unix, lines may be broken in half. Is there a way to, say, split up a file in 5 equal parts, but have it still only consist of whole lines (it's no problem if one of the files is a little larger or smaller)? I know I could just calculate the number of lines, but I have to do this for a lot of files in a bash script. Many thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

柒夜笙歌凉 2024-12-16 16:38:51

如果您的意思是相同数量的行, split 有一个选项:

split --lines=75

如果您需要知道 75 真正应该用于 < code>N 等份,其:

lines_per_part = int(total_lines + N - 1) / N

其中总行数可以通过 wc -l 获得。

请参阅以下脚本的示例:

#!/usr/bin/bash

# Configuration stuff

fspec=qq.c
num_files=6

# Work out lines per file.

total_lines=$(wc -l <${fspec})
((lines_per_file = (total_lines + num_files - 1) / num_files))

# Split the actual file, maintaining lines.

split --lines=${lines_per_file} ${fspec} xyzzy.

# Debug information

echo "Total lines     = ${total_lines}"
echo "Lines  per file = ${lines_per_file}"    
wc -l xyzzy.*

此输出:

Total lines     = 70
Lines  per file = 12
  12 xyzzy.aa
  12 xyzzy.ab
  12 xyzzy.ac
  12 xyzzy.ad
  12 xyzzy.ae
  10 xyzzy.af
  70 total

split 的最新版本允许您使用 -n/--number< 指定多个 CHUNKS /代码> 选项。因此,您可以使用类似以下内容的内容:(

split --number=l/6 ${fspec} xyzzy.

ell-slash-6,意思是,而不是one-slash-6)。

这将为您提供大小大致相等的文件,并且没有中线分割。

我提到最后一点是因为它不会为您提供每个文件中大致相同的数,以及更多相同的字符数。

因此,如果您有一个 20 - 字符行和 19 个 1 字符行(总共 20 行)并拆分为五个文件,您很可能不会在每个文件中得到四行。

If you mean an equal number of lines, split has an option for this:

split --lines=75

If you need to know what that 75 should really be for N equal parts, its:

lines_per_part = int(total_lines + N - 1) / N

where total lines can be obtained with wc -l.

See the following script for an example:

#!/usr/bin/bash

# Configuration stuff

fspec=qq.c
num_files=6

# Work out lines per file.

total_lines=$(wc -l <${fspec})
((lines_per_file = (total_lines + num_files - 1) / num_files))

# Split the actual file, maintaining lines.

split --lines=${lines_per_file} ${fspec} xyzzy.

# Debug information

echo "Total lines     = ${total_lines}"
echo "Lines  per file = ${lines_per_file}"    
wc -l xyzzy.*

This outputs:

Total lines     = 70
Lines  per file = 12
  12 xyzzy.aa
  12 xyzzy.ab
  12 xyzzy.ac
  12 xyzzy.ad
  12 xyzzy.ae
  10 xyzzy.af
  70 total

More recent versions of split allow you to specify a number of CHUNKS with the -n/--number option. You can therefore use something like:

split --number=l/6 ${fspec} xyzzy.

(that's ell-slash-six, meaning lines, not one-slash-six).

That will give you roughly equal files in terms of size, with no mid-line splits.

I mention that last point because it doesn't give you roughly the same number of lines in each file, more the same number of characters.

So, if you have one 20-character line and 19 1-character lines (twenty lines in total) and split to five files, you most likely won't get four lines in every file.

一袭水袖舞倾城 2024-12-16 16:38:51

一个简单问题的简单解决方案:

split -n l/5 your_file.txt

这里不需要编写脚本。

man 文件中,CHUNKS 可能是:< /code>

l/N     split into N files without splitting lines

更新

并非所有 UNIX dist 都包含此标志。例如,它在 OSX 中不起作用。要使用它,您可以考虑替换Mac OS X 实用程序与 GNU 核心实用程序

A simple solution for a simple question:

split -n l/5 your_file.txt

no need for scripting here.

From the man file, CHUNKS may be:

l/N     split into N files without splitting lines

Update

Not all unix dist include this flag. For example, it will not work in OSX. To use it, you can consider replacing the Mac OS X utilities with GNU core utilities.

冷心人i 2024-12-16 16:38:51

该脚本甚至不是必需的,split(1) 支持开箱即用的所需功能:
split -l 75 auth.log auth.log。
上面的命令将文件分割成 75 行的块,并以以下形式输出文件: auth.log.aa, auth.log.ab, ...

wc -l< /code> 原始文件和输出给出:

  321 auth.log
   75 auth.log.aa
   75 auth.log.ab
   75 auth.log.ac
   75 auth.log.ad
   21 auth.log.ae
  642 total

The script isn't even necessary, split(1) supports the wanted feature out of the box:
split -l 75 auth.log auth.log.
The above command splits the file in chunks of 75 lines a piece, and outputs file on the form: auth.log.aa, auth.log.ab, ...

wc -l on the original file and output gives:

  321 auth.log
   75 auth.log.aa
   75 auth.log.ab
   75 auth.log.ac
   75 auth.log.ad
   21 auth.log.ae
  642 total
冷情妓 2024-12-16 16:38:51

split 在 coreutils 版本 8.8 中更新(2010 年 12 月 22 日宣布) --number 选项生成特定数量的文件。选项 --number=l/n 生成 n 个文件而不分割行。

coreutils 手册

split was updated in coreutils release 8.8 (announced 22 Dec 2010) with the --number option to generate a specific number of files. The option --number=l/n generates n files without splitting lines.

coreutils manual

深陷 2024-12-16 16:38:51

我制作了一个 bash 脚本,给出了多个部分作为输入,分割了

#!/bin/sh

parts_total="$2";
input="$1";

parts=$((parts_total))
for i in $(seq 0 $((parts_total-2))); do
  lines=$(wc -l "$input" | cut -f 1 -d" ")
  #n is rounded, 1.3 to 2, 1.6 to 2, 1 to 1
  n=$(awk  -v lines=$lines -v parts=$parts 'BEGIN { 
    n = lines/parts;
    rounded = sprintf("%.0f", n);
    if(n>rounded){
      print rounded + 1;
    }else{
      print rounded;
    }
  }');
  head -$n "$input" > split${i}
  tail -$((lines-n)) "$input" > .tmp${i}
  input=".tmp${i}"
  parts=$((parts-1));
done
mv .tmp$((parts_total-2)) split$((parts_total-1))
rm .tmp*

我使用 headtail 命令的文件,并存储在 tmp 文件中,用于分割文件

#10 means 10 parts
sh mysplitXparts.sh input_file 10

或使用 awk,其中 0.1 是 10% => 10 份,或 0.334 为 3 份

awk -v size=$(wc -l < input) -v perc=0.1 '{
  nfile = int(NR/(size*perc)); 
  if(nfile >= 1/perc){
    nfile--;
  } 
  print > "split_"nfile
}' input

I made a bash script, that given a number of parts as input, split a file

#!/bin/sh

parts_total="$2";
input="$1";

parts=$((parts_total))
for i in $(seq 0 $((parts_total-2))); do
  lines=$(wc -l "$input" | cut -f 1 -d" ")
  #n is rounded, 1.3 to 2, 1.6 to 2, 1 to 1
  n=$(awk  -v lines=$lines -v parts=$parts 'BEGIN { 
    n = lines/parts;
    rounded = sprintf("%.0f", n);
    if(n>rounded){
      print rounded + 1;
    }else{
      print rounded;
    }
  }');
  head -$n "$input" > split${i}
  tail -$((lines-n)) "$input" > .tmp${i}
  input=".tmp${i}"
  parts=$((parts-1));
done
mv .tmp$((parts_total-2)) split$((parts_total-1))
rm .tmp*

I used head and tail commands, and store in tmp files, for split the files

#10 means 10 parts
sh mysplitXparts.sh input_file 10

or with awk, where 0.1 is 10% => 10 parts, or 0.334 is 3 parts

awk -v size=$(wc -l < input) -v perc=0.1 '{
  nfile = int(NR/(size*perc)); 
  if(nfile >= 1/perc){
    nfile--;
  } 
  print > "split_"nfile
}' input
吃不饱 2024-12-16 16:38:51
var dict = File.ReadLines("test.txt")
               .Where(line => !string.IsNullOrWhitespace(line))
               .Select(line => line.Split(new char[] { '=' }, 2, 0))
               .ToDictionary(parts => parts[0], parts => parts[1]);


or 

    enter code here

line="[email protected][email protected]";
string[] tokens = line.Split(new char[] { '=' }, 2, 0);

ans:
tokens[0]=to
token[1][email protected][email protected]"
var dict = File.ReadLines("test.txt")
               .Where(line => !string.IsNullOrWhitespace(line))
               .Select(line => line.Split(new char[] { '=' }, 2, 0))
               .ToDictionary(parts => parts[0], parts => parts[1]);


or 

    enter code here

line="[email protected][email protected]";
string[] tokens = line.Split(new char[] { '=' }, 2, 0);

ans:
tokens[0]=to
token[1][email protected][email protected]"
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文