将文件的行分割成二维,其中行元素由字符串长度确定,因为没有分隔符

发布于 2024-10-19 06:08:48 字数 2690 浏览 8 评论 0原文

我使用 file() 读取数据并迭代每一行。需要能够将字符串拆分为“列”数组。问题是列的宽度不均匀(60 个字符、24 个字符、16 个字符)。似乎所有执行此操作的函数都期望列具有固定大小。

这将定期对大型数据文件执行,因此需要最佳性能。

数据示例。

XXXXXXXXXXXXXXXXXXXXXXXXXX                                  XXXXXXXXXXXXX           XX        XXXXXX
XXXXXXXXX                                                   XXX XXX                 X         XXX
XXXXXXXXXXXXXXX                                             XXXXXXXXXXXXX           XX        XXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXX                                  XXXXXXXXXXXXX           XX        XXXXXX
XXXXXXXXX                                                   XXX XXX                 X         XXX
XXXXXXXXXXXXXXX                                             XXXXXXXXXXXXX           XX        XXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXX                                  XXXXXXXXXXXXX           XX        XXXXXX
XXXXXXXXX                                                   XXX XXX                 X         XXX
XXXXXXXXXXXXXXX                                             XXXXXXXXXXXXX           XX        XXXXXX

期望的结果:

array (
  0 => 
  array (
    0 => 'XXXXXXXXXXXXXXXXXXXXXXXXXX                                  ',
    1 => 'XXXXXXXXXXXXX           ',
    2 => 'XX        XXXXXX',
  ),
  1 => 
  array (
    0 => 'XXXXXXXXX                                                   ',
    1 => 'XXX XXX                 ',
    2 => 'X         XXX',
  ),
  2 => 
  array (
    0 => 'XXXXXXXXXXXXXXX                                             ',
    1 => 'XXXXXXXXXXXXX           ',
    2 => 'XX        XXXXXX',
  ),
  3 => 
  array (
    0 => 'XXXXXXXXXXXXXXXXXXXXXXXXXX                                  ',
    1 => 'XXXXXXXXXXXXX           ',
    2 => 'XX        XXXXXX',
  ),
  4 => 
  array (
    0 => 'XXXXXXXXX                                                   ',
    1 => 'XXX XXX                 ',
    2 => 'X         XXX',
  ),
  5 => 
  array (
    0 => 'XXXXXXXXXXXXXXX                                             ',
    1 => 'XXXXXXXXXXXXX           ',
    2 => 'XX        XXXXXX',
  ),
  6 => 
  array (
    0 => 'XXXXXXXXXXXXXXXXXXXXXXXXXX                                  ',
    1 => 'XXXXXXXXXXXXX           ',
    2 => 'XX        XXXXXX',
  ),
  7 => 
  array (
    0 => 'XXXXXXXXX                                                   ',
    1 => 'XXX XXX                 ',
    2 => 'X         XXX',
  ),
  8 => 
  array (
    0 => 'XXXXXXXXXXXXXXX                                             ',
    1 => 'XXXXXXXXXXXXX           ',
    2 => 'XX        XXXXXX',
  ),
)

I have a data fine being read in using file() and iterate over each row. Need to be able to split the string into an array of "columns". Problem is the columns are not even widths (60 chars, 24 chars, 16 chars). Seems like all the functions to do this expect that the columns are a fixed size.

This will be performed on a large data file quite regularly so optimal performance is desired.

Example of data.

XXXXXXXXXXXXXXXXXXXXXXXXXX                                  XXXXXXXXXXXXX           XX        XXXXXX
XXXXXXXXX                                                   XXX XXX                 X         XXX
XXXXXXXXXXXXXXX                                             XXXXXXXXXXXXX           XX        XXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXX                                  XXXXXXXXXXXXX           XX        XXXXXX
XXXXXXXXX                                                   XXX XXX                 X         XXX
XXXXXXXXXXXXXXX                                             XXXXXXXXXXXXX           XX        XXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXX                                  XXXXXXXXXXXXX           XX        XXXXXX
XXXXXXXXX                                                   XXX XXX                 X         XXX
XXXXXXXXXXXXXXX                                             XXXXXXXXXXXXX           XX        XXXXXX

Desired result:

array (
  0 => 
  array (
    0 => 'XXXXXXXXXXXXXXXXXXXXXXXXXX                                  ',
    1 => 'XXXXXXXXXXXXX           ',
    2 => 'XX        XXXXXX',
  ),
  1 => 
  array (
    0 => 'XXXXXXXXX                                                   ',
    1 => 'XXX XXX                 ',
    2 => 'X         XXX',
  ),
  2 => 
  array (
    0 => 'XXXXXXXXXXXXXXX                                             ',
    1 => 'XXXXXXXXXXXXX           ',
    2 => 'XX        XXXXXX',
  ),
  3 => 
  array (
    0 => 'XXXXXXXXXXXXXXXXXXXXXXXXXX                                  ',
    1 => 'XXXXXXXXXXXXX           ',
    2 => 'XX        XXXXXX',
  ),
  4 => 
  array (
    0 => 'XXXXXXXXX                                                   ',
    1 => 'XXX XXX                 ',
    2 => 'X         XXX',
  ),
  5 => 
  array (
    0 => 'XXXXXXXXXXXXXXX                                             ',
    1 => 'XXXXXXXXXXXXX           ',
    2 => 'XX        XXXXXX',
  ),
  6 => 
  array (
    0 => 'XXXXXXXXXXXXXXXXXXXXXXXXXX                                  ',
    1 => 'XXXXXXXXXXXXX           ',
    2 => 'XX        XXXXXX',
  ),
  7 => 
  array (
    0 => 'XXXXXXXXX                                                   ',
    1 => 'XXX XXX                 ',
    2 => 'X         XXX',
  ),
  8 => 
  array (
    0 => 'XXXXXXXXXXXXXXX                                             ',
    1 => 'XXXXXXXXXXXXX           ',
    2 => 'XX        XXXXXX',
  ),
)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

转身泪倾城 2024-10-26 06:08:48

最简单的方法是使用 substr 来分割列:

foreach (file($fn) as $i=>$line) {
    $rows[$i] = array(substr($line, 0, 60), substr($line, 60, 40), substr($line, 100, 30));
}

但与常识相反,使用 PCRE 和正则表达式来分割字符串会更快:

preg_match_all('/^(.{60})(.{40})(.{30})\K/m', file_get_contents($fn), $rows, PREG_SET_ORDER); 

这里的缺点是每行包含一个空的 [ 0](将包含原始行),数据列从索引 [1] 开始。

The straightforward method would be using substr to split up the columns:

foreach (file($fn) as $i=>$line) {
    $rows[$i] = array(substr($line, 0, 60), substr($line, 60, 40), substr($line, 100, 30));
}

But contrary to common wisdom it would be faster to use PCRE and a regular expression to split up the string:

preg_match_all('/^(.{60})(.{40})(.{30})\K/m', file_get_contents($fn), $rows, PREG_SET_ORDER); 

The disadvantage here is that it each row contains an empty [0] (would have contained the original line), and the data columns start at index [1].

转角预定愛 2024-10-26 06:08:48

可以可靠地做到这一点的唯一方法是文件中已有一些分隔符。

explode() 在分隔符上分割字符串,因此如果您知道文件列是制表符分隔的,您可以
爆炸('\t',$string)
获取列的数组。

除此之外,我想不出可靠的方法可以让您在事先不知道大小的情况下拉出可变大小的列。

The only way you could reliably do this is if there is some separator already in the file.

explode() splits strings on a separator, so if you know your file columns are tab separated, you can
explode('\t',$string)
to get an array of the columns.

Other than that, there's no reliable way I can think of that would let you pull out variable sized columns without previously knowing the size.

只为守护你 2024-10-26 06:08:48

在您对我之前的答案发表评论之后,看来 substr() 就是您所需要的。

如果您知道每行每列的宽度,只需执行以下操作:

$rows = array();
foreach( $lines as $line )
{
  $columns = array();
  array_push($columns, substr($line, FirstColStart, FirstColEnd));
  array_push($columns, substr($line, SecondColStart, SecondColEnd));
  //more array pushing for each column
  array_push($rows, $columns);
}
//Do something with your 'row' array of columns ($rows)

After your comments on my previous answer, it appears that substr() is all you need.

If you know the width of each column for every line just do something like:

$rows = array();
foreach( $lines as $line )
{
  $columns = array();
  array_push($columns, substr($line, FirstColStart, FirstColEnd));
  array_push($columns, substr($line, SecondColStart, SecondColEnd));
  //more array pushing for each column
  array_push($rows, $columns);
}
//Do something with your 'row' array of columns ($rows)
把人绕傻吧 2024-10-26 06:08:48

解析可预测格式的文件的行是 fscanf() 的设计目的。捕获指定长度的非换行符; 3次。

如果捕获了 3 个子字符串,请将这些值保存为结果数组中的 3 元素行。

如果您不想unset()循环中的$row,您可以将赋值显式设置为$row[0], $row[ 1],$row[2]

代码:(演示)

$result = [];
$handle = fopen('data.txt', 'r');
while (fscanf($handle, "%60[^\n]%24[^\n]%16[^\n]", $row[], $row[], $row[]) === 3) {
    $result[] = $row;
    unset($row);
}
fclose($handle);
var_export($result);

Parsing the lines of a predictably formatted file is something that fscanf() was designed to do. Capture non-newline characters of designated length; 3 times.

If 3 substrings are captured, save those values as a 3-element row in the result array.

If you didn't want to unset() the $row in the loop, you could explicitly key the assignments as $row[0], $row[1], $row[2].

Code: (Demo)

$result = [];
$handle = fopen('data.txt', 'r');
while (fscanf($handle, "%60[^\n]%24[^\n]%16[^\n]", $row[], $row[], $row[]) === 3) {
    $result[] = $row;
    unset($row);
}
fclose($handle);
var_export($result);
风吹雪碎 2024-10-26 06:08:48

这就是我想出来的。我假设列宽提前未知。

<?php

$data = 'XXXXXXXXXXXXXXXXXXXXXXXXXX                                  XXXXXXXXXXXXX           XX        XXXXXX
XXXXXXXXX                                                   XXX XXX                 X         XXX
XXXXXXXXXXXXXXX                                             XXXXXXXXXXXXX           XX        XXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXX                                  XXXXXXXXXXXXX           XX        XXXXXX
XXXXXXXXX                                                   XXX XXX                 X         XXX
XXXXXXXXXXXXXXX                                             XXXXXXXXXXXXX           XX        XXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXX                                  XXXXXXXXXXXXX           XX        XXXXXX
XXXXXXXXX                                                   XXX XXX                 X         XXX
XXXXXXXXXXXXXXX                                             XXXXXXXXXXXXX           XX        XXXXXX';

$dataLines = explode("\n", $data);

// detect column breaks
$numDataLines = count($dataLines);
$colBreaks = array();
$c = 0;
while (true) {
    $rowEnds = 0; // count how many rows have terminated in the current col.
    $notSet = 0; // a special case of $rowEnds, when the line no longer has     
                 // chars.
    // run down each column. if there are no X's, then it is a col break.
    for ($r = 0; $r < $numDataLines; ++$r) {
        if (!isset($dataLines[$r][$c])) {
            ++$notSet;
            ++$rowEnds;
        } elseif ($dataLines[$r][$c] != 'X') {
            ++$rowEnds;
        }
    }
    // if no lines have chars left, end the while loop. this counts as a col 
    // break.
    if ($notSet == $numDataLines) {
        $colBreaks[] = $c;
        break;
    }
    // if no X's were in the line, this is a col break.
    if ($rowEnds == $numDataLines) {
        $colBreaks[] = $c;
    }
    ++$c; // move on to the next col
}

// now that we have all the col breaks, we simply iterate over them and slice
// out the needed sections from each line to create the columns.
$dataCols = array();
$left = 0;
foreach ($colBreaks as $cb) {
    // skip empty cols
    if ($left == $cb) {
        $left = $cb + 1;
        continue;
    }
    $colLen = $cb - $left;
    $dataCol = array();
    echo "left: $left, len: $colLen, cb: $cb\n";
    foreach ($dataLines as $dl) {
        $dataCol[] = substr($dl, $left, $colLen);
    }
    $dataCols[] = implode("\n", $dataCol);
    $left += $colLen + 1;
}

// tada!
print_r($dataCols);

This is what I came up with. I assumed that the column widths are not known ahead of time.

<?php

$data = 'XXXXXXXXXXXXXXXXXXXXXXXXXX                                  XXXXXXXXXXXXX           XX        XXXXXX
XXXXXXXXX                                                   XXX XXX                 X         XXX
XXXXXXXXXXXXXXX                                             XXXXXXXXXXXXX           XX        XXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXX                                  XXXXXXXXXXXXX           XX        XXXXXX
XXXXXXXXX                                                   XXX XXX                 X         XXX
XXXXXXXXXXXXXXX                                             XXXXXXXXXXXXX           XX        XXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXX                                  XXXXXXXXXXXXX           XX        XXXXXX
XXXXXXXXX                                                   XXX XXX                 X         XXX
XXXXXXXXXXXXXXX                                             XXXXXXXXXXXXX           XX        XXXXXX';

$dataLines = explode("\n", $data);

// detect column breaks
$numDataLines = count($dataLines);
$colBreaks = array();
$c = 0;
while (true) {
    $rowEnds = 0; // count how many rows have terminated in the current col.
    $notSet = 0; // a special case of $rowEnds, when the line no longer has     
                 // chars.
    // run down each column. if there are no X's, then it is a col break.
    for ($r = 0; $r < $numDataLines; ++$r) {
        if (!isset($dataLines[$r][$c])) {
            ++$notSet;
            ++$rowEnds;
        } elseif ($dataLines[$r][$c] != 'X') {
            ++$rowEnds;
        }
    }
    // if no lines have chars left, end the while loop. this counts as a col 
    // break.
    if ($notSet == $numDataLines) {
        $colBreaks[] = $c;
        break;
    }
    // if no X's were in the line, this is a col break.
    if ($rowEnds == $numDataLines) {
        $colBreaks[] = $c;
    }
    ++$c; // move on to the next col
}

// now that we have all the col breaks, we simply iterate over them and slice
// out the needed sections from each line to create the columns.
$dataCols = array();
$left = 0;
foreach ($colBreaks as $cb) {
    // skip empty cols
    if ($left == $cb) {
        $left = $cb + 1;
        continue;
    }
    $colLen = $cb - $left;
    $dataCol = array();
    echo "left: $left, len: $colLen, cb: $cb\n";
    foreach ($dataLines as $dl) {
        $dataCol[] = substr($dl, $left, $colLen);
    }
    $dataCols[] = implode("\n", $dataCol);
    $left += $colLen + 1;
}

// tada!
print_r($dataCols);
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文