将文件的行分割成二维,其中行元素由字符串长度确定,因为没有分隔符
我使用 file()
读取数据并迭代每一行。需要能够将字符串拆分为“列”数组。问题是列的宽度不均匀(60 个字符、24 个字符、16 个字符)。似乎所有执行此操作的函数都期望列具有固定大小。
这将定期对大型数据文件执行,因此需要最佳性能。
数据示例。
XXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXX XX XXXXXX
XXXXXXXXX XXX XXX X XXX
XXXXXXXXXXXXXXX XXXXXXXXXXXXX XX XXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXX XX XXXXXX
XXXXXXXXX XXX XXX X XXX
XXXXXXXXXXXXXXX XXXXXXXXXXXXX XX XXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXX XX XXXXXX
XXXXXXXXX XXX XXX X XXX
XXXXXXXXXXXXXXX XXXXXXXXXXXXX XX XXXXXX
期望的结果:
array (
0 =>
array (
0 => 'XXXXXXXXXXXXXXXXXXXXXXXXXX ',
1 => 'XXXXXXXXXXXXX ',
2 => 'XX XXXXXX',
),
1 =>
array (
0 => 'XXXXXXXXX ',
1 => 'XXX XXX ',
2 => 'X XXX',
),
2 =>
array (
0 => 'XXXXXXXXXXXXXXX ',
1 => 'XXXXXXXXXXXXX ',
2 => 'XX XXXXXX',
),
3 =>
array (
0 => 'XXXXXXXXXXXXXXXXXXXXXXXXXX ',
1 => 'XXXXXXXXXXXXX ',
2 => 'XX XXXXXX',
),
4 =>
array (
0 => 'XXXXXXXXX ',
1 => 'XXX XXX ',
2 => 'X XXX',
),
5 =>
array (
0 => 'XXXXXXXXXXXXXXX ',
1 => 'XXXXXXXXXXXXX ',
2 => 'XX XXXXXX',
),
6 =>
array (
0 => 'XXXXXXXXXXXXXXXXXXXXXXXXXX ',
1 => 'XXXXXXXXXXXXX ',
2 => 'XX XXXXXX',
),
7 =>
array (
0 => 'XXXXXXXXX ',
1 => 'XXX XXX ',
2 => 'X XXX',
),
8 =>
array (
0 => 'XXXXXXXXXXXXXXX ',
1 => 'XXXXXXXXXXXXX ',
2 => 'XX XXXXXX',
),
)
I have a data fine being read in using file()
and iterate over each row. Need to be able to split the string into an array of "columns". Problem is the columns are not even widths (60 chars, 24 chars, 16 chars). Seems like all the functions to do this expect that the columns are a fixed size.
This will be performed on a large data file quite regularly so optimal performance is desired.
Example of data.
XXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXX XX XXXXXX
XXXXXXXXX XXX XXX X XXX
XXXXXXXXXXXXXXX XXXXXXXXXXXXX XX XXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXX XX XXXXXX
XXXXXXXXX XXX XXX X XXX
XXXXXXXXXXXXXXX XXXXXXXXXXXXX XX XXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXX XX XXXXXX
XXXXXXXXX XXX XXX X XXX
XXXXXXXXXXXXXXX XXXXXXXXXXXXX XX XXXXXX
Desired result:
array (
0 =>
array (
0 => 'XXXXXXXXXXXXXXXXXXXXXXXXXX ',
1 => 'XXXXXXXXXXXXX ',
2 => 'XX XXXXXX',
),
1 =>
array (
0 => 'XXXXXXXXX ',
1 => 'XXX XXX ',
2 => 'X XXX',
),
2 =>
array (
0 => 'XXXXXXXXXXXXXXX ',
1 => 'XXXXXXXXXXXXX ',
2 => 'XX XXXXXX',
),
3 =>
array (
0 => 'XXXXXXXXXXXXXXXXXXXXXXXXXX ',
1 => 'XXXXXXXXXXXXX ',
2 => 'XX XXXXXX',
),
4 =>
array (
0 => 'XXXXXXXXX ',
1 => 'XXX XXX ',
2 => 'X XXX',
),
5 =>
array (
0 => 'XXXXXXXXXXXXXXX ',
1 => 'XXXXXXXXXXXXX ',
2 => 'XX XXXXXX',
),
6 =>
array (
0 => 'XXXXXXXXXXXXXXXXXXXXXXXXXX ',
1 => 'XXXXXXXXXXXXX ',
2 => 'XX XXXXXX',
),
7 =>
array (
0 => 'XXXXXXXXX ',
1 => 'XXX XXX ',
2 => 'X XXX',
),
8 =>
array (
0 => 'XXXXXXXXXXXXXXX ',
1 => 'XXXXXXXXXXXXX ',
2 => 'XX XXXXXX',
),
)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
最简单的方法是使用 substr 来分割列:
但与常识相反,使用 PCRE 和正则表达式来分割字符串会更快:
这里的缺点是每行包含一个空的
[ 0]
(将包含原始行),数据列从索引[1]
开始。The straightforward method would be using substr to split up the columns:
But contrary to common wisdom it would be faster to use PCRE and a regular expression to split up the string:
The disadvantage here is that it each row contains an empty
[0]
(would have contained the original line), and the data columns start at index[1]
.可以可靠地做到这一点的唯一方法是文件中已有一些分隔符。
explode()
在分隔符上分割字符串,因此如果您知道文件列是制表符分隔的,您可以爆炸('\t',$string)
获取列的数组。
除此之外,我想不出可靠的方法可以让您在事先不知道大小的情况下拉出可变大小的列。
The only way you could reliably do this is if there is some separator already in the file.
explode()
splits strings on a separator, so if you know your file columns are tab separated, you canexplode('\t',$string)
to get an array of the columns.
Other than that, there's no reliable way I can think of that would let you pull out variable sized columns without previously knowing the size.
在您对我之前的答案发表评论之后,看来
substr()
就是您所需要的。如果您知道每行每列的宽度,只需执行以下操作:
After your comments on my previous answer, it appears that
substr()
is all you need.If you know the width of each column for every line just do something like:
解析可预测格式的文件的行是 fscanf() 的设计目的。捕获指定长度的非换行符; 3次。
如果捕获了 3 个子字符串,请将这些值保存为结果数组中的 3 元素行。
如果您不想
unset()
循环中的$row
,您可以将赋值显式设置为$row[0], $row[ 1],$row[2]
。代码:(演示)
Parsing the lines of a predictably formatted file is something that
fscanf()
was designed to do. Capture non-newline characters of designated length; 3 times.If 3 substrings are captured, save those values as a 3-element row in the result array.
If you didn't want to
unset()
the$row
in the loop, you could explicitly key the assignments as$row[0], $row[1], $row[2]
.Code: (Demo)
这就是我想出来的。我假设列宽提前未知。
This is what I came up with. I assumed that the column widths are not known ahead of time.