如何读取固定宽度数据？

发布于 2025-01-22 04:29:43 字数 514 浏览 3 评论 0原文

数据看起来

212253820000025000.00000002500.00000000375.00111120211105202117
212456960000000750.00000000075.00000000011.25111120211102202117
212387470000010000.00000001000.00000000150.00111120211105202117

需要添加分离器，例如

21225382,0000025000.00,000002500.00,000000375.00,11112021,11052021,17
21245696,0000000750.00,000000075.00,000000011.25,11112021,11022021,17
21238747,0000010000.00,000001000.00,000000150.00,11112021,11052021,17

CSV文件长度高接近20000行，有可能要这样做

原文

data looks like

212253820000025000.00000002500.00000000375.00111120211105202117
212456960000000750.00000000075.00000000011.25111120211102202117
212387470000010000.00000001000.00000000150.00111120211105202117

need to add separator like

21225382,0000025000.00,000002500.00,000000375.00,11112021,11052021,17
21245696,0000000750.00,000000075.00,000000011.25,11112021,11022021,17
21238747,0000010000.00,000001000.00,000000150.00,11112021,11052021,17

The CSV file length is high nearly 20000 rows are there is there any possibility to do

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

寄与心 2025-01-29 04:29:43

这个问题通常是关于阅读“ 固定宽度数据“。

如果您遇到此数据，则需要按行分析IT行，然后按列列。我将向您展示如何与Python一起做。

首先，您在评论中计算的列与样本输出不符。您似乎已经用2个字符的计数省略了最后一列。

您需要准确的列宽度来执行任务。我获取了您的示例数据并为您计算了这些列并得到了这些数字：

8, 13, 12, 12, 8, 8, 2

因此，我们将按行读取输入数据，对于每一行，我们将：

阅读8个字符并将其保存为一列，然后将13个字符保存为13个字符并将其保存为列，然后将其保存12个字符等...直到我们读取了所有指定的列宽度
时，我们将在线路上移动时，我们将使用变量来跟踪我们的位置beg 结束表示列在哪里开始（包含）及其结束的位置（独家）
第一列的末尾成为下一个开始的开始，依此类推，依此类推，
我们将存储这些列在列表（数组）中，这是
行末尾的新行，我们将将新行保存到所有行的列表中
，然后我们重复下一行的过程，

这是Python中的外观：

import pprint

Col_widths = [8, 13, 12, 12, 8, 8, 2]

all_rows = []
with open("data.txt") as in_file:
    for line in in_file:
        row = []
        beg = 0
        for width in Col_widths:
            end = beg + width
            col = line[beg:end]
            row.append(col)
            beg = end
        all_rows.append(row)

pprint.pprint(all_rows, width=100)

all_rows只是文本列表的列表：

[['21225382', '0000025000.00', '000002500.00', '000000375.00', '11112021', '11052021', '17'],
 ['21245696', '0000000750.00', '000000075.00', '000000011.25', '11112021', '11022021', '17'],
 ['21238747', '0000010000.00', '000001000.00', '000000150.00', '11112021', '11052021', '17']]

使用此方法，如果您误缩了列宽度或列数，则可以轻松地修改column_widths以匹配数据。

从这里开始，我们将使用Python的CSV模块来确保正确编写CSV文件：

import csv

with open("data.csv", "w", newline="") as out_file:
    writer = csv.writer(out_file)
    writer.writerows(all_rows)

我的Data.csv文件看起来像：

21225382,0000025000.00,000002500.00,000000375.00,11112021,11052021,17
21245696,0000000750.00,000000075.00,000000011.25,11112021,11022021,17
21238747,0000010000.00,000001000.00,000000150.00,11112021,11052021,17

This question is generally about reading "fixed width data".

If you're stuck with this data, you'll need to parse it line by line then column by column. I'll show you how to do this with Python.

First off, the columns you counted off in the comment do not match your sample output. You seemed to have omitted the last column with a count of 2 characters.

You'll need accurate column widths to perform the task. I took your sample data and counted the columns for you and got these numbers:

8, 13, 12, 12, 8, 8, 2

So, we'll read the input data line by line, and for every line we'll:

Read 8 chars and save it as a column, then 13 chars and save it as a column, then 12 chars, etc... till we've read all the specified column widths
As we move through the line we'll keep track of our position with the variables beg and end to denote where a column begins (inclusive) and where it ends (exclusive)
The end of the first column becomes the beginning of the next, and so on down the line
We'll store those columns in a list (array) that is the new row
At the end of the line we'll save the new row to a list of all the rows
Then, we'll repeat the process for the next line

Here's how this looks in Python:

import pprint

Col_widths = [8, 13, 12, 12, 8, 8, 2]

all_rows = []
with open("data.txt") as in_file:
    for line in in_file:
        row = []
        beg = 0
        for width in Col_widths:
            end = beg + width
            col = line[beg:end]
            row.append(col)
            beg = end
        all_rows.append(row)

pprint.pprint(all_rows, width=100)

all_rows is just a list of lists of text:

[['21225382', '0000025000.00', '000002500.00', '000000375.00', '11112021', '11052021', '17'],
 ['21245696', '0000000750.00', '000000075.00', '000000011.25', '11112021', '11022021', '17'],
 ['21238747', '0000010000.00', '000001000.00', '000000150.00', '11112021', '11052021', '17']]

With this approach, if you miscounted the column width or the number of columns you can easily modify the Column_widths to match your data.

From here we'll use Python's CSV module to make sure the CSV file is written correctly:

import csv

with open("data.csv", "w", newline="") as out_file:
    writer = csv.writer(out_file)
    writer.writerows(all_rows)

and my data.csv file looks like:

21225382,0000025000.00,000002500.00,000000375.00,11112021,11052021,17
21245696,0000000750.00,000000075.00,000000011.25,11112021,11022021,17
21238747,0000010000.00,000001000.00,000000150.00,11112021,11052021,17

回复收藏 0 原文

绅士风度i 2025-01-29 04:29:43

如果您可以访问命令行工具尴尬，则可以如下修复数据：

substr（）给出一部分字符串$ 0，这是整个行
您从char 1开始，然后指定第一列的宽度，8
下一个substr（），您再次使用$ 0，从9
开始在“最后一列的开始以及最后一列的宽度”

#!/bin/sh

# Col_widths = [8, 13, 12, 12, 8, 8, 2]

awk '{print substr($0,1,8) "," substr($0,9,13) "," substr($0,22,12) "," substr($0,34,12) "," substr($0,46,8) "," substr($0,54,8) "," substr($0,62,2)}' data.txt > data.csv

If you have access to the command-line tool awk, you can fix your data like the following:

substr() gives a portion of the string $0, which is the entire line
you start at char 1 then specify the width of your first column, 8
for the next substr(), you again use $0, you start at 9 (1+8 from the last substr), and give it the second column's width, 13
and repeat for each column, starting at "the start of the last column plus the last column's width"

#!/bin/sh

# Col_widths = [8, 13, 12, 12, 8, 8, 2]

awk '{print substr($0,1,8) "," substr($0,9,13) "," substr($0,22,12) "," substr($0,34,12) "," substr($0,46,8) "," substr($0,54,8) "," substr($0,62,2)}' data.txt > data.csv

回复收藏 0 原文

~没有更多了~