通过从TXT文件过滤数据和不必要的信息来创建整洁的CSV文件

发布于 2025-01-31 05:56:16 字数 1690 浏览 2 评论 0原文

我有一个分配的分配来导出一个仅存在标题和数据的整洁CSV文件，所有其他数据必须被过滤掉。大约有500多个文本文件。

每个文件必须是一个单独的CSV文件，该格式必须为“年度月（Original_file_name）”。

一个例子是：原始文件：PM990902.b17

CSV文件：1999-09-02（PM990902.B17）.CSV

我已经有用于过滤数据的代码：

import pandas as pd
import numpy as np
import glob
pred = lambda x: x  in np.arange(0, 192, 1)
inval = [99999.9, 999.0, 999.9900, 999.9]
files = glob.glob('C:\\Users\Lenovo\Desktop\Python\Files\*')
for file in files:
    
    df = pd.read_csv(file, header = 0, delim_whitespace=True, skiprows=pred, 
                 engine='python', na_values=inval)
    
    df = df[1:]
    df.to_csv('Name of the new file.csv', index=False)

我仍然不知道如何做文件的新名称（日期），这实际上是我的问题。

这就是文件中的文件的样子：

*AAAAAAAAAAAAAAAAAAAAAAAAAA          zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz       05-JAN-2000 12:21:0005-JAN-2000 14:00:300102
160  2160
           1.00     1.0   1.00   1.00  1.0000   1.0   1.0     1.0    1.0000    1.0000   1.00  1.000   1.0   1.0  1.0000  1.0000
        9999.90 99999.0 999.90 999.00 99.9900 999.0 999.9 99999.9  999.9900  999.9900 999.90 99.990 999.9 999.9 99.9900 99.9900
Pressure [hPa]
Geopotential height [gpm]
Temperature [K]
Relative humidity [%]
Ozone partial pressure [mPa]
Horizontal wind direction [decimal degrees]
Horizontal wind speed [m/s]
GPS geometric height [m]
GPS longitude [decimal degrees E]
GPS latitude [decimal degrees N]
Internal temperature [K]
Ozone raw current [microA]
Battery voltage [V]
Pump current [mA]
Ozone mixing ratio per volume [ppm]
Ozone partial pressure uncertainty estimate [mPa]*

我无法连接整个文本文件，但这是每个文本文件的开始的一个示例。

那么，如何从此行中获取文件名的所需日期呢？

原文

I have an assignment to export neat CSV files where only the headers and data are present, all other data must be filtered out. There are about 500+ text files.

Each file must be a separate CSV file, the format must be "YEAR-MONTH-DAY (ORIGINAL_FILE_NAME)".

An example of this is:
Original file: pm990902.b17

CSV file: 1999-09-02 (pm990902.b17).csv

I already have code for filtering the data:

import pandas as pd
import numpy as np
import glob
pred = lambda x: x  in np.arange(0, 192, 1)
inval = [99999.9, 999.0, 999.9900, 999.9]
files = glob.glob('C:\\Users\Lenovo\Desktop\Python\Files\*')
for file in files:
    
    df = pd.read_csv(file, header = 0, delim_whitespace=True, skiprows=pred, 
                 engine='python', na_values=inval)
    
    df = df[1:]
    df.to_csv('Name of the new file.csv', index=False)

I still can't figure out how to do the new name of the file (the date) which is actually the problem for me.

This is what the file looks like with the date in the first line:

*AAAAAAAAAAAAAAAAAAAAAAAAAA          zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz       05-JAN-2000 12:21:0005-JAN-2000 14:00:300102
160  2160
           1.00     1.0   1.00   1.00  1.0000   1.0   1.0     1.0    1.0000    1.0000   1.00  1.000   1.0   1.0  1.0000  1.0000
        9999.90 99999.0 999.90 999.00 99.9900 999.0 999.9 99999.9  999.9900  999.9900 999.90 99.990 999.9 999.9 99.9900 99.9900
Pressure [hPa]
Geopotential height [gpm]
Temperature [K]
Relative humidity [%]
Ozone partial pressure [mPa]
Horizontal wind direction [decimal degrees]
Horizontal wind speed [m/s]
GPS geometric height [m]
GPS longitude [decimal degrees E]
GPS latitude [decimal degrees N]
Internal temperature [K]
Ozone raw current [microA]
Battery voltage [V]
Pump current [mA]
Ozone mixing ratio per volume [ppm]
Ozone partial pressure uncertainty estimate [mPa]*

I can't attach the whole text file, but this is an example of the beginning of every text file.

So how can I get the desired date for the file name out of this line?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

雾里花 2025-02-07 05:56:16

如果输入文件始终具有相同的格式，则始终在生产线末尾的日期/时间元素，您可以将行分开，并且只需从末尾拿出第三个元素即可。

您可以根据 w3schools

line = "*AAAAAAAAAAAAAAAAAAAAAAAAAA          zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz       05-JAN-2000 12:21:0005-JAN-2000 14:00:300102"

# default split splits on the whitepace character
date_str = line.split()[-3]
print(date_str)

05-JAN-2000

将其应用于您的逻辑，您需要将下面的行更改为我的代码示例：

    df.to_csv('Name of the new file.csv', index=False)

您需要import os在我使用os.path时和os.sep获取结果文件名。

    filename_orig = os.path.basename(file)
    filedir = os.path.dirname(file)
    df.to_csv(f"{filedir}{os.sep}{date_str} ({filename_orig}).csv)", index=False)

请注意，当我使用f-strings时，这需要python 3.6+。
另请注意，您需要打开原始文件并实际读取文件的第一行。这将起作用。

If the input files always have the same format, with the date/time elements always at the end of the line, you can split the line, and just take the third element from the end.

You can do this with negative indexing, as per w3schools

line = "*AAAAAAAAAAAAAAAAAAAAAAAAAA          zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz       05-JAN-2000 12:21:0005-JAN-2000 14:00:300102"

# default split splits on the whitepace character
date_str = line.split()[-3]
print(date_str)

output

05-JAN-2000

As for applying this to your logic, you'll need to change the line below to my code example further down:

    df.to_csv('Name of the new file.csv', index=False)

You need to import os as I use os.path and os.sep to get the resulting filename.

    filename_orig = os.path.basename(file)
    filedir = os.path.dirname(file)
    df.to_csv(f"{filedir}{os.sep}{date_str} ({filename_orig}).csv)", index=False)

Note that this requires Python 3.6+ as I'm using f-strings.
Also note that you need to open the original files and actually read the first line of the file. This will work.

回复收藏 0 原文

~没有更多了~