从txt文件获取特定数据到pandas dataframe

发布于 2025-01-16 08:04:24 字数 732 浏览 0 评论 0原文

我在 txt 文件中有这样的数据：

Wed Mar 23 16:59:25 GMT 2022
      1 State
      1 ESTAB

Wed Mar 23 16:59:26 GMT 2022
      1 State
      1 ESTAB
      1 CLOSE-WAIT

Wed Mar 23 16:59:27 GMT 2022
      1 State
      1 ESTAB
      10 FIN-WAIT

Wed Mar 23 16:59:28 GMT 2022
      1 State
      1 CLOSE-WAIT
      102 ESTAB

我想得到一个如下所示的 pandas 数据框：

timestamp | State | ESTAB | FIN-WAIT | CLOSE-WAIT
Wed Mar 23 16:59:25 GMT 2022 | 1 | 1 | 0 | 0
Wed Mar 23 16:59:26 GMT 2022 | 1 | 1 | 0 | 1
Wed Mar 23 16:59:27 GMT 2022 | 1 | 1 | 10 | 0
Wed Mar 23 16:59:28 GMT 2022 | 1 | 102 | 0 | 1

这意味着每段第一行中的字符串应该用于第一列 timestamp。其他列应根据数字后面的字符串填充数字。下一栏在段落之后开始。

我怎样才能用熊猫做到这一点？

原文

I have such data in a txt file:

Wed Mar 23 16:59:25 GMT 2022
      1 State
      1 ESTAB

Wed Mar 23 16:59:26 GMT 2022
      1 State
      1 ESTAB
      1 CLOSE-WAIT

Wed Mar 23 16:59:27 GMT 2022
      1 State
      1 ESTAB
      10 FIN-WAIT

Wed Mar 23 16:59:28 GMT 2022
      1 State
      1 CLOSE-WAIT
      102 ESTAB

I want to get a pandas dataframe looking like this:

timestamp | State | ESTAB | FIN-WAIT | CLOSE-WAIT
Wed Mar 23 16:59:25 GMT 2022 | 1 | 1 | 0 | 0
Wed Mar 23 16:59:26 GMT 2022 | 1 | 1 | 0 | 1
Wed Mar 23 16:59:27 GMT 2022 | 1 | 1 | 10 | 0
Wed Mar 23 16:59:28 GMT 2022 | 1 | 102 | 0 | 1

That means the string in the first line per paragraph should be used for the first column timestamp. The other columns should be filled withg the numbers according to the string following the number. The next column begins after a paragraph.

How can I do this with pandas?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

晌融 2025-01-23 08:04:24

首先，您可以将txt文件处理为列表列表。内部列表意味着每个大块线。外部列表意味着不同的块：

import pandas as pd

with open('data.txt', 'r') as f:
    res = f.read()

records = [list(map(str.strip, line.strip().split('\n'))) for line in res.split('\n\n')]

print(records)

[['Wed Mar 23 16:59:25 GMT 2022', '1 State', '1 ESTAB'], ['Wed Mar 23 16:59:26 GMT 2022', '1 State', '1 ESTAB', '1 CLOSE-WAIT'], ['Wed Mar 23 16:59:27 GMT 2022', '1 State', '1 ESTAB', '10 FIN-WAIT'], ['Wed Mar 23 16:59:28 GMT 2022', '1 State', '1 CLOSE-WAIT', '102 ESTAB']]

然后您可以通过手动定义每个键和值将列表列表转换为字典列表

l = []
for record in records:
    d = {}
    d['timestamp'] = record[0]
    for r in record[1:]:
        key = r.split(' ')[1]
        value = r.split(' ')[0]
        d[key] = value

    l.append(d)

print(l)

[{'timestamp': 'Wed Mar 23 16:59:25 GMT 2022', 'State': '1', 'ESTAB': '1'}, {'timestamp': 'Wed Mar 23 16:59:26 GMT 2022', 'State': '1', 'ESTAB': '1', 'CLOSE-WAIT': '1'}, {'timestamp': 'Wed Mar 23 16:59:27 GMT 2022', 'State': '1', 'ESTAB': '1', 'FIN-WAIT': '10'}, {'timestamp': 'Wed Mar 23 16:59:28 GMT 2022', 'State': '1', 'CLOSE-WAIT': '1', 'ESTAB': '102'}]

最后您可以将此字典输入数据帧并填充 nan 单元格

df = pd.DataFrame(l).fillna(0)

print(df)

                      timestamp State ESTAB CLOSE-WAIT FIN-WAIT
0  Wed Mar 23 16:59:25 GMT 2022     1     1          0        0
1  Wed Mar 23 16:59:26 GMT 2022     1     1          1        0
2  Wed Mar 23 16:59:27 GMT 2022     1     1          0       10
3  Wed Mar 23 16:59:28 GMT 2022     1   102          1        0

First you can process the txt file to a list of list. Inner list means each hunk lines. Outer list means different hunks:

import pandas as pd

with open('data.txt', 'r') as f:
    res = f.read()

records = [list(map(str.strip, line.strip().split('\n'))) for line in res.split('\n\n')]

print(records)

[['Wed Mar 23 16:59:25 GMT 2022', '1 State', '1 ESTAB'], ['Wed Mar 23 16:59:26 GMT 2022', '1 State', '1 ESTAB', '1 CLOSE-WAIT'], ['Wed Mar 23 16:59:27 GMT 2022', '1 State', '1 ESTAB', '10 FIN-WAIT'], ['Wed Mar 23 16:59:28 GMT 2022', '1 State', '1 CLOSE-WAIT', '102 ESTAB']]

Then you can turn the list of list to list of dictionary by manually define each key and value

l = []
for record in records:
    d = {}
    d['timestamp'] = record[0]
    for r in record[1:]:
        key = r.split(' ')[1]
        value = r.split(' ')[0]
        d[key] = value

    l.append(d)

print(l)

[{'timestamp': 'Wed Mar 23 16:59:25 GMT 2022', 'State': '1', 'ESTAB': '1'}, {'timestamp': 'Wed Mar 23 16:59:26 GMT 2022', 'State': '1', 'ESTAB': '1', 'CLOSE-WAIT': '1'}, {'timestamp': 'Wed Mar 23 16:59:27 GMT 2022', 'State': '1', 'ESTAB': '1', 'FIN-WAIT': '10'}, {'timestamp': 'Wed Mar 23 16:59:28 GMT 2022', 'State': '1', 'CLOSE-WAIT': '1', 'ESTAB': '102'}]

At last you can feed this dictionary into dataframe and fill the nan cell

df = pd.DataFrame(l).fillna(0)

print(df)

                      timestamp State ESTAB CLOSE-WAIT FIN-WAIT
0  Wed Mar 23 16:59:25 GMT 2022     1     1          0        0
1  Wed Mar 23 16:59:26 GMT 2022     1     1          1        0
2  Wed Mar 23 16:59:27 GMT 2022     1     1          0       10
3  Wed Mar 23 16:59:28 GMT 2022     1   102          1        0

回复收藏 0 原文

泪是无色的血 2025-01-23 08:04:24

尝试：

#read text file to a DataFrame
df = pd.read_csv("data.txt", header=None, skip_blank_lines=False)

#Extract possible column names
df["Column"] = df[0].str.extract("(State|ESTAB|FIN-WAIT|CLOSE-WAIT)")

#Remove the column names from the data
df[0] = df[0].str.replace("(State|ESTAB|FIN-WAIT|CLOSE-WAIT)","",regex=True)

df = df.dropna(how="all").fillna("timestamp")
df["Index"] = df["Column"].eq("timestamp").cumsum()

#Pivot the data to match expected output structure
output = df.pivot("Index","Column",0)

#Re-format columns as needed
output = output.set_index("timestamp").astype(float).fillna(0).astype(int).reset_index()

>>> output
Column                     timestamp  CLOSE-WAIT  ESTAB  FIN-WAIT  State
0       Wed Mar 23 16:59:25 GMT 2022           0      1         0      1
1       Wed Mar 23 16:59:26 GMT 2022           1      1         0      1
2       Wed Mar 23 16:59:27 GMT 2022           0      1        10      1
3       Wed Mar 23 16:59:28 GMT 2022           1    102         0      1

Try:

#read text file to a DataFrame
df = pd.read_csv("data.txt", header=None, skip_blank_lines=False)

#Extract possible column names
df["Column"] = df[0].str.extract("(State|ESTAB|FIN-WAIT|CLOSE-WAIT)")

#Remove the column names from the data
df[0] = df[0].str.replace("(State|ESTAB|FIN-WAIT|CLOSE-WAIT)","",regex=True)

df = df.dropna(how="all").fillna("timestamp")
df["Index"] = df["Column"].eq("timestamp").cumsum()

#Pivot the data to match expected output structure
output = df.pivot("Index","Column",0)

#Re-format columns as needed
output = output.set_index("timestamp").astype(float).fillna(0).astype(int).reset_index()

>>> output
Column                     timestamp  CLOSE-WAIT  ESTAB  FIN-WAIT  State
0       Wed Mar 23 16:59:25 GMT 2022           0      1         0      1
1       Wed Mar 23 16:59:26 GMT 2022           1      1         0      1
2       Wed Mar 23 16:59:27 GMT 2022           0      1        10      1
3       Wed Mar 23 16:59:28 GMT 2022           1    102         0      1

回复收藏 0 原文

~没有更多了~