使用REGEX自定义令牌化

发布于 2025-01-31 04:56:45 字数 458 浏览 2 评论 0原文

我有以下文本:

4/21/2021 11:43:32 PM 0ED4 PACKET [OUTPUT] GET

日志行的格式可能会更改,并且可能会添加更多字段,但它们都是单个单词。我只想加入日期和时间,

我想将其标记为:

['4/21/2021 11:43:32 PM','0RU4', 'PACKET', 'OUTPUT', 'GET']

我已经使用了此正则是“ \\ [| \\] | \,| \\ s+| \ w:| =“哪个给我输出为:

['4/21/2021', '11:43:32', 'PM', '0ED4', 'PACKET', 'OUTPUT', 'GET']

我应该对正则施加什么更改,以便我将所需的输出作为一个令牌,以使我所需的输出作为一个令牌。

I have the following text:

4/21/2021 11:43:32 PM 0ED4 PACKET [OUTPUT] GET

The format of the log line may change and more fields may add into it but they are all single words. I only want to join date and time

I want to tokenize it to :

['4/21/2021 11:43:32 PM','0RU4', 'PACKET', 'OUTPUT', 'GET']

I have used this regex "\\[|\\]|\,|\\s+|\W:|=" which gives me the output as:

['4/21/2021', '11:43:32', 'PM', '0ED4', 'PACKET', 'OUTPUT', 'GET']

What changes should I make to the regex such that I get my desired output with the entire date and time as one token.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

若能看破又如何 2025-02-07 04:56:45

您可以将单个正则态度模式与re.findall

inp = "4/21/2021 11:43:32 PM 0ED4 PACKET [OUTPUT] GET"
matches = re.findall(r'^(\S+ \S+ [AP]M) (\S+) (\S+) \[(\S+)\] (\w+)', inp)
print(matches)  # [('4/21/2021 11:43:32 PM', '0ED4', 'PACKET', 'OUTPUT', 'GET')]

You could just use a single regex pattern along with re.findall:

inp = "4/21/2021 11:43:32 PM 0ED4 PACKET [OUTPUT] GET"
matches = re.findall(r'^(\S+ \S+ [AP]M) (\S+) (\S+) \[(\S+)\] (\w+)', inp)
print(matches)  # [('4/21/2021 11:43:32 PM', '0ED4', 'PACKET', 'OUTPUT', 'GET')]
绿光 2025-02-07 04:56:45

您也可以使用以下Python TTP模块。请参阅示例:

from ttp import ttp
import json
import re

data_to_parse = "4/21/2021 11:43:32 PM 0ED4 PACKET [OUTPUT] GET"

ttp_template = """
{{ TIME | PHRASE }} PM {{REST | PHRASE}}
"""

parser = ttp(data=data_to_parse, template=ttp_template)
parser.parse()

# print result in JSON format
results = parser.result(format='json')[0]
#print(results)

#converting str to json. 
result = json.loads(results)

rest_split = result[0]['REST'].split()

desired_data = []
desired_data.append(f"{result[0]['TIME']} PM")

pattern = "\[(\S+)\]" # Just to capture everything between [], your OUTPUT data. 

for i in rest_split:
    if re.match(pattern, i):
        i = re.findall(pattern, i)
        desired_data.append(i[0])
        continue
    desired_data.append(i)

print(desired_data)

请参阅第一个结果的输出:

”在此处输入图像说明“

请参阅所需数据的输出:

”在此处输入图像描述”

You can also use the following python ttp module. See the example:

from ttp import ttp
import json
import re

data_to_parse = "4/21/2021 11:43:32 PM 0ED4 PACKET [OUTPUT] GET"

ttp_template = """
{{ TIME | PHRASE }} PM {{REST | PHRASE}}
"""

parser = ttp(data=data_to_parse, template=ttp_template)
parser.parse()

# print result in JSON format
results = parser.result(format='json')[0]
#print(results)

#converting str to json. 
result = json.loads(results)

rest_split = result[0]['REST'].split()

desired_data = []
desired_data.append(f"{result[0]['TIME']} PM")

pattern = "\[(\S+)\]" # Just to capture everything between [], your OUTPUT data. 

for i in rest_split:
    if re.match(pattern, i):
        i = re.findall(pattern, i)
        desired_data.append(i[0])
        continue
    desired_data.append(i)

print(desired_data)

See the output of result first:

enter image description here

See the output of desired data:

enter image description here

梦回旧景 2025-02-07 04:56:45

为什么要打扰言论?

s = '4/21/2021 11:43:32 PM 0ED4 PACKET [OUTPUT] GET'
s = s.replace(' [',' ').replace('] ',' ') #the output shows no square brackets

tags = [' AP ', ' PM ']
for t in tags:
    if t in s:
        start, end = s.split(t)
        start = (start + t).strip()
        result = [start, *end.split()]
        break
        
print(result)
#['4/21/2021 11:43:32 PM', '0ED4', 'PACKET', 'OUTPUT', 'GET']

Why bother with regex?

s = '4/21/2021 11:43:32 PM 0ED4 PACKET [OUTPUT] GET'
s = s.replace(' [',' ').replace('] ',' ') #the output shows no square brackets

tags = [' AP ', ' PM ']
for t in tags:
    if t in s:
        start, end = s.split(t)
        start = (start + t).strip()
        result = [start, *end.split()]
        break
        
print(result)
#['4/21/2021 11:43:32 PM', '0ED4', 'PACKET', 'OUTPUT', 'GET']
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文