我正在尝试清除Web刮擦中的一些数据。
这是我正在使用的信息的一个示例:
Best Time
Adam Jones (w/ help) (6:34)Best Time
Kenny Gobbin (a) (2:38)Personal Best
Matt Herrera (12:44)No-record
Nick Elizabeth (19:04)
这是我要实现的示例:
Best Time
Adam Jones (w/ help) (6:34)
Best Time
Kenny Gobbin (2:38)
Personal Best
Matt Herrera (12:44)
No-record
Nick Elizabeth (19:04)
我想在每个正确的括号之后添加两条新行,但是随着时间的不同,我不喜欢,我不喜欢t知道我如何搜索和更换它。另外,数字有时可能发生在时代以外。
我最接近的是用结肠搜索括号内的数字将它们分开,但是我不知道如何用相同的信息替换它们。
re.sub(r"\([0-9]+:[0-9]+\)", "\n\n", result)
有人知道我如何实现这一目标吗?
I'm trying to clean up some data from web scraping.
This is an example of the information I'm working with:
Best Time
Adam Jones (w/ help) (6:34)Best Time
Kenny Gobbin (a) (2:38)Personal Best
Matt Herrera (12:44)No-record
Nick Elizabeth (19:04)
And this is an example of what I'm trying to achieve:
Best Time
Adam Jones (w/ help) (6:34)
Best Time
Kenny Gobbin (2:38)
Personal Best
Matt Herrera (12:44)
No-record
Nick Elizabeth (19:04)
I want to add two new lines after each right parentheses, but as the times are all different, I don't know how I can search and replace it. Also, numbers may sometimes occur outside of the times.
The closest I've come is by searching for numbers inside the parentheses with a colon to separate them, but I don't know how to replace that with the same information.
re.sub(r"\([0-9]+:[0-9]+\)", "\n\n", result)
Does anyone know how I can achieve this?
发布评论
评论(2)
请注意,需要插入两个换行符的位置位于结束括号和字母字符之间。因此,您可以使用:
例如:
输出:
这是其工作原理的解释:
对于我们尝试匹配的表达式,我们有
r"\)([A-Za-z])":
\)
匹配文字结束括号。[A-Za-z]
匹配单个字母字符。[A-Za-z]
括在括号中使其成为我们稍后引用的捕获组。对于替换表达式,我们有
r")\n\n\1"
:)\n\n
添加一个结束括号和两行新行。\1
指的是之前的捕获组。直观上,我们立即捕获末尾括号后的字母字符,然后将相同的字符添加回替换表达式中。Notice that the place where you need to insert two newlines comes between an end parenthesis and an alphabetic character. So, you can use:
For example:
outputs:
Here's an explanation for how it works:
For the expression we're trying to match, we have
r"\)([A-Za-z])"
:\)
matches a literal end parenthesis.[A-Za-z]
matches a single alphabetic character.[A-Za-z]
in parentheses makes it a capture group that we refer to later.For the replacement expression, we have
r")\n\n\1"
:)\n\n
adds an end parenthesis plus two new lines.\1
refers to the capture group from earlier. Intuitively, we capture the alphabetic character immediately after the end parenthesis, and then add that same character back into the replacement expression.您可以通过最小的变化来做到这一点。您只需要了解分组并添加
\ g< 0>
权利\ n \ n
。您可以在有关。在这里,我使用了组0(
()
中的匹配)再次插入它。每组()
是一个组,从左至右侧计数为0。You can do it your way with a minimal change. You only have to know about grouping and add
\g<0>
right befor\n\n
. You can read about it in the offical documentation in the section about search-and-replace.Here I used group 0 (the match in
()
) to insert it again. Each set of()
is a group, counted from the left to the right started with 0.