GNU尴尬打印文本之间的两个模式不包括图案

发布于 2025-02-07 21:08:28 字数 1805 浏览 2 评论 0原文

GNU awk 5.1.1

我在下面有一个尴尬表达式,我用来在.txt文档中发现的两个模式之间打印内容,但我对表达不太满意,想要更优雅的建议吗?我不只是尴尬。

我想抓住这些标签之间的所有文本,但不包括标签。

<monDescription> and <endMonDescription> 

如果我简单地使用:

awk '/<monDescription>/,/<endMonDescription>/' ~mydocument.txt 

它包括&lt; mondescription&gt; and &lt; endmondScription&gt;我不想要。 因此,为了解决此问题,我使用GSUB将尴尬输出输送到另一个awk命令:

awk '/<monDescription>/,/<endMonDescription>/' ~mydocument.txt | awk '{gsub(/<monDescription>|<endMonDescription>|DAVE:/, "")}1' | awk '{$1=$1;print}'

然后,我还gsub“ dave:”这是在同一行之前和同一行上发生的文本内容,以及&lt; mondsecription&gt;我不要。很难仅在模式之前或之后才能在模式之间获得干净的文本,而不包括图案本身而不会倾斜外观管道。建议?

这是输入文本的示例:

DAVE:   <monDescription>Lorem ipsum dolor sit amet, consectetuer
adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum
sociis natoque penatibus et magnis dis parturient montes, nascetur
ridiculus mus. Donec quam felis, ultricies nec, pellentesque eu,
pretium quis, sem. Nulla consequat massa quis enim. Donec pede justo,
fringilla vel, aliquet nec, vulputate eget, arcu. In enim justo,
rhoncus ut, imperdiet a, venenatis vitae, justo. Nullam dictum felis
eu pede mollis pretium.<endMonDescription>

预期输出应为:

lorem ipsum dolor sit amet,共销剂脂肪宣传elit。阿尼恩 Commodo Ligula Eget Dolor。 Aenean Massa。社会北约北约北约 Et Magnis distalient Montes,Nascetur嘲笑Mus。 Donec Quam Felis,Ultricies NEC,Pellentesque Eu,Pretium Quis,Sem。 NULLA Reactation Massa Quis Enim。 Donec Pede Justo,Fringilla Vel,等分试样 Nec,vulputate eget,arcu。在Enim Justo,Rhoncus ut,Imperdiet A, Justo Venenatis Vitae。 Nullam Distum Felis Eu Pede Mollis Pretium。

GNU AWK 5.1.1

I have an AWK expression below that I use to print the content between two patterns found in a .txt document but I'm not particularly happy with the expression, would like something more elegant any suggestions? I don't was SED just AWK.

I want to grab all the text between these tags but not including the tags.

<monDescription> and <endMonDescription> 

If I simple use:

awk '/<monDescription>/,/<endMonDescription>/' ~mydocument.txt 

It includes the <monDescription> and <endMonDescription> which I don't want.
So to fix this I pipe the AWK output to another AWK command using gsub:

awk '/<monDescription>/,/<endMonDescription>/' ~mydocument.txt | awk '{gsub(/<monDescription>|<endMonDescription>|DAVE:/, "")}1' | awk '{$1=$1;print}'

Then I also gsub "DAVE: " which is text content that occurs before and on the same line and the <monDescription> that I don't want. It's tough just to get clean text in-between patterns not before or after the patterns and not including the patterns themselves without slopping looking piping. Suggestions?

Here's a sample of input text:

DAVE:   <monDescription>Lorem ipsum dolor sit amet, consectetuer
adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum
sociis natoque penatibus et magnis dis parturient montes, nascetur
ridiculus mus. Donec quam felis, ultricies nec, pellentesque eu,
pretium quis, sem. Nulla consequat massa quis enim. Donec pede justo,
fringilla vel, aliquet nec, vulputate eget, arcu. In enim justo,
rhoncus ut, imperdiet a, venenatis vitae, justo. Nullam dictum felis
eu pede mollis pretium.<endMonDescription>

Expected output should be:

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean
commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus
et magnis dis parturient montes, nascetur ridiculus mus. Donec quam
felis, ultricies nec, pellentesque eu, pretium quis, sem. Nulla
consequat massa quis enim. Donec pede justo, fringilla vel, aliquet
nec, vulputate eget, arcu. In enim justo, rhoncus ut, imperdiet a,
venenatis vitae, justo. Nullam dictum felis eu pede mollis pretium.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

泡沫很甜 2025-02-14 21:08:29

使用 awk awk 。

DAVE:   <monDescription>Lorem ipsum dolor sit amet, consectetuer
adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum
sociis natoque penatibus et magnis dis parturient montes, nascetur
ridiculus mus. Donec quam felis, ultricies nec, pellentesque eu,
pretium quis, sem. Nulla consequat massa quis enim. Donec pede justo,
fringilla vel, aliquet nec, vulputate eget, arcu. In enim justo,
rhoncus ut, imperdiet a, venenatis vitae, justo. Nullam dictum felis
eu pede mollis pretium.<endMonDescription>

我将在此任务上

awk 'BEGIN{RS="<[^>]*onDescription>"}NR%2==0' file.txt

我会在此任务以下方式使用file.txt内容进行输出

Lorem ipsum dolor sit amet, consectetuer
adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum
sociis natoque penatibus et magnis dis parturient montes, nascetur
ridiculus mus. Donec quam felis, ultricies nec, pellentesque eu,
pretium quis, sem. Nulla consequat massa quis enim. Donec pede justo,
fringilla vel, aliquet nec, vulputate eget, arcu. In enim justo,
rhoncus ut, imperdiet a, venenatis vitae, justo. Nullam dictum felis
eu pede mollis pretium.

说明:我准备了与&lt; mondescription&gt匹配的正则表达式, ;&lt; endmondescription&gt;(您可以选择使用|加入的两者,如果我提供的正则表达式会给您的文件提供误报) ,然后我告知GNU awk将其用作行分隔符(rs)和print甚至只有行。 免责声明:此解决方案假设所有启动标签都有结尾标签,永远不会嵌套,并且每个结尾标签都在其之前的某个地方都有启动标签。

(在GNU AWK 5.0.1中测试)

I would harness GNU AWK for this task following way, let file.txt content be

DAVE:   <monDescription>Lorem ipsum dolor sit amet, consectetuer
adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum
sociis natoque penatibus et magnis dis parturient montes, nascetur
ridiculus mus. Donec quam felis, ultricies nec, pellentesque eu,
pretium quis, sem. Nulla consequat massa quis enim. Donec pede justo,
fringilla vel, aliquet nec, vulputate eget, arcu. In enim justo,
rhoncus ut, imperdiet a, venenatis vitae, justo. Nullam dictum felis
eu pede mollis pretium.<endMonDescription>

then

awk 'BEGIN{RS="<[^>]*onDescription>"}NR%2==0' file.txt

gives output

Lorem ipsum dolor sit amet, consectetuer
adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum
sociis natoque penatibus et magnis dis parturient montes, nascetur
ridiculus mus. Donec quam felis, ultricies nec, pellentesque eu,
pretium quis, sem. Nulla consequat massa quis enim. Donec pede justo,
fringilla vel, aliquet nec, vulputate eget, arcu. In enim justo,
rhoncus ut, imperdiet a, venenatis vitae, justo. Nullam dictum felis
eu pede mollis pretium.

Explanation: I prepared regular expression which would match both <monDescription> and <endMonDescription> (you might elect to use both of these joined by | if regular expression which I provide will give false positives with your file), then I inform GNU AWK to use it as row separator (RS) and to print only even lines. Disclaimer: this solution assumes that all starting tag has ending tag, there is not never nesting and every ending tag has starting tag somewhere before it.

(tested in GNU Awk 5.0.1)

楠木可依 2025-02-14 21:08:29

如果有领先或尾随,那么,请尝试此非专有 - awk解决方案:

输入(包装已封装,包括落后空间)

<<<<{DAVE:   <monDescription>Lorem ipsum dolor sit amet, consectetuer
adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum
sociis natoque penatibus et magnis dis parturient montes, nascetur
ridiculus mus. Donec quam felis, ultricies nec, pellentesque eu,
pretium quis, sem. Nulla consequat massa quis enim. Donec pede justo,
fringilla vel, aliquet nec, vulputate eget, arcu. In enim justo,
rhoncus ut, imperdiet a, venenatis vitae, justo. Nullam dictum felis
eu pede mollis pretium.<endMonDescription>      }>>>>

code

# This also trims out final \n :
#      remove ORS= part to 
#      retain trailing newline

{m,n,g}awk ++NF OFS= RS='^

<强>输出(封装)

<<<<{Lorem ipsum dolor sit amet, consectetuer
adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum
sociis natoque penatibus et magnis dis parturient montes, nascetur
ridiculus mus. Donec quam felis, ultricies nec, pellentesque eu,
pretium quis, sem. Nulla consequat massa quis enim. Donec pede justo,
fringilla vel, aliquet nec, vulputate eget, arcu. In enim justo,
rhoncus ut, imperdiet a, venenatis vitae, justo. Nullam dictum felis
eu pede mollis pretium.}>>>>
\ ORS= FS='(^[^><]*)?[<](end)?[Mm]onDescription[>]([^><]*$)?'

<强>输出(封装)

If there's leading or trailing, well, anything, try this non-proprietary-awk solution :

INPUT (encapsulated, including trailing spaces)

<<<<{DAVE:   <monDescription>Lorem ipsum dolor sit amet, consectetuer
adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum
sociis natoque penatibus et magnis dis parturient montes, nascetur
ridiculus mus. Donec quam felis, ultricies nec, pellentesque eu,
pretium quis, sem. Nulla consequat massa quis enim. Donec pede justo,
fringilla vel, aliquet nec, vulputate eget, arcu. In enim justo,
rhoncus ut, imperdiet a, venenatis vitae, justo. Nullam dictum felis
eu pede mollis pretium.<endMonDescription>      }>>>>

CODE

# This also trims out final \n :
#      remove ORS= part to 
#      retain trailing newline

{m,n,g}awk ++NF OFS= RS='^

OUTPUT (encapsulated)

<<<<{Lorem ipsum dolor sit amet, consectetuer
adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum
sociis natoque penatibus et magnis dis parturient montes, nascetur
ridiculus mus. Donec quam felis, ultricies nec, pellentesque eu,
pretium quis, sem. Nulla consequat massa quis enim. Donec pede justo,
fringilla vel, aliquet nec, vulputate eget, arcu. In enim justo,
rhoncus ut, imperdiet a, venenatis vitae, justo. Nullam dictum felis
eu pede mollis pretium.}>>>>
\ ORS= FS='(^[^><]*)?[<](end)?[Mm]onDescription[>]([^><]*$)?'

OUTPUT (encapsulated)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文