使用 sed 删除具有重复和增量字符的字符串?

发布于 2024-12-04 20:32:29 字数 737 浏览 0 评论 0原文

我正在尝试使用 sed 删除包含重复字符的字符串,然后将它们附加到文件中。 到目前为止,我已经做到了,可以通过连续重复(如“AA”或“22”)来删除刺痛,但我正在努力处理完整的字符串重复和增量字符。

generic string generator | sed '/\([^A-Za-z0-9_]\|[A-Za-z0-9]\)\1\{1,\}/d' >> parsed sting to file

我还想删除包含任何重复的字符串,例如“ABA”。 此外,还包含任何升序或降序字符的字符串,例如“AEF”或“AFE”。

我假设使用多次 sed 来删除不需要的字符串会更容易。

** 更多信息,以尽量避免提到的 XY 问题。 **

字符串的长度可以是 8 到 64,但在本例中我关注的是 8。同时我将字符串生成限制为仅输出大写字母字符串 (AZ)。这有几个原因,但主要是我不希望生成的文件占用的空间大得离谱。

sed 的第一遍从流中删除不必要的输出,例如“AAAAAAAA”和“AAAAAAB”。这会导致文件以字符串“ABABABAB”和“ABABABAC”开头。

下一次我想检查从一个字符到下一个字符是否不会增加或减少一个值。因此,像“ABABABAB”这样的字符串将被删除,但“ACACACAC”将解析为流。

下一步我想删除整个字符串中包含任何重复字符的字符串。因此,像“ACACACAC”这样的字符串将被删除,但“ACEBDFHJ”将解析到文件中。

希望有帮助。

I'm trying to use sed to drop strings containing repeated characters before appending them to a file.
So far I have this, to drop stings with consecutive repetition like 'AA' or '22', but I'm struggling with full string repetition and incremental characters.

generic string generator | sed '/\([^A-Za-z0-9_]\|[A-Za-z0-9]\)\1\{1,\}/d' >> parsed sting to file

I also want to drop strings contain any repetition like 'ABA'.
As well as, strings containing any ascending or descending characters like 'AEF' or 'AFE'.

I'm assuming it would be easier to use multiple passes of sed to drop the unwanted strings.

** A little more information to try to avoid the XY problem mentioned. **

The character strings could be from 8 to 64 in length, but in this instance I'm focusing on 8. While at the same time I've restricted the string generation to only output an upper-case alpha string (A-Z). This is for a few reasons, but mainly that I don't want the generated file to have a ridiculously huge footprint.

With the first pass of sed dropping unnecessary outputs like 'AAAAAAAA' and 'AAAAAAAB' from the stream. This results in the file starting with strings 'ABABABAB' and 'ABABABAC'.

Next pass I want to check that from one character to the next doesn't increase or decrease by a value of one. So strings like 'ABABABAB' would be dropped, but 'ACACACAC' would parse to the stream.

Next pass I want to drop strings that contain any repeated characters in the whole string. So strings like 'ACACACAC' would be dropped, but 'ACEBDFHJ' would parse to the file.

Hope that helps.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

紫南 2024-12-11 20:32:29

为了使用 sed 执行您所描述的操作,您需要多次运行它。由于 sed 不理解“这个字符是从另一个字符增量”的概念,因此您需要在所有可能的组合上运行它:

sed '/AB/d'
sed '/BC/d'
sed '/CD/d'
sed '/DE/d'

等等。

对于降序字符,同样的事情:

sed '/BA/d'
sed '/CB/d'

为了然后删除具有重复字符的字符串,你可以做这样的事情:

sed '/\(.\).*\1/d'

以下应该可以解决问题:

generic string generator |sed '/\(.\).*\1/d'|sed /BA/d|sed /AB/d||sed /CB/d|sed /BC/d|sed /DC/d|sed /CD/d|sed /ED/d|sed /DE/d|sed /FE/d|sed /EF/d|sed /GF/d|sed /FG/d|sed /HG/d|sed /GH/d|sed /IH/d|sed /HI/d|sed /JI/d|sed /IJ/d|sed /KJ/d|sed /JK/d|sed /LK/d|sed /KL/d|sed /ML/d|sed /LM/d|sed /NM/d|sed /MN/d|sed /ON/d|sed /NO/d|sed /PO/d|sed /OP/d|sed /QP/d|sed /PQ/d|sed /RQ/d|sed /QR/d|sed /SR/d|sed /RS/d|sed /TS/d|sed /ST/d|sed /UT/d|sed /TU/d|sed /VU/d|sed /UV/d|sed /WV/d|sed /VW/d|sed /XW/d|sed /WX/d|sed /YX/d|sed /XY/d|sed /ZY/d|sed /YZ/d

我只在几个输入样本上测试了这个,但它们似乎都有效。

请注意,这相当笨拙,最好用比 sed 更复杂的东西来完成。这是一个 python 示例:

import math
def isvalid(x):
   if set(len(x)) < len(x):
     return False
   for a in range(1, len(x)):
     if math.fabs(ord(x[a])-ord(x[a-1])) == 1:
       return False
   return True

这比大量 sed 调用更具可读性,并且具有相同的功能。

In order to do what you're describing with sed, you'd need to run it many times. Since sed doesn't understand the concept of "this character is incremental from this other character", you need to run it across all possible combinations:

sed '/AB/d'
sed '/BC/d'
sed '/CD/d'
sed '/DE/d'

etc.

For descending characters, the same thing:

sed '/BA/d'
sed '/CB/d'

In order to then drop strings with repeated characters, you can do something like this:

sed '/\(.\).*\1/d'

The following should do the trick:

generic string generator |sed '/\(.\).*\1/d'|sed /BA/d|sed /AB/d||sed /CB/d|sed /BC/d|sed /DC/d|sed /CD/d|sed /ED/d|sed /DE/d|sed /FE/d|sed /EF/d|sed /GF/d|sed /FG/d|sed /HG/d|sed /GH/d|sed /IH/d|sed /HI/d|sed /JI/d|sed /IJ/d|sed /KJ/d|sed /JK/d|sed /LK/d|sed /KL/d|sed /ML/d|sed /LM/d|sed /NM/d|sed /MN/d|sed /ON/d|sed /NO/d|sed /PO/d|sed /OP/d|sed /QP/d|sed /PQ/d|sed /RQ/d|sed /QR/d|sed /SR/d|sed /RS/d|sed /TS/d|sed /ST/d|sed /UT/d|sed /TU/d|sed /VU/d|sed /UV/d|sed /WV/d|sed /VW/d|sed /XW/d|sed /WX/d|sed /YX/d|sed /XY/d|sed /ZY/d|sed /YZ/d

I only tested this on a few input samples, but they all seemed to work.

Note that this is quite ungainly, and would be better done by something a little more sophisticated than sed. Here's a sample in python:

import math
def isvalid(x):
   if set(len(x)) < len(x):
     return False
   for a in range(1, len(x)):
     if math.fabs(ord(x[a])-ord(x[a-1])) == 1:
       return False
   return True

This is much more readable than the giant set of sed calls, and has the same functionality.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文