字母的分裂顺序，同时保留原始序列位置

发布于 2025-01-26 03:07:17 字数 909 浏览 1 评论 0原文

我需要将以下字母顺序分为不同的块

scdksfnrgecscdksfnrgecscscdksfnrgec，

我使用了以下用户提供的以下代码来实现我最初想要的东西，这是每次C之后分开序列

library(dplyr)

TestSequence <- "SCDKSFNRGECSCDKSFNRGECSCDKSFNRGEC"

Test <- strsplit(TestSequence, "(?<=[C])", perl = TRUE) %>% unlist 

df <- data.frame(Fragment = Test) %>%
  mutate("position" = cumsum(nchar(Test)))

。 C并保留其在序列中的位置，例如C处的位置2、11等。

现在，我需要在不同位置拆分相同的序列，我可以使用以下位置进行以下以在p，a，g或s之后

Test2 <- strsplit(TestSequence, "(?<=[P,A,G,S])", perl = TRUE) %>% unlist

分开如果我希望它在给定的字符之后分开，但是如果我尝试在字符之前将其分开，例如D，我似乎无法将D保留在片段中。我只有在D之后分开的情况下才能保留它

。

Test3 <- strsplit(TestSequence, "(?=[D])", perl = TRUE) %>% unlist

还有一种方法可以保留原始序列中每个C的确切位置吗？

因此，如果我要在初始K之后拆分测试序列，那么我会有一个是SCDK的片段，我可以有一个单独的列告诉我C在原始序列中的位置。就像第二个示例一样，下一个片段将是sfnrgecscdk，在该单独的列中，它将说C最初位于位置11。

原文

I need to split the following sequence of letters into distinct chunks

SCDKSFNRGECSCDKSFNRGECSCDKSFNRGEC

I have used the following code provided from a previous user to achieve what I initially wanted, which was to split the sequence after every C.

library(dplyr)

TestSequence <- "SCDKSFNRGECSCDKSFNRGECSCDKSFNRGEC"

Test <- strsplit(TestSequence, "(?<=[C])", perl = TRUE) %>% unlist 

df <- data.frame(Fragment = Test) %>%
  mutate("position" = cumsum(nchar(Test)))

This allowed me to split the sequence after every C and retain it's position in the sequence, for example C at position 2, 11 etc.

Now I need to split the same sequence at different locations, which I can do using the following to split after P,A,G or S:

Test2 <- strsplit(TestSequence, "(?<=[P,A,G,S])", perl = TRUE) %>% unlist

This is fine if I want it to split after a given character, but if I try to split it before a character for example D, I cannot seem to retain the D in the fragment. I can only have it retained if it is split after the D.

I have tried every combination of look behind or look ahead I can think of, the following cuts before and after every D which isn't that useful.

Test3 <- strsplit(TestSequence, "(?=[D])", perl = TRUE) %>% unlist

Also is there a way to retain the exact position of every C in the original sequence?

So if I were to split the test sequence after the initial K, I'd have a fragment that was SCDK, could I have a separate column that tells me where the C was in the original sequence. Just as a second example, the next fragment would be SFNRGECSCDK and in that separate column it would say the C was originally in position 11.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

萧瑟寒风 2025-02-02 03:07:17

使用strsplit中仅使用的图案而导致的零长度匹配不能正确处理。

在这种情况下，您也需要在左侧“锚定”比赛。使用非字边界，或者在字符串开始时放弃匹配项的外观：

TestSequence <- "SCDKSFNRGECSCDKSFNRGECSCDKSFNRGEC"
strsplit(TestSequence, "\\B(?=D)", perl = TRUE)
# => [[1]]
# => [1] "SC"          "DKSFNRGECSC" "DKSFNRGECSC" "DKSFNRGEC"  
 
strsplit(TestSequence, "(?<!^)(?=D)", perl = TRUE)
# => [[1]]
# => [1] "SC"          "DKSFNRGECSC" "DKSFNRGECSC" "DKSFNRGEC"

请参阅在线r demo 。

\ b（？= d）模式匹配与单词char之前的位置，并立即随后遵循d。

（？代码> D。

另外，请注意[P，A，G，S]匹配P，a，g，s 和comma 。您应该使用[PAGS]匹配其中一个字母。

Zero-length matches that result from the use of lookahead only patterns used in strsplit are not handled properly.

In this case, you need to "anchor" the matches on the left, too. Either use a non-word boundary, or a lookbehind that disallows the match at the start of string:

TestSequence <- "SCDKSFNRGECSCDKSFNRGECSCDKSFNRGEC"
strsplit(TestSequence, "\\B(?=D)", perl = TRUE)
# => [[1]]
# => [1] "SC"          "DKSFNRGECSC" "DKSFNRGECSC" "DKSFNRGEC"  
 
strsplit(TestSequence, "(?<!^)(?=D)", perl = TRUE)
# => [[1]]
# => [1] "SC"          "DKSFNRGECSC" "DKSFNRGECSC" "DKSFNRGEC"

See the online R demo.

The \B(?=D) pattern matches a location that is immediately preceded with a word char and is immediately followed with D.

The (?<!^)(?=D) pattern matches a location that is not immediately preceded with a start of string location (i.e. not at the start of string) and is immediately followed with D.

Also, note that [P,A,G,S] matches P, A, G, S and a comma. You should use [PAGS] to match one of the letters.

回复收藏 0 原文

~没有更多了~