从字典中的子字符串生成字符串,同时最小化某些彼此相邻的字符

发布于 2025-01-13 15:20:15 字数 918 浏览 2 评论 0原文

我希望能够从包含子字符串的字典中生成一个字符串,其中我输入一个字符串,其中每个字符对应于字典的键,然后它从与该键相关的值中吐出一个新字符串。不过,我也想尽量减少某些角色彼此相邻的情况。

例如:

dict = {'I': ['ATA', 'ATC', 'ATT'], 'M': ['ATG'], 'T': ['ACA', 'ACC', 'ACG', 'ACT'], 'N':['AAC', 'AAT'], 'K': ['AAA', 'AAG'], 'S': ['AGC', 'AGT'], 'R': ['AGA', 'AGG']}

input_str = "IIMTSTTKRI"

输出将是与每个键关联的三个字符子字符串的字符串。但是可以使用许多 3 个字符子字符串,我想尽量减少彼此相邻的 G 和 C 的数量。

我目前有这个:

n = []

#make list of possible substrings for each character in string 
for i in str:
    if i in dict.keys():
       n.append(dict[i])
#generate all permutations 
p = [''.join(s) for s in itertools.product(*n)]

#if no consecutive GCs in a permutation add to list
ls = []
for i in p:
    q = i.count('GC')    
    if q == 0:
        ls.append(i)

哪个“有效”,但存在一些问题。第一个(次要的)是我必须假设连续的“GC”为 0,对于某些字符串来说这可能是不可能的。第二个(主要的)是它对于较长的字符串来说非常慢,因为它必须生成所有排列。

谁能提供一种提高速度的方法或替代方法?

I want to be able to generate a string from a dictionary containing substrings, whereby I input a string where each character corresponds to the key of the dictionary and it spits out a new string from the associated values to that key. However I also want to minimise certain characters being next to each other.

For example:

dict = {'I': ['ATA', 'ATC', 'ATT'], 'M': ['ATG'], 'T': ['ACA', 'ACC', 'ACG', 'ACT'], 'N':['AAC', 'AAT'], 'K': ['AAA', 'AAG'], 'S': ['AGC', 'AGT'], 'R': ['AGA', 'AGG']}

input_str = "IIMTSTTKRI"

The output would be a string of the three character substrings associated with each key.However there are many 3 character substrings that could be used, I would like to minimise the number of G's and C's that are next to one another.

I currently have this:

n = []

#make list of possible substrings for each character in string 
for i in str:
    if i in dict.keys():
       n.append(dict[i])
#generate all permutations 
p = [''.join(s) for s in itertools.product(*n)]

#if no consecutive GCs in a permutation add to list
ls = []
for i in p:
    q = i.count('GC')    
    if q == 0:
        ls.append(i)

Which 'works' but there are a couple of problems. The first (minor one) is that I have to assume the consective "GC" is 0 and for some strings that may not be possible. The second (major one), is its extremely slow for longer strings because it has to generate all permutations.

Can anyone provide a way to improve the speed or an alternative way?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

分分钟 2025-01-20 15:20:15

根据您的评论,您可以将问题视为最佳路径搜索(将您的问题视为一个图表,您必须遵循 input_str 中定义的路径,并且在每个顶点中,您必须从以下列表中进行选择定义的 3 字符宽字符串)。

有很多搜索算法,我的解决方案是使用 A*Star

from heapq import heappop, heappush

dct = {
    "I": ["ATA", "ATC", "ATT"],
    "M": ["ATG"],
    "T": ["ACA", "ACC", "ACG", "ACT"],
    "N": ["AAC", "AAT"],
    "K": ["AAA", "AAG"],
    "S": ["AGC", "AGT"],
    "R": ["AGA", "AGG"],
}


input_str = "IIMTSTTKRI"

def valid_moves(s):
    key = input_str[len(s) // 3]
    for i in dct[key]:
        yield s + i


def distance(s):
    return len(input_str) - (len(s) // 3)


def my_cost_func(_from, _to):
    return _to.count("GC")


def a_star(start, moves_func, h_func, cost_func):
    """
    Find a shortest sequence of states from start to a goal state
    (a state s with h_func(s) == 0).
    """

    frontier = [
        (h_func(start), start)
    ]  # A priority queue, ordered by path length, f = g + h
    previous = {
        start: None
    }  # start state has no previous state; other states will
    path_cost = {start: 0}  # The cost of the best path to a state.
    Path = lambda s: ([] if (s is None) else Path(previous[s]) + [s])
    while frontier:
        (f, s) = heappop(frontier)
        if h_func(s) == 0:
            return Path(s)
        for s2 in moves_func(s):
            g = path_cost[s] + cost_func(s, s2)
            if s2 not in path_cost or g < path_cost[s2]:
                heappush(frontier, (g + h_func(s2), s2))
                path_cost[s2] = g
                previous[s2] = s


path = a_star("", valid_moves, distance, my_cost_func)
print("Result:", path[-1])

这会打印:

Result: ATAATAATGACAAGTACAACAAAAAGAATAAGT

Based on your comments, you can look at the problem as a optimal path searching (Think about your problem as a graph where you must follow path defined in input_str and in each vertex you must chose from a list of defined 3-character wide strings).

There are many search algorithms, my solution is using A*Star:

from heapq import heappop, heappush

dct = {
    "I": ["ATA", "ATC", "ATT"],
    "M": ["ATG"],
    "T": ["ACA", "ACC", "ACG", "ACT"],
    "N": ["AAC", "AAT"],
    "K": ["AAA", "AAG"],
    "S": ["AGC", "AGT"],
    "R": ["AGA", "AGG"],
}


input_str = "IIMTSTTKRI"

def valid_moves(s):
    key = input_str[len(s) // 3]
    for i in dct[key]:
        yield s + i


def distance(s):
    return len(input_str) - (len(s) // 3)


def my_cost_func(_from, _to):
    return _to.count("GC")


def a_star(start, moves_func, h_func, cost_func):
    """
    Find a shortest sequence of states from start to a goal state
    (a state s with h_func(s) == 0).
    """

    frontier = [
        (h_func(start), start)
    ]  # A priority queue, ordered by path length, f = g + h
    previous = {
        start: None
    }  # start state has no previous state; other states will
    path_cost = {start: 0}  # The cost of the best path to a state.
    Path = lambda s: ([] if (s is None) else Path(previous[s]) + [s])
    while frontier:
        (f, s) = heappop(frontier)
        if h_func(s) == 0:
            return Path(s)
        for s2 in moves_func(s):
            g = path_cost[s] + cost_func(s, s2)
            if s2 not in path_cost or g < path_cost[s2]:
                heappush(frontier, (g + h_func(s2), s2))
                path_cost[s2] = g
                previous[s2] = s


path = a_star("", valid_moves, distance, my_cost_func)
print("Result:", path[-1])

This prints:

Result: ATAATAATGACAAGTACAACAAAAAGAATAAGT
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文