返回介绍

Exercises

发布于 2025-02-25 23:43:38 字数 5199 浏览 0 评论 0 收藏 0

1 . Write a function to find the complementary strand given a DNA sequence. For example

Given ATCGTTA Return TAGCAAT

Note: The following are complementary bases A|T, C|G.

# YOUR CODE HERE

def complement(dna):
    """Return compelementary strand given DNA sequence."""
    import string
    table = string.maketrans('actgACTG', 'tgacTGAC')
    return dna.translate(table)

print complement('ATCGTTA')
TAGCAAT

2 . Write a regular expression that matches the following:

  • Phone numbers with the format: (919)-1234567 (i.e. (123)-9876543 should match but not 234-1234567 or (123)-666666)
  • Email addresss john.doe@duke.edu (i.e. steve@gmail.com should match but not steve@gmail )
  • DNA seqences with the motif A-C-T-G where - indicates 0 or 1 other nucleotide (any of A,C,T or G)
# YOUR CODE HERE

phone_pat = re.compile(r'\(\d{3}\)-\d{7}')

for s in ['(123)-9876543', '234-1234567', '123)-666666)']:
    m = phone_pat.match(s)
    if m:
        print 'Mathced', s
    else:
        print 'Not matched', s
Mathced (123)-9876543
Not matched 234-1234567
Not matched 123)-666666)

Note: This is just for practice - actual email validators should not be using regular expressions because the rules for a valid eamil are insanely complex , and should probably be checked with a parser.

email_pat = re.compile(r'[\w]+[\.[\w]+]?@([\w]+\.)+[\w]+')

for s in ['johm@', 'john.doe@duke.edu', 'steve@gmail.com', 'steve@gmail']:
    m = email_pat.match(s)
    if m:
        print 'Mathced', s
    else:
        print 'Not matched', s
Not matched johm@
Mathced john.doe@duke.edu
Mathced steve@gmail.com
Not matched steve@gmail
motif_pat = re.compile(r'A.?C.?T.?G')

for s in ['GATTACA', 'ACTG', 'AACCTTGG', 'AAACCCTTTGGG']:
    m = motif_pat.match(s)
    if m:
        print 'Mathced', s
    else:
        print 'Not matched', s
Not matched GATTACA
Mathced ACTG
Mathced AACCTTGG
Not matched AAACCCTTTGGG

3 . Download ‘Pride and Prejudice’ by Jane Austem from Project Gutenbrrg.

  • Remove all punctuation and covert to lower case
  • Count how many times the word ‘married’ appears
  • Count how often the word ‘daughter’ and ‘married’ appear in the same 10-word window
# YOUR CODE HERE

if not os.path.exists('pride_and_prejudice.txt'):
    ! curl 'http://www.gutenberg.org/cache/epub/1342/pg1342.txt' > 'pride_and_prejudice.txt'
import string

with open('pride_and_prejudice.txt') as f:
    s = f.read()
    s = s.lower().translate(None, string.punctuation)

    words = s.split()
    size = 10
    windows = list(partition(size, words))
    print "'daughter' and 'married' appera %d times in the same 10-word window" % \
        sum('daughter' in window and 'married' in window for window in windows)
    print "The word 'married' appears %d times" % s.count('married')
'daughter' and 'married' appera 5 times in the same 10-word window
The word 'married' appears 61 times

4 . Download “The Gutenberg Webster’s Unabridged Dictionary” from Project Gutenbrrg

  • First extract all defined words (109561 words) - oops I cannot replicate this number
  • Count the number of defined English words containing 3 or more vowels (aeiou)
  • Find all longest palindrome (a palindrome is a word that is spelt the same forwards as backwards - e.g. ‘deified’)
# YOUR CODE HERE

# If you look at the plain text file,
# it is quite hard to figure out how to extract a defined word.
# We have more luck wiht the HTNL file.

if not os.path.exists('websters.html'):
    ! curl 'www.gutenberg.org/cache/epub/29765/pg29765.html' > 'websters.html'
! head -n 400 websters.html | tail -n 30
# Notice that in the HTML, word definitions have the structure <p id="xxxxxxx">WORD</br> or <p id="xxxxxxx">WORD NEWLINE

text = open('websters.html').read()
word = re.compile(r'<p id="id\d+">([A-Z]+)[<br/>|\r\n+]')

words = word.findall(text)
count = 0
for word in words:
    if word.count('A') + word.count('E') + word.count('I') + word.count('O') + word.count('U') >= 3:
        count += 1

print "Number of words is %d" % len(words)
print "Number of words with 3 or  more vowels is %d" % count

palindromes = [word for word in words if word == word[::-1]]
lengths = map(len, palindromes)
max_len = max(lengths)
print "Longest palindromes are", [p for p in palindromes if len(p) == max_len]
Number of words is 103020
Number of words with 3 or  more vowels is 69210
Longest palindromes are ['MALAYALAM']

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
    我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
    原文