Exercises

发布于 2025-02-25 23:43:38 字数 5199 浏览 0 评论 0 收藏 0

1 . Write a function to find the complementary strand given a DNA sequence. For example

Given ATCGTTA Return TAGCAAT

Note: The following are complementary bases A|T, C|G.

# YOUR CODE HERE

def complement(dna):
    """Return compelementary strand given DNA sequence."""
    import string
    table = string.maketrans('actgACTG', 'tgacTGAC')
    return dna.translate(table)

print complement('ATCGTTA')

TAGCAAT

2 . Write a regular expression that matches the following:

Phone numbers with the format: (919)-1234567 (i.e. (123)-9876543 should match but not 234-1234567 or (123)-666666)
Email addresss john.doe@duke.edu (i.e. steve@gmail.com should match but not steve@gmail )
DNA seqences with the motif A-C-T-G where - indicates 0 or 1 other nucleotide (any of A,C,T or G)

# YOUR CODE HERE

phone_pat = re.compile(r'\(\d{3}\)-\d{7}')

for s in ['(123)-9876543', '234-1234567', '123)-666666)']:
    m = phone_pat.match(s)
    if m:
        print 'Mathced', s
    else:
        print 'Not matched', s

Mathced (123)-9876543
Not matched 234-1234567
Not matched 123)-666666)

Note: This is just for practice - actual email validators should not be using regular expressions because the rules for a valid eamil are insanely complex , and should probably be checked with a parser.

email_pat = re.compile(r'[\w]+[\.[\w]+]?@([\w]+\.)+[\w]+')

for s in ['johm@', 'john.doe@duke.edu', 'steve@gmail.com', 'steve@gmail']:
    m = email_pat.match(s)
    if m:
        print 'Mathced', s
    else:
        print 'Not matched', s

Not matched johm@
Mathced john.doe@duke.edu
Mathced steve@gmail.com
Not matched steve@gmail

motif_pat = re.compile(r'A.?C.?T.?G')

for s in ['GATTACA', 'ACTG', 'AACCTTGG', 'AAACCCTTTGGG']:
    m = motif_pat.match(s)
    if m:
        print 'Mathced', s
    else:
        print 'Not matched', s

Not matched GATTACA
Mathced ACTG
Mathced AACCTTGG
Not matched AAACCCTTTGGG

3 . Download ‘Pride and Prejudice’ by Jane Austem from Project Gutenbrrg.

Remove all punctuation and covert to lower case
Count how many times the word ‘married’ appears
Count how often the word ‘daughter’ and ‘married’ appear in the same 10-word window

# YOUR CODE HERE

if not os.path.exists('pride_and_prejudice.txt'):
    ! curl 'http://www.gutenberg.org/cache/epub/1342/pg1342.txt' > 'pride_and_prejudice.txt'

import string

with open('pride_and_prejudice.txt') as f:
    s = f.read()
    s = s.lower().translate(None, string.punctuation)

    words = s.split()
    size = 10
    windows = list(partition(size, words))
    print "'daughter' and 'married' appera %d times in the same 10-word window" % \
        sum('daughter' in window and 'married' in window for window in windows)
    print "The word 'married' appears %d times" % s.count('married')

'daughter' and 'married' appera 5 times in the same 10-word window
The word 'married' appears 61 times

4 . Download “The Gutenberg Webster’s Unabridged Dictionary” from Project Gutenbrrg

First extract all defined words (109561 words) - oops I cannot replicate this number
Count the number of defined English words containing 3 or more vowels (aeiou)
Find all longest palindrome (a palindrome is a word that is spelt the same forwards as backwards - e.g. ‘deified’)

# YOUR CODE HERE

# If you look at the plain text file,
# it is quite hard to figure out how to extract a defined word.
# We have more luck wiht the HTNL file.

if not os.path.exists('websters.html'):
    ! curl 'www.gutenberg.org/cache/epub/29765/pg29765.html' > 'websters.html'

! head -n 400 websters.html | tail -n 30

# Notice that in the HTML, word definitions have the structure <p id="xxxxxxx">WORD</br> or <p id="xxxxxxx">WORD NEWLINE

text = open('websters.html').read()
word = re.compile(r'<p id="id\d+">([A-Z]+)[<br/>|\r\n+]')

words = word.findall(text)
count = 0
for word in words:
    if word.count('A') + word.count('E') + word.count('I') + word.count('O') + word.count('U') >= 3:
        count += 1

print "Number of words is %d" % len(words)
print "Number of words with 3 or  more vowels is %d" % count

palindromes = [word for word in words if word == word[::-1]]
lengths = map(len, palindromes)
max_len = max(lengths)
print "Longest palindromes are", [p for p in palindromes if len(p) == max_len]

Number of words is 103020
Number of words with 3 or  more vowels is 69210
Longest palindromes are ['MALAYALAM']

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

列表为空，暂无数据

Exercises

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。