CompSciWeek6
Contents
- Beginning Python - skim. chapters 8-14 (use as reference material)
- see expecially urlopen on p. 300, forks and threads on p. 304
- Beginning Python - Chapter 15 (Web services)
Class 1: Effective Design
- Structured Code, Bioinformatics example from AOS Book
- Code Testing
- Source Code Versioning
- basic git
Class 2: Using HPC Resources
- Accessing binaries and libraries, using modules
- Using scratch space
- Submitting a job script
- Managing queued jobs
- Advanced scripting tips and tricks
- awk
Homework 4 (Due Fri., Oct. 10)
Please email the completed homework with the subject line "SciComp HW4, (your name)"
- Write example functions that use the advanced function notation from Beginning Python, Ch. 6 (see especially the example on p. 124).
- f(arg=default): the function should do nothing if the function is called as f(), and it should call arg.set_price(12) if it is called as f(type("InvItem", (), {"set_price":(lambda a,b: b)})())
- f(*arg): the function should return the number of arguments passed in the call f('a', 'b', 1, 5, {'t': [4]})
- f(**args): the function should return the value associated with the key "agent" in the call f(auto="DB5", lno=31337, agent="007")
- Write an example python class to represent a general inventory item. It should store its own name, and must contain the following methods: getCount(), returning the (arbitrary, fixed) number of items in inventory, and getPrice(), which computes the price using the formula price = price0 - k*log(count), where price0 and k are arbitrary, fixed variables belonging to the object.
- The article "Working with Big Data in Bioinformatics" describes software that reads lots of small strings and increments some counters for each string. The overall structure of their code contains a fast C++ library, a python wrapper, and python scripts. Describe which of those three categories you would place each of the following routines in, and why.
- A class that creates C++ objects representing counters for sequence data and that contains methods for translating the counts to numpy arrays.
- A script that creates a plot of the k-mer counts in a subset of the data.
- A function reading and parsing files containing genomic sequence data.
- A script installing the complete Khmer package, (compiling the C++ library, copying the python package, etc.)
- Explain (without trying to solve their problems) why each of the following quotes from the article might be relevant to the performance of their code:
- "We expected the highest traffic to be in the k-mer counting logic."
- "Redundant calls to the toupper function were present in the highest traffic regions of the code."
- "Input of genomic reads was performed line-by-line and on demand and without any readahead tuning."
- "A copy-by-value of the genomic read struct [was] performed for every parsed and valid genomic read."
Codes
Power function with logarithmic run time in n (linear in the size of n) <source lang="python> def pow(x, n): Returns the number x raised to the integer power, n.
>>> pow(2, 4) 16 >>> pow(3, 2) 9 >>> pow(5, 0) 1
complexity = O(log n) = O(m), where m = # digits in n if n < 1: return 1 # correct for n=0 elif n == 1: return x elif n % 2 == 0: hp = pow(x, n/2) return hp*hp else: # 3, 5, 7, ... hp = pow(x, (n-1)/2) return x*hp*hp </source>
Using the python-geocoder-0.2 interface to Google's web-API to get distances: <source lang="python"> from geocode.google import GoogleGeocoderClient from numpy import *
geocoder = GoogleGeocoderClient(False) # must specify sensor parameter explicitely
def to_xyz(phi, th):
c = cos(phi) return array([c*cos(th), c*sin(th), sin(phi)])
def to_polar(lat, lon): return (90-float(lat))*pi/180.0, float(lon)*pi/180.0
def dist(a, b): # distance in kilometers across a perfect sphere of radius 6370 km
return 6370*arccos(dot(to_xyz(*a), to_xyz(*b)))
def get_loc(name): result = geocoder.geocode(name) if result.is_success(): return to_polar(*result.get_location()) else: print "Geocoding failed" return (0.0, 0.0)
a = get_loc("Lowry Park Zoo") # spherical polar
b = get_loc("MOSI, Tampa, FL")
print dist(a, b) </source>