Difference between revisions of "CompSciWeek6"
(→Codes) |
|||
(5 intermediate revisions by the same user not shown) | |||
Line 5: | Line 5: | ||
= Class 1: Effective Design = |
= Class 1: Effective Design = |
||
− | * Code walk-through, parsing and using graphs |
||
* Structured Code, Bioinformatics example from AOS Book |
* Structured Code, Bioinformatics example from AOS Book |
||
* Code Testing |
* Code Testing |
||
Line 18: | Line 17: | ||
* Advanced scripting tips and tricks |
* Advanced scripting tips and tricks |
||
** awk |
** awk |
||
+ | |||
+ | = Homework 4 (Due Fri., Oct. 10) = |
||
+ | Please email the completed homework with the subject line "SciComp HW4, (your name)" |
||
+ | |||
+ | # Write example functions that use the advanced function notation from Beginning Python, Ch. 6 (see especially the example on p. 124). |
||
+ | ## f(arg=default): the function should do nothing if the function is called as f(), and it should call arg.set_price(12) if it is called as f(type("InvItem", (), {"set_price":(lambda a,b: b)})()) |
||
+ | ## f(*arg): the function should return the number of arguments passed in the call f('a', 'b', 1, 5, {'t': [4]}) |
||
+ | ## f(**args): the function should return the value associated with the key "agent" in the call f(auto="DB5", lno=31337, agent="007") |
||
+ | # Write an example python class to represent a general inventory item. It should store its own name, and must contain the following methods: getCount(), returning the (arbitrary, fixed) number of items in inventory, and getPrice(), which computes the price using the formula price = price0 - k*log(count), where price0 and k are arbitrary, fixed variables belonging to the object. |
||
+ | # The article [http://www.aosabook.org/en/posa/working-with-big-data-in-bioinformatics.html "Working with Big Data in Bioinformatics"] describes software that reads lots of small strings and increments some counters for each string. The overall structure of their code contains a fast C++ library, a python wrapper, and python scripts. Describe which of those three categories you would place each of the following routines in, and why. |
||
+ | ## A class that creates C++ objects representing counters for sequence data and that contains methods for translating the counts to numpy arrays. |
||
+ | ## A script that creates a plot of the k-mer counts in a subset of the data. |
||
+ | ## A function reading and parsing files containing genomic sequence data. |
||
+ | ## A script installing the complete Khmer package, (compiling the C++ library, copying the python package, etc.) |
||
+ | # Explain (without trying to solve their problems) why each of the following quotes from the article might be relevant to the performance of their code: |
||
+ | ## "We expected the highest traffic to be in the k-mer counting logic." |
||
+ | ## "Redundant calls to the toupper function were present in the highest traffic regions of the code." |
||
+ | ## "Input of genomic reads was performed line-by-line and on demand and without any readahead tuning." |
||
+ | ## "A copy-by-value of the genomic read struct [was] performed for every parsed and valid genomic read." |
||
+ | |||
+ | = Codes = |
||
+ | |||
+ | Power function with logarithmic run time in n (linear in the ''size'' of n) |
||
+ | <source lang="python"> |
||
+ | def pow(x, n): |
||
+ | ''' |
||
+ | Returns the number x raised to the integer power, n. |
||
+ | |||
+ | >>> pow(2, 4) |
||
+ | 16 |
||
+ | >>> pow(3, 2) |
||
+ | 9 |
||
+ | >>> pow(5, 0) |
||
+ | 1 |
||
+ | |||
+ | complexity = O(log n) = O(m), where m = # digits in n |
||
+ | ''' |
||
+ | if n < 1: |
||
+ | return 1 # correct for n=0 |
||
+ | elif n == 1: |
||
+ | return x |
||
+ | elif n % 2 == 0: |
||
+ | hp = pow(x, n/2) |
||
+ | return hp*hp |
||
+ | else: # 3, 5, 7, ... |
||
+ | hp = pow(x, (n-1)/2) |
||
+ | return x*hp*hp |
||
+ | </source> |
||
+ | |||
+ | Testing the last module using python's doctest: |
||
+ | <source lang="python"> |
||
+ | #!/usr/bin/env python |
||
+ | if __name__=="__main__": |
||
+ | import doctest, vector # assumes pow() is defined in vector |
||
+ | doctest.testmod(vector) |
||
+ | </source> |
||
+ | |||
+ | Using the python-geocoder-0.2 interface to Google's web-API to get distances: |
||
+ | <source lang="python"> |
||
+ | from geocode.google import GoogleGeocoderClient |
||
+ | from numpy import * |
||
+ | |||
+ | geocoder = GoogleGeocoderClient(False) # must specify sensor parameter explicitely |
||
+ | |||
+ | def to_xyz(phi, th): |
||
+ | c = cos(phi) |
||
+ | return array([c*cos(th), c*sin(th), sin(phi)]) |
||
+ | |||
+ | def to_polar(lat, lon): |
||
+ | return (90-float(lat))*pi/180.0, float(lon)*pi/180.0 |
||
+ | |||
+ | def dist(a, b): # distance in kilometers across a perfect sphere of radius 6370 km |
||
+ | return 6370*arccos(dot(to_xyz(*a), to_xyz(*b))) |
||
+ | |||
+ | def get_loc(name): |
||
+ | result = geocoder.geocode(name) |
||
+ | if result.is_success(): |
||
+ | return to_polar(*result.get_location()) |
||
+ | else: |
||
+ | print "Geocoding failed" |
||
+ | return (0.0, 0.0) |
||
+ | |||
+ | |||
+ | a = get_loc("Lowry Park Zoo") # spherical polar |
||
+ | b = get_loc("MOSI, Tampa, FL") |
||
+ | |||
+ | print dist(a, b) |
||
+ | </source> |
Latest revision as of 14:02, 1 October 2014
Contents
- Beginning Python - skim. chapters 8-14 (use as reference material)
- see expecially urlopen on p. 300, forks and threads on p. 304
- Beginning Python - Chapter 15 (Web services)
Class 1: Effective Design
- Structured Code, Bioinformatics example from AOS Book
- Code Testing
- Source Code Versioning
- basic git
Class 2: Using HPC Resources
- Accessing binaries and libraries, using modules
- Using scratch space
- Submitting a job script
- Managing queued jobs
- Advanced scripting tips and tricks
- awk
Homework 4 (Due Fri., Oct. 10)
Please email the completed homework with the subject line "SciComp HW4, (your name)"
- Write example functions that use the advanced function notation from Beginning Python, Ch. 6 (see especially the example on p. 124).
- f(arg=default): the function should do nothing if the function is called as f(), and it should call arg.set_price(12) if it is called as f(type("InvItem", (), {"set_price":(lambda a,b: b)})())
- f(*arg): the function should return the number of arguments passed in the call f('a', 'b', 1, 5, {'t': [4]})
- f(**args): the function should return the value associated with the key "agent" in the call f(auto="DB5", lno=31337, agent="007")
- Write an example python class to represent a general inventory item. It should store its own name, and must contain the following methods: getCount(), returning the (arbitrary, fixed) number of items in inventory, and getPrice(), which computes the price using the formula price = price0 - k*log(count), where price0 and k are arbitrary, fixed variables belonging to the object.
- The article "Working with Big Data in Bioinformatics" describes software that reads lots of small strings and increments some counters for each string. The overall structure of their code contains a fast C++ library, a python wrapper, and python scripts. Describe which of those three categories you would place each of the following routines in, and why.
- A class that creates C++ objects representing counters for sequence data and that contains methods for translating the counts to numpy arrays.
- A script that creates a plot of the k-mer counts in a subset of the data.
- A function reading and parsing files containing genomic sequence data.
- A script installing the complete Khmer package, (compiling the C++ library, copying the python package, etc.)
- Explain (without trying to solve their problems) why each of the following quotes from the article might be relevant to the performance of their code:
- "We expected the highest traffic to be in the k-mer counting logic."
- "Redundant calls to the toupper function were present in the highest traffic regions of the code."
- "Input of genomic reads was performed line-by-line and on demand and without any readahead tuning."
- "A copy-by-value of the genomic read struct [was] performed for every parsed and valid genomic read."
Codes
Power function with logarithmic run time in n (linear in the size of n) <source lang="python"> def pow(x, n): Returns the number x raised to the integer power, n.
>>> pow(2, 4) 16 >>> pow(3, 2) 9 >>> pow(5, 0) 1
complexity = O(log n) = O(m), where m = # digits in n if n < 1: return 1 # correct for n=0 elif n == 1: return x elif n % 2 == 0: hp = pow(x, n/2) return hp*hp else: # 3, 5, 7, ... hp = pow(x, (n-1)/2) return x*hp*hp </source>
Testing the last module using python's doctest: <source lang="python">
- !/usr/bin/env python
if __name__=="__main__":
import doctest, vector # assumes pow() is defined in vector doctest.testmod(vector)
</source>
Using the python-geocoder-0.2 interface to Google's web-API to get distances: <source lang="python"> from geocode.google import GoogleGeocoderClient from numpy import *
geocoder = GoogleGeocoderClient(False) # must specify sensor parameter explicitely
def to_xyz(phi, th):
c = cos(phi) return array([c*cos(th), c*sin(th), sin(phi)])
def to_polar(lat, lon): return (90-float(lat))*pi/180.0, float(lon)*pi/180.0
def dist(a, b): # distance in kilometers across a perfect sphere of radius 6370 km
return 6370*arccos(dot(to_xyz(*a), to_xyz(*b)))
def get_loc(name): result = geocoder.geocode(name) if result.is_success(): return to_polar(*result.get_location()) else: print "Geocoding failed" return (0.0, 0.0)
a = get_loc("Lowry Park Zoo") # spherical polar
b = get_loc("MOSI, Tampa, FL")
print dist(a, b) </source>