Latest revision as of 15:02, 1 October 2014

Reading (shared with Week 7)

Beginning Python - skim. chapters 8-14 (use as reference material)
- see expecially urlopen on p. 300, forks and threads on p. 304
Beginning Python - Chapter 15 (Web services)

Class 1: Effective Design

Structured Code, Bioinformatics example from AOS Book
Code Testing
Source Code Versioning
- basic git

Class 2: Using HPC Resources

Accessing binaries and libraries, using modules
Using scratch space
Submitting a job script
Managing queued jobs
Advanced scripting tips and tricks
- awk

Homework 4 (Due Fri., Oct. 10)

Please email the completed homework with the subject line "SciComp HW4, (your name)"

Write example functions that use the advanced function notation from Beginning Python, Ch. 6 (see especially the example on p. 124).
1. f(arg=default): the function should do nothing if the function is called as f(), and it should call arg.set_price(12) if it is called as f(type("InvItem", (), {"set_price":(lambda a,b: b)})())
2. f(*arg): the function should return the number of arguments passed in the call f('a', 'b', 1, 5, {'t': [4]})
3. f(**args): the function should return the value associated with the key "agent" in the call f(auto="DB5", lno=31337, agent="007")
Write an example python class to represent a general inventory item. It should store its own name, and must contain the following methods: getCount(), returning the (arbitrary, fixed) number of items in inventory, and getPrice(), which computes the price using the formula price = price0 - k*log(count), where price0 and k are arbitrary, fixed variables belonging to the object.
The article "Working with Big Data in Bioinformatics" describes software that reads lots of small strings and increments some counters for each string. The overall structure of their code contains a fast C++ library, a python wrapper, and python scripts. Describe which of those three categories you would place each of the following routines in, and why.
1. A class that creates C++ objects representing counters for sequence data and that contains methods for translating the counts to numpy arrays.
2. A script that creates a plot of the k-mer counts in a subset of the data.
3. A function reading and parsing files containing genomic sequence data.
4. A script installing the complete Khmer package, (compiling the C++ library, copying the python package, etc.)
Explain (without trying to solve their problems) why each of the following quotes from the article might be relevant to the performance of their code:
1. "We expected the highest traffic to be in the k-mer counting logic."
2. "Redundant calls to the toupper function were present in the highest traffic regions of the code."
3. "Input of genomic reads was performed line-by-line and on demand and without any readahead tuning."
4. "A copy-by-value of the genomic read struct [was] performed for every parsed and valid genomic read."

Codes

Power function with logarithmic run time in n (linear in the size of n) <source lang="python"> def pow(x, n): Returns the number x raised to the integer power, n.

>>> pow(2, 4) 16 >>> pow(3, 2) 9 >>> pow(5, 0) 1

complexity = O(log n) = O(m), where m = # digits in n if n < 1: return 1 # correct for n=0 elif n == 1: return x elif n % 2 == 0: hp = pow(x, n/2) return hp*hp else: # 3, 5, 7, ... hp = pow(x, (n-1)/2) return x*hp*hp </source>

Testing the last module using python's doctest: <source lang="python">

!/usr/bin/env python

if __name__=="__main__":

       import doctest, vector # assumes pow() is defined in vector
       doctest.testmod(vector)

</source>

Using the python-geocoder-0.2 interface to Google's web-API to get distances: <source lang="python"> from geocode.google import GoogleGeocoderClient from numpy import *

geocoder = GoogleGeocoderClient(False) # must specify sensor parameter explicitely

def to_xyz(phi, th):

    c = cos(phi)
    return array([c*cos(th), c*sin(th), sin(phi)])

def to_polar(lat, lon): return (90-float(lat))*pi/180.0, float(lon)*pi/180.0

def dist(a, b): # distance in kilometers across a perfect sphere of radius 6370 km

    return 6370*arccos(dot(to_xyz(*a), to_xyz(*b)))

def get_loc(name): result = geocoder.geocode(name) if result.is_success(): return to_polar(*result.get_location()) else: print "Geocoding failed" return (0.0, 0.0)

a = get_loc("Lowry Park Zoo") # spherical polar b = get_loc("MOSI, Tampa, FL")

print dist(a, b) </source>

Difference between revisions of "CompSciWeek6"

Latest revision as of 15:02, 1 October 2014

Contents

Reading (shared with Week 7)

Class 1: Effective Design

Class 2: Using HPC Resources

Homework 4 (Due Fri., Oct. 10)

Codes

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

@@ Line 5: / Line 5: @@
 = Class 1: Effective Design =
-* Code walk-through, parsing and using graphs
 * Structured Code, Bioinformatics example from AOS Book
 * Code Testing
@@ Line 18: / Line 17: @@
 * Advanced scripting tips and tricks
 ** awk
+= Homework 4 (Due Fri., Oct. 10) =
+Please email the completed homework with the subject line "SciComp HW4, (your name)"
+# Write example functions that use the advanced function notation from Beginning Python, Ch. 6 (see especially the example on p. 124).
+## f(arg=default): the function should do nothing if the function is called as f(), and it should call arg.set_price(12) if it is called as f(type("InvItem", (), {"set_price":(lambda a,b: b)})())
+## f(*arg): the function should return the number of arguments passed in the call f('a', 'b', 1, 5, {'t': [4]})
+## f(**args): the function should return the value associated with the key "agent" in the call f(auto="DB5", lno=31337, agent="007")
+# Write an example python class to represent a general inventory item.  It should store its own name, and must contain the following methods: getCount(), returning the (arbitrary, fixed) number of items in inventory, and getPrice(), which computes the price using the formula price = price0 - k*log(count), where price0 and k are arbitrary, fixed variables belonging to the object.
+# The article [http://www.aosabook.org/en/posa/working-with-big-data-in-bioinformatics.html "Working with Big Data in Bioinformatics"] describes software that reads lots of small strings and increments some counters for each string.  The overall structure of their code contains a fast C++ library, a python wrapper, and python scripts.  Describe which of those three categories you would place each of the following routines in, and why.
+## A class that creates C++ objects representing counters for sequence data and that contains methods for translating the counts to numpy arrays.
+## A script that creates a plot of the k-mer counts in a subset of the data.
+## A function reading and parsing files containing genomic sequence data.
+## A script installing the complete Khmer package, (compiling the C++ library, copying the python package, etc.)
+# Explain (without trying to solve their problems) why each of the following quotes from the article might be relevant to the performance of their code:
+## "We expected the highest traffic to be in the k-mer counting logic."
+## "Redundant calls to the toupper function were present in the highest traffic regions of the code."
+## "Input of genomic reads was performed line-by-line and on demand and without any readahead tuning."
+## "A copy-by-value of the genomic read struct [was] performed for every parsed and valid genomic read."
+= Codes =
+Power function with logarithmic run time in n (linear in the ''size'' of n)
+<source lang="python">
+def pow(x, n):
+	'''
+	Returns the number x raised to the integer power, n.
+	>>> pow(2, 4)
+	>>> pow(3, 2)
+	>>> pow(5, 0)
+	complexity = O(log n) = O(m), where m = # digits in n
+	'''
+	if n < 1:
+		return 1 # correct for n=0
+	elif n == 1:
+		return x
+	elif n % 2 == 0:
+		hp = pow(x, n/2)
+		return hp*hp
+	else: # 3, 5, 7, ...
+		hp = pow(x, (n-1)/2)
+		return x*hp*hp
+</source>
+Testing the last module using python's doctest:
+<source lang="python">
+#!/usr/bin/env python
+if __name__=="__main__":
+        import doctest, vector # assumes pow() is defined in vector
+        doctest.testmod(vector)
+</source>
+Using the python-geocoder-0.2 interface to Google's web-API to get distances:
+<source lang="python">
+from geocode.google import GoogleGeocoderClient
+from numpy import *
+geocoder = GoogleGeocoderClient(False) # must specify sensor parameter explicitely
+def to_xyz(phi, th):
+     c = cos(phi)
+     return array([c*cos(th), c*sin(th), sin(phi)])
+def to_polar(lat, lon):
+	return (90-float(lat))*pi/180.0, float(lon)*pi/180.0
+def dist(a, b): # distance in kilometers across a perfect sphere of radius 6370 km
+     return 6370*arccos(dot(to_xyz(*a), to_xyz(*b)))
+def get_loc(name):
+	result = geocoder.geocode(name)
+	if result.is_success():
+		return to_polar(*result.get_location())
+	else:
+		print "Geocoding failed"
+		return (0.0, 0.0)
+a = get_loc("Lowry Park Zoo") # spherical polar
+b = get_loc("MOSI, Tampa, FL")
+print dist(a, b)
+</source>