Sunday, January 31, 2010

Two sample t-test in Python


[UPDATE: I'm currently (November. 2010) in the middle of a series of posts about Student's t test. The first six posts are here, here, here, here, here, and here.]

I'd like to do my teaching in Python as much as possible. In keeping with that, I plan to explore methods for analyzing microarray data using Python rather than R and Bioconductor. As a preliminary matter, I'm going to use Student's t-test (two sample) as implemented in PyCogent. A simple-minded approach would be to a situation where the samples have already been classified, would be to filter them by the t-test statistic.

Here is a simple test. We draw a bunch of numbers from the normal distribution and keep them in a numpy array. We slide a window along the array, pulling out two small sets of n numbers at a time, and do a two sample t-test on them. We record the p-value. We expect that since all the numbers come from the same normal distribution the distribution of the p-value will be such that, on the average, p < 0.05 about 5% of the time.


# two sample t-tests
import numpy as np
import cogent.maths.stats.test as stats

def oneTrial(n,f=np.random.normal):
N = 500 # num of individual tests
SZ = n*2*N # need this many nums
draw = f(loc=50,scale=3,size=SZ)
counter = 0
for i in range(0,SZ,n*2):
nums1 = draw[i:i+n]
nums2 = draw[i+n:i+2*n]
t, prob = stats.t_two_sample(nums1,nums2)
if prob < 0.05: counter += 1
return 1.0*counter / N

Now, we run the trial a bunch of times. For this run, I've chosen the sample size to be 3.


n = 3             # sample size
L = list()
for i in range(100):
#if i and not i % 10: print 'i =', i
L.append(oneTrial(n))

print 'avg %3.4f, std %3.4f' % (np.mean(L), np.std(L)),
print 'min %3.4f, max %3.4f' % (min(L), max(L))

Output:


avg 0.0511, std 0.0095 min 0.0280, max 0.0780

Saturday, January 30, 2010

Checkerboard in Numpy

Here's a question on SO about how to make a checkerboard in Numpy (like you see on pdfs when as "the classic representation for "no pixels", or transparent." I have no doubt that I had the correct answer, but I could not buy a vote. Even from a guy with 22.9 K rep. Sheesh.

We use hstack and vstack to stack "unit" squares of solid color:

>>> import numpy as np
>>> u = 15
>>> b = np.zeros(u**2)
>>> b.shape = (u,u)
>>> w = b + 0x99
>>>
>>> width = 20 # squares across of a single type
>>> row1 = np.hstack([w,b]*width)
>>> row2 = np.hstack([b,w]*width)
>>> board = np.vstack([row1,row2]*width)
>>> board.shape
(600, 600)


Now just slice it down to what's needed

>>> width_actual = 357
>>> height_actual = 239
>>> board = board[:height_actual,:width_actual]
>>> board.shape
(239, 357)

Broadcasting in Numpy

The issue of array shapes matters when trying to combine them by arithmetic operations. Here is a simple array of 4 rows and 3 columns:

>>> from numpy import ones
>>> A = ones((4,3)) # 4 rows x 3 cols
>>> A.shape
(4, 3)
>>> A
array([[ 1., 1., 1.],
[ 1., 1., 1.],
[ 1., 1., 1.],
[ 1., 1., 1.]])


Here is a second array of 4 rows x 1 col

>>> ones((4,1))      # 4 rows x 1 col
array([[ 1.],
[ 1.],
[ 1.],
[ 1.]])


We can add them, and in the process the second array is "broadcast" to the same shape as the first one:

>>> A + ones((4,1))
array([[ 2., 2., 2.],
[ 2., 2., 2.],
[ 2., 2., 2.],
[ 2., 2., 2.]])


An array of a single row and 3 cols works the same:

>>> ones((1,3))      # 1 row x 3 cols
array([[ 1., 1., 1.]])
>>> A + ones((1,3))
array([[ 2., 2., 2.],
[ 2., 2., 2.],
[ 2., 2., 2.],
[ 2., 2., 2.]])


Now, it is possible to make a 1D array, e.g.:

>>> B = ones((3,))   # a 1D array
>>> B
array([ 1., 1., 1.])
>>> B.shape
(3,)


And again, we can add A + B:

>>> A + B
array([[ 2., 2., 2.],
[ 2., 2., 2.],
[ 2., 2., 2.],
[ 2., 2., 2.]])


But this doesn't work if the number of columns doesn't match:


>>> C = ones((4,))   # a 1D array
>>> C.shape
(4,)
>>> C
array([ 1., 1., 1., 1.])
>>> A + C
Traceback (most recent call last):
File "", line 1, in
ValueError: shape mismatch: objects cannot be broadcast to a single shape


We need to call newaxis:

from numpy import newaxis
>>> D = C[:,newaxis]
>>> D.shape
(4, 1)
>>> A + D
array([[ 2., 2., 2.],
[ 2., 2., 2.],
[ 2., 2., 2.],
[ 2., 2., 2.]])

Cycling along a list

This question on Stack Overflow involved the problem of moving along a list and then, once you reach the end, cycling back to the beginning. One very simple way to do that is to use the % (mod) operator. If the length of the list isn't changing (and it isn't a good idea to change a list during iteration), you can cache the length of the list:

>>> L = list('abcde')
>>> N = len(L)
>>> rL = list()
>>> for i in range(12):
... rL.append(L[i%N])
...
>>> print ''.join(rL)
abcdeabcdeab

Other answers used several functions from itertools:

from itertools import cycle


We can check out the docs (remember)

$ pydoc -w itertools


>>> L = list('abcde')
>>> it = cycle(L)
>>> rL = list()
>>> for i in range(12):
... rL.append(it.next())
...
>>> print ' '.join(rL)
a b c d e a b c d e a b

To get the elements in groups of two (advancing one index at a time):

>>> it = cycle(L)
>>> x = it.next()
>>> for i in range(12):
... y = it.next()
... rL.append(x+y)
... x = y
...
>>> print ' '.join(rL)
ab bc cd de ea ab bc cd de ea ab bc

Or, you can do a full itertools solution with the functions tee and izip as well as cycle.

tee makes two independent iterators for the sequence and izip zips them up (of course):

>>> from itertools import izip, cycle, tee
>>>
>>> def pairwise(seq):
... a, b = tee(seq)
... next(b)
... return izip(a, b)
...
>>> L = list('abcde')
>>> rL = list()
>>> it = pairwise(cycle(L))
>>> for i in range(12):
... rL.append(''.join(it.next()))
...
>>> print ' '.join(rL)
ab bc cd de ea ab bc cd de ea ab bc

Or use islice:

>>> from itertools import islice
>>> it = pairwise(cycle(L))
>>> [''.join(t) for t in islice(it,6)]
['ab', 'bc', 'cd', 'de', 'ea', 'ab']

Friday, January 29, 2010

Andrew Dalke's Slides for Introduction to Python

I came across a nice set of slides from a course that Andrew Dalke taught a few years ago. They give a great introduction to Python for people interested in Bioinformatics. Here is a link (scroll to the very bottom of the page).

Andrew has a blog, and a company with a presence on the web (where the slides are hosted), and he's on Stack Overflow as dalke.

Thanks, Andrew.

Viewing a nested list as a tree

The problem, as discussed in the last post, is to "flatten" a list of nested lists of unspecified structure. One can look at such an object as a tree.

I thought about writing code to do a "depth-first traversal" of this tree, but then I remembered PyCogent. It has functions that manipulate trees. All we have to do is to convert our data to a form that it can recognize:

>>> L = [[[1, 2, 3], [4, 5]], 6]
>>> `L`
'[[[1, 2, 3], [4, 5]], 6]'
>>> t = `L`.replace('[','(')
>>> t = t.replace(']',')') + ';'
>>> t
'(((1, 2, 3), (4, 5)), 6);'

It's easiest to load the data from a file, so first we write it:

>>> FH = open('temp.tree','w')
>>> FH.write(t)
>>> FH.close()
>>>
>>> from cogent import LoadTree
>>> tr = LoadTree('temp.tree')
>>> tr
Tree("(((1,2,3),(4,5)),6);")
>>>
>>> print tr.asciiArt()
/-1
|
/edge.0--|--2
| |
/edge.2--| \-3
| |
| | /-4
-root----| \edge.1--|
| \-5
|
\-6
>>>
>>> [str(tip) for tip in tr.iterTips()]
['1;', '2;', '3;', '4;', '5;', '6;']

Flatten a list in Python

I asked another question on Stack Overflow, this time about "flattening" a list of lists. If the input is

L = [[[1, 2, 3], [4, 5]], 6]

the output should be

[1, 2, 3, 4, 5, 6]

The (simple) solution that I liked the best is a generator (i.e. it returns an iterator). I changed the variable names slightly:


import collections
It = collections.Iterable

def flatten(L):
for e in L:
if isinstance(e, It) and not isinstance(e, basestring):
for sub in flatten(e):
yield sub
else:
yield e


In the interpreter:

>>> L = [[[1, 2, 3], [4, 5]], 6]
>>> flatten(L)
<generator object flatten at 0x100492140>
>>> list(flatten(L))
[1, 2, 3, 4, 5, 6]


As you can see, this works, and it also works if one of the elements is a string or unicode object (that's the reason for checking isinstance(e, basestring).

>>> list(flatten([[['abc',2,3],[4,5]],6]))
['abc', 2, 3, 4, 5, 6]


Recursion will fail if the "depth" is too great. SO user unutbu came up with another version that doesn't use recursion, which Alex Martelli simplified:

def genflat(L, ltypes=collections.Sequence):
L = list(L)
while L:
while L and isinstance(L[0], ltypes):
L[0:1] = L[0]
if L: yield L.pop(0)


There's a bit of trickery here. Suppose we're just starting with

L = [[[1, 2, 3], [4, 5]], 6]

The first element, L[0], is [[1, 2, 3], [4, 5]]

so we go into the second while and we repeatedly do

L[0:1] = L[0]

>>> L = [[[1, 2, 3], [4, 5]], 6]
>>> L[0]
[[1, 2, 3], [4, 5]]
>>> L[:1]
[[[1, 2, 3], [4, 5]]]
>>> L[1:]
[6]


We continue until L[0] is not an iterable…

>>> L[0:1] = L[0]
>>> L
[[1, 2, 3], [4, 5], 6]
>>> L[0:1] = L[0]
>>> L
[1, 2, 3, [4, 5], 6]


Assigning L[0:1] to be = L[0] removes the nesting!

Note that this will choke on a string element (we need to guard against that as we did in the first version.

And a little matter of style. For short-lived variables I prefer simple names: i (an index), N (a numerical constant), s (a string), L (a list), D (a dictionary). For many people, capitalized words (even words of a single letter) should be reserved for class objects. This instinct is so strong in them that the people posting on SO followed my usage of a single letter "l" for the list variable, but made it lowercase. Of course, this is also a bad idea, because it can be confused with the digit "one", but they probably figured that was my problem.

Thursday, January 28, 2010

Sending data into a generator function

Generators and iterators are relatively new to Python. Here is a nice discussion of functional programming that has a lot of good information about them, as well as other topics.

There is some technical distinction between the terms. An interator is something you can invoke next on, and a generator function or expression is something that returns an iterator. At least that much, I think I understand. So this first example is a simple generator function for the Fibonacci series:

>>> def fib():
... a, b = 1,1
... while True:
... yield a
... a, b = b, a+b
...
>>> it = fib()
>>> import itertools
>>> list(itertools.islice(it, 10))
[1, 1, 2, 3, 5, 8, 13, 21, 34, 55]


We use islice from itertools to get a slice from it. This setup has the disadvantage that we are computing two values ahead of the one we actually return, but it allows me to do the yield in only one place.

One great thing about iterators is that they allow you to deal with sequences that go on forever (like the Fibonacci sequence does) in a compact way. One can make a finite iterator like this:

>>> def f():
... for i in range(3):
... yield 2**i
... raise StopIteration
...
>>> it = f()
>>> for n in it: print n
...
1
2
4
>>> it = f()
>>> list(it)
[1, 2, 4]


The next example uses an "advanced" feature of iterators. It is possible to code it so that a value can be passed into the generator function by use of the send method. See here. In this contrived example, we simply advance the generator a given number of steps:

>>> def fib():
... a, b = 1,1
... while True:
... value = (yield a)
... try:
... if value:
... value = int(value)
... except ValueError:
... value = None
... if value:
... for i in range(value):
... a, b = b, a+b
... else:
... a, b = b, a+b
...
>>> it = fib()
>>> it.next()
1
>>> it.send(5)
8
>>> import itertools
>>> list(itertools.islice(it, 10))
[13, 21, 34, 55, 89, 144, 233, 377, 610, 987]


The call to send does a yield thing.

A more serious use of send is described by Alex Martelli here (in the first answer to the question).

What should I know in Python?

This was posted on SO (and promptly closed!):

What questions should every good Python developer be able to answer?

Alex Martelli, author of the Python Cookbook and Python guru extraordinaire, gave these four questions:

• how do you sort a list of dicts by the value each has for key 'foo'?

There's a really nice page on sorting here. Recall that the built-in function sorted takes a keyword argument key. I usually define a function to feed to sorted,

>>> L = [{'foo':1}, {'foo':3}, {'foo':10}]
>>> def f(e): return e['foo']
...
>>> sorted(L, key=f)
[{'foo': 1}, {'foo': 3}, {'foo': 10}]

but the cool kids like lambdas:

>>> import operator
>>> sorted(L, key = operator.itemgetter('foo'))
[{'foo': 1}, {'foo': 3}, {'foo': 10}]

sorted also can do reverse:

>>> sorted(L, key=f, reverse=True)
[{'foo': 10}, {'foo': 3}, {'foo': 1}]




• how do you get the highest 10 integers out of a list of a million integers?

The built-in function sort can do this in an instant.

>>> import random
>>> N = int(1E6)
>>> L = [random.choice(xrange(N)) for i in range(N)]
>>> L.sort()
>>> L[-5:]
[999990, 999991, 999992, 999994, 999998]





• how do you sort a list of strings in case-insensitive alphabetical order?

Feed the built-in function str.lower to sort

>>> FH = open('/usr/share/dict/words')
>>> words = FH.read().strip().split('\n')
>>> FH.close()
>>> len(words)
234936
>>> L = [random.choice(words) for i in range(1000)]
>>> L.sort(key=str.lower)
>>> for w in L[10:15]:
... print w
...
adenophthalmia
advenient
Aegisthus
afterpast
afterripening




• given a list of ints, how do you make a string with their str forms separated by spaces?

This one's trivial.

>>> L = range(10)
>>> print ''.join([str(n) for n in L])
0123456789

Wednesday, January 27, 2010

Python datetime module

Steven Lott recommends on SO, to use the standard library. In this case, the module of interest is datetime. Normally you would use it something like this:

>>> import datetime
>>> d = datetime.date(2010, 1, 27)
>>> d
datetime.date(2010, 1, 27)
>>> t = datetime.time(12, 30)
>>> t
datetime.time(12, 30)
>>> datetime.datetime.combine(d, t)
datetime.datetime(2010, 1, 27, 12, 30)


The fields are y m d h m s, plus microseconds.

"A datetime object is a single object containing all the information from a date object and a time object. Like a date object, datetime assumes the current Gregorian calendar extended in both directions; like a time object, datetime assumes there are exactly 3600*24 seconds in every day."


What time is it now?

>>> n = datetime.datetime.now()
>>> n
datetime.datetime(2010, 1, 27, 17, 41, 47, 94035)
>>> n.__repr__()
'datetime.datetime(2010, 1, 27, 17, 41, 47, 94035)'
>>> print n
2010-01-27 17:41:47.094035
>>> n.ctime()
'Wed Jan 27 17:41:47 2010'


The format returned by print generally follows the ISO-8601 format, except for the microseconds, which I don't find in the wikipedia entry describing ISO-8601.

The method strftime() takes a large number of format codes to control its behavior.

>>> for c in 'aAbBcdfHIjmMyY':
... fc = '%' + c
... print fc, n.strftime(fc)
...
%a Wed
%A Wednesday
%b Jan
%B January
%c Wed Jan 27 17:41:47 2010
%d 27
%f 094035
%H 17
%I 05
%j 027
%m 01
%M 41
%y 10
%Y 2010


Another useful object is the timedelta.

"A timedelta object represents a duration, the difference between two dates or times."


>>> n = datetime.date.today()
>>> my_bd = datetime.date(1955, 10, 24)
>>> diff = n - my_bd
>>> diff
datetime.timedelta(19819)
>>> d = 7 + 30 + 31 + 27 # days since my last bd
>>> 54*365 + 14 + d # years (in days) + leap days
19819


One of the issues that always comes up is time zones. Time zones are complicated, because their behavior is variable and needs to be specified. So we need yet another class object to deal with them. I'll look at that another time.

Tuesday, January 26, 2010

Python csv module

I'm trying to do what Steven Lott recommends on SO, to use the standard library. He also often says:
"Use the source, Luke."

In this case, the module of interest is csv. We don't need the source here. Normally you would use it something like this:

import csv
fn = 'mydata.csv'
reader = csv.reader(open(fn,'rb'))
for line in reader:
dosomething(line)

csv.reader is initialized with a file object (what I learned to call a file "handle"), for which I typically use the variable name FH. But it will also work with data. So we can use it as shown here:

>>> import csv
>>> data = 'a,b\n1,2\n3,4\n'
>>> reader = csv.reader(data.split('\n'))
>>> for row in reader:
... print row
...
['a', 'b']
['1', '2']
['3', '4']
[]

We should have done strip() on the data before split(), that will get rid of the extra element which is preserved as an empty list in the output. csv also has cool DictReader (and DictWriter) objects:

>>> import csv
>>> data = 'a,b\n1,2\n3,4\n'
>>> reader = csv.DictReader(data.split('\n'))
>>> for D in reader:
... print D
...
{'a': '1', 'b': '2'}
{'a': '3', 'b': '4'}

I don't know a way to get DictWriter to write the header. I guess you'd just do it separately with FH.write() when the file is first opened.

>>> import csv
>>> data = 'a,b\n1,2\n3,4\n'
>>> reader = csv.DictReader(data.split('\n'))
>>> L = [D for D in reader]
>>> print L
[{'a': '1', 'b': '2'}, {'a': '3', 'b': '4'}]
>>> FH = open('result.csv', 'w')
>>> writer = csv.DictWriter(FH, fieldnames = ['a','b'])
>>> writer.writerows(L)

Closures in Python

A question came up on Stack Overflow about nested functions and closures.

According to wikipedia, 'a closure is a first-class function with free variables that are bound in the lexical environment. Such a function is said to be "closed over" its free variables.' An example should make it clearer.

def adder(n):
def func(x):
return x + n
return func


adder is a so-called factory function. We exercise it like so:

>>> i = 2
>>> f = adder(i)
>>> f(10)
12
>>> j = 3
>>> g = adder(j)
>>> g(10)
13


So the question on SO was, "why does the nested function" (func, assigned to f) "remember the first value" of n?

One answer is: because that's the rule!

>>> hex(id(i))
'0x100202810'
>>> f
<function func at 0x100479050>
>>> f.func_closure
(<cell at 0x10048fde0: int object at 0x100202810>,)


f.func_closure is a tuple. The first item is a "cell" that contains the same int object passed to adder in constructing f.

>>> hex(id(f.func_closure[0].cell_contents))
'0x100202810'


More about function objects here

Monday, January 25, 2010

Pydoc


Category: learn something every day.

In the Python interpreter you can do (docs):

>>> import itertools
>>> help(itertools)

and get what look like Unix man pages for the module. But I didn't know you can do:
$ pydoc -w itertools

from the command line and pydoc will write an html file to the current directory (or you can redirect). Very helpful!

Sunday, January 24, 2010

Permutations done right

In the last post (about dups in permutations), I forgot to mention the correct way to handle this problem. It is to use itertools.permutations (docs).

>>> from itertools import permutations as perms
>>> L = range(10)
>>> g = perms(L)
>>> rL = [g.next() for i in range(5000)]
>>> len(set(rL))
5000


These are in sorted order.

>>> rL[0]
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
>>> rL[1]
(0, 1, 2, 3, 4, 5, 6, 7, 9, 8)

You need to do random.shuffle if you want them in random order.

Perhaps there is a better way that I don't know? What I'd like is to get a generator that can give the next permutation at random, without the chance of a dup, and without constructing an actual list of all 10! elements.

The Birthday Problem, revisited

I ran into the Birthday Problem in another context, and didn't recognize it. It's pretty embarassing. (I've even blogged about the issue!). Here's the question on Stack Overflow.

Problem description: take a list of 10 elements and run random.shuffle on them in Python. There are a total of 10! permutations. Shuffle 5000 times, recording a copy of the list each time.

>>> import random
>>> random.seed(137)
>>> L = range(10)
>>> N = 5000
>>> rL = list()
>>>
>>> for i in range(N):
... random.shuffle(L)
... rL.append(L[:])
...
>>>

Then to be safe, check for duplicates among the results. Use set to check for dups, but it only works if the items are immutable (so substitute tuples for lists). There are 3 duplicates!

>>> rL = [tuple(e) for e in rL]
>>> len(set(rL))
4997
>>> tL = list()
>>> for i,t in enumerate(rL):
... if t in tL:
... print rL.index(t), i, t
... tL.append(t)
...
290 3354 (9, 3, 1, 0, 6, 2, 5, 4, 8, 7)
3486 3807 (6, 4, 1, 5, 2, 0, 8, 7, 9, 3)
435 4265 (1, 4, 7, 8, 6, 9, 5, 0, 3, 2)

The number of permutations of 10 elements is much larger than 5000.

>>> 10*9*8*7*6*5*4*3*2*1
3628800

Confirm this by using another method in the random module:

>>> import random
>>> random.seed(137)
>>> L = list()
>>> N = 5000
>>> for i in range(N):
... L.append(random.choice(xrange(3628800)))
...
>>> len(set(L))
4995

Get a rough estimate of how likely this is:

>>> import random
>>> random.seed(1357)
>>> def oneTrial():
... L = list()
... for i in range(5000):
... L.append(random.choice(xrange(3628800)))
... return len(set(L)) == len(L)
...
>>> N = 1000
>>> L = [oneTrial() for i in range(N)]
>>> p = 1.0 * sum(L) / N
>>> print round(p,2)
0.04

Analysis:

One approach is to say that the probability that any two permutations are not the same is:

p = 1.0*3628799/3628800

Very large, but definitely not certainty (p = 1). The number of combinations of 2 elements drawn from 5000 is also large.

5000! / 2! * 4998! 
>>> C = 5000 * 4999 / 2
>>> C
12497500

The probability that at least l combination is the same (shares a birthday) is:


>>> 1 - p**C
0.96806256495611798

The other way is to consider a group of two (lists of 10 elements). The probability that they are not the same is (again)

p = 1.0*3628799/3628800

Now, consider a third list. There are only 3628798/3628800 orders available that would not be a duplication, so the total probability for all three is the product of

p = (1.0*3628799/3628800) * (1.0*3628798/3628800)

And the probability of at least one dup is 1 - p. We extend this for a total of N-2 steps:

>>> N = 5000
>>> num = 3628799
>>> den = 3628800
>>> p = 1.0*num/den
>>> for i in range(N-2):
... num -= 1
... p *= 1.0*num/den
...
>>> 1-p
0.96811305757470723

Close enough.

Saturday, January 23, 2010

Timeit from the command line in Python

Python has a module called timeit, whose function should be obvious. It can be used simply from the command line.

python -m timeit [-n N] [-r N] [-s S] [-t] [-c] [-h] [statement ...]

The -s flags an argument that should be executed only once. Multiple -s [setup code] lines can be included. There can also be multiple statements which is, or are, executed many times and the timing reported. I was curious about the relative speed of two approaches to grouping list elements (post here). I put the code for the two functions into a module test.py (the listing is at the end of the post).

Here is what I got:

$ python -m timeit -s 'import test' 'list(test.grouper(2,"abcdef"))'
100000 loops, best of 3: 5.34 usec per loop
$ python -m timeit -s 'import test' 'test.grouper(2,"abcdef")'
100000 loops, best of 3: 2.19 usec per loop
$ python -m timeit -s 'import test' 'list(test.chunks(2,"abcdef"))'
100000 loops, best of 3: 2.85 usec per loop
$ python -m timeit -s 'import test' 'test.chunks(2,"abcdef")'
1000000 loops, best of 3: 0.685 usec per loop

The itertools solution takes 2-3 times as long, and getting the actual list rather than a generator or iterator has overhead as well.

from itertools import izip_longest

def grouper(n, iterable, fillvalue=None):
global itertools
"grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)

def chunks(n, L):
""" Yield successive n-sized chunks from L.
"""
for i in xrange(0, len(L), n):
yield L[i:i+n]

Chunks of sequence in Python

I achieved enlightenment in Python this morning. It's quite exciting :)

I asked a question on Stack Overflow about generating "chunks" of sequence, which is an everyday thing in bioinformatics: a DNA seq -> groups of 3 nt per codon. I do this in a simple way (as discussed here). And one of the answers to this question (look for "chunks") is the same as my method but employs a generator.

However, many (most?) people prefer this solution (from the recipes section of the itertools docs)


def grouper(n, iterable, fillvalue=None):
"grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)


Where the docs for izip_longest (renamed zip_longest in Python 3) show this:


def izip_longest(*args, **kwds):
# izip_longest('ABCD', 'xy', fillvalue='-') --> Ax By C- D-


I've been trying to understand how this works. On the first line of grouper we have:

args = [iter(iterable)] * n

>>> iter('abc')
<iterator object at 0x10048b890>


So we're just making an iterator out of the iterable thing (string, list, tuple, set ...), and putting n copies of the same object into a list. That's the key---multiple copies of the same object.

>>> for obj in [iter('abc')]*3:
... print obj
...
<iterator object at 0x10048b890>
<iterator object at 0x10048b890>
<iterator object at 0x10048b890>


The *args part is just Python-speak for being able to pass a list of arguments (whose length isn't specified) into a function and then have Python "unpack" the arguments and use them. You can't look at what *args does directly, but only as part of a function's arguments:


>>> *[iter('abc')]*3
File "", line 1
*[iter('abc')]*3
^
SyntaxError: invalid syntax

>>> def popper(*args):
... for e in args:
... print e.next()
...
>>> popper(*[iter('abc')]*3)
a
b
c


So... when we feed izip_longest multiple copies of the same iterator object in a list, and it tries to do its thing:

izip_longest(*args)

To make the first "group", it starts by grabbing the first element in our <iterator object at 0x10048b890>, giving 'a', plus the first element from <iterator object at 0x10048b890>, which now gives 'b', plus the first element from <iterator object at 0x10048b890>, which gives 'c'.

Popping the "first" element from the same iterator object a second time gives the second element, because the first is already gone.

Don't know how clear that is to you, but it makes sense to me now. The key is multiples of the same iterator object.

Thursday, January 21, 2010

Binary multiplication in Python

I came across a question on Stack OverFlow that references binary multiplication. This is a fun problem for a beginning Python coder.

Let's say we want to multiply 3 x 5 in binary:

  00101  ( 5)
x 00011 ( 3)
-------
00101 ( 5)
+ 01010 (10)
-------
01111 (15)


Let's talk generally by substituting m x n (so here m = 3 and n = 5).

No matter what the value of m, we will end up by adding together a series of bit-shifted versions of n, here: 00101 and 01010

5 = 00101
3 = 00011 = 2**0 + 2**1
3 x 5 = (5 << 0) + (5 << 1)


For any m, we require the set of exponents i, j, k.. such that:

2**i + 2**j + 2**k + ... = m

[This has to be done from the "top down." First find the largest power of 2 that is less than m, record that exponent, and take m = m % 2**e to continue... It's tricky, and that's why I took the easy way out, below.]

Now we sum

n << i
+ n << j
+ n << k
+ ... = m x n


It's very helpful that Python will do bit-shifting on ints:

>>> 5 << 1
10

I spent a while debugging code to generate the needed exponents, before I decided to cheat (a bit) by using binary representation:

>>> bin(3)
'0b11'

The first function below returns the exponents i,j,k...
The second does the bit-shifting. For efficiency, we do the factoring on the smaller of the two operands.

import random

def binary_exps(n):
L = list(bin(n)[2:])
L.reverse()
L = [i for i,c in enumerate(L) if c == '1']
return L

def multiply(x,y):
if y > x: x,y = y,x
result = 0
L = binary_exps(y)
for e in L:
result += x << e
return result, L

if __name__ == '__main__':
for i in range(1, 10):
n = 2**i-1
L = binary_exps(n)
print str(n).rjust(4), L
assert n == sum([2**e for e in L])
print

for i in range(20):
x = random.choice(range(1000))
y = random.choice(range(1000))
prod, L = multiply(x,y)
print str(x).rjust(4), str(y).rjust(4), L
assert prod == x * y

$ python multiply_binary.py 
1 [0]
3 [0, 1]
7 [0, 1, 2]
15 [0, 1, 2, 3]
31 [0, 1, 2, 3, 4]
63 [0, 1, 2, 3, 4, 5]
127 [0, 1, 2, 3, 4, 5, 6]
255 [0, 1, 2, 3, 4, 5, 6, 7]
511 [0, 1, 2, 3, 4, 5, 6, 7, 8]

406 819 [1, 2, 4, 7, 8]
902 817 [0, 4, 5, 8, 9]
755 115 [0, 1, 4, 5, 6]
963 170 [1, 3, 5, 7]
526 836 [1, 2, 3, 9]
236 228 [2, 5, 6, 7]
504 448 [6, 7, 8]
438 840 [1, 2, 4, 5, 7, 8]
225 791 [0, 5, 6, 7]
373 512 [0, 2, 4, 5, 6, 8]
307 230 [1, 2, 5, 6, 7]
672 565 [0, 2, 4, 5, 9]
389 137 [0, 3, 7]
444 406 [1, 2, 4, 7, 8]
587 707 [0, 1, 3, 6, 9]
412 246 [1, 2, 4, 5, 6, 7]
390 307 [0, 1, 4, 5, 8]
269 368 [0, 2, 3, 8]
146 553 [1, 4, 7]
90 182 [1, 3, 4, 6]

Tuesday, January 19, 2010

Python command line arguments

It's become clear to me that I need to learn more about the Python standard library. Lots of great stuff has been added. I did dir(str) today and saw methods that I never knew about.

In that vein, here is a short demo of optparse (docs). It should give you a good feeling for what's available (of course, there is more):

from optparse import OptionParser
parser = OptionParser()
parser.add_option('-f', '--file', dest='fn',
help='write report to FILE', metavar='FILE')
parser.add_option('-q', '--quiet',
action='store_false', dest='v', default=True,
help="don't print status messages to stdout")
parser.add_option('-n',
dest='n', type='int',
help="need to input a number")

options, args = parser.parse_args()
print options,
if args: print args,
print

We can exercise it like this:

$ python options.py -f xyz
{'n': None, 'fn': 'xyz', 'v': True}
n$ python options.py -f xyz abc def
{'n': None, 'fn': 'xyz', 'v': True} ['abc', 'def']
$ python options.py -fxyz
{'n': None, 'fn': 'xyz', 'v': True}
$ python options.py --file=xyz
{'n': None, 'fn': 'xyz', 'v': True}
$ python options.py -q
{'n': None, 'fn': None, 'v': False}
$ python options.py -n
Usage: options.py [options]

options.py: error: -n option requires an argument
$ python options.py -n 7
{'n': 7, 'fn': None, 'v': True}
$ python options.py --help
Usage: options.py [options]

Options:
-h, --help show this help message and exit
-f FILE, --file=FILE write report to FILE
-q, --quiet don't print status messages to stdout
-n N need to input a number

Wednesday, January 13, 2010

Matplotlib in OS X (5)


Still fooling around with matplotlib. This example is similar to the first one (here) but goes further, and computes the stuff we'd need for principal component analysis.

I'm still floundering a bit: it seems to be impossible or at least I haven't found out how to plot open circles. Pretty amazing, since that's the default in R. 'markerfacecolor' and 'mfc' as mentioned in the docs for pyplot don't work with ax.scatter. Also, all attempts to adjust the axes are ignored. And most important, there are surely PCA routines to do this directly.

In the first part we do something very similar to the first example, except that the matrix is constructed from a list of lists, properly, so that we don't need a call to reset the shape to a 2-row by x-col one. Also, in anticipation of part 2, I've changed the example so that the y-axis is the one with most of the variance. These points are plotted as salmon diamonds. Then, the first matrix of points is rotated by 45 degrees as before, and these points are plotted as cyan squares.

In the second part of the code, we use np.cov to compute the covariance of the rotated matrix, and then call np.linalg.eig on it, to compute the eigenvalues and eigenvectors. The eigenvalues are not guaranteed to be in any sorted order (who knows why?). To prove that the eigenvector is correct, we rotate the matrix again using that and plot those points as well. As expected, they are now spread along the x-axis.

The output (formatted by the function rs) is:


     35.27    3073.52
36.36 3179.63 0.99


These are the variances of the x and y-points on line 1, followed by the two eigenvalues and the fraction of the total variance explained by the second (large) eigenvalue. As expected, this is about 99%.

import random, math, sys
import matplotlib.pyplot as plt
import numpy as np
random.seed(1357)

def rs(n,r=2): return str(round(n,2)).rjust(10)

r = range(-10,10)
R = range(-100,100)
xL1 = [random.choice(r) for i in range(30)]
yL1 = [random.choice(R) for i in range(30)]
print rs(np.var(xL1)), rs(np.var(yL1))
m1 = np.array([xL1,yL1])

z = 1.0/math.sqrt(2)
t = np.array([[z,-z],[z,z]])
xL2,yL2 = np.dot(t,m1)
m2 = np.array([xL2,yL2])
#---------------------------------------------
cm = np.cov(m2)
evals,evecs = np.linalg.eig(cm)
# The eigenvalues are not necessarily ordered
evals.sort()
print rs(evals[0]), rs(evals[1]),
print rs(evals[-1]/sum(evals))
xL3,yL3 = np.dot(evecs,m2)

fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(xL1,yL1,s=150,color='salmon',marker='d')
ax.scatter(xL2,yL2,s=150,color='cyan',marker='s')
ax.scatter(xL3,yL3,s=150,color='b',marker='o')
plt.axes = ((-150,150,-150,150))
plt.savefig('example.png')

Sunday, January 10, 2010

Z-scores

As in several recent posts, I've been exploring PyCogent (docs). In fact, I hope to contribute to the cookbook documentation---"PyCogent for dummies," in a sense. I'll post about it here if and when my stuff is accepted.

PyCogent uses Numpy extensively, and I'm not really familiar with that either, so I've been fiddling with it today. A goal for the future would be to explore how one would use Python (and Numpy and Scipy) for processing microarray data. I know that Bioconductor has become standard---in fact I gave a talk about it. But R is ... ugh.

So, the first thing you want to do with the data from a microarray experiment is to put it into an array :), and then, likely, normalize the data. And that leads to the question of how to deal with missing values---nan (not a number). The problem is that if you run numpy.mean or numpy.std on a 1D array containing an nan value, you get back nan.

I couldn't find a function to do this, so I decided to roll my own for fun. It doesn't have to be super efficient because you only do this once or twice, and Python is already plenty fast enough. And please, if anybody knows how to do this better, let me know.

At the bottom of the post is the code. This is the output:

[  2.29908899  -0.63725912  23.02275554  -5.0692318 ]
mean 4.9038 stdev 10.7848
[-0.24152091 -0.51378873 1.6800454 -0.92473577]
[-0.24152091 -0.51378873 1.6800454 -0.92473577]

[-13.41031373 -0.40077508 11.72402522 -3.76176703]
mean -1.4622 stdev 8.9868
[-1.32952082 0.11811049 1.46729289 -0.25588257]
[-1.32952082 0.11811049 1.46729289 -0.25588257]

In the two blocks above, we print two randomly chosen rows, with the mean and stdev. Then we manually calculate the z-score, and compare it to the output from the code. In the second part of the output, we find the first two rows that contain NaN's and check to see that they are handled correctly.

[ -9.29748571  10.472215            NaN  -4.96373691]
[-0.94695253 1.38312492 NaN -0.4361724 ]
[-12.34288901 NaN 11.73414048 -1.49869538]
[-1.18230539 NaN 1.26317611 -0.08087072]


from numpy import nan, isnan
from numpy import array, mean, std, random

def z_score(row):
L = [n for n in row if not isnan(n)]
m = mean(L)
s = std(L)
zL = [1.0 * (n - m) / s for n in L]
if len(L) == len(row): return zL
# deal with nan
retL = list()
for n in row:
if isnan(n):
retL.append(nan)
else:
retL.append(zL.pop(0))
assert len(zL) == 0
return retL

def z_scores_by_row(A):
retL = [z_score(r) for r in A]
Z = array(retL)
Z.shape = A.shape
return Z

# utilities

def show(row):
m = mean(row)
s = std(row)
print row
print 'mean', round(m,4), 'stdev', round(s,4)
zL = (row - m) / s
print zL

def has_nan(row):
for n in row:
if isnan(n): return True
return False

#=======================================

if __name__ == '__main__':
random.seed(137)
N = 10000
A = random.randn(N)
A *= 10
for i in range(20):
A[random.randint(N)] = nan
A.shape = (2500, 4)

Z = z_scores_by_row(A)

for i in range(2):
j = random.randint(1000)
#print j,
show(A[j])
print Z[j]
print

L = [(i,r) for i,r in enumerate(A) if has_nan(r)]
for t in L[:2]:
i = t[0]
print A[i]
print Z[i]

Thursday, January 7, 2010

Duly quoted

Some nice quotes on ideas
(from the numpy handbook; download, and one from Stroustrup)
Don’t worry about people stealing your ideas. If your ideas are any good, you’ll have to ram them down people’s throats.
—Howard Aiken, IBM engineer
No idea is so antiquated that it was not once modern; no idea is so modern that it will not someday be antiquated.
—Ellen Glasgow
Language shapes the way we think, and determines what we can think about.
—B. L. Whorf