Tuesday, March 9, 2010

Running Phylip as a process

This post is in the category of a "note to myself." I need to work harder on phylogenetics, and in particular, on maximum likelihood methods. I went to Joe Felsenstein's site to get Phylip.

The current version is 3.69. I downloaded the one compiled for OS X. It contains a directory full of executables with names like dnaml.app. If you double-click one of those, it will bring up a Terminal window and you can run the program interactively, specifying the necessary infile path and desired settings, etc. There is also a symbolic link to the actual executable which is inside the bundle.

I wanted more flexibility than this, for example, to run a program from any directory and have the output written there. As a test, I chose the dnadist program, which calculates distances for an alignment under different models. I put a link into the directory ~/bin:

ln -s ~/Software/phylip-3.69/exe/dnadist.app/Contents/MacOS/dnadist ~/bin/dnadist

Since I have ~/bin on my $PATH, if I'm on the Desktop I can just do:
$ dnadist

But we can do more than this. I would like to run the program as a separate process from Python. In order to do that I need a way to anticipate and respond to the interactive prompts. Phylip provides access, though it is a little clunky. We need to issue a command to the shell something like:

dnadist < responses > screenout &


We invoke the program with our input placed in a file. It doesn't seem to take an infile name on the command line in this mode. So we put that path (and it has to be a full path) on the first line of the file responses. The sequence file format must be .phy format. Luckily clustal will write this.

We need to anticipate that if the file 'output' exists, we'll be asked whether to remove it ('R'). And the various parameters get toggled in a funny way:

Settings for this run:
D Distance (F84, Kimura, Jukes-Cantor, LogDet)? F84
G Gamma distributed rates across sites? No
T Transition/transversion ratio? 2.0
C One category of substitution rates? Yes
W Use weights for sites? No
F Use empirical base frequencies? Yes
L Form of distance matrix? Square
M Analyze multiple data sets? No
I Input sequences interleaved? Yes
0 Terminal type (IBM PC, ANSI, none)? ANSI
1 Print out the data at start of run No
2 Print indications of progress of run Yes

Y to accept these or type the letter for one to change


Interactively, we would input 'D' twice in a row to progress to the third option. Finally, we need to do 'Y' to accept all the options. Here is a Python script to do all. Invoke it with 'python script.py' as usual.

import os,subprocess

fn = 'capno1.phy'
prog = 'dnadist'

responses = '.temp.txt'
FH = open(responses,'w')
current_dir = os.getcwd()
FH.write (current_dir + '/' + fn + '\n')

outfile = os.path.exists(current_dir + '/outfile')
if outfile: FH.write('R\n')

FH.write('D\n')
FH.write('D\n')
FH.write('Y\n')
FH.close()

cmd = prog
cmd += ' < ' + responses
cmd += ' > screenout &'

p = subprocess.Popen(cmd, shell=True)
pid,ecode = os.waitpid(p.pid, 0)
print ecode