Thursday, July 23, 2009

Phylogenetic Trees: parsing

Continuing toward the goal of being able to manipulate (and print) phylogenetic trees, I wrote a simple parser for the Newick format.

(Note: This has not been tested very much, although it does work on several examples that I tried. It is just a toy program for use in exploring what trees are about).

The parser works in three phases:

First, we process each character of the input and recognize the five special characters '():,;', generating a list of tokens. Disregarding newlines and other characters that are not valid, we accumulate standard characters in a buffer which is flushed when a special character is encountered.

Here is the first 7 lines of the output for the tree from last time:

(
(
Stenotrophomonas_maltophilia
:
0.07574
,
Kingella_oralis


In phase two, we make note of names and distances from each external node to its parental (internal) node. The external node information in the tree is replaced by a short name. Here is the output of the second stage:

external nodes:
E1 Stenotrophomonas_maltophilia 0.07574
E2 Kingella_oralis 0.08026
E3 Pseudomonas_aeruginosa 0.05950
E4 Salmonella_typhi 0.01297
E5 Escherichia_coli 0.01491
E6 Haemophilus_parainfluenzae 0.06113


The tree after this step:

((E1,E2):0.00827,E3,((E4,E5):0.03356,E6):0.03863);


(There is a trick in the second step. Because we are modifying the list of tree elements within this function, we process the external nodes in reverse order. This ensures that all the other indexes remain valid).

In the third and final phase, we break down the structure of the tree into its internal nodes. Each internal node gets a name, a list of its sub-nodes, and the distance to its parental node. The output is shown below. The information for the external and internal nodes will be written as text to disk and then processed further.

internal nodes:
I1 E1 E2 0.00827
I2 E4 E5 0.03356
I3 I2 E6 0.03863
I4 I1 E3 I3 0


After processing the first two internal nodes, the tree looks like this:

(I1,E3,(I2,E6):0.03863);


The full listing:

UC = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
LC = UC.lower()
digits = '0123456789'
punc = '._- '
chars = UC + LC + digits + punc
special = '():,;'

def loadData(fn):
FH = open(fn)
data = FH.read().strip()
FH.close()
return data

def parseData(s):
rL = list()
temp = list()
for c in s:
if c in chars:
temp.append(c)
elif c in special:
if temp:
rL.append(''.join(temp))
temp = list()
rL.append(c)
return rL
#-------------------------------------------
def doExternalNodes(L):
iL = list()
eNodeL = list()
# get pos for ':' for external nodes
for i,e in enumerate(L):
if e == ':':
# internal nodes have '):'
if not L[i-1] == ')':
iL.append(i)
eNodeCount = len(iL)
# reverse to keep indexes valid below
for i in reversed(iL):
name = L[i-1]
dist = L[i+1]
n = 'E' + str(eNodeCount)
eNodeCount -= 1
eNodeL.append((n,name,dist))
# replace the node with n
L = L[:i-1] + [n] + L[i+2:]
return L, reversed(eNodeL)
#-------------------------------------------
def findInternalNode(L):
# every ')' comes at the end of an iNode
try: j = L.index(')')
except ValueError: return None
i = j
while L[i] != '(': i -= 1
return i,j

def doInternalNodes(L):
iNodeCount = 0
iNodeL = list()
t = findInternalNode(L)
while t:
i,j = t
j += 2 # move out to dist
iNodeCount += 1
n = 'I' + str(iNodeCount)
# root may not have a dist
try:
dist = L[j]
except IndexError:
dist = 0
assert L[j-1] == ';'
# may have more than two sub-nodes:
nodes = list()
for k in range(i+1,j-2,2):
#print k, L[k]
nodes.append(L[k])
iNodeL.append((n,nodes,dist))
L = L[:i] + [n] + L[j+1:]
t = findInternalNode(L)
return L, iNodeL
#-------------------------------------------
data = loadData(fn = 'seq.phy.txt')
L = parseData(data)
#for e in L: print e
#print
#-------------------------------------------
L,eNodeL = doExternalNodes(L)
print 'external nodes:'
for e in eNodeL:
print e[0],e[1],e[2]
print
print ''.join(L) + '\n'
#-------------------------------------------
L,iNodeL = doInternalNodes(L)
print 'internal nodes:'
for e in iNodeL:
print e[0],
for n in e[1]: print n,
print e[2]