Monday, May 12, 2008

Using formatdb

There are several reasons to run BLAST locally rather than over the internet. One is speed. However, the size of the non-redundant (nr) database is enormous (zipped files total ≈10 GB). For 1E3 or even 1E4 runs a day, it makes a lot of sense to let them maintain the database and do BLAST over the web. Particularly when one can bundle queries (I don't know the limits for this, but 6 rRNA sequences works fine). It says somewhere in the docs that you can submit a new job once you receive the 'RID.' Hey, just run the job overnight.

A more common use is to search a custom-made database. A database is simply a text file of FASTA formatted sequences. In order to use one, it must first be processed by the formatdb program provided with the BLAST programs. After formatdb runs, it creates three new files: *.nhr, *.nin and *.nsq. The original database file is no longer needed afterward.

The documentation that comes with formatdb looks forbidding. The program is sophisticated, but its use doesn't have to be complicated. I do not use a configuration (.formatdbrc) file. The following program employs a hard-coded path to the binary. One subtle point about file paths: the shell understands '~' to mean the user's home directory, but formatdb does not. You have to use the full path. Here I obtain that using a Foundation function called NSHomeDirectory(). If the program does not run for this reason, the logfile contains no hint about why. It's interesting that the blast programs can handle '~' correctly.

Minimal flags passed to formatdb include:

-i  path to database file
-p  protein? F
-o  parse SeqId and do fancy stuff? F
-l  path to write logfile

Here is a very nice tutorial that covers the whole process of installing BLAST locally.

import os, sys, subprocess
from Foundation import NSHomeDirectory

def run():
    prog = '~/bin/blast'
    prog += '/programs/blast-2.2.18/bin/formatdb'

    home = NSHomeDirectory()

    directory = home + '/Desktop/analysis.seq/db.all/'
    cmd = prog + ' -i' + directory + 'db.txt'
    cmd += ' -p F -o F -l '
    cmd += directory + 'logfile.txt'

    p = subprocess.Popen(cmd, shell=True)
    sts = os.waitpid(p.pid, 0)

if __name__ == "__main__":
run()