[Biojava-l] Parsing a BLAST file

Keith James kdj@sanger.ac.uk
05 Nov 2001 10:16:44 +0000


>>>>> "David" == David Waring <dwaring@u.washington.edu> writes:

[...]

   David> This parses the blast file and builds
    David> SequenceDBSearchResults into a list.  It is a little bit
    David> more complicated than that really. But this complication
    David> gives very great functionality. The SearchResultBuilder
    David> must have two things that you might not expect, a
    David> SequenceDB with all the query sequences that blast was
    David> called with, and a SequenceDBInstallation which contains a
    David> SequenceDB with the same name as that found in the blast
    David> output file, in the demo this is 'genome'. With these
    David> things in place you can get both the subject, and query
    David> sequences of any hit from the SequenceDBSearchResult. I
    David> included a little sample of how to do this below since it
    David> is not in the demo.

    David> But, you say, this is a blast against some foreign
    David> database, How can I have a sequencDB with all this
    David> data. The truth is you do not really need it. You just need
    David> an empty SequenceDB with the correct name inside your
    David> SequenceDBInstallation. But then of course you can not get
    David> the subject sequences from the search result.

Yeah, the added complexity issue has been bugging me since the
bootcamp. I'm just finishing off a dotplot-style viewer for pairwise
comparisons which has to read Blast/Fasta/whatever. As an end-user
application it's got to cope with this robustly (e.g. where the
sequence name of the query/subject or database may not match up with
the search output).

As you say, there are tricks to get round the problem. The tests don't
contain a copy of EMBL (!), but use a dummy SequenceDB in the way you
describe. In cases where a user has said "this was my query, no matter
what your code thinks" I use a SingleSequenceDB (which contains one
sequence, no ID list and you always get back that sequence when you
request it). It's also possible to compact things like the
SequenceDBInstallation to anonymous inner classes which behave exactly
as you want (such as making assumptions about the identity of
sequences/databases which you wouldn't normally allow).

Keith

-- 

-= Keith James - kdj@sanger.ac.uk - http://www.sanger.ac.uk/Users/kdj =-
Pathogen Sequencing Unit, Wellcome Trust Sanger Institute, Cambridge, UK