[Bioperl-l] Experiences from a newbie

Thu Dec 11 18:55:26 EST 2003

Hi,

I was not going to write an email, but I've read the Bioperl overview
and it encouraged me to do so.  The reason why I was not going to
write an email is that I don't know where to start with the
difficulties I had.  I don't want to blame anybody and usually I had
the feeling I didn't get some things to work because I am missing
something very little.

OK, here we go.  My background: lots of programming experience, but
not in perl, first time user of Bioperl.  Running on Linux.

The installation was easy as far as I remember.  Some things to add
like with almost every Linux program installation, but nothing
exciting as far as I remember.

The first point I got overwhelmed was in the tutorial: It was very
huge and covered a lot of stuff I didn't want to do, so I tried to
read only the parts I was interested in.

I read the section about the different objects BioPerl is using.  I
found this section rather confusing, as I don't understand, when to
use which object and why I had to care about them.  I just wanted to
read in a file and get some information out of it.

My task was to extract the translation information out of a GenBank
file that I got from NCBI, that describes a whole genome of a
bacterium.  The file has the nucleotide information and in the
FEATURES section all predicted proteins.  I wanted those proteins to
be written into a FASTA file.

So I used the following code:

use Bio::SeqIO;

$in  = Bio::SeqIO->new(-file => shift @ARGV,
		       -format => 'GenBank');
$out = Bio::SeqIO->new(-file => ">outputfilename",
		       -format => 'FASTA');
while ( my $seqobj = $in->next_seq() ) 
  {$out->write_seq($seqobj); }

which I got from the tutorial (III.2).

I run this script and got a FASTA file with the DNA.  Hm, well, that's
what's in the GenBank file and that's what I got, but not what I
wanted.  OK, my fault, let's get the FEATURES.

So I looked into the tutorial and went to section III.7.1 Representing
sequence annotations and found this line:

@allfeatures = $seqobj->all_SeqFeatures(); # descend into sub features

adding this to my script yielded an error:

Can't locate object method "all_SeqFeatures" via package
"Bio::SeqIO::genbank" (perhaps you forgot to load
"Bio::SeqIO::genbank"?) at ./test.pl line 10.

Hm, the section meant probably something else with $seqobj.  Reviewing
the title, I saw, it should be a RichSeq object.  OK, how can I
translate my SeqIO object into a RichSeq object?  (And by the way, why
do I have to do it?  I mean, I've already read the file, why can't I
access the data?  This is where I think I am missing part of the
BioPerl philosophy or assuming too much.)

So I looked through the SeqIO manpage, but I didn't find anything that
told me, how I could get a RichSeq out of a SeqIO.  I looked through
the RichSeq manpage, but it can only create new RichSeqs, not reading
some out of SeqIO.  I went to the SeqIO HOWTO afterwards
(http://bioperl.org/HOWTOs/html/SeqIO.html).  (By the way, in chapter
4 there is a small bug in the very first working example.  It uses
only a single colon instead of double colons, like Bio:SeqIO instead
of Bio::SeqIO.)

At the description of the splitgb.pl script it tells me, that it uses
the species attribute.  Looks good, as I probably want the FEATURES
attribute.  OK, so I looked at the Seq manpage.  There I found a way
to get an AnnotationCollectionI, from which I can get an AnnotationI.
So I changed the body of the while loop above to the following, using
the code from the AnnotationCollectionI manual:

while ( my $seq = $in->next_seq() ) {
  my $ann = $seq->annotation();

  foreach my $key ( $ann->get_all_annotation_keys() ) {
    my @values = $ann->get_Annotations($key);
    foreach my $value ( @values ) {
      # value is an Bio::AnnotationI, and defines a "as_text" method
      print "Annotation ",$key," stringified value ",$value->as_text,"\n";
      # you can also use a generic hash_tree method for getting 
      # stuff out say into XML format
      #$hash_tree = $value->hash_tree();
    }
  }
}

The output is the following:

[andreas at hgt tmp]$ ./test.pl ~/research/genomes/Actinobacteria/NC_000962.gbk
Annotation origin stringified value Value:       
Annotation comment stringified value Comment: COMPLETENESS: full length.
Annotation reference stringified value Reference: Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence
Annotation reference stringified value Reference: Direct Submission

Which looks a little bit like garbage to me.  There is no text
"Value:" in the whole file.  The ordering is surprisingly reversed.
The References only prints the title of the reference.  And the worst
thing is, that it omits the FEATURES section that I am actually
looking for, not to mention that it obviously misses all the other
annotations like LOCUS, DEFINITION, ACCESSION, etc.  But they might be
stored in some other object I am not aware of.

I gave up at this point.  I spend almost an hour for a task that
should be easy to accomplish.  In fact, in half the time probably, I
wrote a Scheme script that extracted me the information, as the
GenBank file is pretty structured.

But I did want to give BioPerl another chance.  Some time later, I had
run a blast on some databases and wanted to extract the results.  I
used something like:

  my $report = new Bio::SearchIO (-format => 'blast',
				  -file   => $file);

  while (my $result = $report->next_result ) {
    print "Next ($blastId) result (" . $result->num_hits . " hits)\n";
    my $hitNo = 0;
    while (my $hit = $result->next_hit) {
      $hitNo++;
      # do something with hit.
    }
  }

I found the SearchIO tutorial which was pretty cool, as it was the
first page that really gave me an overview of all the functions I can
call and what they will give me as a result (not only what object, but
what part of the BLAST result file) opposed to the manpages that were
not so informative and clear for me.

As I din't want to look at the actual sequences and only needed the
hits themselves, I told blast to omit the detailed listing of the
alignments.  This turned out to be a mistake, as for some reason that
I can never think of, I can only get HITS in $result, when I include
the alignments (which are accessed via HSP which I never use).  Again
something that I cannot access with my logic.

The bottom line for me is: BioPerl is valuable for me in so far that I
don't have to write a parser for BLAST output by myself.  But with all
its different kinds of objects, interfaces and collections I am
totally lost and either I spend an hour figuring out, how to do simple
tasks or I just write a small script that does it for me, as most
files are descent structured.

Although I am not completely satisfied with BioPerl I like the idea
itself and the ability to program over biological data.  I would
appreciate if somebody could point me to what I am missing or where I
have wrong assumptions.  Maybe a webpage that gives me the big picture
and that I've missed.  I always had the feeling that I am just missing
a little bit to understand why this has to work this (for me
complicated) way and not the other.

Thank you for your attention.

Best regards,

Andreas Bernauer.

PS:  While I reviewed my email, I realized how to get the FEATURES I
wanted: by using $seqobj->get_SeqFeatures() and
$feat->get_tag_values('translation').  So, finally, it worked.  I am
still wondering why it was so difficult for me to get to this.  The
task doesn't look so special to me.  Maybe I don't understand, what
BioPerl means by Annotation, Feature, Sequence, etc.  Is there a page
that gives me an overview of this?  I couldn't find one but maybe I
haven't tried hard enough.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 232 bytes
Desc: not available
Url : http://portal.open-bio.org/pipermail/bioperl-l/attachments/20031211/2f2f85ef/attachment.bin