[Bioperl-l] Experiences from a newbie

Fri Dec 12 07:55:24 EST 2003

Andreas,

I found your comments quite reasonable. Bioperl has a lot to offer, but the
documentation is not exactly coherent. Personally I think that a good HOWTO
goes far to solve this problem as most people probably start to use Bioperl
because they have a specific task in mind.

I've started writing a HOWTO on Features and Annotations, I would appreciate
your taking a look at it and telling me what you think, here's the URL:

http://bioperl.org/HOWTOs/html/Feature-Annotation.html

Ignore the blank sections, please! Any comments would be appreciated, since
you're precisely the sort of user I'm trying to write for, someone with
substantial experience in one of those fields that comprises bioinformatics,
but not all.

Brian O.

-----Original Message-----
From: bioperl-l-bounces at portal.open-bio.org
[mailto:bioperl-l-bounces at portal.open-bio.org]On Behalf Of
andreas.bernauer at gmx.de
Sent: Thursday, December 11, 2003 6:55 PM
To: bioperl-l at bioperl.org
Subject: [Bioperl-l] Experiences from a newbie

Hi,

I was not going to write an email, but I've read the Bioperl overview
and it encouraged me to do so.  The reason why I was not going to
write an email is that I don't know where to start with the
difficulties I had.  I don't want to blame anybody and usually I had
the feeling I didn't get some things to work because I am missing
something very little.

OK, here we go.  My background: lots of programming experience, but
not in perl, first time user of Bioperl.  Running on Linux.

The installation was easy as far as I remember.  Some things to add
like with almost every Linux program installation, but nothing
exciting as far as I remember.

The first point I got overwhelmed was in the tutorial: It was very
huge and covered a lot of stuff I didn't want to do, so I tried to
read only the parts I was interested in.

I read the section about the different objects BioPerl is using.  I
found this section rather confusing, as I don't understand, when to
use which object and why I had to care about them.  I just wanted to
read in a file and get some information out of it.

My task was to extract the translation information out of a GenBank
file that I got from NCBI, that describes a whole genome of a
bacterium.  The file has the nucleotide information and in the
FEATURES section all predicted proteins.  I wanted those proteins to
be written into a FASTA file.

So I used the following code:

use Bio::SeqIO;

$in  = Bio::SeqIO->new(-file => shift @ARGV,
                       -format => 'GenBank');
$out = Bio::SeqIO->new(-file => ">outputfilename",
                       -format => 'FASTA');
while ( my $seqobj = $in->next_seq() )
  {$out->write_seq($seqobj); }

which I got from the tutorial (III.2).

I run this script and got a FASTA file with the DNA.  Hm, well, that's
what's in the GenBank file and that's what I got, but not what I
wanted.  OK, my fault, let's get the FEATURES.

So I looked into the tutorial and went to section III.7.1 Representing
sequence annotations and found this line:

@allfeatures = $seqobj->all_SeqFeatures(); # descend into sub features

adding this to my script yielded an error:

Can't locate object method "all_SeqFeatures" via package
"Bio::SeqIO::genbank" (perhaps you forgot to load
"Bio::SeqIO::genbank"?) at ./test.pl line 10.

Hm, the section meant probably something else with $seqobj.  Reviewing
the title, I saw, it should be a RichSeq object.  OK, how can I
translate my SeqIO object into a RichSeq object?  (And by the way, why
do I have to do it?  I mean, I've already read the file, why can't I
access the data?  This is where I think I am missing part of the
BioPerl philosophy or assuming too much.)

So I looked through the SeqIO manpage, but I didn't find anything that
told me, how I could get a RichSeq out of a SeqIO.  I looked through
the RichSeq manpage, but it can only create new RichSeqs, not reading
some out of SeqIO.  I went to the SeqIO HOWTO afterwards
(http://bioperl.org/HOWTOs/html/SeqIO.html).  (By the way, in chapter
4 there is a small bug in the very first working example.  It uses
only a single colon instead of double colons, like Bio:SeqIO instead
of Bio::SeqIO.)

At the description of the splitgb.pl script it tells me, that it uses
the species attribute.  Looks good, as I probably want the FEATURES
attribute.  OK, so I looked at the Seq manpage.  There I found a way
to get an AnnotationCollectionI, from which I can get an AnnotationI.
So I changed the body of the while loop above to the following, using
the code from the AnnotationCollectionI manual:

while ( my $seq = $in->next_seq() ) {
  my $ann = $seq->annotation();

  foreach my $key ( $ann->get_all_annotation_keys() ) {
    my @values = $ann->get_Annotations($key);
    foreach my $value ( @values ) {
      # value is an Bio::AnnotationI, and defines a "as_text" method
      print "Annotation ",$key," stringified value ",$value->as_text,"\n";
      # you can also use a generic hash_tree method for getting
      # stuff out say into XML format
      #$hash_tree = $value->hash_tree();
    }
  }
}

The output is the following:

[andreas at hgt tmp]$ ./test.pl ~/research/genomes/Actinobacteria/NC_000962.gbk
Annotation origin stringified value Value:
Annotation comment stringified value Comment: COMPLETENESS: full length.
Annotation reference stringified value Reference: Deciphering the biology of
Mycobacterium tuberculosis from the complete genome sequence
Annotation reference stringified value Reference: Direct Submission

Which looks a little bit like garbage to me.  There is no text
"Value:" in the whole file.  The ordering is surprisingly reversed.
The References only prints the title of the reference.  And the worst
thing is, that it omits the FEATURES section that I am actually
looking for, not to mention that it obviously misses all the other
annotations like LOCUS, DEFINITION, ACCESSION, etc.  But they might be
stored in some other object I am not aware of.

I gave up at this point.  I spend almost an hour for a task that
should be easy to accomplish.  In fact, in half the time probably, I
wrote a Scheme script that extracted me the information, as the
GenBank file is pretty structured.

But I did want to give BioPerl another chance.  Some time later, I had
run a blast on some databases and wanted to extract the results.  I
used something like:

  my $report = new Bio::SearchIO (-format => 'blast',
                                  -file   => $file);

  while (my $result = $report->next_result ) {
    print "Next ($blastId) result (" . $result->num_hits . " hits)\n";
    my $hitNo = 0;
    while (my $hit = $result->next_hit) {
      $hitNo++;
      # do something with hit.
    }
  }

I found the SearchIO tutorial which was pretty cool, as it was the
first page that really gave me an overview of all the functions I can
call and what they will give me as a result (not only what object, but
what part of the BLAST result file) opposed to the manpages that were
not so informative and clear for me.

As I din't want to look at the actual sequences and only needed the
hits themselves, I told blast to omit the detailed listing of the
alignments.  This turned out to be a mistake, as for some reason that
I can never think of, I can only get HITS in $result, when I include
the alignments (which are accessed via HSP which I never use).  Again
something that I cannot access with my logic.

The bottom line for me is: BioPerl is valuable for me in so far that I
don't have to write a parser for BLAST output by myself.  But with all
its different kinds of objects, interfaces and collections I am
totally lost and either I spend an hour figuring out, how to do simple
tasks or I just write a small script that does it for me, as most
files are descent structured.

Although I am not completely satisfied with BioPerl I like the idea
itself and the ability to program over biological data.  I would
appreciate if somebody could point me to what I am missing or where I
have wrong assumptions.  Maybe a webpage that gives me the big picture
and that I've missed.  I always had the feeling that I am just missing
a little bit to understand why this has to work this (for me
complicated) way and not the other.

Thank you for your attention.

Best regards,

Andreas Bernauer.

PS:  While I reviewed my email, I realized how to get the FEATURES I
wanted: by using $seqobj->get_SeqFeatures() and
$feat->get_tag_values('translation').  So, finally, it worked.  I am
still wondering why it was so difficult for me to get to this.  The
task doesn't look so special to me.  Maybe I don't understand, what
BioPerl means by Annotation, Feature, Sequence, etc.  Is there a page
that gives me an overview of this?  I couldn't find one but maybe I
haven't tried hard enough.