[Bioperl-l] Experiences from a newbie
andreas.bernauer at gmx.de
andreas.bernauer at gmx.de
Thu Dec 11 18:55:26 EST 2003
Hi,
I was not going to write an email, but I've read the Bioperl overview
and it encouraged me to do so. The reason why I was not going to
write an email is that I don't know where to start with the
difficulties I had. I don't want to blame anybody and usually I had
the feeling I didn't get some things to work because I am missing
something very little.
OK, here we go. My background: lots of programming experience, but
not in perl, first time user of Bioperl. Running on Linux.
The installation was easy as far as I remember. Some things to add
like with almost every Linux program installation, but nothing
exciting as far as I remember.
The first point I got overwhelmed was in the tutorial: It was very
huge and covered a lot of stuff I didn't want to do, so I tried to
read only the parts I was interested in.
I read the section about the different objects BioPerl is using. I
found this section rather confusing, as I don't understand, when to
use which object and why I had to care about them. I just wanted to
read in a file and get some information out of it.
My task was to extract the translation information out of a GenBank
file that I got from NCBI, that describes a whole genome of a
bacterium. The file has the nucleotide information and in the
FEATURES section all predicted proteins. I wanted those proteins to
be written into a FASTA file.
So I used the following code:
use Bio::SeqIO;
$in = Bio::SeqIO->new(-file => shift @ARGV,
-format => 'GenBank');
$out = Bio::SeqIO->new(-file => ">outputfilename",
-format => 'FASTA');
while ( my $seqobj = $in->next_seq() )
{$out->write_seq($seqobj); }
which I got from the tutorial (III.2).
I run this script and got a FASTA file with the DNA. Hm, well, that's
what's in the GenBank file and that's what I got, but not what I
wanted. OK, my fault, let's get the FEATURES.
So I looked into the tutorial and went to section III.7.1 Representing
sequence annotations and found this line:
@allfeatures = $seqobj->all_SeqFeatures(); # descend into sub features
adding this to my script yielded an error:
Can't locate object method "all_SeqFeatures" via package
"Bio::SeqIO::genbank" (perhaps you forgot to load
"Bio::SeqIO::genbank"?) at ./test.pl line 10.
Hm, the section meant probably something else with $seqobj. Reviewing
the title, I saw, it should be a RichSeq object. OK, how can I
translate my SeqIO object into a RichSeq object? (And by the way, why
do I have to do it? I mean, I've already read the file, why can't I
access the data? This is where I think I am missing part of the
BioPerl philosophy or assuming too much.)
So I looked through the SeqIO manpage, but I didn't find anything that
told me, how I could get a RichSeq out of a SeqIO. I looked through
the RichSeq manpage, but it can only create new RichSeqs, not reading
some out of SeqIO. I went to the SeqIO HOWTO afterwards
(http://bioperl.org/HOWTOs/html/SeqIO.html). (By the way, in chapter
4 there is a small bug in the very first working example. It uses
only a single colon instead of double colons, like Bio:SeqIO instead
of Bio::SeqIO.)
At the description of the splitgb.pl script it tells me, that it uses
the species attribute. Looks good, as I probably want the FEATURES
attribute. OK, so I looked at the Seq manpage. There I found a way
to get an AnnotationCollectionI, from which I can get an AnnotationI.
So I changed the body of the while loop above to the following, using
the code from the AnnotationCollectionI manual:
while ( my $seq = $in->next_seq() ) {
my $ann = $seq->annotation();
foreach my $key ( $ann->get_all_annotation_keys() ) {
my @values = $ann->get_Annotations($key);
foreach my $value ( @values ) {
# value is an Bio::AnnotationI, and defines a "as_text" method
print "Annotation ",$key," stringified value ",$value->as_text,"\n";
# you can also use a generic hash_tree method for getting
# stuff out say into XML format
#$hash_tree = $value->hash_tree();
}
}
}
The output is the following:
[andreas at hgt tmp]$ ./test.pl ~/research/genomes/Actinobacteria/NC_000962.gbk
Annotation origin stringified value Value:
Annotation comment stringified value Comment: COMPLETENESS: full length.
Annotation reference stringified value Reference: Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence
Annotation reference stringified value Reference: Direct Submission
Which looks a little bit like garbage to me. There is no text
"Value:" in the whole file. The ordering is surprisingly reversed.
The References only prints the title of the reference. And the worst
thing is, that it omits the FEATURES section that I am actually
looking for, not to mention that it obviously misses all the other
annotations like LOCUS, DEFINITION, ACCESSION, etc. But they might be
stored in some other object I am not aware of.
I gave up at this point. I spend almost an hour for a task that
should be easy to accomplish. In fact, in half the time probably, I
wrote a Scheme script that extracted me the information, as the
GenBank file is pretty structured.
But I did want to give BioPerl another chance. Some time later, I had
run a blast on some databases and wanted to extract the results. I
used something like:
my $report = new Bio::SearchIO (-format => 'blast',
-file => $file);
while (my $result = $report->next_result ) {
print "Next ($blastId) result (" . $result->num_hits . " hits)\n";
my $hitNo = 0;
while (my $hit = $result->next_hit) {
$hitNo++;
# do something with hit.
}
}
I found the SearchIO tutorial which was pretty cool, as it was the
first page that really gave me an overview of all the functions I can
call and what they will give me as a result (not only what object, but
what part of the BLAST result file) opposed to the manpages that were
not so informative and clear for me.
As I din't want to look at the actual sequences and only needed the
hits themselves, I told blast to omit the detailed listing of the
alignments. This turned out to be a mistake, as for some reason that
I can never think of, I can only get HITS in $result, when I include
the alignments (which are accessed via HSP which I never use). Again
something that I cannot access with my logic.
The bottom line for me is: BioPerl is valuable for me in so far that I
don't have to write a parser for BLAST output by myself. But with all
its different kinds of objects, interfaces and collections I am
totally lost and either I spend an hour figuring out, how to do simple
tasks or I just write a small script that does it for me, as most
files are descent structured.
Although I am not completely satisfied with BioPerl I like the idea
itself and the ability to program over biological data. I would
appreciate if somebody could point me to what I am missing or where I
have wrong assumptions. Maybe a webpage that gives me the big picture
and that I've missed. I always had the feeling that I am just missing
a little bit to understand why this has to work this (for me
complicated) way and not the other.
Thank you for your attention.
Best regards,
Andreas Bernauer.
PS: While I reviewed my email, I realized how to get the FEATURES I
wanted: by using $seqobj->get_SeqFeatures() and
$feat->get_tag_values('translation'). So, finally, it worked. I am
still wondering why it was so difficult for me to get to this. The
task doesn't look so special to me. Maybe I don't understand, what
BioPerl means by Annotation, Feature, Sequence, etc. Is there a page
that gives me an overview of this? I couldn't find one but maybe I
haven't tried hard enough.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 232 bytes
Desc: not available
Url : http://portal.open-bio.org/pipermail/bioperl-l/attachments/20031211/2f2f85ef/attachment.bin
More information about the Bioperl-l
mailing list