[Bioperl-l] Parsing contig information from HTGs
simon andrews (BI)
simon.andrews@bbsrc.ac.uk
Tue, 11 Dec 2001 12:22:11 -0000
Dear All,
I'm trying to parse out contig information from EMBL HTG flatfiles using
BioPerl. I can read the file OK with Bio::SeqIO and get myself a Bio::Seq
object. The problem I'm having is getting at the contig info.
I don't think I can use the usual feature methods as for some bizarre reason
the contigs are often only identified in the comments section of the
entries, eg:
CC * 1 8591: contig of 8591 bp in length
CC * gap of unknown length
CC * 8592 28835: contig of 20244 bp in length
CC * gap of unknown length
CC * 28836 40356: contig of 11521 bp in length
CC * gap of unknown length
CC * 40357 58902: contig of 18546 bp in length
CC * gap of unknown length
CC * 58903 61812: contig of 2910 bp in length
CC * gap of unknown length
CC * 61813 71640: contig of 9828 bp in length
CC * gap of unknown length
CC * 71641 75199: contig of 3559 bp in length
CC * gap of unknown length
CC * 75200 91638: contig of 16439 bp in length.
...and this information *doesn't* appear in the feature table!!
Trying to parse this, I've found I can get the comments section from the
Bio::Seq object using;
my $annot = $seq->annotation();
foreach my $comment($annot->each_Comment){
print $comment->text . "\n";
}
..but the each_Comment iterator only returns one comment per database entry,
and this is a concatenation of all of the comment lines from the original
entry. Removing the line breaks makes the resulting string a lot harder to
process.
So my questions are:
1) Is there a better way to get at the contig information through the
existing objects (wishful thinking??).
2) Am I retrieving the comments the right way? ..and if so is there
a reason why the newlines are stripped upon processing? My assumption was
that the each_Comment iterator would give me back the original comments one
line at a time, which I could then process to extract the contig info.
This is all using BioPerl 0.7.0 (I think..)
Any help is much appreciated
Simon.
----
Simon Andrews PhD
Bioinformatics Dept
The Babraham Institute
simon.andrews@bbsrc.ac.uk
+44 (0)1223 496463