[Bioperl-l] CONTIG dealing

Thu Oct 19 02:46:14 UTC 2006

On Oct 18, 2006, at 4:58 PM, Nikki Appleby wrote:

>
> I have just entered the wonderful new world of BioPerl, so the  
> answer to my
> question may be obvious to any of the gurus reading this.
>
> I need to collect sequence features and ontology annotations. Here  
> goes.
>
> I am retrieving sequences from SwissProt via Bio::DB::SwissProt and
> get_Seq_by_id, for this example Q8RZV7. Once I have parsed it into  
> an RDBMS
> format that I am happy with I can get at the xref ids. In this  
> case, they
> are
>
> AP003451; BAB86144.1; -; Genomic_DNA.
> AP008207; BAF07116.1; -; Genomic_DNA.
> AB103395; BAC81207.1; -; mRNA.
>
> I can happily go off and fetch those from Bio::DB::GenBank (first  
> column),
> and Bio::DB::GenPept (second). All good, except...
>
> AP008207 is a contig. I don't want to get all of the features for  
> the entire
> thing, just the single contig that actually matches the original  
> sequence.
> It takes a couple of hours to get at it and then it gives me way  
> too much.
>
> I will come across this problem with other sequences. How do I (a)  
> find out
> if it is a contig without downloading it in it's entirety and (b)  
> extract
> the list of sequences that are about to be contigged together.
>
> I have searched the web for answers, including this list, but see  
> nothing.
> Help!
>
> Nikki Appleby.

The default setting for the retrieval format for GenBank is  
'gbwithparts' (which gets the full sequence at all times).  You can  
set this to 'gb' using request_format() to retrieve the sequence file  
with the contig information instead of the sequence, if it contains  
such (otherwise it just retrieves the sequence anyway).

However, I have noticed this particular file does not represent a  
true contig record but is the entire chromosome sequence.  The contig  
information is in the comments section, probably b/c the record is  
converted over.  You could just download the sequence record and run  
regexp to grab the comments section, then parse out the contigs (a  
pain) if you really want that.  Or you could try to find the  
equivalent GenBank record, such as the ones derived from the WGS  
records.

I did notice the list of dbxrefs in your swissprot record indicate  
three EMBL sequences.  If the order is consistent for the SwissProt  
entries you want, they probably represent:

The contig (what you want): AP003451; BAB86144.1; -; Genomic_DNA.
The supercontig (chromosome) : AP008207; BAF07116.1; -; Genomic_DNA.
The cDNA : AB103395; BAC81207.1; -; mRNA.

I checked the first one (AP003451), which seems to confirm this.

Since the chromosome supercontig is built from the smaller sequence  
contigs you could just grab the first EMBL dbxref instead of all of  
them.  It parses much faster than the chromosome file.

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign