[Bioperl-l] Re: Bio::SeqIO

francis@cmmt.ubc.ca francis@cmmt.ubc.ca
Tue, 17 Apr 2001 16:26:48 -0700 (PDT)


> On Tue, 17 Apr 2001, Qi, Yanping wrote:
> 
> > Hi;
> > 
> > I am using the Bio::SeqIO module to convert a EMBL file to Genbank format.
> > I have found the DR (database crossreference) section did not appear in the
> > Genbank file.  Does the DR section get converted?  
> 
eb> I don't think GenBank do DR lines (or equivalent) in the main part 
eb> of the files, they just do dbref's in the features. 

correct

eb> Does anyone have any insight here on the list? I am not a genbank 
eb> format expert.

<_former_ genbank flat file expert hat on>

The only thing that the databases (here I speak of DDBJ, EMBL and
GenBank) agree to exchange are the all of the features, some of the
lines in the header whicjh includes the ID, AC, SV, DT, DE, KW, OS, OC
(although, tHere GenBank puts its lineage as far as I recall), RN, RP,
RA, RT, and all of the FT and the sequence (of course). The respective
databases are free to organize the info in the header (the part I call
above the FT section) with much more liberty than one would expect.

Obviously I'm in the camp that thinks that none of these are useful or
properly tagged and deemed to be parsed ... But that seems to be what
people want to do, as opposed to extract the information that is
already tagged properly (like ASN.1 :)  I know it's pretty lonely in
this camp .. So I will let you bioperlers try and parse data elements
that the international collab cannot really exchange (and there is
actually a reason for this -- has to do with the data source (which
database it came from) and the timing of that DR line ... note it's a
part of the record where the other database can qrite on other
databases' record ... ie A pretty strict 'ownership' is emforced (and
needs to be emforced) so that GenBank doesn't write over records
generated by EMBL or DDBJ ... But the DR line (written by EMBL people)
is done on all records (and is not something that is reparsed and
redone by others.

If you like the DR line it's a good reason for using EMBL instead of
GenBank Flat files.  On the other hand, how useful is it (you should
ask) to know that a

P57170 is encoded somewhere in a 350 kb of bacterial DNA -- are you not
going to need a unique identifier (with a version) and the coordinates
on that DNA string anyways? 

<_former_ genbank flat file expert hat off>

'nuf said ...

cheers,

f.


--
| B.F. Francis Ouellette                      Tel: (604) 875-3815 | 
| Director, Bioinformatics Core Facility      Fax: (425) 740-6978 | 
| CMMT, UBC, Canada                        http://www.cmmt.ubc.ca | 
| francis@cmmt.ubc.ca                http://www.bioinformatics.ca |