[Biojava-l] parsing EMBL files

mark.schreiber at group.novartis.com mark.schreiber at group.novartis.com
Thu Oct 21 21:02:59 EDT 2004


I guess the logic was, if two sublocations overlap then they should be 
merged. Apparently that is not always the case : (

I think that the Location that is formed will be an instance of a 
MergeLocation in which case it should be possible to recover the 
sub-locations





Lorna Morris <lmorris at ebi.ac.uk>
Sent by: biojava-l-bounces at portal.open-bio.org
10/21/2004 06:35 PM

 
        To:     biojava-l at biojava.org
        cc:     (bcc: Mark Schreiber/GP/Novartis)
        Subject:        [Biojava-l] parsing EMBL files


Hi

I'm using the biojava (SeqIOTools.readEmbl()) to parse EMBL flat files. 
However I've noticed after reading EMBL flat files, and writing them out 
again using biojava the location data of features can change.

e.g:

In the original flat file (accession number : AE001273) this location 
data in present:

FT   CDS 
join(complement(536417..536485),complement(535378..536418))

This gets changed by Biojava to:

FT   CDS             complement(535378..536485)


I looked at the biojava src code and found out that this occurs because
Biojava merges sub-locations in join statments that are overlapping.

If I comment out this code in LocationTools._union():

if(canMerge(last,cur)) {
                 try {

             last = MergeLocation.mergeLocations(last,cur);

         } catch (BioException ex) {
             throw new BioError("Cannot make MergeLocation",ex);
           }
         }


Then the location appears as it should in the original EMBL format, with 
the join descibing the overlap. Overlaps between sub-locations in 
joins,are allowed in EMBL format, but are very rare. They maybe used to 
describe frameshifts,occurring through sequence errors.

Is there another reason, for including this code which perhaps I've 
missed? I've commented it out and diffed the files produced, and there 
aren't any other differences with the EMBL file (AE001273.embl) at least.

I just wanted to check if by removing this code, whether there might be 
other side effects.

Many thanks,

Lorna Morris


_______________________________________________
Biojava-l mailing list  -  Biojava-l at biojava.org
http://biojava.org/mailman/listinfo/biojava-l





More information about the Biojava-l mailing list