[Biojava-l] parsing EMBL files

Lorna Morris lmorris at ebi.ac.uk
Mon Oct 25 06:14:37 EDT 2004


Hi Mark

Thanks for your answer. I can see the logic, it is an odd way to 
represent a frameshift by using an overlap in a join statement. It seems 
easier for me to not call mergeLocations in the first place instead of 
recovering the sub-locations from the MergeLocation instance.

I spotted something else, whilst parsing EMBL files. Sorting 
sub-locations by their natural order (min, max values) doesn't always 
work, if the CompoundLocation overlaps with the origin.

For example the following join statement occurs in the EMBL file 
(AE001273.embl):

join(1041920..1042519,1..1176)

 From this Biojava creates the following CompoundLocation:

1176, 1041920,{([1,1176]), [1041920,1042519])}

A call to Collections.sort(locations, Location.naturalOrder) in 
constructor of CompoundLocation flips the 2 sub-locations, so the one 
with the lower coordinates appears first.

The quickest solution for me is to retain the original order of the 
Location objects as they are parsed, and not do any sorting. Perhaps 
there is a better solution, to allow CompoundLocation objects to be 
CircularLocations and if they pass through the origin, then don't sort 
them according to the naturalOrder. Do you think this solution would be 
preferable, or do you think it is sufficient to retain original order of 
sub-locations in all CompoundLocations. I will stick with this quick fix 
for the time being, but maybe work on a better one for the future, if it 
is more appropriate.

Thanks for your help,

Lorna


mark.schreiber at group.novartis.com wrote:

>I guess the logic was, if two sublocations overlap then they should be 
>merged. Apparently that is not always the case : (
>
>I think that the Location that is formed will be an instance of a 
>MergeLocation in which case it should be possible to recover the 
>sub-locations
>
>
>
>
>
>Lorna Morris <lmorris at ebi.ac.uk>
>Sent by: biojava-l-bounces at portal.open-bio.org
>10/21/2004 06:35 PM
>
> 
>        To:     biojava-l at biojava.org
>        cc:     (bcc: Mark Schreiber/GP/Novartis)
>        Subject:        [Biojava-l] parsing EMBL files
>
>
>Hi
>
>I'm using the biojava (SeqIOTools.readEmbl()) to parse EMBL flat files. 
>However I've noticed after reading EMBL flat files, and writing them out 
>again using biojava the location data of features can change.
>
>e.g:
>
>In the original flat file (accession number : AE001273) this location 
>data in present:
>
>FT   CDS 
>join(complement(536417..536485),complement(535378..536418))
>
>This gets changed by Biojava to:
>
>FT   CDS             complement(535378..536485)
>
>
>I looked at the biojava src code and found out that this occurs because
>Biojava merges sub-locations in join statments that are overlapping.
>
>If I comment out this code in LocationTools._union():
>
>if(canMerge(last,cur)) {
>                 try {
>
>             last = MergeLocation.mergeLocations(last,cur);
>
>         } catch (BioException ex) {
>             throw new BioError("Cannot make MergeLocation",ex);
>           }
>         }
>
>
>Then the location appears as it should in the original EMBL format, with 
>the join descibing the overlap. Overlaps between sub-locations in 
>joins,are allowed in EMBL format, but are very rare. They maybe used to 
>describe frameshifts,occurring through sequence errors.
>
>Is there another reason, for including this code which perhaps I've 
>missed? I've commented it out and diffed the files produced, and there 
>aren't any other differences with the EMBL file (AE001273.embl) at least.
>
>I just wanted to check if by removing this code, whether there might be 
>other side effects.
>
>Many thanks,
>
>Lorna Morris
>
>
>_______________________________________________
>Biojava-l mailing list  -  Biojava-l at biojava.org
>http://biojava.org/mailman/listinfo/biojava-l
>
>
>
>  
>




More information about the Biojava-l mailing list