[Biojava-l] FASTA header, loc attribute question

Richard Holland holland at eaglegenomics.com
Mon May 11 10:27:23 UTC 2009


Short answer - no, not directly.

Longer answer - if you can write some code to snip out the Loc string
from the FASTA description line then there is existing code which can
convert the snipped Loc string into a RichLocation, which you can then
apply to the parsed FASTA sequence in order to extract the required
location. The Loc string parser is GenbankLocationParser, part of the
biojavax packages. This assumes that the Loc string conforms to Genbank
format location definitions.

cheers,
Richard

On Mon, 2009-05-11 at 11:05 +0100, JP wrote:
> Hi there at Biojava,
> 
> I have two FASTA files - one containing amino acid sequences and the other
> containing dna sequences.
> 
> In the AA FASTA file I have something like :
> 
> >FBpp0077713 type=protein;
> loc=2L:join(384551..384894,385701..385746,386308..386576,386703..387270);
> ID=FBpp0077713; name=al-PA; parent=FBgn0000061,FBtr0078053;
> dbxref=FlyBase:FBpp0077713,GB_protein:AAF51505.1,GB_protein:AAF51505,FlyBase_Annotation_IDs:CG3935-PA,REFSEQ:NP_722629;
> MD5=64a866db3e2913b97a2158c2de9d02f6; length=408; release=r5.9;
> species=Dmel;
> MGISEEIKLEELPQEAKLAHPDAVVLVDRAPGSSAASAGAALTVSMSVSG
> GAPSGASGASGGTNSPVSDGNSDCEADEYAPKRKQRRYRTTFTSFQLEEL...
> etc etc etc
> 
> I would like to parse this header line in particular the loc attribute and
> extract it from the entry in the DNA FASTA file (so I get the genomic data
> for the protein)
> 
> >FBgn0000061 type=gene; loc=2L:378116..387439; ID=FBgn0000061; name=al;
> dbxref=FlyBase:FBgn0000061,FlyBase:FBan0003935,FlyBase_Annotation_IDs:CG3935,GB:AE003589,GB_protein:AAF51505,GB:AY121696,GB_protein:AAM52023,GB:BI485174,GB:CZ486795,GB:L08401,GB_protein:AAA28840,UniProt/Swiss-Prot:Q06453,INTERPRO:IPR000047,INTERPRO:IPR001356,INTERPRO:IPR003654,INTERPRO:IPR009057,INTERPRO:IPR012287,bdgpinsituexpr:al,dedb:5830,drsc:FBgn0000061,flight:FBgn0000061,flyatlas:FBgn0000061,flyexpress:FBgn0000061,flygrid:59464,flymine:FBgn0000061,geo:FBgn0000061,hdri:FBgn0000061,if:/gene/aristal.htm,orthologs:ensANOGA:ENSANGP00000011877,orthologs:ensBOSTA:ENSBTAP00000015907,orthologs:ensCANFA:ENSCAFP00000009888,orthologs:ensGALGA:ENSGALP00000005255,orthologs:ensHOMSA:ENSP00000298420,orthologs:ensMACMU:ENSMMUP00000007349,orthologs:ensMONDO:ENSMODP00000008388,orthologs:ensPANTR:ENSPTRP00000004281,orthologs:ensRATNO:ENSRNOP00000027186,orthologs:ensTETNI:GSTENP00015517001,orthologs:graORYSA:Q6YYB8,orthologs:graORYSA:Q8W0T5,orthologs:modCAEEL:WBGene00044330,orthologs:modDA!
>  NRE:ZDB-GENE-990415-15,orthologs:modMUSMU:MGI:1097716,panther:FBgn0000061;
> cyto_range=21C1-21C1; gbunit=AE014134; MD5=0f5568cf13aeb2c7076f11b1ce3d6b2f;
> length=9324; release=r5.9; species=Dmel;
> GTAGTTTGCTGCCGGCTCTGGAACAGCCCGGTCATCTCGTCGCGTTCGGT
> TCCGATTCCGATTCGAATAGTCGAGCTGGGGATACATTGTTGTTTCCGGG
> etc etc etc
> 
> I understand this is not exactly conventional, but does biojava support the
> parsing of the loc attribute ? (join, complement etc.)
> 
> Many Thanks
> JP
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
-- 
Richard Holland, BSc MBCS
Finance Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/





More information about the Biojava-l mailing list