[Biopython] parsing a fasta with multiple entries

Mon Apr 26 15:59:02 UTC 2010

On Mon, Apr 26, 2010 at 4:36 PM, Nick Leake <nick_leake77 at hotmail.com> wrote:
>
> Hello,
>
> I'm having trouble parsing a fasta file with multiple sequences - it is a fasta
> that has most of the transposable elements in fruit flies found at
> http://www.fruitfly.org/p_disrupt/TE.html#NAT right side, third box down.

Hi Nick,

You mean this file?
http://www.fruitfly.org/data/p_disrupt/datasets/ASHBURNER/D_mel_transposon_sequence_set.fasta

> I want to be able to access the DNA sequences for manipulation and later
> removal from a chromosomal region.  I originally thought that I could follow
> the same fasta format example shown in the biopython tutorial.  However,
> that failed to work.  I think it might be because there are multiple entries.

The Bio.SeqIO.read() function is for when there is a single record. The
Bio.SeqIO.parse() function is for when you have multiple records. Could
you clarify which bit of the tutorial was confusing? We'd like to make it
better.

> Basically, I just want parse the information and have dictionaries hold the
> transposon elements name and sequence for later use.  Can I do that with
> biopython or should I make my own parser? Any help would be greatly
> appreciated.  I'm still very much a python novice and get frustrated by not
> knowing how to ask my questions appropriately.

You should be able to use the Bio.SeqIO.index() function for this.

>>> from Bio import SeqIO
>>> data = SeqIO.index("D_mel_transposon_sequence_set.fasta", "fasta")
>>> data.keys()[:10]
['gb|U14101|TART-B', 'gb|AF162798|Dbuz\\BuT1',
'gb|U26847|Dvir\\Helena', 'gb|X67681|Bari1', 'gb|M69216|hobo',
'gb|U29466|Dkoe\\Gandalf', 'gb|Z27119|flea',
'gb|AB022762|aurora-element', 'gb|nnnnnnnn|Stalker3T',
'gb|AF518730|Dwil\\Vege']
>>> data["gb|nnnnnnnn|Stalker3T"]
SeqRecord(seq=Seq('TGTAGTGTATCTACCCTCAATATGTArAGTAGAGTTAATATGTAAGTAAGTAAT...ACA',
SingleLetterAlphabet()), id='gb|nnnnnnnn|Stalker3T',
name='gb|nnnnnnnn|Stalker3T', description='gb|nnnnnnnn|Stalker3T
STALKER3 372bp', dbxrefs=[])
>>> print data["gb|nnnnnnnn|Stalker3T"].seq
TGTAGTGTATCTACCCTCAATATGTArAGTAGAGTTAATATGTAAGTAAGTAATATGTAAAGTAGAGTTAATATGTAAGTAAGCAAAAGACCACCAACACTTACATGAACACTCCAGCTCTTGAAATACGATCGAGCGCTTAAACATAAGCCGATCGCGGAGCGTGAGAGTGCCGAGCATACACCTAGCAGCTCAAGTGATTAAGATAAGATAAGATAAGATAACAAACACGTAGTCTTAAGCGCGTCATGTGCGGGTGGCTGTACCCAAGAACAGCAAAGTGAATTCATTCGAATAAACCGCTTCAAGCAGAGCAGAGCCAAGTCTATTATATCAACTTCAAAAATACCGTATAACCTTGAACCTATTACA

Peter