[BioPython] trying to make NBRF dictionary
ashleigh smythe
absmythe at ucdavis.edu
Mon Mar 15 16:57:48 EST 2004
Hello. As there seems to be no existing Bio.Fasta-style dictionary code
for alignments (Clustalw or NBRF), I thought I'd try to write a simple
script using the NBRF iterator to make a dictionary of sequence
name:sequence key:value pairs. My ultimate goal is to be able to
combine different aligned datasets where the sequence names (taxa) are
the same but they are in a different order (otherwise I could just
append one to the other). It seemed like a good use of a dictionary,
only I'm still pretty lame at python. I thought I'd start with just
trying to get one file into a dictionary, and I'm stuck already. My
code seems to make a dictionary of sorts, but it behaves like it only
has 1 key:value pair rather than 4 (len(mydict) returns 1) and the keys
are just my variable name (cur_record.sequence_name), not what I think
the keys should be - the actual data I put into the dictionary. I'm
guessing that means I have some scope problem. Can anybody please give
me some tips on where to go, at least for this first chunk?
Here is my script:
import Bio
from Bio import NBRF
mydict={}
def makedict(file1):
parser=NBRF.RecordParser()
first_file=open(file1, 'r')
iterator=NBRF.Iterator(first_file, parser)
while 1:
cur_record=iterator.next()
if cur_record is None:
break
name=cur_record.sequence_name
sequence=cur_record.sequence.data
mydict[name] = sequence
return mydict
And here is what I get:
>>> seqcombine2.makedict('test.pir')
{'9.1Otostrongylus_sp._U81589.1':
'----------------------------------------------------------------------------------------------------T-GTC-GA--GTTC-A--CC------TT--C--A---AG-T-GA--AA-C-TGCGAACGGCTCATTAG-AGCAGATG-T-CATT---TATT-CG--G--AA-A--A-T--C--C--A-TTT-GGA--TAACTGCG--GTAAT-TCTGGAGCTAATACATGCG-ATTA-A-AC-CCTG-AC---T--T-T--T---GAAA--GGGTGCAAT-TA-TTAGAG---C---AA-A-TCAAT-CAT-------------T-T---TC----------G-GA------TG----TAGTT----------T---GCT---G-A-C-TC-TGAATA-A---CG--CAG--CATA-TCGG-CGGC-T-T-GT---TCGCCGATAAT-CCGAAAA----AG---TGT-C-TGCCC-TATCA--AC---CT---GA-TGGTAGTCTATTAGTCTA-CCATGGTTATTACGGGTAACGGAGAATAAGGGTT-CGACTCCGGAGAGGGAGCCTTAGAAACGGCTACCACATCCAAGGAAGGCAGCAG-GCGCGAAACTTATCCAA-T-CTTG-----A-ATAGATGA-GATAGTGACT-----------------------AAAAATAAAAA--GACCA---TTCC-T-AT-G--GAACG-GTTATTTCAATGAGT--TGATCATAAACCTTTTTT--C-G-AGGA--TCAAGTGGAGGGCAAGTCTGGTGCCAGCAGCCGCGGTAATTCCAGCTC--CACTAGTGTA-AATCGTCATTGCTGCGGTTAAAAAGC-TCGTAGTTGGAT-C-TGAGTCGC---AT--GCA-AT-G-GTTCG--C-CT----T--TG--G--CGT----TAAT------C---AT-TG-TTGTG---ACTA---T------T-T---G--CTG--G-T-T--TTCT-AT--TG-A--AA-----TTTC-----G-A-TT-----TCTTTA-GTG-GC-TA--GCGA-GTT-TA-CTTTGA-AT-AAATTAGAGTGCT-CAGAACAAG---CGTT-----T--GC-TT-G--AAT-G-GTCGAT-CATGGAATAA-----TAAAAGAGGAC--TTCG---GT-T------CTATT-T----ATTGGTTC-AG---G-AA------CTG------AAAT-AATGGTTAAGAGGGACA--ATTC-GGGGGCATTCGTATCCCTGCGCGAGAGGTGAAATTCGTG-GACCG-CAGGGGGACGCCCTAAAGCGAAAG-CATTTGCC-AAGAAT--GTCTTCATTAATCA-AGAACGAAAGTCAGAGGTTCGAAGGCGATTAGATA--CCGCCC-TAGTTCTGACCGTAAACTATGCCATCTAGC-GA--TCC-GAT--GG-GG--TA--T--TG--T-T----GCCTT--GTCGAGG-AGCTT-CCCGGAAACGA--AA-GTCTTTCGGT-TCCTGGGGTAGTATGGTTGC-AAAGCT-G-AAACTTAAAGA-AATTGACGGAATGGCACCACCAGGAGTGGAGCCTGCGGCTTAATTTGACTCAACACGGGA--AAACT-CACCC-GGCCCGGACACCGTAA-GGATTGAC-----AGATTGA--A---AGCTCTTTCTC-GATTTGGTGGTTGGTGGTGCATGGCCGTTCTTAGTTG-GTGGAG-CGATTTGTCTGGTTTATTCC-GAT-AACGAGCGAGACTCT-AG-C-C--TG-CTAAA-TA-G--TGA--CAA---------------GA----TT-----------TT------T----ATGTC-------TA-G----T--C-------TA-------------C-TT-----CTT-AG---AGGGATAAG-CGG---TGTT-T-----A-G-C--CGCA--CG-AGATTGAGCGATAACAGGTCTGTGATGCCCTTAGATGTCCGGGG-CTG-CACGCGCGCTACAATGGAAG-AAT-CAGT--TGGC---CTA--T----CCAT-TGC-CG-A-AAGGT-AT----T----GGTAAACCG-TTGAAACT--CTTCC-GTG-ACCGGGATAGGGAATTGT--A-ATT---------ATT---TCCC-TTGAACG-AGGAATTCCTAGTAAGTGTG-AGTCATCAGCTCACGCTGATTACGTCCC-TGCCATTTGTACACACCGCCCGTCGCTGTC-CGGG-ACTG--AGC-TGTC--TCGAGAGGACT-GCGG-A-CTG----CT--GTA----TTGA-GG---CCT-------T---CGGG------TCG-----TGGTA----TAGCG---GG-AAA-CAG-TTC-AATC-G-CAATG-G--CTTGAACCGGGTAAAAGTCGT-AACAAGGTATCTG---------------------------------------------------------------------', '813Otostrongylus_circumlitus_A': '-----------------------------------------------------------------------------------GATT-AAGCCATG-CA-T-GTC-GA--GTTC-A--GC------TT--C--A---AG-T-GA--AA-C-TGCGAACGGCTCATTAG-AGCAGATG-T-CATT---TATT-CG--G--AA-A--A-T--C--C--A-TTT-GGA--TAACTGCG--GTAAT-TCTGGAGCTAATACATGCG-ATTA-A-AC-CCTG-AC---T--T-T--T---GAAA--GGGTGCAAT-TA-TTAGAG---C---AA-A-TCAAT-CAT-------------T-T---TC----------G-GA------TG----TAGTT----------T---GCT---G-A-C-TC-TGAATA-A---CG--CAG--CATA-TCGG-CGGC-T-T-GT---TCGCCGATAAT-CCGAAAA----AG---TGT-C-TGCCC-TATCA--AC---CT---GA-TGGTAGTCTATTAGTCTA-CCATGGTTATTACGGGTAACGGAGAATAAGGGTT-CGACTCCGGAGAGGGAGCCTTAGAAACGGCTACCACATCCAAGGAAGGCAGCAG-GCGCGAAACTTATCCAA-T-CTTG-----A-ATAGATGA-GATAGTGACT-----------------------AAAAATAAAAA--GACCA---TTCC-T-AT-G--GAACG-GTTATTTCAATGAGT--TGATCATAAACCTTTTTT--C-G-AGGA--TCAAGTGGAGGGCAAGTCTGGTGCCAGCAGCCGCGGTAATTCCAGCTC--CACTAGTGTA-AATCGTCATTGCTGCGGTTAAAAAGC-TCGTAGTTGGAT-C-TGAGTCGC---AT--GCA-AT-G-GTTCG--C-CT----T--TG--G--CGT----TAAT------C---AT-TG-TTGTG---ACTA---T------T-T---G--CTG--G-T-T--TTCT-AT--TG-A--AA-----TTTC-----G-A-TT-----TCTTTA-GTG-GC-TA--GCGA-GTT-TA-CTTTGA-AT-AAATTAGAGTGCT-CAGAACAAG---CGTT-----T--GC-TT-G--AAT-G-GTCGAT-CATGGAATAA-----TAAAAGAGGAC--TTCG---GT-T------CTATT-T----ATTGGTTC-AG---G-AA------CTG------AAAT-AATGGTTAAGAGGGACA--ATTC-GGGGGCATTCGTATCCCTGCGCGAGAGGTGAAATTCGTG-GACCG-CAGGGGGACGCCCTAAAGCGAAAG-CATTTGCC-AAGAAT--GTCTTCATTAATCA-AGAACGAAAGTCAGAGGTTCGAAGGCGATTAGATA--CCGCCC-TAGTTCTGACCGTAAACTATGCCATCTAGC-GA--TCC-GAT--GG-GG--TA--T--TG--T-T----GCCTT--GTCGAGG-AGCTT-CCCGGAAACGA--AA-GTCTTTCGGT-TCCTGGGGTAGTATGGTTGC-AAAGCT-G-AAACTTAAAGA-AATTGACGGAATGGCACCACCAGGAGTGGAGCCTGCGGCTTAATTTGACTCAACACGGGA--AAACT-CACCC-GGCCCGGACACCGTAA-GGATTGAC-----AGATTGA--A---AGCTCTTTCTC-GATTTGGTGGTTGGTGGTGCATGGCCGTTCTTAGTTG-GTGGAG-CGATTTGTCTGGTTTATTCC-GAT-AACGAGCGAGACTCT-AG-C-C--TG-CTAAA-TA-G--TGA--CAA---------------GA----TT-----------TT------T----ATGTC-------TA-G----T--C-------TA-------------C-TT-----CTT-AG---AGGGATAAG-CGG---TGTT-T-----A-G-C--CGCA--CG-AGATTGAGCGATAACAGGTCTGTGATGCCCTTAGATGTCCGGGG-CTG-CACGCGCGCTACAATGGAAG-AAT-CAGT--TGGC---CTA--T----CCAT-TGC-CG-A-AAGGT-AT----T----GGTAAACCG-TTGAAACT--CTTCC-GTG-ACCGGGATAGGGAATTGT--A-ATT---------ATT---TCCC-TTGAACG-AGGAATTCCTAGTAAGTGTG-AGTCATCAGCTCACGCTGATTACGTCCC-TGCCATTTGTACACACCGCCCGTCGCTGTC-CGGG-ACTG--AGC-TGTC--TCGAGAGGACT-GCGG-A-CTG----CT--GTA----TTGA-GG---CCT-------T---CGGG------TCG-----TGGTA----TAGCG---GG-AAA-CAG-TTC-AATC-G-CAATG-G--CTTGAACCGGGTAAAAGTCGT-AACAAGGTATCTGTAGGTGAACCTGG--------------------------------------------------------', '815Parelaphostrongylus_odocoil': '------------------------------------------------------------------------------------ATT-AAGCCATG-CA-T-GTG-GA--GTTC-A--AC------TT--CA-A---AG-T-GA--AA-C-TGCGAACGGCTCATTAG-AGCAGATG-T-CATT---TATT-CG--G--AA-A--A-T--CC-T--T-AAT-GGA--TAACTGCG--GTAAT-TCTGGAGCTAATACATATGCAT-A-A-AC-CCTG-AC---T--C-TG-T---GAAA--GGGTGCAAT-TA-TTAGAG---C---AA-A-TCAAT-CAT-------------T-T---TC----------G-GA------TG----TAGTT----------T---GCT---G-A-C-TC-TGAATA-A---CG--CAG--CATA-TCGG-CGGC-T-T-GT---TCGCCGATATT-CCGAAAA----AG---TGT-C-TGCCC-TATCA--AC---CT---GA-TGGTAGTCTATTAGTCTA-CCATGGTTATTACGGGTAACGGAGAATAAGGGTT-CGACTCCGGAGAGGGAGCCTTAGAAACGGCTACCACATCCAAGGAAGGCAGCAG-GCGCGAAACTTATCCAA-T-CTTG-----A-ATAGATGA-GATAGTGACT-----------------------AAAAATAAAAA--GACCA---TTCC-T-AT-G--GAACG-GTCATTTCAATGAGT--TGATCATAAACCTTTTTT--C-G-AGTA--TCAAGTGGAGGGCAAGTCTGGTGCCAGCAGCCGCGGTAATTCCAGCTC--CACTAGTGTA-AATCGTCATTGCTGCGGTTAAAAAGC-TCGTAGTTGGAT-C-TGAGTCGC---AT--GCA-AT-G-ATTCG--C-CT----T--TG--G--CGT----TAAT------C---AT-TG-TTGTG---ACTA---T------T-T---G--CTG--G-T-T--TTCT-AT--TG-A--AA-----TTTC-----G-A-TT-----TCTATA-GTG-GC-TA--GCGA-GTT-TA-CTTTGA-AT-AAATTAAAGTGCT-CAGAACAAG---CGTT-----T--GC-TT-G--AAT-G-GTCGAT-CATGGAATAA-----TAAAAGAGGAC--TTCG---GT-T------CTATT-T----ATTGGTTC-AG---G-AA------CTG------AAAT-AATGGTTAAGAGGGACA--ATTC-GGGGGCATTCGTATCCCTGCGCGAGAGGTGAAATTCGTG-GACCG-CAGGGGGACGCCCTAAAGCGAAAG-CATTTGCC-AAGAAT--GTCTTCATTAATCA-AGAACGAAAGTCAGAGGTTCGAAGGCGATTAGATA--CCGCCC-TAGTTCTGACCGTAAACTATGCCATCTAGC-GA--TCC-GAT--GG-GG--TA--T--TG--T-T----GCCTT--GTCGAGG-AGCTT-CCCGGAAACGA--AA-GTCTTTCGGT-TCCTGGGGTAGTATGGTTGC-AAAGCT-G-AAACTTAAAGA-AATTGACGGAATGGCACCACCAGGAGTGGAGCCTGCGGCTTAATTTGACTCAACACGGGA--AAACT-CACCC-GGCCCGGACACCGTAA-GGATTGAC-----AGATTGA--A---AGCTCTTTCTC-GATTTGGTGGTTGGTGGTGCATGGCCGTTCTTAGTTG-GTGGAG-CGATTTGTCTGGTTTATTCC-GAT-AACGAGCGAGACTCT-AG-C-C--TG-CTAAA-TA-G--TGA--CTA---------------GA----T------------ACG-----T----ATGTC-------TA-G----T--C-------TA-------------C-TT-----CTT-AG---AGGGATAAG-CGG---TGTT-T-----A-G-C--CGCA--CG-AGATTGAGCGATAACAGGTCTGTGATGCCCTTAGATGTTCGGGG-CTG-CACGCGCGCTACAATGGAAG-AAT-CAGC--TGGC---CTA--T----CCAT-TAC-CG-A-AAGGT-AT----T----GGTAAACCG-TTGAAACT--CTTCC-GTG-ACCGGGATAGGGAATTGT--A-ATT---------ATT---TCCC-TTGAACG-AGGAATTCCTAGTAAGTGTG-AGTCATCAGCTCACGCTGATTACGTCCC-TGCCATTTGTACACACCGCCCGTCGCTGTC-CGGG-ACTG--AGC-TGTC--TCGAGAGGACT-GCGG-A-CTA----CT--GTA----TTGA-GG---CCT-------T---CGGG------TCG-----CGATA----TGGCG---GG-AAA-CAG-TTC-AATC-G-CAATG-G--CTTGAACCGGGTAAAAGTCGT-A---------------------------------------------------------------------------------', '804Angiostrongylus_cantonensis': '------------------------------------------------------------------------------------ATT-AAGCCATG-CA-T-GAG-GA--GTTC-A--GC------TT--TA-A----G-T-GA--AA-C-TGCGAACGGCTCATTAG-AGCAGATG-T-GATT---TATT-CG--G--AA-A--A-T--CC-T----ATT-GGA--TAACTGCG--GTAAT-TCTGGAGCTAATACATGCGTAT-A-A-AC-CCTG-AC---T--T-T--C---GAAA--GGGTGCAAT-TA-TTAGAG---C---AA-A-TCAAT-CAT-------------T-T---TC----------G-GA------TG----TAGTT----------T---GCT---G-A-C-TC-TGAATA-A---CG--CAG--CATA-TCGG-CGGC-T-T-GT---TCGCCGATAAT-CCGAAAA----AG---TGT-C-TGCCC-TATCA--AC---CT---GA-TGGTAGTCTATTAGTCTA-CCATGGTTATTACGGGTAACGGAGAATAAGGGTT-CGACTCCGGAGAGGGAGCCTTAGAAACGGCTACCACATCCAAGGAAGGCAGCAG-GCGCGAAACTTATCCAA-T-CTTG-----A-ATAGATGA-GATAGTGACT-----------------------AAAAATAAAAA--GACCA---TTCC-T-AT-G--GAACG-GTTATTTCAATGAGT--TGATCATAAACCTTTTTT--C-G-AGTA--TCCAGTGGAGGGCAAGTCTGGTGCCAGCAGCCGCGGTAATTCCAGCTC--CACTAGTGTA-AATCGTCATTGCTGCGGTTAAAAAGC-TCGTAGTTGGAT-C-TGAGTTGC---AT--GCA-AT-G-ATTCG--C-CT----T--TG--G--CGT----TAAT------C---AT-TG-TTGTG---ACTA---T------T-T---G--CTG--G-T-T--TTCT-AT--TG-A--AA-----TTTC-----G-A-TT-----TCTTTA-GTG-GC-TA--GCGA-GTT-TA-CTTTGA-AT-AAATTAAAGTGCT-CAGAACAAG---CGTT-----T--GC-TT-G--AAT-G-GTCGAT-CATGGAATAA-----TAAAAGAGGAC--TTCG---GT-T------CTATT-T----ATTGGTTC-AG---G-AA------CTG------AAGT-AATGATTAAGAGGGACA--ATTC-GGGGGCATTCGTATCCCTGCGCGAGAGGTGAAATTCGTG-GACCG-CAGGGGGACGCCCTAAAGCGAAAG-CATTTGCC-AAGAAT--GTCTTCATTAATCA-AGAACGAAAGTCAGAGGTTCGAAGGCGATTAGATA--CCGCCC-TAGTTCTGACCGTAAACTATGCCATCTAGC-GA--TCC-GAT--GG-GG--TA--T--TG--T-T----GCCTT--GTCGAGG-AGCTT-CCCGGAAACGA--AA-GTCTTTCGGT-TCCTGGGGTAGTATGGTTGC-AAAGCT-G-AAACTTAAAGA-AATTGACGGAATGGCACCACCAGGAGTGGAGCCTGCGGCTTAATTTGACTCAACACGGGA--AAACT-CACCC-GGCCCGGACACCGTAA-GGATTGAC-----AGATTGA--A---AGCTCTTTCTC-GATTTGGTGGTTGGTGGTGCATGGCCGTTCTTAGTTG-GTGGAG-CGATTTGTCTGGTTTATTCC-GAT-AACGAGCGAGACTCT-AG-C-C--TG-CTAAA-TA-G--TGA--CTA---------------GA----TT-----------AT------T----GAGTC-------TA-G----T--C-------TA-------------C-TT-----CTT-AG---AGGGATAAG-CGG---TGTT-T-----A-G-C--CGCA--CG-AGATTGAGCGATAACAGGTCTGTGATGCCCTTAGATGTCCGGGG-CTG-CACGCGCGCTACAATGGAAG-AAT-CAGC--TGGC---CTA--T----CCAT-TGC-CG-A-AAGGT-AT----T----GGTAAACCG-TTGAAACT--CTTCC-GTG-ACCGGGATAGGGAATTGT--A-ATT---------ATT---TCCC-TTGAACG-AGGAATTCCTAGTAAGTGTG-AGTCATCAGCTCACGCTGATTACGTCCC-TGCCATTTGTACACACCGCCCGTCG
CTGTC-CGGG-ACTG--AGC-TGTC--TCGAGAGGACT-GCGG-A-CTA----CT--GTA----TTGA-GG---CCT-------T---CGGG------TCG-----CGATA----TGGCG---GG-AAA-CAG-TTC-AATC-G-CAATG-G--CTTGAACCGGGTAAAAGTCGT-AACAAGGTATCTG---------------------------------------------------------------------'}
Thanks for any input.
Ashleigh
More information about the BioPython
mailing list