[Bioperl-l] fasta header
Brian Desany
bdesany@bcm.tmc.edu
Mon, 5 Aug 2002 13:05:09 -0500
Alex,
My experience is that like you, I find it most convenient when the first
word to the right of the '>' is unique, and the rest of the line is
free-form text.
But I've run into cases where the convention is the opposite: the right-most
word prior to the newline is the unique identifier, and it's kind of
irritating to me.
But the fact is the fasta format doesn't address this at all; the format
just says that everything on the '>' line is a comment. So even if certain
tools may be smart enough to, e.g. pull out an accession number from the
comment line, that's a bonus and doesn't imply that all comment lines have
to follow that convention.
So as far as I can tell, I think you're going to have to either explicitly
check each of your sequences for what the "short unique" part is, or use the
whole comment line in your database.
-Brian.
>-----Original Message-----
>From: bioperl-l-admin@bioperl.org
>[mailto:bioperl-l-admin@bioperl.org]On
>Behalf Of Alexander Kozik
>Sent: Monday, August 05, 2002 12:52 PM
>To: bioperl-l@bioperl.org; support@ebi.ac.uk; info@ncbi.nlm.nih.gov
>Subject: [Bioperl-l] fasta header
>
>
>Hi,
>
>I am confused by some types of fasta headers at NCBI and EMBL.
>For example, everything is fine when fasta header contains unique
>ID near ">" symbol:
>>gi|21728383|ref|NM_133695.1| Mus musculus RIKEN cDNA 1300007K12 gene
>(1300007K12Rik), mRNA
>ATTGAGTGTTGTTCATTGGCCTAGGTGAAGCCTGGGAAGCAGTGGGGCAGCCATGGAGCTGCTGACTGGG
>ACTGGCCTGTGGCCTGTGGCCATATTCACAGTCATCTTCATATTACTGGTGGACCTGATGCACCGGCGCC
>.....
>In this case my Blast/Fasta parser considers "GI" number as a unique
>number (ID) and can generate
>matrix (hits) tables like:
>
>Query ID Hit ID Expect Identity Overlap etc...
>
>In this table I need >short< unique ID to input data into database.
>"GI" or accession number works very well in this case.
>
>However everything is ruined if I try to use some fasta files
>from EMBL:
>
>(Arabidopsis genome)
>ftp://ftp.ebi.ac.uk/pub/databases/embl/genomes/Eukaryota/athali
>ana/I/ath1.prot
>
>>AC007323.7 AAF26460.1
>MEDQVGFGFRPNDEELVGHYLRNKIEGNTSRDVEVAISEVNICSYDPWNLRFQSKYKSRD
>..............
>>AC007323.7 AAF26477.1
>MAASEHRCVGCGFRVKSLFIQYSPGNIRLMKCGNCKEVADEYIECERMVCFNHFLSLFGP
>..............
>>AC007323.7 AAF26476.1
>MDLSLAPTTTTSSDQEQDRDQELTSNIGASSSSGPSGNNNNLPMMMIPPPEKEHMFDKVV
>..............
>or from NCBI:
>ftp://ftp.ncbi.nih.gov/genomes/A_thaliana/CHR_I/NC_003070.ffn
>>(gi|18426880:3760-3913, 3996-4276, 4486-4605, 4706-5095, 5174-5326,
>5439-5630), At1g01010
>ATGGAGGATCAAGTTGGGTTTGGGTTCCGTCCGAACGACGAGGAGCTCGTTGGTCACTATCTCCGTAACA
>..............
>>(gi|18426880:c8666-8571, c8464-8417, c8325-8236, c7987-7942,
>c7835-7729), At1g01020
>ATGGCGGCGAGTGAACACAGATGCGTGGGATGTGGTTTTAGGGTAAAGTCATTGTTCATTCAATACTCTC
>..............
>>gi|18426880:c12940-11864, At1g01030
>ATGGATCTATCCCTGGCTCCGACAACAACAACAAGTTCCGACCAAGAACAAGACAGAGACCAAGAATTAA
>..............
>There is no >short< unique number near ">" and some programs, for
>example hmmsearch are confused too.
>Output of hmmsearch:
>
>Scores for complete sequences (score includes all domains):
>Sequence Description Score
>E-value N
>-------- ----------- -----
> -------
>---
>.........
>AC015448.7 AAF99864.1 837.7
>5e-249 1
>AC015448.7 AAF99853.1 828.9
>2.3e-246 1
>AC015448.7 AAF99858.1 806.1
>1.6e-239 1
>AC015448.7 AAF99852.1 801.1
>5.2e-238 1
>.........
>
>How do you handle this problem and what is the unique identifier
>for this type of sequences? Are these examples deviation from
>classical definition of fasta header?
>Sorry if I misunderstand the subject,
>hope for your help.
>
>Thanks a lot in advance,
>
>Alex.
>
>--
>Alexander Kozik
>Department of Vegetable Crops
>Asmundson Hall
>University California at Davis
>tel: (530) 752-1742
>email: akozik@atgc.org
>http://www.atgc.org
>
>
>
>_______________________________________________
>Bioperl-l mailing list
>Bioperl-l@bioperl.org
>http://bioperl.org/mailman/listinfo/bioperl-l
>