[Bioperl-l] fasta header

Alexander Kozik akozik@atgc.org
Mon, 05 Aug 2002 10:52:11 -0700


Hi,

I am confused by some types of fasta headers at NCBI and EMBL.
For example, everything is fine when fasta header contains unique
ID near ">" symbol:
>gi|21728383|ref|NM_133695.1| Mus musculus RIKEN cDNA 1300007K12 gene
(1300007K12Rik), mRNA
ATTGAGTGTTGTTCATTGGCCTAGGTGAAGCCTGGGAAGCAGTGGGGCAGCCATGGAGCTGCTGACTGGG
ACTGGCCTGTGGCCTGTGGCCATATTCACAGTCATCTTCATATTACTGGTGGACCTGATGCACCGGCGCC
.....
In this case my Blast/Fasta parser considers "GI" number as a unique
number (ID) and can generate
matrix (hits) tables like:

Query ID    Hit ID    Expect    Identity    Overlap etc...

In this table I need >short< unique ID to input data into database.
"GI" or accession number works very well in this case.

However everything is ruined if I try to use some fasta files from EMBL:

(Arabidopsis genome)
ftp://ftp.ebi.ac.uk/pub/databases/embl/genomes/Eukaryota/athaliana/I/ath1.prot

>AC007323.7 AAF26460.1
MEDQVGFGFRPNDEELVGHYLRNKIEGNTSRDVEVAISEVNICSYDPWNLRFQSKYKSRD
..............
>AC007323.7 AAF26477.1
MAASEHRCVGCGFRVKSLFIQYSPGNIRLMKCGNCKEVADEYIECERMVCFNHFLSLFGP
..............
>AC007323.7 AAF26476.1
MDLSLAPTTTTSSDQEQDRDQELTSNIGASSSSGPSGNNNNLPMMMIPPPEKEHMFDKVV
..............
or from NCBI:
ftp://ftp.ncbi.nih.gov/genomes/A_thaliana/CHR_I/NC_003070.ffn
>(gi|18426880:3760-3913, 3996-4276, 4486-4605, 4706-5095, 5174-5326,
5439-5630), At1g01010
ATGGAGGATCAAGTTGGGTTTGGGTTCCGTCCGAACGACGAGGAGCTCGTTGGTCACTATCTCCGTAACA
..............
>(gi|18426880:c8666-8571, c8464-8417, c8325-8236, c7987-7942,
c7835-7729), At1g01020
ATGGCGGCGAGTGAACACAGATGCGTGGGATGTGGTTTTAGGGTAAAGTCATTGTTCATTCAATACTCTC
..............
>gi|18426880:c12940-11864, At1g01030
ATGGATCTATCCCTGGCTCCGACAACAACAACAAGTTCCGACCAAGAACAAGACAGAGACCAAGAATTAA
..............
There is no >short< unique number near ">" and some programs, for
example hmmsearch are confused too.
Output of hmmsearch:

Scores for complete sequences (score includes all domains):
Sequence    Description                                 Score
E-value  N
--------    -----------                                 -----    -------
---
.........
AC015448.7  AAF99864.1                                  837.7
5e-249   1
AC015448.7  AAF99853.1                                  828.9
2.3e-246   1
AC015448.7  AAF99858.1                                  806.1
1.6e-239   1
AC015448.7  AAF99852.1                                  801.1
5.2e-238   1
.........

How do you handle this problem and what is the unique identifier
for this type of sequences? Are these examples deviation from
classical definition of fasta header?
Sorry if I misunderstand the subject,
hope for your help.

Thanks a lot in advance,

Alex.

--
Alexander Kozik
Department of Vegetable Crops
Asmundson Hall
University California at Davis
tel: (530) 752-1742
email: akozik@atgc.org
http://www.atgc.org