[Bioperl-l] fasta header
Alexander Kozik
akozik@atgc.org
Mon, 05 Aug 2002 10:52:11 -0700
Hi,
I am confused by some types of fasta headers at NCBI and EMBL.
For example, everything is fine when fasta header contains unique
ID near ">" symbol:
>gi|21728383|ref|NM_133695.1| Mus musculus RIKEN cDNA 1300007K12 gene
(1300007K12Rik), mRNA
ATTGAGTGTTGTTCATTGGCCTAGGTGAAGCCTGGGAAGCAGTGGGGCAGCCATGGAGCTGCTGACTGGG
ACTGGCCTGTGGCCTGTGGCCATATTCACAGTCATCTTCATATTACTGGTGGACCTGATGCACCGGCGCC
.....
In this case my Blast/Fasta parser considers "GI" number as a unique
number (ID) and can generate
matrix (hits) tables like:
Query ID Hit ID Expect Identity Overlap etc...
In this table I need >short< unique ID to input data into database.
"GI" or accession number works very well in this case.
However everything is ruined if I try to use some fasta files from EMBL:
(Arabidopsis genome)
ftp://ftp.ebi.ac.uk/pub/databases/embl/genomes/Eukaryota/athaliana/I/ath1.prot
>AC007323.7 AAF26460.1
MEDQVGFGFRPNDEELVGHYLRNKIEGNTSRDVEVAISEVNICSYDPWNLRFQSKYKSRD
..............
>AC007323.7 AAF26477.1
MAASEHRCVGCGFRVKSLFIQYSPGNIRLMKCGNCKEVADEYIECERMVCFNHFLSLFGP
..............
>AC007323.7 AAF26476.1
MDLSLAPTTTTSSDQEQDRDQELTSNIGASSSSGPSGNNNNLPMMMIPPPEKEHMFDKVV
..............
or from NCBI:
ftp://ftp.ncbi.nih.gov/genomes/A_thaliana/CHR_I/NC_003070.ffn
>(gi|18426880:3760-3913, 3996-4276, 4486-4605, 4706-5095, 5174-5326,
5439-5630), At1g01010
ATGGAGGATCAAGTTGGGTTTGGGTTCCGTCCGAACGACGAGGAGCTCGTTGGTCACTATCTCCGTAACA
..............
>(gi|18426880:c8666-8571, c8464-8417, c8325-8236, c7987-7942,
c7835-7729), At1g01020
ATGGCGGCGAGTGAACACAGATGCGTGGGATGTGGTTTTAGGGTAAAGTCATTGTTCATTCAATACTCTC
..............
>gi|18426880:c12940-11864, At1g01030
ATGGATCTATCCCTGGCTCCGACAACAACAACAAGTTCCGACCAAGAACAAGACAGAGACCAAGAATTAA
..............
There is no >short< unique number near ">" and some programs, for
example hmmsearch are confused too.
Output of hmmsearch:
Scores for complete sequences (score includes all domains):
Sequence Description Score
E-value N
-------- ----------- ----- -------
---
.........
AC015448.7 AAF99864.1 837.7
5e-249 1
AC015448.7 AAF99853.1 828.9
2.3e-246 1
AC015448.7 AAF99858.1 806.1
1.6e-239 1
AC015448.7 AAF99852.1 801.1
5.2e-238 1
.........
How do you handle this problem and what is the unique identifier
for this type of sequences? Are these examples deviation from
classical definition of fasta header?
Sorry if I misunderstand the subject,
hope for your help.
Thanks a lot in advance,
Alex.
--
Alexander Kozik
Department of Vegetable Crops
Asmundson Hall
University California at Davis
tel: (530) 752-1742
email: akozik@atgc.org
http://www.atgc.org