[Bioperl-l] Bio:Seq $seq_obj->accession_number not returningaccession number?

Jason Stajich jason.stajich at duke.edu
Sun Dec 4 16:49:40 EST 2005


Sam -
Yeah what Barry said.

It doesn't get set when reading fasta files - see Hilmar's link below  
for more info - all the info is in the display id, available in $seq- 
 >display_id

my ($gi,$acc,$locus);
(undef,$gi,undef,$acc,$locus) = split(/\|/,$seq->display_id);
$seq->accession_number($acc);

I thought there was a function already to do this for you, but I  
guess not.  There is something Search::Hit objects to parse accession  
number so maybe we can consolidate this if someone volunteers to do it.

See also Hilmar's response about this:
http://bioperl.org/pipermail/bioperl-l/2005-August/019579.html

I've added it as a Q&A to the new wiki FAQ which we'll roll out soon.

-jason

On Dec 4, 2005, at 4:23 PM, Barry Moore wrote:

> Sam-
>
> The fasta parser makes no attempt to parse the fasta header since  
> there
> is no standard format for what should be in a fasta header.  Parse the
> accession out of the primary_id field with a regular expression in  
> your
> script or use GenBank or ENSEMBL format sequences to get all the  
> goodies
> parsed for you.  Google on "accession fasta parse site:bioperl.org" to
> read other posts on this topic.
>
> Barry
>
> -----Original Message-----
> From: bioperl-l-bounces at portal.open-bio.org
> [mailto:bioperl-l-bounces at portal.open-bio.org] On Behalf Of Sam
> Al-Droubi
> Sent: Sunday, December 04, 2005 1:18 PM
> To: BioPerl list BioPerl list
> Subject: [Bioperl-l] Bio:Seq $seq_obj->accession_number not
> returningaccession number?
>
> The fasta format for this sequence AF410462 from NCBI looks like this
>
>
>> gi|17066572|gb|AF410462.1|AF410462 Mus musculus PEM homeobox (Pem)
> gene, promoter region and partial cds
> ATGCGTGTGGGCATGCGCTCATGCCCACTTGCTTGAGCACATGTGTGCTCACATGGACGTTAGAGGCAAC
> TTTCAGGAGTTATTTTTTTCCCTTCTAACTTGAGTTCCTGGACCTCAGACTTGTATAATAGGTACTTTCC
> CAACTTAAGTCTTACTGGCTCCAGGGTATCTGGTATACTCTTCTAGCCTCCAAGGGCAGCCACTCATGCT
> TCTTCAGGTGTGAAGAGGTGAGCCAGATACAACGGTGGGAGGCAGTGTGCCCTCAGTGTGTAGACTCTTT
> ATGCCCTTGGGGATTAGCGCCTCTAGCTGCCAGTCGGGTCTCTGGGTCCCTCCTGCTAAGGCCACTCTCG
> TCATGGTTCCTCTTGTCCTGGTGAGCCATTACGACCCTCTCACTTCCTTGTGTTCTCTTCCCTGTGTTCT
> CTCTCTGCTGCTGTGGCCATTCTAGCTCCCTGCACAGTCCTTCAAGCTCACCTCCTGCCTTCCGTGGACA
> AGAGGAAGCACAAAGAATCATCCAGTATGTATGCTCATGGCATAAGGGGATCCTGGGGAAGGGCTGAAGC
> CTGAGCCGGGCTGGTCAACAGAATCTCCCTCTCCCTAACTCCATCTCCCTCTCCTTCCCTCTTCCTCTCT
> CTATCCCTCCCCCCTCTCTCCCCCCACCACCGCATGTTTTGGGTCAGCTGACTGCTCTAGCCTTGATGAG
> ATATCTTCCCAGGAAGAGTTGGTGCTGACTGTACAGATTGAGTTAGAGGGAGGGAAGAAAGCTCCTGTTT
> GATCACTGGAGATCTTTATGCCTAGCTACATGTCTTACCAAAGCCAGGGGAGTCAGCTGAGCTGTAACTG
> GGCACCCTAAGTTCTGCACACCCACATGCCCATGAACTGTGTCCATCTTGCAAGCACATCGTGCTCATTA
> CATCCCCAAACTGCTATCACTTGTGTACCCCAAAGGCTCGGCCCACAGGAACGTCCTGTGAGCAAATCAC
> AAAGACCAGCTTAGGGCTGGAAACATTGTAACCTGAAGTAGGCCAGAGGAGATCCCTGCCAGGTTGAGCA
> TCACAGATCTCATTCTGTTCCCGGGGACACCAGGGGCCCAAGCTCAGAATCTGCCGAAGCATAACTTCAT
> CATTGATCCTATTCAGGGTATGGAAGCTGAGGGTTCCAGCCGCAAGGTCACCAGGCTACTCCGCCTGGGA
> GTCAAGGAAG
>
>  When I read this from a file as a sequence object using Bio::Seq I  
> get
> accession_number unknow.  The
>  accession number is in the header of the fasta file.  Anyone knows  
> why
> this happens.
>
>  My code looks like this:
>
>  print "primary id is: ",$seq_obj->primary_id."\n";
>  print "Description is ",$seq_obj->desc."\n";
>  print "Accession Number is ",$seq_obj->accession_number."\n";
>
>  Output looks like this:
>
>  primary id is: gi|17066572|gb|AF410462.1|AF410462
>  Description is Mus musculus PEM homeobox (Pem) gene, promoter region
> and partial cds
>  Accession Number is unknown
>
>
>  Thank you.
>
>
>
>
>
> Sincerely,
> Sam Al-Droubi, M.S.
> saldroubi at yahoo.com
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l

--
Jason Stajich
Duke University
http://www.duke.edu/~jes12




More information about the Bioperl-l mailing list