[Bioperl-l] FASTA 2 GenBank

Mon Oct 17 16:39:52 EDT 2005

Well asking for 'EMBL' format instead of 'genbank' is your first  
problem, you'll not get very far without specifying the right format.

As for all the rest of the information, you need to add it to the  
sequence object yourself.  See the feature-annotation HOWTO.  You'll  
want to add annotations and features.
This is described here:
http://bioperl.org/HOWTOs/html/Feature-Annotation.html#annotation

-jason
On Oct 17, 2005, at 3:27 PM, Peter.Robinson at t-online.de wrote:

> Dear bioperlers,
>
> forgive what may be a simple question, but consulting the howtos  
> and Google did not reveal an answer to me.
> I am in the process of analyzing ESTs from a nonmodel organism and  
> would like to build GenBank style files for the contig sequences by  
> adding in information about sequence features. I would like to  
> start by adding info about the presumed ORF as follows:
>
>
> ## 1) This is the 'new' sequence
> my $seqio = new Bio::SeqIO('-file'   => $inname , '-format' =>  
> 'fasta');
> my $seq    = $seqio->next_seq();
>
> ## 2) This is the feature I would like to add, with $startpos
> ## and $endpos being the start/end of the ORF based on translations
> ## and alignments
> my $feat = new Bio::SeqFeature::Generic ( -start => $startpos,
>                       -end => $endpos,
>                        -strand => 1,
>                        -primary => 'CDS',
>                        -source => 'Manual annotation of CDS',
>                        );
> $seq->add_SeqFeature($feat);
> ## 3) Here I would like to output the sequence in GenBank format
> my $out =  Bio::SeqIO->new(-file => ">$outputfilename",
>                          -format => 'EMBL');
> $out->write_seq($seq);
>
>
> ### However, I get this:
>
> ID   ABC2002.1   standard; DNA; UNK; 5914 BP.
> XX
> AC   unknown;
> XX
> DE   /early=858 /middle=1093 /late=436
> XX
> FH   Key             Location/Qualifiers
> FH
> FT   CDS             104..4501
> XX
> SQ   Sequence 5914 BP; 1088 A; 1893 C; 1748 G; 1174 T; 11 other;
> acgt....
>
> But I would like to get something like this:
>
> LOCUS       XM_213440               5804 bp    mRNA    linear   ROD  
> 15-APR-2005
> DEFINITION  PREDICTED: Rattus norvegicus collagen, type 1, alpha 1  
> (Col1a1),
>             mRNA.
> ACCESSION   XM_213440
> VERSION     XM_213440.3  GI:62656859
> KEYWORDS    .
> SOURCE      Rattus norvegicus (Norway rat)
>   ORGANISM  Rattus norvegicus
>             Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;  
> Euteleostomi;
>             Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia;
>             Sciurognathi; Muroidea; Muridae; Murinae; Rattus.
> COMMENT     MODEL REFSEQ:  This record is predicted by automated  
> computational
>             analysis. This record is derived from an annotated  
> genomic sequence
>             (NW_047337) using gene prediction method: GNOMON,  
> supported by mRNA
>             and EST evidence.
>             Also see:
>                 Documentation of NCBI's Annotation Process
>
>             On Apr 15, 2005 this sequence version replaced gi: 
> 34873454.
> FEATURES             Location/Qualifiers
>      source          1..5804
>                      /organism="Rattus norvegicus"
>                      /mol_type="mRNA"
>                      /strain="BN/SsNHsdMCW"
>                      /db_xref="taxon:10116"
>                      /chromosome="10"
>      gene            1..5804
>                      /gene="Col1a1"
>                      /note="Derived by automated computational  
> analysis using
>                      gene prediction method: GNOMON. Supporting  
> evidence
>                      includes similarity to: 2 mRNAs, 48 ESTs, 1  
> Protein"
>                      /db_xref="GeneID:29393"
>                      /db_xref="RGD:61817"
>      CDS             95..4456
>                      /gene="Col1a1"
>                      /codon_start=1
>                      /product="similar to Collagen alpha1"
>                      /protein_id="XP_213440.1"
>                      /db_xref="GI:27688933"
>                      /db_xref="GeneID:29393"
>                      /db_xref="RGD:61817"
>                      / 
> translation="MFSFVDLRLLLLLGATALLTHGQEDIPEVSCIHNGLRVPNGETW
>                       
> KPDVCLICICHNGTAVCDGVLCKEDLDCPNPQKREGECCPFCPEEYVSPDAEVIGVEG
>                       etc "
> ORIGIN
>         1 gacggagcag gaggcacacg gagtgaggcc acgcatgagc cgaagctaac  
> cccccacccc
>        61 agccgcaaag agtctacatg tctagggtct agacatgttc a
>
>
> I would be happy if I could get the CDS bit right and very happy if  
> I could add some further information in the above style. At the  
> moment some downstream applications are not working because the  
> GenBank format is incorrect.
>
> Thanks ,
>
> Peter
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>

--
Jason Stajich
Duke University
http://www.duke.edu/~jes12