[Bioperl-l] FASTA 2 GenBank
Jason Stajich
jason.stajich at duke.edu
Mon Oct 17 16:39:52 EDT 2005
Well asking for 'EMBL' format instead of 'genbank' is your first
problem, you'll not get very far without specifying the right format.
As for all the rest of the information, you need to add it to the
sequence object yourself. See the feature-annotation HOWTO. You'll
want to add annotations and features.
This is described here:
http://bioperl.org/HOWTOs/html/Feature-Annotation.html#annotation
-jason
On Oct 17, 2005, at 3:27 PM, Peter.Robinson at t-online.de wrote:
> Dear bioperlers,
>
> forgive what may be a simple question, but consulting the howtos
> and Google did not reveal an answer to me.
> I am in the process of analyzing ESTs from a nonmodel organism and
> would like to build GenBank style files for the contig sequences by
> adding in information about sequence features. I would like to
> start by adding info about the presumed ORF as follows:
>
>
> ## 1) This is the 'new' sequence
> my $seqio = new Bio::SeqIO('-file' => $inname , '-format' =>
> 'fasta');
> my $seq = $seqio->next_seq();
>
> ## 2) This is the feature I would like to add, with $startpos
> ## and $endpos being the start/end of the ORF based on translations
> ## and alignments
> my $feat = new Bio::SeqFeature::Generic ( -start => $startpos,
> -end => $endpos,
> -strand => 1,
> -primary => 'CDS',
> -source => 'Manual annotation of CDS',
> );
> $seq->add_SeqFeature($feat);
> ## 3) Here I would like to output the sequence in GenBank format
> my $out = Bio::SeqIO->new(-file => ">$outputfilename",
> -format => 'EMBL');
> $out->write_seq($seq);
>
>
> ### However, I get this:
>
> ID ABC2002.1 standard; DNA; UNK; 5914 BP.
> XX
> AC unknown;
> XX
> DE /early=858 /middle=1093 /late=436
> XX
> FH Key Location/Qualifiers
> FH
> FT CDS 104..4501
> XX
> SQ Sequence 5914 BP; 1088 A; 1893 C; 1748 G; 1174 T; 11 other;
> acgt....
>
> But I would like to get something like this:
>
> LOCUS XM_213440 5804 bp mRNA linear ROD
> 15-APR-2005
> DEFINITION PREDICTED: Rattus norvegicus collagen, type 1, alpha 1
> (Col1a1),
> mRNA.
> ACCESSION XM_213440
> VERSION XM_213440.3 GI:62656859
> KEYWORDS .
> SOURCE Rattus norvegicus (Norway rat)
> ORGANISM Rattus norvegicus
> Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;
> Euteleostomi;
> Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia;
> Sciurognathi; Muroidea; Muridae; Murinae; Rattus.
> COMMENT MODEL REFSEQ: This record is predicted by automated
> computational
> analysis. This record is derived from an annotated
> genomic sequence
> (NW_047337) using gene prediction method: GNOMON,
> supported by mRNA
> and EST evidence.
> Also see:
> Documentation of NCBI's Annotation Process
>
> On Apr 15, 2005 this sequence version replaced gi:
> 34873454.
> FEATURES Location/Qualifiers
> source 1..5804
> /organism="Rattus norvegicus"
> /mol_type="mRNA"
> /strain="BN/SsNHsdMCW"
> /db_xref="taxon:10116"
> /chromosome="10"
> gene 1..5804
> /gene="Col1a1"
> /note="Derived by automated computational
> analysis using
> gene prediction method: GNOMON. Supporting
> evidence
> includes similarity to: 2 mRNAs, 48 ESTs, 1
> Protein"
> /db_xref="GeneID:29393"
> /db_xref="RGD:61817"
> CDS 95..4456
> /gene="Col1a1"
> /codon_start=1
> /product="similar to Collagen alpha1"
> /protein_id="XP_213440.1"
> /db_xref="GI:27688933"
> /db_xref="GeneID:29393"
> /db_xref="RGD:61817"
> /
> translation="MFSFVDLRLLLLLGATALLTHGQEDIPEVSCIHNGLRVPNGETW
>
> KPDVCLICICHNGTAVCDGVLCKEDLDCPNPQKREGECCPFCPEEYVSPDAEVIGVEG
> etc "
> ORIGIN
> 1 gacggagcag gaggcacacg gagtgaggcc acgcatgagc cgaagctaac
> cccccacccc
> 61 agccgcaaag agtctacatg tctagggtct agacatgttc a
>
>
> I would be happy if I could get the CDS bit right and very happy if
> I could add some further information in the above style. At the
> moment some downstream applications are not working because the
> GenBank format is incorrect.
>
> Thanks ,
>
> Peter
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
--
Jason Stajich
Duke University
http://www.duke.edu/~jes12
More information about the Bioperl-l
mailing list