[Bioperl-l] Genbank file : bad features (tag) order with /translation

Chris Fields cjfields at illinois.edu
Wed Aug 3 16:52:02 UTC 2011


On Aug 3, 2011, at 11:00 AM, Peter Cock wrote:

> 2011/8/3 Maxime Déraspe <maximilien1er at gmail.com>:
>>> 
>>> Why do you care about the order?
>>> 
>> 
>> Hi Peter,
>> 
>> I care about the order for the submission to ncbi.
> 
> Do the NCBI have some guidelines which ask for a particular order?

No, beyond the feature table there is no specification that indicates such that I am aware of.  Submitted data is tabular; sequin is a nicer GUI API for getting data into a useful format for submission to NCBI, where data is converted to ASN.1 I believe.

>> But I guess they
>> will reformat the file before getting it in their database.
> 
> They seem to generate the official GenBank files from their
> database - so I doubt the input order matters.

Yep, that's correct.  If NCBI ruled the world everyone would be using ASN.1 (b/c that's what they use internally).

>> It's also
>> visually better when the translation of the protein comes in the end
>> of the annotation for the CDS and not before /product, /note ....
> 
> I do see your point, but if that were the only motivation I wouldn't
> want to make generating GenBank output any more complicated
> than it already is.
...
>> Anyway maybe I'll reformat the file in sequin table for a direct
>> submission to ncbi with sequin.
>> 
>> Thank you.
>> 
>> Max
> 
> Peter


Maxime, I find most users try to avoid using GenBank format except when absolutely needed.  There is a very good reason Sequin and tbl2asn are used by NCBI for submissions; they end up generating simple tabular data that is easier to feed into their internal ASN.1 format.  Genbank is a nice human-readable format, but structure-wise I find it's a pain to deal with, not to mention the variant third-party 'genbank' data that users want us to handle.

We try to support generation of output within reason, but that's never been our primary goal.  As long as the output generated is capable of being re-read by our parsers with the data intact and generates sane data we're pretty happy.

Saying that, any additions to deal with this are perfectly welcome (I pointed out one mechanism that could be used), but they would have to address the concerns Peter and I alluded to previously, and it would be nice to evaluate how any changes affect performance.  You are more than welcome to submit this as a feature request using our redmine server (including patches if you do this yourself):

https://redmine.open-bio.org/

chris



More information about the Bioperl-l mailing list