[Biopython] Increase line length when writing EMBL format

Fri Sep 18 13:07:45 UTC 2020

Thanks Peter!

The accession is CAADRP010000001 and the EMBL file for the genome annotation can be downloaded directly from: ftp://ftp.ebi.ac.uk/pub/databases/ena/wgs/public/ca/CAADRP01.dat.gz

All the best,
Pedro

> On 18 Sep 2020, at 13:54, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> 
> Thanks Pedro,
> 
> Could you share the accession / URL for the problem record(s) then?
> 
> And to clarify why your experiment didn't work, the Bio.GenBank.Record
> objects are irrelevant to Bio.SeqIO which uses SeqRecord objects. The
> GenBank parser can either produce records using Bio.GenBank.Record
> (mimics a GenBank record very closely, see the Bio.GenBank.parse
> function), or SeqRecords (as used in SeqIO).
> 
> The output from SeqIO is via the EmblWriter object here, where MAX_WIDTH = 80:
> 
> https://github.com/biopython/biopython/blob/master/Bio/SeqIO/InsdcIO.py#L1105
> 
> Peter
> 
> On Fri, Sep 18, 2020 at 1:52 PM Pedro Almeida <p.almeida.mc at gmail.com> wrote:
>> 
>> Hi Peter,
>> 
>> thank you so much for the prompt reply. Yes, it was downloaded directly from EMBL. It's from a recent submission early this year, so maybe there were some modifications related to these cases as you pointed out.
>> 
>> All the best,
>> Pedro
>> 
>> 
>>> On 18 Sep 2020, at 13:47, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>>> 
>>> Hello Pedro,
>>> 
>>> Sadly this annotation value is one of those awkward cases of a long
>>> value with no spaces, so there is no good place to break it for line
>>> wrapping.
>>> 
>>> Where is your original file from? The input line was 81 characters
>>> long, which I believe is too long.  It is from EMBL themselves? If so,
>>> perhaps we need to more closely match how they now handle this corner
>>> case - which may have changed since I last looked at this code.
>>> 
>>> Peter
>>> 
>>> On Fri, Sep 18, 2020 at 1:36 PM Pedro Almeida <p.almeida.mc at gmail.com> wrote:
>>>> 
>>>> Dear BioPython Developers and enthusiasts,
>>>> 
>>>> I’m working in a script to perform some modifications in an EMBL file format I have at hand. Everything seems to be working OK, except for some features where `SeqIO.write(record, fh, 'embl')` seems to be writing the last closing quote (`"`) in a new line as a feat of its own.
>>>> 
>>>> Here’s how the original feature is:
>>>> 
>>>> ```
>>>> FT                   /standard_name="species:rnd-4_family-1331|genus:Unspecified"
>>>> ```
>>>> 
>>>> but with  `SeqIO.write` gets printed in 2 lines as:
>>>> 
>>>> ```
>>>> FT                   /standard_name="species:rnd-4_family-1331|genus:Unspecified
>>>> FT                   "
>>>> ```
>>>> 
>>>> I remember seeing (can’t remember where though) that the ‘embl’ format uses for the most part the genbank structure, so thought that increasing the value of `record.GB_LINE_LENGTH` say to 100 `record.GB_LINE_LENGTH=100` could work, but it doesn’t…
>>>> 
>>>> I actually think that `record.GB_LINE_LENGTH` is not taken into account with ‘embl’ writing format because the default value seems to be [79](https://biopython.org/docs/1.75/api/Bio.GenBank.Record.html#Bio.GenBank.Record.Record.GB_LINE_LENGTH) but by default it prints the line above with a width of 81.
>>>> 
>>>> Any ideas/suggestions on how to work around this? I could probably write another parser to correct for this but would be easier/better if this could be worked with BioPython.
>>>> 
>>>> Many thanks,
>>>> Pedro
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Biopython mailing list  -  Biopython at mailman.open-bio.org
>>>> https://mailman.open-bio.org/mailman/listinfo/biopython
>>