[Biopython] Increase line length when writing EMBL format

Peter Cock p.j.a.cock at googlemail.com
Fri Sep 18 12:54:38 UTC 2020


Thanks Pedro,

Could you share the accession / URL for the problem record(s) then?

And to clarify why your experiment didn't work, the Bio.GenBank.Record
objects are irrelevant to Bio.SeqIO which uses SeqRecord objects. The
GenBank parser can either produce records using Bio.GenBank.Record
(mimics a GenBank record very closely, see the Bio.GenBank.parse
function), or SeqRecords (as used in SeqIO).

The output from SeqIO is via the EmblWriter object here, where MAX_WIDTH = 80:

https://github.com/biopython/biopython/blob/master/Bio/SeqIO/InsdcIO.py#L1105

Peter

On Fri, Sep 18, 2020 at 1:52 PM Pedro Almeida <p.almeida.mc at gmail.com> wrote:
>
> Hi Peter,
>
> thank you so much for the prompt reply. Yes, it was downloaded directly from EMBL. It's from a recent submission early this year, so maybe there were some modifications related to these cases as you pointed out.
>
> All the best,
> Pedro
>
>
> > On 18 Sep 2020, at 13:47, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> >
> > Hello Pedro,
> >
> > Sadly this annotation value is one of those awkward cases of a long
> > value with no spaces, so there is no good place to break it for line
> > wrapping.
> >
> > Where is your original file from? The input line was 81 characters
> > long, which I believe is too long.  It is from EMBL themselves? If so,
> > perhaps we need to more closely match how they now handle this corner
> > case - which may have changed since I last looked at this code.
> >
> > Peter
> >
> > On Fri, Sep 18, 2020 at 1:36 PM Pedro Almeida <p.almeida.mc at gmail.com> wrote:
> >>
> >> Dear BioPython Developers and enthusiasts,
> >>
> >> I’m working in a script to perform some modifications in an EMBL file format I have at hand. Everything seems to be working OK, except for some features where `SeqIO.write(record, fh, 'embl')` seems to be writing the last closing quote (`"`) in a new line as a feat of its own.
> >>
> >> Here’s how the original feature is:
> >>
> >> ```
> >> FT                   /standard_name="species:rnd-4_family-1331|genus:Unspecified"
> >> ```
> >>
> >> but with  `SeqIO.write` gets printed in 2 lines as:
> >>
> >> ```
> >> FT                   /standard_name="species:rnd-4_family-1331|genus:Unspecified
> >> FT                   "
> >> ```
> >>
> >> I remember seeing (can’t remember where though) that the ‘embl’ format uses for the most part the genbank structure, so thought that increasing the value of `record.GB_LINE_LENGTH` say to 100 `record.GB_LINE_LENGTH=100` could work, but it doesn’t…
> >>
> >> I actually think that `record.GB_LINE_LENGTH` is not taken into account with ‘embl’ writing format because the default value seems to be [79](https://biopython.org/docs/1.75/api/Bio.GenBank.Record.html#Bio.GenBank.Record.Record.GB_LINE_LENGTH) but by default it prints the line above with a width of 81.
> >>
> >> Any ideas/suggestions on how to work around this? I could probably write another parser to correct for this but would be easier/better if this could be worked with BioPython.
> >>
> >> Many thanks,
> >> Pedro
> >>
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> Biopython mailing list  -  Biopython at mailman.open-bio.org
> >> https://mailman.open-bio.org/mailman/listinfo/biopython
>



More information about the Biopython mailing list