[Biopython] zero-length feature

Mon Mar 22 14:52:53 UTC 2010

Brilliant! Thanks.

Regards,
Anne.

On 22 Mar 2010, at 12:07, Peter wrote:

> On Mon, Mar 22, 2010 at 11:44 AM, Anne Pajon <ap12 at sanger.ac.uk>  
> wrote:
>> My genome has a single N character at this point.
>>
>
> OK - then the feature should be length one, describing this single
> base region. i.e. Using python counting, start+1 == end
>
>>
>> Here is the code I use to insert these gaps:
>>
>>    # Add FT gap
>>    seq = record.seq
>>    in_N = False
>>    gap_features = []
>>    for i in range(len(seq)):
>>        if seq[i] == 'N' and not in_N:
>>            start_N = i
>>            in_N = True
>>        if in_N and not seq[i+1] == 'N':
>>            end_N = i
>>            if start_N == end_N:
>>                log.warning("gap of size 1 %s..%s" % (start_N, end_N))
>>            length = (end_N - start_N) + 1
>>            gap_feature = SeqFeature(FeatureLocation(start_N,end_N+1),
>> strand=1, type="gap")
>>            gap_feature.qualifiers['estimated_length'] = [length]
>>            gap_features.append(gap_feature)
>>            in_N = False
>>
>> What should I do to make it works with (unmodified) Biopython EMBL  
>> output?
>> Thanks in advance for your help.
>>
>> Regards,
>> Anne.
>
> I think you have some out by one counting there (resulting in features
> of length one shorted than they should have been). How does this self
> contained example look?
>
> from Bio.Alphabet import generic_dna
> from Bio.Seq import Seq
> from Bio.SeqRecord import SeqRecord
> from Bio.SeqFeature import SeqFeature, FeatureLocation
> seq = Seq("ANANNANNNANNNNNA", generic_dna)
> record = SeqRecord(seq, id="Test")
> print "Finding stretches of N in this:"
> print seq
> # TODO - Cope with a sequence which ends with N
> assert seq[-1] != "N", "FIXME - seq ends with N"
> in_N = False
> for i in range(len(seq)):
>    if seq[i] == 'N' and not in_N:
>        start_N = i
>        in_N = True
>    if in_N and not seq[i+1] == 'N':
>        end_N = i+1
>        length = end_N - start_N
>        assert length > 0
>        assert str(seq[start_N:end_N]) == "N"*length
>        print "."*start_N + seq[start_N:end_N] + "."*(len(seq)-end_N),
>        print "gap of size %i, Python slicing  %s:%s" % (length,  
> start_N, end_N)
>        gap_feature = SeqFeature(FeatureLocation(start_N,end_N),
> strand=1, type="gap")
>        gap_feature.qualifiers['estimated_length'] = [length]
>        record.features.append(gap_feature)
>        in_N = False
> print
> print record.format("embl")
>
>
> And the output, which looks fine to me (this is more readable if your
> email client uses a fixed width font):
>
>
> Finding stretches of N in this:
> ANANNANNNANNNNNA
> .N.............. gap of size 1, Python slicing  1:2
> ...NN........... gap of size 2, Python slicing  3:5
> ......NNN....... gap of size 3, Python slicing  6:9
> ..........NNNNN. gap of size 5, Python slicing  10:15
>
> ID   Test; ; ; DNA; ; UNC; 16 BP.
> XX
> AC   Test;
> XX
> DE   .
> XX
> OS   .
> OC   .
> XX
> FH   Key             Location/Qualifiers
> FT   gap             2..2
> FT                   /estimated_length=1
> FT   gap             4..5
> FT                   /estimated_length=2
> FT   gap             7..9
> FT                   /estimated_length=3
> FT   gap             11..15
> FT                   /estimated_length=5
> SQ   Sequence 16 BP; 5 A; 0 C; 0 G; 0 T; 11 other;
>     ANANNANNNA  
> NNNNNA                                                        16
> //
>
> Regards,
>
> Peter

--
Dr Anne Pajon - Pathogen Genomics, Team 81
Sanger Institute, Wellcome Trust Genome Campus, Hinxton
Cambridge CB10 1SA, United Kingdom
+44 (0)1223 494 798 (office) | +44 (0)7958 511 353 (mobile)

-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.