[Biopython] zero-length feature
Anne Pajon
ap12 at sanger.ac.uk
Mon Mar 22 14:52:53 UTC 2010
Brilliant! Thanks.
Regards,
Anne.
On 22 Mar 2010, at 12:07, Peter wrote:
> On Mon, Mar 22, 2010 at 11:44 AM, Anne Pajon <ap12 at sanger.ac.uk>
> wrote:
>> My genome has a single N character at this point.
>>
>
> OK - then the feature should be length one, describing this single
> base region. i.e. Using python counting, start+1 == end
>
>>
>> Here is the code I use to insert these gaps:
>>
>> # Add FT gap
>> seq = record.seq
>> in_N = False
>> gap_features = []
>> for i in range(len(seq)):
>> if seq[i] == 'N' and not in_N:
>> start_N = i
>> in_N = True
>> if in_N and not seq[i+1] == 'N':
>> end_N = i
>> if start_N == end_N:
>> log.warning("gap of size 1 %s..%s" % (start_N, end_N))
>> length = (end_N - start_N) + 1
>> gap_feature = SeqFeature(FeatureLocation(start_N,end_N+1),
>> strand=1, type="gap")
>> gap_feature.qualifiers['estimated_length'] = [length]
>> gap_features.append(gap_feature)
>> in_N = False
>>
>> What should I do to make it works with (unmodified) Biopython EMBL
>> output?
>> Thanks in advance for your help.
>>
>> Regards,
>> Anne.
>
> I think you have some out by one counting there (resulting in features
> of length one shorted than they should have been). How does this self
> contained example look?
>
> from Bio.Alphabet import generic_dna
> from Bio.Seq import Seq
> from Bio.SeqRecord import SeqRecord
> from Bio.SeqFeature import SeqFeature, FeatureLocation
> seq = Seq("ANANNANNNANNNNNA", generic_dna)
> record = SeqRecord(seq, id="Test")
> print "Finding stretches of N in this:"
> print seq
> # TODO - Cope with a sequence which ends with N
> assert seq[-1] != "N", "FIXME - seq ends with N"
> in_N = False
> for i in range(len(seq)):
> if seq[i] == 'N' and not in_N:
> start_N = i
> in_N = True
> if in_N and not seq[i+1] == 'N':
> end_N = i+1
> length = end_N - start_N
> assert length > 0
> assert str(seq[start_N:end_N]) == "N"*length
> print "."*start_N + seq[start_N:end_N] + "."*(len(seq)-end_N),
> print "gap of size %i, Python slicing %s:%s" % (length,
> start_N, end_N)
> gap_feature = SeqFeature(FeatureLocation(start_N,end_N),
> strand=1, type="gap")
> gap_feature.qualifiers['estimated_length'] = [length]
> record.features.append(gap_feature)
> in_N = False
> print
> print record.format("embl")
>
>
> And the output, which looks fine to me (this is more readable if your
> email client uses a fixed width font):
>
>
> Finding stretches of N in this:
> ANANNANNNANNNNNA
> .N.............. gap of size 1, Python slicing 1:2
> ...NN........... gap of size 2, Python slicing 3:5
> ......NNN....... gap of size 3, Python slicing 6:9
> ..........NNNNN. gap of size 5, Python slicing 10:15
>
> ID Test; ; ; DNA; ; UNC; 16 BP.
> XX
> AC Test;
> XX
> DE .
> XX
> OS .
> OC .
> XX
> FH Key Location/Qualifiers
> FT gap 2..2
> FT /estimated_length=1
> FT gap 4..5
> FT /estimated_length=2
> FT gap 7..9
> FT /estimated_length=3
> FT gap 11..15
> FT /estimated_length=5
> SQ Sequence 16 BP; 5 A; 0 C; 0 G; 0 T; 11 other;
> ANANNANNNA
> NNNNNA 16
> //
>
> Regards,
>
> Peter
--
Dr Anne Pajon - Pathogen Genomics, Team 81
Sanger Institute, Wellcome Trust Genome Campus, Hinxton
Cambridge CB10 1SA, United Kingdom
+44 (0)1223 494 798 (office) | +44 (0)7958 511 353 (mobile)
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
More information about the Biopython
mailing list