[Biopython] zero-length feature
Chris Fields
cjfields at illinois.edu
Mon Mar 22 15:01:48 UTC 2010
All,
Just to make sure I'm sanity-checked here, the GenBank/EMBL/DDBJ location specs indicate a location of one nucleotide (inclusive) in length is to be characterized as one number, not a range at all:
http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html#3.5.3
2..2
should just be:
2
Or, did I miss something in the discussion?
chris
On Mar 22, 2010, at 9:52 AM, Anne Pajon wrote:
> Brilliant! Thanks.
>
> Regards,
> Anne.
>
> On 22 Mar 2010, at 12:07, Peter wrote:
>
>> On Mon, Mar 22, 2010 at 11:44 AM, Anne Pajon <ap12 at sanger.ac.uk> wrote:
>>> My genome has a single N character at this point.
>>>
>>
>> OK - then the feature should be length one, describing this single
>> base region. i.e. Using python counting, start+1 == end
>>
>>>
>>> Here is the code I use to insert these gaps:
>>>
>>> # Add FT gap
>>> seq = record.seq
>>> in_N = False
>>> gap_features = []
>>> for i in range(len(seq)):
>>> if seq[i] == 'N' and not in_N:
>>> start_N = i
>>> in_N = True
>>> if in_N and not seq[i+1] == 'N':
>>> end_N = i
>>> if start_N == end_N:
>>> log.warning("gap of size 1 %s..%s" % (start_N, end_N))
>>> length = (end_N - start_N) + 1
>>> gap_feature = SeqFeature(FeatureLocation(start_N,end_N+1),
>>> strand=1, type="gap")
>>> gap_feature.qualifiers['estimated_length'] = [length]
>>> gap_features.append(gap_feature)
>>> in_N = False
>>>
>>> What should I do to make it works with (unmodified) Biopython EMBL output?
>>> Thanks in advance for your help.
>>>
>>> Regards,
>>> Anne.
>>
>> I think you have some out by one counting there (resulting in features
>> of length one shorted than they should have been). How does this self
>> contained example look?
>>
>> from Bio.Alphabet import generic_dna
>> from Bio.Seq import Seq
>> from Bio.SeqRecord import SeqRecord
>> from Bio.SeqFeature import SeqFeature, FeatureLocation
>> seq = Seq("ANANNANNNANNNNNA", generic_dna)
>> record = SeqRecord(seq, id="Test")
>> print "Finding stretches of N in this:"
>> print seq
>> # TODO - Cope with a sequence which ends with N
>> assert seq[-1] != "N", "FIXME - seq ends with N"
>> in_N = False
>> for i in range(len(seq)):
>> if seq[i] == 'N' and not in_N:
>> start_N = i
>> in_N = True
>> if in_N and not seq[i+1] == 'N':
>> end_N = i+1
>> length = end_N - start_N
>> assert length > 0
>> assert str(seq[start_N:end_N]) == "N"*length
>> print "."*start_N + seq[start_N:end_N] + "."*(len(seq)-end_N),
>> print "gap of size %i, Python slicing %s:%s" % (length, start_N, end_N)
>> gap_feature = SeqFeature(FeatureLocation(start_N,end_N),
>> strand=1, type="gap")
>> gap_feature.qualifiers['estimated_length'] = [length]
>> record.features.append(gap_feature)
>> in_N = False
>> print
>> print record.format("embl")
>>
>>
>> And the output, which looks fine to me (this is more readable if your
>> email client uses a fixed width font):
>>
>>
>> Finding stretches of N in this:
>> ANANNANNNANNNNNA
>> .N.............. gap of size 1, Python slicing 1:2
>> ...NN........... gap of size 2, Python slicing 3:5
>> ......NNN....... gap of size 3, Python slicing 6:9
>> ..........NNNNN. gap of size 5, Python slicing 10:15
>>
>> ID Test; ; ; DNA; ; UNC; 16 BP.
>> XX
>> AC Test;
>> XX
>> DE .
>> XX
>> OS .
>> OC .
>> XX
>> FH Key Location/Qualifiers
>> FT gap 2..2
>> FT /estimated_length=1
>> FT gap 4..5
>> FT /estimated_length=2
>> FT gap 7..9
>> FT /estimated_length=3
>> FT gap 11..15
>> FT /estimated_length=5
>> SQ Sequence 16 BP; 5 A; 0 C; 0 G; 0 T; 11 other;
>> ANANNANNNA NNNNNA 16
>> //
>>
>> Regards,
>>
>> Peter
>
> --
> Dr Anne Pajon - Pathogen Genomics, Team 81
> Sanger Institute, Wellcome Trust Genome Campus, Hinxton
> Cambridge CB10 1SA, United Kingdom
> +44 (0)1223 494 798 (office) | +44 (0)7958 511 353 (mobile)
>
>
>
> --
> The Wellcome Trust Sanger Institute is operated by Genome ResearchLimited, a charity registered in England with number 1021457 and acompany registered in England with number 2742969, whose registeredoffice is 215 Euston Road, London, NW1 2BE._______________________________________________
> Biopython mailing list - Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
More information about the Biopython
mailing list