[BioRuby] removing primers and corresponding quality data from sequences

Wed Feb 17 14:37:49 UTC 2010

Hi,

On Fri, 12 Feb 2010 11:57:54 +0300
George Githinji <georgkam at gmail.com> wrote:

> Hi
> 
> I would like to remove both the primer and the portion before the 5'
> end and one after the 3' end
> 
> def primers
>   ['G*CACG[A|C]AGTTT[C|T]GC','GC[G|A]AAACT[T|G]CGTGC','G*CCCATTC[G|C]TCGAACCA','TGGTTCGA[C|G]GAATGGGC']
>   #primers.collect! { |primer| create_regexp(primer) }
>  end

The above regular expressions might be different from what
you really want. For example, /G*C/ matches with "C", "GC",
"GGC", "GGGC", "GGGGC", ..., and /[C|T]/ matches with "C", "|",
or "T". Please check the syntax of regular expression in Ruby.

> 
>  def bioentries(reads_file)
>    Bio::FlatFile.auto(reads_file){ |f| f.map {|entry| entry} }
>  end
> 
> def remove_primers(file_name)
>   reg1 = Regexp.new(primers[0])
>    bioentries(file_name).map do |entry|
>     # puts ">#{entry.definition}"
>      #puts entry.seq
> 
>     puts  entry.seq.gsub(reg1,'')
> 
>  end
> end
> 
> would remove the primers but not the portion before the 5'  end
> 
> Secondly, it does not give me the corresponding co-ordinates so that i
> can remove the associated quality data for the removed file
> 
> third the approach seems  'dirty'

One of the simplest approach is to mask the primer sequences
with "X" (or any special character you want) without changing
the original sequence length. I suppose many software for
cutting vector sequences would also do so.

      #puts  entry.seq.gsub(reg1,'')

      seq = Bio::Sequence::NA.new(entry.seq)

      # regs contains regular expressions in an array,
      # for example: regs = [ /ACGTACGT/, /ATATATAT/ ]
      # Note that primer sequences are expected to be
      # completely different from each others.
      #
      regs.each do |reg|
        seq.gsub!(reg) { |x| "X" * x.length }
      end

      # After that, all 5' bases before "X" are replaced
      # with "X".

      seq.sub!(/\A[^X]+X/) { |x| "X" * x.length }

      # All 3' bases after "X" are also replaced with "X".

      seq.sub!(/X[^X]+\z/) { |x| "X" * x.length }

      # Then, start and end positions of the unmasked region
      # can be obtained.

      start_pos = seq.index(/[^X]/)
      end_pos = seq.rindex(/[^X]/)

Be careful that the code ignores any error checks.
If one of the 5' or 3' primers are not detected in a sequence,
whole of the sequence will be filled with "X". If both 5' and 3'
primers are not found, the sequence will be kept unchanged.

In addition, the above code ignores partial primer sequences
in the 3' end (and sometimes in the 5' end). Sequencing errors
are also ignored.

Sincerely,

Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org

> 
> On Fri, Feb 12, 2010 at 11:56 AM, George Githinji <georgkam at gmail.com> wrote:
> > Hi would like to remove both the primer and the portion before the 5'
> > end and one after the 3' end
> >
> > def primers
> >   ['G*CACG[A|C]AGTTT[C|T]GC','GC[G|A]AAACT[T|G]CGTGC','G*CCCATTC[G|C]TCGAACCA','TGGTTCGA[C|G]GAATGGGC']
> >   #primers.collect! { |primer| create_regexp(primer) }
> >  end
> >
> >  def bioentries(reads_file)
> >    Bio::FlatFile.auto(reads_file){ |f| f.map {|entry| entry} }
> >  end
> >
> > def remove_primers(file_name)
> >   reg1 = Regexp.new(primers[0])
> >    bioentries(file_name).map do |entry|
> >     # puts ">#{entry.definition}"
> >      #puts entry.seq
> >
> >     puts  entry.seq.gsub(reg1,'')
> >
> >  end
> > end
> >
> > would remove the primers but not the portion before the 5'  end
> >
> > Secondly, it does not give me the corresponding co-ordinates so that i
> > can remove the associated quality data for the removed file
> >
> > third the approach seems  'dirty'
> >
> >
> >
> > On Fri, Feb 12, 2010 at 11:46 AM, Andrew Grimm <andrew.j.grimm at gmail.com> wrote:
> >> I can't really help, but is it primers that you want removed, or the
> >> portion of sequence that's before the 5' primer or after the 3'
> >> primer?
> >>
> >> Andrew
> >>
> >> On Fri, Feb 12, 2010 at 7:35 PM, George Githinji <georgkam at gmail.com> wrote:
> >>> Hi All,
> >>> I have a list of sequences and corresponding quality files for the
> >>> same data. I would like to remove the primers as well as the
> >>> corresponding quality information.
> >>> The approach that i am using is proving to be dirty and buggy,
> >>>
> >>> For example given:
> >>> 1.A list of sequences in fasta file format
> >>> 2.A list of 4 possible primer patterns. (no idea which sequence might
> >>> contain which primer)
> >>> 3.A list of quality data in phred format for each sequence,
> >>>
> >>> The task is to remove the possible primers from the sequences and
> >>> anything before or after the primer.
> >>> Each sequence has at least 2 combination of primes. one on the 5' and
> >>> the other on the 3' end.
> >>>
> >>> Return a list of sequences with primer ends removed and the
> >>> corresponding quality data for the primers removed.
> >>>
> >>> What would be a nice way to approach this problem.
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> ---------------
> >>> Sincerely
> >>> George
> >>> PhD Student
> >>> KEMRI/Wellcome-Trust Research Program
> >>> Skype: george_g2
> >>> Blog: http://biorelated.wordpress.com/
> >>> _______________________________________________
> >>> BioRuby Project - http://www.bioruby.org/
> >>> BioRuby mailing list
> >>> BioRuby at lists.open-bio.org
> >>> http://lists.open-bio.org/mailman/listinfo/bioruby
> >>>
> >>
> >
> >
> >
> > --
> > ---------------
> > Sincerely
> > George
> > PhD Student
> > KEMRI/Wellcome-Trust Research Program
> > Skype: george_g2
> > Blog: http://biorelated.wordpress.com/
> >
> 
> 
> 
> -- 
> ---------------
> Sincerely
> George
> PhD Student
> KEMRI/Wellcome-Trust Research Program
> Skype: george_g2
> Blog: http://biorelated.wordpress.com/
> 
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby