[Bioperl-l] problem to fit genomic coordinates

Kevin Brown Kevin.M.Brown at asu.edu
Wed Mar 25 17:30:12 UTC 2009


Please keep all replies on list.
 
Doing it with the SimpleAlign gets rid of the problem of incrementing and reduces the complexity of the number of loop iterations you'll have to do.  Based on your sample data you have a lot of IDs that actually have the same location information that they are needing, you also have overlapping information from the first file. So you'll still need to make decisions as to which item is what you really want (e.g. CDS vs Exon).


________________________________

	From: Laurent MANCHON [mailto:lmanchon at univ-montp2.fr] 
	Sent: Wednesday, March 25, 2009 9:44 AM
	To: Kevin Brown
	Subject: Re: [Bioperl-l] problem to fit genomic coordinates
	
	
	Okay but i think it's not an easy way with this method,
	the files are already sorted on colum numbers, so maybe another logical method
	without using Bioperl libraries exist, for example using a while loop,
	
	something like:
	
	$i = $j = 1;
	$idx = number of lines in file1
	$cpt = number of lines in file2
	while ($i <= $idx && $j <= $cpt) {
	 #compare current elements
	 #increment either $i or $j depending which segment comes before the other
	}
	the difficulty is when to decide to incremente $i or $j inside the loop
	
	Laurent --
	
	Kevin Brown a écrit : 

		Read in first file and create a Bio::SimpleAlign object
		
		Then use the slice method to find the features that are between the
		start/end values of your second file
		
		=head2 slice
		
		 Title     : slice
		 Usage     : $aln2 = $aln->slice(20,30)
		 Function  : Creates a slice from the alignment inclusive of start and
		             end columns, and the first column in the alignment is
		denoted 1.
		             Sequences with no residues in the slice are excluded from
		the
		             new alignment and a warning is printed. Slice beyond the
		length of
		             the sequence does not do padding.
		 Returns   : A Bio::SimpleAlign object
		 Args      : Positive integer for start column, positive integer for end
		column,
		             optional boolean which if true will keep gap-only columns
		in the newly
		             created slice. Example:
		
		             $aln2 = $aln->slice(20,30,1)
		
		=cut 
		
		  

			-----Original Message-----
			From: bioperl-l-bounces at lists.open-bio.org 
			[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of 
			Laurent MANCHON
			Sent: Wednesday, March 25, 2009 7:57 AM
			To: bioperl-l at lists.open-bio.org
			Subject: [Bioperl-l] problem to fit genomic coordinates
			
			this is my problem:
			how is it possible to fit range of genomic coordinates stored in two 
			distinct files ?
			
			first file (file1.txt) is my annotation file with format as:
			
			regulatory_region 3455 3463
			regulatory_region 3535 3544
			regulatory_region 3601 3608
			transcriptional_cis_regulatory_region 3622 3630
			five_prime_UTR 3631 3759
			CDS 3760 3913
			exon 3631 3913
			CDS 3996 4276
			exon 3996 4276
			CDS 4486 4605
			exon 4486 4605
			CDS 4706 5095
			exon 4706 5095
			CDS 5174 5326
			exon 5174 5326
			....
			....
			
			second file (file2.txt) is my experimental file with format as:
			
			acc_2765773 3222 3239 -
			acc_2842543 3222 3239 -
			acc_2842544 3222 3239 -
			acc_442945 3222 3239 -
			acc_442946 3222 3239 -
			acc_4873 3222 3239 -
			acc_53956 3222 3239 -
			acc_562588 3222 3239 -
			acc_807114 3222 3239 -
			acc_84146 3222 3239 -
			acc_2419732 3268 3285 +
			acc_3041065 3565 3583 +
			acc_362358 3640 3656 -
			acc_3279485 3793 3813 +
			acc_3091017 3794 3811 -
			acc_2807380 3832 3848 +
			acc_3105138 3832 3848 +
			acc_3105139 3832 3848 +
			acc_3105140 3832 3848 +
			acc_3116450 3832 3848 +
			acc_86708 3832 3848 +
			acc_1987802 3922 3938 -
			acc_1679660 4113 4129 +
			acc_891489 4113 4129 +
			acc_2829973 4299 4318 +
			....
			....
			
			
			number of lines in file1.txt ~ 150000
			number of lines in file2.txt ~ 800000
			
			so, how to annotate my file2 using the genomic coordinates stored in 
			file1. I need to compare each couple of range of my file2 with each 
			couple of range of my file1: 800000x150000 combinaisons (quadratic 
			analysis) ?
			i'm looking for a fast method to do that, something like linear 
			progression in the analysis
			
			thank you so much if you have ideas for help me.
			
			Laurent --
			_______________________________________________
			Bioperl-l mailing list
			Bioperl-l at lists.open-bio.org
			http://lists.open-bio.org/mailman/listinfo/bioperl-l
			
			    

		
		_______________________________________________
		Bioperl-l mailing list
		Bioperl-l at lists.open-bio.org
		http://lists.open-bio.org/mailman/listinfo/bioperl-l
		  







More information about the Bioperl-l mailing list