[Bioperl-l] Very basic Perl/BioPerl Help

Terry Jones tcj25 at cam.ac.uk
Thu Apr 14 11:49:29 EDT 2005


| What I am trying to do is take either A) Two fasta files with
| refseq/genbank data OR B) Two text files with 1 accession# per line
| and compare them, outputting only those fasta seqs or accession #'s
| that are not present in both.
| 
| So is it easier to just use perl somehow to compare the two raw acc#
| text files?

If your files do not contain repeat lines, you can do this from the
raw acc# text files in various ways. If you're using some form of
UNIX, you can do this on the command line:

  $ cat file1 file2 | sort | uniq -c | egrep '^ *1        ' | cut -f2 | sort

Note that there's a TAB in the egrep expression (between the 1 and the ').
Another way is to use comm

  $ sort file1 > file1.sorted
  $ sort file2 > file2.sorted
  $ comm -3 file1.sorted file2.sorted

You can guarantee that your input files do not have duplicates via

$ sort -u -i file1 > file1.sorted


This is all outside perl. In perl you could do something like

open(F1, "file1") || die "could not open file1 ($!)";
open(F2, "file2") || die "could not open file2 ($!)";

my %names;

while (<F1>){
	  chomp;
	  $names{$_}++;
}

while (<F2>){
	  chomp;
	  $names{$_}++;
}

close(F1) || die "could not close file1 ($!)";
close(F2) || die "could not close file2 ($!)";

my @not_in_both = grep { $names{$_} == 1 } keys %names;


Again, this relies on names only being present once in each file.  You
could code around this requirement in your perl if you wanted, by
doing more checking of the input.

Terry



More information about the Bioperl-l mailing list