[Bioperl-l] Very basic Perl/BioPerl Help
Terry Jones
tcj25 at cam.ac.uk
Thu Apr 14 11:49:29 EDT 2005
| What I am trying to do is take either A) Two fasta files with
| refseq/genbank data OR B) Two text files with 1 accession# per line
| and compare them, outputting only those fasta seqs or accession #'s
| that are not present in both.
|
| So is it easier to just use perl somehow to compare the two raw acc#
| text files?
If your files do not contain repeat lines, you can do this from the
raw acc# text files in various ways. If you're using some form of
UNIX, you can do this on the command line:
$ cat file1 file2 | sort | uniq -c | egrep '^ *1 ' | cut -f2 | sort
Note that there's a TAB in the egrep expression (between the 1 and the ').
Another way is to use comm
$ sort file1 > file1.sorted
$ sort file2 > file2.sorted
$ comm -3 file1.sorted file2.sorted
You can guarantee that your input files do not have duplicates via
$ sort -u -i file1 > file1.sorted
This is all outside perl. In perl you could do something like
open(F1, "file1") || die "could not open file1 ($!)";
open(F2, "file2") || die "could not open file2 ($!)";
my %names;
while (<F1>){
chomp;
$names{$_}++;
}
while (<F2>){
chomp;
$names{$_}++;
}
close(F1) || die "could not close file1 ($!)";
close(F2) || die "could not close file2 ($!)";
my @not_in_both = grep { $names{$_} == 1 } keys %names;
Again, this relies on names only being present once in each file. You
could code around this requirement in your perl if you wanted, by
doing more checking of the input.
Terry
More information about the Bioperl-l
mailing list