[Bioperl-l] Homologene again...
Jason Stajich
jason@cgt.mc.duke.edu
Thu, 14 Feb 2002 21:54:56 -0500 (EST)
No there aren't objects or parsers for this data in bioperl because
homologene it is just a cluster of LocusLink Ids and accessions. I
tried to write a basic parser for my own needs last month and sort of gave
up on the data - been happier with the InParanoid Orthologs for what I
needed in the end.
Happy to give you what I started writing (note - I was interested in
drosophila orthologs to human so this is specific for that).
Hope this helps at all - I realize it is not at all sophisticated - and
there are a couple of cases that it fails to parse because for some reason
the file doesn't follow the format all the way through - go figure...
-jason
#!/usr/bin/perl -w
use strict;
# This is from the Homologene readme
# The field delimiter is "|".
# -The first two fields indicate the organisms from which the sequences
# originate.
# -The third field indicates the type of similarity.
# -The fourth (LocusLink ID), fifth (UniGene ID), and sixth (Accession
# number) fields correspond to the first organism. One or both of UG ID
# and LL ID may be present. Locus Link and UniGene are in one-to-one
# correspondence in the latter case, so no ambiguity arises through the
# choice of set identifier.
# -The seventh(LL), eighth(UG), and ninth(Accession) fields correspond
# to the second organism.
# -The tenth field is the percent identity of the alignment, or a URL to
# the source of a curated ortholog.
# A similarity between organisms may be a best match of several
# different types, with the type of match indicated by the sixth
# character of the record.
# t indicates best match from the second field to the first. (when
# using the second sequence as query, the first sequence is the best
# match, with percent identity of alignments over 100 nt the score)
# f indicates best match to the the second field from the first.
# (when using the first sequence as query, the second sequence is the
# best match)
# b indicates reciprocal best match (cluster pairs identified by f and t
# coincide).
# B indicates reinforced reciprocal best match (reciprocal best matches
# between at least three organisms agree).
# c indicates a curated homology (i.e., one that
# comes from outside NCBI or froma syntenic association,
# rather than one that is produced by an automatic process run at NCBI).
# Nota bene: many curated homologies are between genes rather than
# between accession numbers; consequently, we've chosen not to display
# accessions for all curated homologies, since the gene identifier-
# accession mapping is not always accurately resolvable.
open(HGENE, "hmlg.trip.ftp") or die("cannot open hmlg.trip.ftp");
$/ ="\n>";
while(my $l = <HGENE>) {
my @data = split(/\n/,$l);
my ($title,$gene);
foreach my $line ( @data ) {
last if( $gene && $title);
next if( $line =~ /^>/ );
if( $line =~ /^TITLE/ && $line =~ /Hs\./ ) {
(undef,$title) = split(/\s+/,$line);
} else {
next unless ( $line =~ /Dm/ );
my ($speciesa,$speciesb, $matchtype,
$lla,$uga,$acc_a,undef,
$llb,$ugb,$acc_b, $pid) = split(/\|/,$line);
if( lc($speciesa) eq 'dm' ) {
$lla =~ s/^\s+(\S+)/$1/;
$lla =~ s/(\S+)\s+$/$1/;
$gene = $lla;
} elsif( lc($speciesb) eq 'dm' ) {
$llb =~ s/^\s+(\S+)/$1/;
$llb =~ s/(\S+)\s+$/$1/;
$gene = $llb;
}
}
}
if( $title && $gene ) {
print "Title: $title Gene:$gene\n";
}
}
On Fri, 15 Feb 2002, Andrew Macgregor wrote:
> Hello,
>
> I haven't had any feedback on whether bioperl can parse homologene
> files so I'm guessing maybe it can't. Is this the type of thing that
> you want bioperl to do or is it out of scope?
>
> Can anybody point me to perl scripts that do this? If not, I'll be
> writing something to do the job. Is this something that could/should
> get put in bioperl somewhere, or in scripts central or is there just
> not too much interest in doing this?
>
> Cheers, Andrew.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>
--
Jason Stajich
Duke University
jason@cgt.mc.duke.edu