[Bioperl-l] Reading sequences without parsing them

16 Jul 2001 09:40:20 -0700

Following the conversation and thought I'd had my $.02 from tackling
this
problem 3 years ago.  We were doing regular imports of the genbank
databases tracking NR,NT and dbEST mostly.  The problem was not just
tracking what new sequences were coming in, but some old sequences
would be updated and given a new accession # or the best case was when
the annotation changed slightly and a new accession was added with
no change to the sequence.  The diffs on the databases that you could
download would eventually push us further and further away from doing
a fresh download and we would be tracking duplicates for the sake of
keeping the accession history clean.

So we decided to do something similar to what Amir is trying to do (I
think)
and create a database of check sums that represented each sequence in
the database.  We had a multi-part check sum, though, to deal with each
of the possible changes:

1)  A change in the sequence with no change in the description fields,
just
     adding a new accession number.

2)  A change in one of the description/annotation fields, new accession
but
     same sequence.

3)  A change in both plus a new accession.

Depending on which case, we did different things in managing our
internal
databases.  As Ewan noted, we used DBM files for the checksum database
and it worked well.  We coded our own check sum scheme that worked well
for handling sequence information -- could have used something already
existing in the MD5 realm, CRC32 etc, but we had a couple of
mathemiticians
in the group and the problem kept them happy for a bit.  I wish I had
the code
to paste in here, but that was 2 companies ago...

So, to try and summarize this mess a bit -- I think Ewan's outline is a
good
one based on experience but since the bioperl packages already parses
out
the sequence and annotation information seperately, I'd run a composite
checksum of the two so they could be addressed individually with a finer
grained control (or at least the option to).

-lee

On 16 Jul 2001 15:20:14 +0100, Ewan Birney wrote:
> On Mon, 16 Jul 2001, Karger, Amir wrote:
> 
> >   
> > So, sorry about the lack of clarity. Do a s/sequence/entry/g on my original
> > email.
> > 
> 
> There is not an in built way to do this inside Bioperl nicely.
> 
> options
> 
>    (a) use IO::String but that will be dependent on the bioperl write_seq
> differences - ie, this is not what you want as when we change bioperl
> write_seq for a format you will think all your sequences have updates
> 
>    (b) trust the in built accession.version system for sequences not
> annotations
> 
>    (c) trust the Date line for annotation updates (available in swissprot,
> embl , genbank)
> 
> 
> If you are paranoid you will need to write your own Digest::MD5 system
> based around a string from // to // in the files. This could perhaps
> become quite a nice system integrated into the SeqIO system: for example,
> I could imagine a complex system like:
> 
>    # fictional class 
>    use Bio::DB::AutoUpdate.pm;
> 
>    $auto = Bio::DB::AutoUpdate->new( -file => 'some/file',
>                                    -md5  => '/some/place/with/md5',
>                                      -record => '//',
>                                      -seqio => 'swiss'
>                                      -update => 1 # means update md5 on reading
>                                     );
> 
>    # auto update complies to the implict SeqIO interface of next_seq
>    # but only gives back new MD5 entries
> 
>    while( (my $updated_entry = $auto->next_seq()) ) {
>       # do something with updated
>    }
> 
> 
> the MD5 is probably best implemented as a DBM file.
> 
> 
> If you wrote something like this that would be great! If you wait 6 months
> or so I'll probably get bored on a train sometime and might do it
> assumming half a ton of other interesting things are not happening ;)
> 
> 
> any other thoughts from people?
>    
> 
> > -Amir
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l@bioperl.org
> > http://bioperl.org/mailman/listinfo/bioperl-l
> > 
> 
> -----------------------------------------------------------------
> Ewan Birney. Mobile: +44 (0)7970 151230, Work: +44 1223 494420
> <birney@ebi.ac.uk>. 
> -----------------------------------------------------------------
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l