[Biojava-l] differences between read in sequence and stored sequence in database]

Richard Holland holland at eaglegenomics.com
Thu Oct 30 14:07:42 UTC 2008


Hello.

Sorry for the delayed reply - I've been away on business all week.

The similar Ruby issue (and solution) is discussed here:

http://portal.open-bio.org/pipermail/bioruby/2004-March.txt

How did you parse the files in the first place? Did you use the new
GenBank parsers (BJX), or the older ones? This will help indicate
where the problem lies - the data will have been truncated at the
point it was parsed from file, so the data in your database will
reflect this and you'll have to reload it once the appropriate parser
has been fixed.

If it was the newer BJX parser, then the problem most probably lies in
this regex from org.biojavax.bio.seq.io.GenbankFormat, which can
probably be fixed in a similar manner to the Ruby equivalent dicussed
in the posting above:

    protected static final Pattern sectp =
Pattern.compile("^(\\s{0,8}(\\S+)\\s{1,7}(.*)|\\s{21}(/\\S+?)=(.*)|\\s{21}(/\\S+))$");

Could someone volunteer to develop and test a fix? If you come up with
something, please commit it to the SVN trunk.

cheers,
Richard


2008/10/28 Gabrielle Doan <gabrielle_doan at gmx.net>:
> Hi all,
> concering the problem as described below I have found out that this problem
> also occured in BioRuby and was fixed in 2004.
> See:
> http://cvs.biojava.org/cgi-bin/viewcvs/viewcvs.cgi/bioruby/lib/bio/db.rb?cvsroot=bioruby
> Unfortunately I'm clueless about BioRuby. Does anybody recognize this
> problem or understand how it was solved in BioRuby?
>
> I am grateful for any hints.
>
> Cheers,
>
> Gabrielle
>
>
> -------- Original-Nachricht --------
> Betreff: [Biojava-l] differences between read in sequence and stored
> sequence in database
> Datum: Mon, 27 Oct 2008 13:57:03 +0100
> Von: Gabrielle Doan <gabrielle_doan at gmx.net>
> An: biojava-l at biojava.org
>
> Hi all,
>
> I have a BioSQL database which contains all human chromsomes. For my
> recent project I have to query for a part of a sequence.
> As far as I know I can get the whole sequence from the entry
> Biosequence.Seq in the BioSQL schema. So I've made this query:
>
> SELECT SUBSTRING(bs.seq, 131615042, 131626262) FROM biosequence bs;
>
> But this query hasn't yield the desired string, because the length of
> this biosequence is only 100,000,020 bp. I am very confused why I get
> such a discrepancy. I have added all chromosomes with the build in
> method in BioJava addRichSequence(RichSequence seq) to the database.
> From my raw data I know that this sequence should have a length of
> 140,279,252 bp. So where is the remaining part of my sequence? I have
> observed these discrepancies on all chromsomes which are longer than
> 100,000,020 bp.
>
> Here is an abstract of my database:
> bioentry_id     description     length
> 2       Homo sapiens mitochondrion, complete genome.    16571
> 3       Homo sapiens chromosome Y, reference assembly, complete sequence.
> 57772954
> 4       Homo sapiens chromosome X, reference assembly, complete sequence.
> 100000020
> 5       Homo sapiens chromosome 22, reference assembly, complete sequence.
> 49691432
> 6       Homo sapiens chromosome 21, reference assembly, complete sequence.
> 46944323
> 7       Homo sapiens chromosome 20, reference assembly, complete sequence.
> 25960004
> 8       Homo sapiens chromosome 9, reference assembly, complete sequence.
> 100000020
> 9       Homo sapiens chromosome 7, reference assembly, complete sequence.
> 100000020
>
> Sequences smaller than 100,000,020 bp are correctly stored under
> Biosequence.seq.
>
> I am grateful for any hints, which explain the behaviour of my database.
>
> Cheers,
>
> Gabrielle
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>



-- 
Richard Holland, BSc MBCS
Finance Director, Eagle Genomics Ltd
M: +44 7500 438846 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/



More information about the Biojava-l mailing list