[Biojava-l] differences between read in sequence and stored sequence in database]

Mon Nov 3 14:48:45 UTC 2008

Hi all,
I've changed the regular expression in 
org.biojavax.bio.seq.io.GenbankFormat from

<code>
protected static final Pattern sectp =
Pattern.compile("^(\\s{0,8}(\\S+)\\s{1,7}(.*)|\\s{21}(/\\S+?)=(.*)|\\s{21}(/\\S+))$");
<\code>

to

<code>
protected static final Pattern sectp =
Pattern.compile("^(\\s{0,8}([A-Za-z]+)\\s{1,7}(.*)|\\s{21}(/\\S+?)=(.*)|\\s{21}(/\\S+))$");
<\code>

like in BioRuby 
(http://cvs.biojava.org/cgi-bin/viewcvs/viewcvs.cgi/bioruby/lib/bio/db.rb.diff?r1=0.24&r2=0.25&cvsroot=bioruby). 
But than features like D-loop can't be detected. So this is not the 
solution for my problem.
The reason for the truncation is readSection(BufferedReader br) in 
org.biojavax.bio.seq.io.GenbankFormat.

<snip>
             if (line==null || line.length()==0 || (!line.startsWith(" 
") && linecount++>0)) {
                     // dump out last part of section
                     section.add(new String[]{currKey,currVal.toString()});
                     br.reset();
                     done = true;
<\snip>

The condition in the if-clause will ignore lines which don't begin with 
a whitespace, so this line will be read

<snip>
  99999961  cccgcccaca cccctcggcc ctgccctctg gccatacagg ttctcggtgg 
tgttgaagag
<\snip>

and this line won't be read:
<snip>
100000021 gtcctcgggc tccggcttgg tgctcacgca cacaggaaag tcagcttctc ctgggagggc
<\snip>

If you change the if-statement to this:

<snip>
String firstSecKey = section.size() == 0 ? "" : 
((String[])section.get(0))[0];

if (line==null || line.length()==0 || (!line.startsWith(" ") && 
linecount++>0 && ( !firstSecKey.equals(START_SEQUENCE_TAG)  || 
line.startsWith(END_SEQUENCE_TAG))))
<\snip>

You can add the whole sequence without truncation to the database.
I have attached GenbankFormat.java in this mail. Can anybody check the 
method for me and commit it? Since I'm not a BioJava specialist.

Cheers,
Gabrielle

Richard Holland schrieb:
> Hello.
> 
> Sorry for the delayed reply - I've been away on business all week.
> 
> The similar Ruby issue (and solution) is discussed here:
> 
> http://portal.open-bio.org/pipermail/bioruby/2004-March.txt
> 
> How did you parse the files in the first place? Did you use the new
> GenBank parsers (BJX), or the older ones? This will help indicate
> where the problem lies - the data will have been truncated at the
> point it was parsed from file, so the data in your database will
> reflect this and you'll have to reload it once the appropriate parser
> has been fixed.
> 
> If it was the newer BJX parser, then the problem most probably lies in
> this regex from org.biojavax.bio.seq.io.GenbankFormat, which can
> probably be fixed in a similar manner to the Ruby equivalent dicussed
> in the posting above:
> 
>     protected static final Pattern sectp =
> Pattern.compile("^(\\s{0,8}(\\S+)\\s{1,7}(.*)|\\s{21}(/\\S+?)=(.*)|\\s{21}(/\\S+))$");
> 
> Could someone volunteer to develop and test a fix? If you come up with
> something, please commit it to the SVN trunk.
> 
> cheers,
> Richard
> 
> 
> 2008/10/28 Gabrielle Doan <gabrielle_doan at gmx.net>:
>> Hi all,
>> concering the problem as described below I have found out that this problem
>> also occured in BioRuby and was fixed in 2004.
>> See:
>> http://cvs.biojava.org/cgi-bin/viewcvs/viewcvs.cgi/bioruby/lib/bio/db.rb?cvsroot=bioruby
>> Unfortunately I'm clueless about BioRuby. Does anybody recognize this
>> problem or understand how it was solved in BioRuby?
>>
>> I am grateful for any hints.
>>
>> Cheers,
>>
>> Gabrielle
>>
>>
>> -------- Original-Nachricht --------
>> Betreff: [Biojava-l] differences between read in sequence and stored
>> sequence in database
>> Datum: Mon, 27 Oct 2008 13:57:03 +0100
>> Von: Gabrielle Doan <gabrielle_doan at gmx.net>
>> An: biojava-l at biojava.org
>>
>> Hi all,
>>
>> I have a BioSQL database which contains all human chromsomes. For my
>> recent project I have to query for a part of a sequence.
>> As far as I know I can get the whole sequence from the entry
>> Biosequence.Seq in the BioSQL schema. So I've made this query:
>>
>> SELECT SUBSTRING(bs.seq, 131615042, 131626262) FROM biosequence bs;
>>
>> But this query hasn't yield the desired string, because the length of
>> this biosequence is only 100,000,020 bp. I am very confused why I get
>> such a discrepancy. I have added all chromosomes with the build in
>> method in BioJava addRichSequence(RichSequence seq) to the database.
>> From my raw data I know that this sequence should have a length of
>> 140,279,252 bp. So where is the remaining part of my sequence? I have
>> observed these discrepancies on all chromsomes which are longer than
>> 100,000,020 bp.
>>
>> Here is an abstract of my database:
>> bioentry_id     description     length
>> 2       Homo sapiens mitochondrion, complete genome.    16571
>> 3       Homo sapiens chromosome Y, reference assembly, complete sequence.
>> 57772954
>> 4       Homo sapiens chromosome X, reference assembly, complete sequence.
>> 100000020
>> 5       Homo sapiens chromosome 22, reference assembly, complete sequence.
>> 49691432
>> 6       Homo sapiens chromosome 21, reference assembly, complete sequence.
>> 46944323
>> 7       Homo sapiens chromosome 20, reference assembly, complete sequence.
>> 25960004
>> 8       Homo sapiens chromosome 9, reference assembly, complete sequence.
>> 100000020
>> 9       Homo sapiens chromosome 7, reference assembly, complete sequence.
>> 100000020
>>
>> Sequences smaller than 100,000,020 bp are correctly stored under
>> Biosequence.seq.
>>
>> I am grateful for any hints, which explain the behaviour of my database.
>>
>> Cheers,
>>
>> Gabrielle
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
> 
> 
>