[Bioperl-l] acquiring a local refseq + index

Tue Jan 2 19:04:40 UTC 2007

> That seems like an real improvement over parsing the name out 
> of the text-entry. I'll use taxid = $seq->species->ncbi_taxid 
> from now on.
> 
> Thanks for that elucidation. :)

No problem.

> That leaves the error-throwing problem in Bio::DB::Flat, 
> which I encountered while making a local RefSeq BerkeleyDB index.
> 
> I supposed it remains worthwhile to prevent the indexing from 
> breaking on Bio::SeqIO instantiation (at least for the RefSeq 
> entry set), so I have put a simple fix on bugzilla that 
> prevents one more problem entry
> (NC_004822) from breaking the indexing process.
> 
> 
> Thanks,
> 
> Erikjan

I'll look into the bug fix; that particular record has an unusual taxonomic
name which may change at some point (Candidiatus something-or-other,
likely).  Best that we don't rely on that supposition though.

The way I see it we can go down two roads:

1)  Continue on with working in Bio::Species-related parsing (which I do not
support)
2)  Work towards Bio::Taxon-related parsing (which I do support).  

Note that both the classification issue (first bug, now resolved) and the
SOURCE line issue (second bug, unresolved) are related to the older way of
parsing that we are trying to shift away from, namely reliance on record
data alone for taxonomic analyses.  I think we need to shift more towards
simpler, cleaner parsing and away from the tendency to add fixes based on
one sequence record failing, which is due to the overly complex parsing
scheme currently present.  As past fixes attest, there will always be
another sequence record with a weird name down the road that will break
parsing again!

For instance, the first bug could be solved by splitting the complete
classification array on ';' alone, since that is the delimiter used for the
classification array; there is a substitution of the '.' which causes an
extra split and the parsing error.  

The second bug could be solved by simply assigning the SOURCE name to to
scientific_name (or node_name), and any data in parentheses to
common_name(); organelles would be parsed out as well.  No more subparsing
fixes based on trying to work out genus/species/subsp/etc, which is where
this bug occurs.  

Maybe I'm alone in that.  Sendu?  Any thoughts?

chris