[Bioperl-l] arabidopsis + load_seqdatabase.pl
Hilmar Lapp
hlapp at gmx.net
Tue Dec 20 13:28:27 EST 2005
100% agreed.
Angshu, sometimes it goes a long way if you are precise in the way you
state things or either people have great difficulty understanding what
exactly your problem is, or you come across as clueless, or both.
So, do keep in mind that it is the Bioperl SeqIO parser that does the
parsing, not load_seqdatabase.pl. Also, load_seqdatabase.pl doesn't
manipulate the sequence object returned by the parser, nor does
bioperl-db. If accession, identifier, and name have the same values,
then that is what the SeqIO parser does - and probably it does so for a
reason. If you don't like the way SeqIO builds the object then write
your own SeqProcessor as mentioned before and you are free to entirely
rearrange the object. If you feel the SeqIO parser is in error then
file or post a bug report, in which you will need to state what exactly
is it that you find to be in error.
So, please stop asking for files to be parsed correctly - they *are*
parsed correctly. Instead, take a moment to step back and read Sean's
email again and then stick to the advice given there:
1) What exactly is it out of those or any other files that you want
represented in biosql? Not a single one of your answers indicates that
you know the answer to this question. In other words, not a single one
of your answers indicates that you know precisely what you want. How do
you expect others to help you achieve what you want if you don't even
know what you want, let alone be able to explain it to others.
2) What have you tried to get what you need? Why did the outcome of
those attempts fall short of what you want? When doing so, do not label
software you used in the process as yielding 'incorrect' results unless
you can back that up with a solid bug report, because almost always the
one who is 'incorrect' is you, expecting things that you shouldn't have
expected, or executing things wrongly.
For example, you could state that 1) what you want is all A.thaliana
transcripts in a biosql database, with each bioentry ideally being a
CDS, or at least a transcript, with as much annotation as contained in
the input file, and that 2) you used the NC_* records in GenBank format
with load_seqdatabase.pl but found yourself with only the contigs as
bioentries, not the transcripts or CDS records.
At this point, most people will have understood what your goal is, and
people more experienced in bioperl will also have understood that you
fell for a common misconception many people new to bioperl have, namely
to confuse features with the main entry (i.e., sequence).
It would then have been straightforward to point out that your desired
CDS annotation is surely present in your Biosql instance (if annotated
in the NC* record), but as rows in seqfeature because they were
features on the sequences as they came out of the parser.
It would also be straightforward to suggest a solution to this problem,
namely by either writing a SeqProcessor that converts the CDS features
of the contig sequences to first-class sequence objects (that's for
instance what I do for the EMBL formatted Ensembl dumps), or by using
an input file that has transcripts as primary records instead of as
feature annotation on a contig.
Given that you could then set out to locate GenBank or EMBL formatted
files containing A.thaliana transcripts as their records, instead of
asking others to search the internet for you.
I'm afraid that as far as I'm concerned I won't be able to lend you any
more of my time unless you are specific and precise in stating what
your goal is and what you got.
-hilmar
On Dec 20, 2005, at 9:35 AM, Brian Osborne wrote:
> Angshu,
>
>> I want them to be correctly parsed.
>
> They have been correctly parsed but you're looking in the wrong place.
> The
> names and identifiers associated with things like "CDS" or "gene" will
> not
> be found in the Bioentry table. The Bioentry is the entire NC_*
> record, the
> genes, mRNAs, and proteins are called features. Read the
> Feature-Annotation
> HOWTO and doc/schema-overview.txt in the biosql package.
>
> Brian O.
>
>
> On 12/19/05 5:39 PM, "Angshu Kar" <angshu96 at gmail.com> wrote:
>
>> Sean,
>>
>> I've tried .faa, .fna and .gbk files in the link mentioned below.
>> After
>> running the script when I saw the loaded database, I saw that in the
>> bioentry table the 3 fields accession, identifier and name containing
>> the
>> same data.Also, the version column was not populated. I want them to
>> be
>> correctly parsed. So I want an arabidopsis data file that "goes well"
>> with
>> the load_seqdatabase.pl script.
>>
>> Thanks,
>> Angshu
>>
>>
>> On 12/19/05, Sean Davis <sdavis2 at mail.nih.gov> wrote:
>>>
>>>
>>>
>>>
>>> On 12/19/05 3:20 PM, "Angshu Kar" <angshu96 at gmail.com> wrote:
>>>
>>>> Sean,
>>>>
>>>> I've used files from
>>>> ftp://ftp.ncbi.nih.gov/genomes/Arabidopsis_thaliana/CHR_V . But the
>>> script
>>>> cannot parse them according to biosql-schema.
>>>> So, I want some files that the script can parse correctly.
>>>> Else, I've to load each and every file onto the biodb and then check
>>> whether
>>>> it has been parsed correctly!
>>>
>>> Which file are you trying to load? What format is it in? What
>>> values are
>>> you expecting to be loaded that aren't? For the answer to the last
>>> question, it will likely help folks to see exactly what line of the
>>> input
>>> file isn't being loaded as you think it should be. For example, if
>>> there
>>> is
>>> a line in a file that contains
>>>
>>> foo /note="bar"
>>>
>>> Then you can point out that you would like to know where, if at all,
>>> the
>>> annotation associated with the foo tag is stored.
>>>
>>> Sean
>>>
>>>
>>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at portal.open-bio.org
>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
>
--
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------
More information about the Bioperl-l
mailing list