[Bioperl-l] arabidopsis + load_seqdatabase.pl

Tue Dec 20 13:34:38 EST 2005

Thanks Hilmar.

On 12/20/05, Hilmar Lapp <hlapp at gmx.net> wrote:
>
> 100% agreed.
>
> Angshu, sometimes it goes a long way if you are precise in the way you
> state things or either people have great difficulty understanding what
> exactly your problem is, or you come across as clueless, or both.
>
> So, do keep in mind that it is the Bioperl SeqIO parser that does the
> parsing, not load_seqdatabase.pl. Also, load_seqdatabase.pl doesn't
> manipulate the sequence object returned by the parser, nor does
> bioperl-db. If accession, identifier, and name have the same values,
> then that is what the SeqIO parser does - and probably it does so for a
> reason. If you don't like the way SeqIO builds the object then write
> your own SeqProcessor as mentioned before and you are free to entirely
> rearrange the object. If you feel the SeqIO parser is in error then
> file or post a bug report, in which you will need to state what exactly
> is it that you find to be in error.
>
> So, please stop asking for files to be parsed correctly - they *are*
> parsed correctly. Instead, take a moment to step back and read Sean's
> email again and then stick to the advice given there:
>
>        1) What exactly is it out of those or any other files that you want
> represented in biosql? Not a single one of your answers indicates that
> you know the answer to this question. In other words, not a single one
> of your answers indicates that you know precisely what you want. How do
> you expect others to help you achieve what you want if you don't even
> know what you want, let alone be able to explain it to others.
>
>        2) What have you tried to get what you need? Why did the outcome of
> those attempts fall short of what you want? When doing so, do not label
> software you used in the process as yielding 'incorrect' results unless
> you can back that up with a solid bug report, because almost always the
> one who is 'incorrect' is you, expecting things that you shouldn't have
> expected, or executing things wrongly.
>
> For example, you could state that 1) what you want is all A.thaliana
> transcripts in a biosql database, with each bioentry ideally being a
> CDS, or at least a transcript, with as much annotation as contained in
> the input file, and that 2) you used the NC_* records in GenBank format
> with load_seqdatabase.pl but found yourself with only the contigs as
> bioentries, not the transcripts or CDS records.
>
> At this point, most people will have understood what your goal is, and
> people more experienced in bioperl will also have understood that you
> fell for a common misconception many people new to bioperl have, namely
> to confuse features with the main entry (i.e., sequence).
>
> It would then have been straightforward to point out that your desired
> CDS annotation is surely present in your Biosql instance (if annotated
> in the NC* record), but as rows in seqfeature because they were
> features on the sequences as they came out of the parser.
>
> It would also be straightforward to suggest a solution to this problem,
> namely by either writing a SeqProcessor that converts the CDS features
> of the contig sequences to first-class sequence objects (that's for
> instance what I do for the EMBL formatted Ensembl dumps), or by using
> an input file that has transcripts as primary records instead of as
> feature annotation on a contig.
>
> Given that you could then set out to locate GenBank or EMBL formatted
> files containing A.thaliana transcripts as their records, instead of
> asking others to search the internet for you.
>
> I'm afraid that as far as I'm concerned I won't be able to lend you any
> more of my time unless you are specific and precise in stating what
> your goal is and what you got.
>
>        -hilmar
>
> On Dec 20, 2005, at 9:35 AM, Brian Osborne wrote:
>
> > Angshu,
> >
> >> I want them to be correctly parsed.
> >
> > They have been correctly parsed but you're looking in the wrong place.
> > The
> > names and identifiers associated with things like "CDS" or "gene" will
> > not
> > be found in the Bioentry table. The Bioentry is the entire NC_*
> > record, the
> > genes, mRNAs, and proteins are called features. Read the
> > Feature-Annotation
> > HOWTO and doc/schema-overview.txt in the biosql package.
> >
> > Brian O.
> >
> >
> > On 12/19/05 5:39 PM, "Angshu Kar" <angshu96 at gmail.com> wrote:
> >
> >> Sean,
> >>
> >> I've tried .faa, .fna and .gbk files in the link mentioned below.
> >> After
> >> running the script when I saw the loaded database, I saw that in the
> >> bioentry table the 3 fields accession, identifier and name containing
> >> the
> >> same data.Also, the version column was not populated. I want them to
> >> be
> >> correctly parsed. So I want an arabidopsis data file that "goes well"
> >> with
> >> the load_seqdatabase.pl script.
> >>
> >> Thanks,
> >> Angshu
> >>
> >>
> >> On 12/19/05, Sean Davis <sdavis2 at mail.nih.gov> wrote:
> >>>
> >>>
> >>>
> >>>
> >>> On 12/19/05 3:20 PM, "Angshu Kar" <angshu96 at gmail.com> wrote:
> >>>
> >>>> Sean,
> >>>>
> >>>> I've used files from
> >>>> ftp://ftp.ncbi.nih.gov/genomes/Arabidopsis_thaliana/CHR_V  . But the
> >>> script
> >>>> cannot parse them according to biosql-schema.
> >>>> So, I want some files that the script can parse correctly.
> >>>> Else, I've to load each and every file onto the biodb and then check
> >>> whether
> >>>> it has been parsed correctly!
> >>>
> >>> Which file are you trying to load?  What format is it in?  What
> >>> values are
> >>> you expecting to be loaded that aren't?  For the answer to the last
> >>> question, it will likely help folks to see exactly what line of the
> >>> input
> >>> file isn't being loaded as you think it should be.  For example, if
> >>> there
> >>> is
> >>> a line in a file that contains
> >>>
> >>>     foo         /note="bar"
> >>>
> >>> Then you can point out that you would like to know where, if at all,
> >>> the
> >>> annotation associated with the foo tag is stored.
> >>>
> >>> Sean
> >>>
> >>>
> >>>
> >>
> >> _______________________________________________
> >> Bioperl-l mailing list
> >> Bioperl-l at portal.open-bio.org
> >> http://portal.open-bio.org/mailman/listinfo/bioperl-l
> >
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at portal.open-bio.org
> > http://portal.open-bio.org/mailman/listinfo/bioperl-l
> >
> >
> --
> -------------------------------------------------------------
> Hilmar Lapp                            email: lapp at gnf.org
> GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
> -------------------------------------------------------------
>
>
>