<div dir="ltr">Hi John,<br class="" id="yiv9797058960yui_3_16_0_1_1442583793358_11825">I
don't know if this will help but I recently had a list of proteins for
which I wanted the mRNA or CDS for each one so that I could use the RNA.
(`mRNA` meaning someone entered a specific corresponding Genbank entry
described as the mRNA and CDS meaning extracted from the `coded_by`
information.) I found some of the same issues you seem to be describing
and worked out getting around them, I think. The program tries more
agressive and inefficient means as it gets to the tougher and tougher
ones to extract. I tried to make it so it doesn't give up. It probably
isn't perfect yet but at the time it would easily get several hundred
starting from the NCBI-sourced fasta sequences for the protein. (The
sequence itself isn't important but the description line actually is. It
extracts an id from there.) It even validates them to make sure they
encode the original protein using the correct one of the 24 genetic
codes.<br class="" id="yiv9797058960yui_3_16_0_1_1442583793358_11827"><br class="" id="yiv9797058960yui_3_16_0_1_1442583793358_11829">You
can check the code out at
<a href="https://github.com/fomightez/sequencework/blob/master/RetrieveSeq/GetmRNAorCDSforProtein.py">https://github.com/fomightez/sequencework/blob/master/RetrieveSeq/GetmRNAorCDSforProtein.py</a>.
The description is at
<a href="https://github.com/fomightez/sequencework/tree/master/RetrieveSeq">https://github.com/fomightez/sequencework/tree/master/RetrieveSeq</a> . <br class="" id="yiv9797058960yui_3_16_0_1_1442583793358_11831"><br class="" id="yiv9797058960yui_3_16_0_1_1442583793358_11833">Feel
free to adapt it or let me know if you'd like some help testing it with
your data or my help in maybe trying to get adapt it to what you have
as starting material.<br class="" id="yiv9797058960yui_3_16_0_1_1442583793358_11835"><br class="" id="yiv9797058960yui_3_16_0_1_1442583793358_11837">Wayne<br class="" id="yiv9797058960yui_3_16_0_1_1442583793358_11839"><br><br class="" id="yiv9797058960yui_3_16_0_1_1442583793358_11841">Date: Fri, 18 Sep 2015 14:30:05 +0000<br class="" id="yiv9797058960yui_3_16_0_1_1442583793358_11843">From: "Athey, John *" <<a href="mailto:John.Athey@fda.hhs.gov">John.Athey@fda.hhs.gov</a>><br class="" id="yiv9797058960yui_3_16_0_1_1442583793358_11845">To: "<a href="mailto:biopython@mailman.open-bio.org">biopython@mailman.open-bio.org</a>" <<a href="mailto:biopython@mailman.open-bio.org">biopython@mailman.open-bio.org</a>><br class="" id="yiv9797058960yui_3_16_0_1_1442583793358_11847">Subject: [Biopython] Handling records referencing other records<br class="" id="yiv9797058960yui_3_16_0_1_1442583793358_11849">Message-ID:<br class="" id="yiv9797058960yui_3_16_0_1_1442583793358_11851"> <<a href="mailto:5D5BA0385615F148A9D2FD86BB656F700FEAF9F8@FDSWV09433.fda.gov">5D5BA0385615F148A9D2FD86BB656F700FEAF9F8@FDSWV09433.fda.gov</a>><br class="" id="yiv9797058960yui_3_16_0_1_1442583793358_11853">Content-Type: text/plain; charset="us-ascii"<br class="" id="yiv9797058960yui_3_16_0_1_1442583793358_11855"><br class="" id="yiv9797058960yui_3_16_0_1_1442583793358_11857">Hello all,<br class="" id="yiv9797058960yui_3_16_0_1_1442583793358_11859"><br class="" id="yiv9797058960yui_3_16_0_1_1442583793358_11861">I'm
looking for advice on how to handle Genbank records that reference
other records as part of their location. My program iterates through
large Genbank-formatted files with SeqIO.parse and extracts the CDS for
subsequent analysis, using feat.extract(). However, upon hitting a
record where the feature location references another record, it
SOMETIMES fails. For example,
<a href="http://www.ncbi.nlm.nih.gov/nuccore/DQ100169">http://www.ncbi.nlm.nih.gov/nuccore/DQ100169</a> seems to be handled
correctly, while <a href="http://www.ncbi.nlm.nih.gov/nuccore/DQ100170">http://www.ncbi.nlm.nih.gov/nuccore/DQ100170</a> gives a
"ValueError: Feature references another sequence." Curiously, in both
cases the CDS feature itself doesn't specify another record, only the
parent gene does.<br class="" id="yiv9797058960yui_3_16_0_1_1442583793358_11863"><br class="" id="yiv9797058960yui_3_16_0_1_1442583793358_11865">My questions about this are:<br class="" id="yiv9797058960yui_3_16_0_1_1442583793358_11867"><br class="" id="yiv9797058960yui_3_16_0_1_1442583793358_11869">1) Why does the extraction fail on some records but not on all of them?<br class="" id="yiv9797058960yui_3_16_0_1_1442583793358_11871"><br class="" id="yiv9797058960yui_3_16_0_1_1442583793358_11873">2) Is there a way to extract the data I'm looking for without causing this error?<br class="" id="yiv9797058960yui_3_16_0_1_1442583793358_11875"><br class="" id="yiv9797058960yui_3_16_0_1_1442583793358_11877">3)
If the answer to (2) is no, is there some other way to check whether
the sequence will cause this error, skip extracting that sequence, and
exclude that record from the analysis?<br class="" id="yiv9797058960yui_3_16_0_1_1442583793358_11879"><br class="" id="yiv9797058960yui_3_16_0_1_1442583793358_11881">Thanks for any help you can provide!<br></div>