<div dir="ltr"><div class="markdown-here-wrapper" style=""><p style="margin:0px 0px 1.2em!important">If I understand, your files are lists of names, one name per line. Python has a builtin <code style="font-size:0.85em;font-family:Consolas,Inconsolata,Courier,monospace;margin:0px 0.15em;padding:0px 0.3em;white-space:pre-wrap;border:1px solid rgb(234,234,234);border-radius:3px;display:inline;background-color:rgb(248,248,248)">set</code> type, which lets you apply set operations (intersection, union, difference) on collections. Here’s an example of how to use them to solve your problem:</p>
<pre style="font-size:0.85em;font-family:Consolas,Inconsolata,Courier,monospace;font-size:1em;line-height:1.2em;margin:1.2em 0px"><code class="hljs language-python" style="font-size:0.85em;font-family:Consolas,Inconsolata,Courier,monospace;margin:0px 0.15em;padding:0px 0.3em;white-space:pre-wrap;border:1px solid rgb(234,234,234);border-radius:3px;display:inline;background-color:rgb(248,248,248);white-space:pre;overflow:auto;border-radius:3px;border:1px solid rgb(204,204,204);padding:0.5em 0.7em;display:block!important;display:block;overflow-x:auto;color:rgb(104,97,94);padding:0.5em;background:rgb(241,239,238)">database_proteins_file = <span class="hljs-string" style="color:rgb(90,183,56)">&quot;sorted.a&quot;</span>
organism_proteins_file = <span class="hljs-string" style="color:rgb(90,183,56)">&quot;sorted.b&quot;</span>

<span class="hljs-comment" style="color:rgb(118,110,107)"># set() iterates over the input, storing unique entries</span>
database_proteins = set(open(database_proteins_file))
organism_proteins = set(open(organism_proteins_file))

<span class="hljs-keyword" style="color:rgb(102,102,234)">with</span> open(<span class="hljs-string" style="color:rgb(90,183,56)">&quot;common_proteins&quot;</span>, <span class="hljs-string" style="color:rgb(90,183,56)">&quot;w&quot;</span>) <span class="hljs-keyword" style="color:rgb(102,102,234)">as</span> common_file: <span class="hljs-comment" style="color:rgb(118,110,107)"># use with statement to ensure the file closes after the block ends</span>
    <span class="hljs-comment" style="color:rgb(118,110,107)"># the &amp; operator of set objects will compute intersections. Alternatively</span>
    <span class="hljs-comment" style="color:rgb(118,110,107)"># you could write database_proteins.intersect(organism_proteins)</span>
    <span class="hljs-keyword" style="color:rgb(102,102,234)">for</span> prot <span class="hljs-keyword" style="color:rgb(102,102,234)">in</span> sorted(database_proteins &amp; organism_proteins):
         common_file.write(prot)
</code></pre>
<p style="margin:0px 0px 1.2em!important">However, this just produces a common protein list. To build that matrix, you can adjust the above code:</p>
<pre style="font-size:0.85em;font-family:Consolas,Inconsolata,Courier,monospace;font-size:1em;line-height:1.2em;margin:1.2em 0px"><code class="hljs language-python" style="font-size:0.85em;font-family:Consolas,Inconsolata,Courier,monospace;margin:0px 0.15em;padding:0px 0.3em;white-space:pre-wrap;border:1px solid rgb(234,234,234);border-radius:3px;display:inline;background-color:rgb(248,248,248);white-space:pre;overflow:auto;border-radius:3px;border:1px solid rgb(204,204,204);padding:0.5em 0.7em;display:block!important;display:block;overflow-x:auto;color:rgb(104,97,94);padding:0.5em;background:rgb(241,239,238)">database_proteins_file = <span class="hljs-string" style="color:rgb(90,183,56)">&quot;sorted.a&quot;</span>
organism_proteins_file = <span class="hljs-string" style="color:rgb(90,183,56)">&quot;sorted.b&quot;</span>

<span class="hljs-comment" style="color:rgb(118,110,107)"># set() iterates over the input, storing unique entries</span>
database_proteins = set(open(database_proteins_file))
organism_proteins = set(open(organism_proteins_file))

<span class="hljs-keyword" style="color:rgb(102,102,234)">with</span> open(<span class="hljs-string" style="color:rgb(90,183,56)">&quot;output_matrix&quot;</span>, <span class="hljs-string" style="color:rgb(90,183,56)">&quot;w&quot;</span>) <span class="hljs-keyword" style="color:rgb(102,102,234)">as</span> matrix_file:
    <span class="hljs-keyword" style="color:rgb(102,102,234)">for</span> prot <span class="hljs-keyword" style="color:rgb(102,102,234)">in</span> sorted(database_proteins):
        line = prot.replace(<span class="hljs-string" style="color:rgb(90,183,56)">&quot;\n&quot;</span>, <span class="hljs-string" style="color:rgb(90,183,56)">&quot;&quot;</span>) <span class="hljs-comment" style="color:rgb(118,110,107)"># Trim newline from each entry in the set, since we need to append to the line</span>
        present = prot <span class="hljs-keyword" style="color:rgb(102,102,234)">in</span> organism_proteins <span class="hljs-comment" style="color:rgb(118,110,107)"># like for dict objects the in operator checks for set membership of the first term in the second term</span>
        line += <span class="hljs-string" style="color:rgb(90,183,56)">&quot; 1\n&quot;</span> <span class="hljs-keyword" style="color:rgb(102,102,234)">if</span> present <span class="hljs-keyword" style="color:rgb(102,102,234)">else</span> <span class="hljs-string" style="color:rgb(90,183,56)">&quot; 0\n&quot;</span> <span class="hljs-comment" style="color:rgb(118,110,107)"># add the textual flag using inline if expression. Also called the ternary operator</span>
        matrix_file.write(line)
</code></pre>
<div title="MDH:SWYgSSB1bmRlcnN0YW5kLCB5b3VyIGZpbGVzIGFyZSBsaXN0cyBvZiBuYW1lcywgb25lIG5hbWUg
cGVyIGxpbmUuIFB5dGhvbiBoYXMgYSBidWlsdGluIGBzZXRgIHR5cGUsIHdoaWNoIGxldHMgeW91
IGFwcGx5IHNldCBvcGVyYXRpb25zIChpbnRlcnNlY3Rpb24sIHVuaW9uLCBkaWZmZXJlbmNlKSBv
biBjb2xsZWN0aW9ucy4gSGVyZSdzIGFuIGV4YW1wbGUgb2YgaG93IHRvIHVzZSB0aGVtIHRvIHNv
bHZlIHlvdXIgcHJvYmxlbTo8ZGl2Pjxicj48L2Rpdj48ZGl2PmBgYHB5dGhvbjwvZGl2PjxkaXY+
ZGF0YWJhc2VfcHJvdGVpbnNfZmlsZSA9ICJzb3J0ZWQuYSI8L2Rpdj48ZGl2Pm9yZ2FuaXNtX3By
b3RlaW5zX2ZpbGUgPSAic29ydGVkLmIiPC9kaXY+PGRpdj48YnI+PC9kaXY+PGRpdj4jIHNldCgp
IGl0ZXJhdGVzIG92ZXIgdGhlIGlucHV0LCBzdG9yaW5nIHVuaXF1ZSBlbnRyaWVzPC9kaXY+PGRp
dj5kYXRhYmFzZV9wcm90ZWlucyA9IHNldChvcGVuKGRhdGFiYXNlX3Byb3RlaW5zX2ZpbGUpKTwv
ZGl2PjxkaXY+b3JnYW5pc21fcHJvdGVpbnMgPSBzZXQob3BlbihvcmdhbmlzbV9wcm90ZWluc19m
aWxlKSk8L2Rpdj48ZGl2Pjxicj48L2Rpdj48ZGl2PndpdGggb3BlbigiY29tbW9uX3Byb3RlaW5z
IiwgInciKSBhcyBjb21tb25fZmlsZTogIyB1c2Ugd2l0aCBzdGF0ZW1lbnQgdG8gZW5zdXJlIHRo
ZSBmaWxlIGNsb3NlcyBhZnRlciB0aGUgYmxvY2sgZW5kczwvZGl2PjxkaXY+Jm5ic3A7ICZuYnNw
OyAjIHRoZSAmYW1wOyBvcGVyYXRvciBvZiBzZXQgb2JqZWN0cyB3aWxsIGNvbXB1dGUgaW50ZXJz
ZWN0aW9ucy4gQWx0ZXJuYXRpdmVseTwvZGl2PjxkaXY+Jm5ic3A7ICZuYnNwOyAjIHlvdSBjb3Vs
ZCB3cml0ZSBkYXRhYmFzZV9wcm90ZWlucy5pbnRlcnNlY3Qob3JnYW5pc21fcHJvdGVpbnMpPC9k
aXY+PGRpdj4mbmJzcDsgJm5ic3A7IGZvciBwcm90IGluIHNvcnRlZChkYXRhYmFzZV9wcm90ZWlu
cyAmYW1wOyBvcmdhbmlzbV9wcm90ZWlucyk6PC9kaXY+PGRpdj4mbmJzcDsgJm5ic3A7ICZuYnNw
OyAmbmJzcDsgJm5ic3A7Y29tbW9uX2ZpbGUud3JpdGUocHJvdCk8L2Rpdj48ZGl2PmBgYDwvZGl2
PjxkaXY+SG93ZXZlciwgdGhpcyBqdXN0IHByb2R1Y2VzIGEgY29tbW9uIHByb3RlaW4gbGlzdC4g
VG8gYnVpbGQgdGhhdCBtYXRyaXgsIHlvdSBjYW4gYWRqdXN0IHRoZSBhYm92ZSBjb2RlOjwvZGl2
PjxkaXY+PGJyPjwvZGl2PjxkaXY+YGBgcHl0aG9uPC9kaXY+PGRpdj48ZGl2PmRhdGFiYXNlX3By
b3RlaW5zX2ZpbGUgPSAic29ydGVkLmEiPC9kaXY+PGRpdj5vcmdhbmlzbV9wcm90ZWluc19maWxl
ID0gInNvcnRlZC5iIjwvZGl2PjxkaXY+PGJyPjwvZGl2PjxkaXY+IyBzZXQoKSBpdGVyYXRlcyBv
dmVyIHRoZSBpbnB1dCwgc3RvcmluZyB1bmlxdWUgZW50cmllczwvZGl2PjxkaXY+ZGF0YWJhc2Vf
cHJvdGVpbnMgPSBzZXQob3BlbihkYXRhYmFzZV9wcm90ZWluc19maWxlKSk8L2Rpdj48ZGl2Pm9y
Z2FuaXNtX3Byb3RlaW5zID0gc2V0KG9wZW4ob3JnYW5pc21fcHJvdGVpbnNfZmlsZSkpPC9kaXY+
PC9kaXY+PGRpdj48YnI+PC9kaXY+PGRpdj53aXRoIG9wZW4oIm91dHB1dF9tYXRyaXgiLCAidyIp
IGFzIG1hdHJpeF9maWxlOjwvZGl2PjxkaXY+Jm5ic3A7ICZuYnNwOyBmb3IgcHJvdCBpbiBzb3J0
ZWQoZGF0YWJhc2VfcHJvdGVpbnMpOjwvZGl2PjxkaXY+Jm5ic3A7ICZuYnNwOyAmbmJzcDsgJm5i
c3A7IGxpbmUgPSBwcm90LnJlcGxhY2UoIlxuIiwgIiIpICMgVHJpbSBuZXdsaW5lIGZyb20gZWFj
aCBlbnRyeSBpbiB0aGUgc2V0LCBzaW5jZSB3ZSBuZWVkIHRvIGFwcGVuZCB0byB0aGUgbGluZTwv
ZGl2PjxkaXY+Jm5ic3A7ICZuYnNwOyAmbmJzcDsgJm5ic3A7IHByZXNlbnQgPSBwcm90IGluIG9y
Z2FuaXNtX3Byb3RlaW5zICMgbGlrZSBmb3IgZGljdCBvYmplY3RzIHRoZSBpbiBvcGVyYXRvciBj
aGVja3MgZm9yIHNldCBtZW1iZXJzaGlwIG9mIHRoZSBmaXJzdCB0ZXJtIGluIHRoZSBzZWNvbmQg
dGVybTwvZGl2PjxkaXY+Jm5ic3A7ICZuYnNwOyAmbmJzcDsgJm5ic3A7IGxpbmUgKz0gIiAxXG4i
IGlmIHByZXNlbnQgZWxzZSAiIDBcbiIgIyBhZGQgdGhlIHRleHR1YWwgZmxhZyB1c2luZyBpbmxp
bmUgaWYgZXhwcmVzc2lvbi4gQWxzbyBjYWxsZWQgdGhlIHRlcm5hcnkgb3BlcmF0b3I8L2Rpdj48
ZGl2PiZuYnNwOyAmbmJzcDsgJm5ic3A7ICZuYnNwOyBtYXRyaXhfZmlsZS53cml0ZShsaW5lKTwv
ZGl2PjxkaXY+PGJyPjwvZGl2PjxkaXY+YGBgPC9kaXY+" style="height:0;width:0;max-height:0;max-width:0;overflow:hidden;font-size:0em;padding:0;margin:0">​</div></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Sep 10, 2015 at 11:48 AM, Naiane Negri <span dir="ltr">&lt;<a href="mailto:naiannegri@gmail.com" target="_blank">naiannegri@gmail.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div>

<p>I&#39;m new with python so i&#39;m reaaally struggling in making a script.</p>

<p>So, what I need  is to make a comparison between two files. One file 
contains all proteins of some data base, the other contain only some of 
the proteins presents in the other file, because it belongs to a 
organism. So I need to know wich proteins of this data base is present 
in my organism. For that I want to build a output like a matrix, with 0 
and 1 referring to every protein present in the data base that may or 
may not be in my organism.</p>

<p>Does anybody have any idea of how could I do that?
I&#39;m trying to use something like this 
$ cat sorted.a
A
B
C
D
$ cat sorted.b
A
D
$ join  sorted.a sorted.b | sed &#39;s/^/1 /&#39; &amp;&amp; join  -v 1 sorted.a sorted.b | sed &#39;s/^/0 /&#39;
1 A
1 D
0 B
0 C</p>

<p>But I&#39;m not being able to use it because sometimes a protein is present but its not in the same line. 
Here is a example:</p><p>
1-cysPrx_C<br>14-3-3<br>2-Hacid_dh<br>2-Hacid_dh_C<br>2-oxoacid_dh<br>2H-phosphodiest<br>2OG-FeII_Oxy<br>2OG-FeII_Oxy_3<br>2OG-FeII_Oxy_4<br>2OG-FeII_Oxy_5<br>2OG-Fe_Oxy_2<br>2TM<br>2_5_RNA_ligase2</p>

<p>comparing with</p>

<p>1-cysPrx_C<br>120_Rick_ant<br>14-03-2003<br>2-Hacid_dh<br>2-Hacid_dh_C<br>2-oxoacid_dh<br>2-ph_phosp<br>2CSK_N<br>2C_adapt<br>2Fe-2S_Ferredox<br>2H-phosphodiest<br>2HCT<br>2OG-FeII_Oxy<br></p>

<p>Does anyone have an idea of how could I do that?
Thanks so far.</p>
    </div></div>
<br>_______________________________________________<br>
Biopython-dev mailing list<br>
<a href="mailto:Biopython-dev@mailman.open-bio.org">Biopython-dev@mailman.open-bio.org</a><br>
<a href="http://mailman.open-bio.org/mailman/listinfo/biopython-dev" rel="noreferrer" target="_blank">http://mailman.open-bio.org/mailman/listinfo/biopython-dev</a><br></blockquote></div><br></div>