Using Arc to decode Venter's secret DNA watermark

Recently Craig Venter (who decoded the human genome) created a synthetic bacterium. The J. Craig Venter Institute (JCVI) took a bacterium's DNA sequence as a computer file, modified it, made physical DNA from this sequence, and stuck this DNA into a cell, which then reproduced under control of the new DNA to create a new bacterium. This is a really cool result, since it shows you can create the DNA of an organism entirely from scratch. (I wouldn't exactly call it synthetic life though, since it does require an existing cell to get going.) Although this project took 10 years to complete, I'm sure it's only a matter of time before you will be able to send a data file to some company and get the resulting cells sent back to you.

One interesting feature of this synthetic bacterium is it includes four "watermarks", special sequences of DNA that prove this bacterium was created from the data file, and is not natural. However, they didn't reveal how the watermarks were encoded. The DNA sequences were published (GTTCGAATATTT and so on), but how to get the meaning out of this was left a puzzle. For detailed background on the watermarks, see Singularity Hub. I broke the code (as I described earlier) and found the names of the authors, some quotations, and the following hidden web page. This seems like science fiction, but it's true. There's actually a new bacterium that has a web page encoded in its DNA:
DNA web page
I contacted the JCVI using the link and found out I was the 31st person to break the code, which is a bit disappointing. I haven't seen details elsewhere on how the code works, so I'll provide the details. (Spoilers ahead if you want to break the code yourself.) I originally used Python to break the code, but since this is nominally an Arc blog, I'll show how it can be done using the Arc language.

The oversimplified introduction to DNA

DNA double helix Perhaps I should give the crash course in DNA at this point. The genetic instructions for all organisms are contained in very, very long molecules called DNA. For our purposes, DNA can be described as a sequence of four different letters: C, G, A, and T. The human genome is 3 billion base pairs in 23 chromosome pairs, so your DNA can be described as a pair of sequences of 3 billion C's, G's, A's, and T's.

Watson and Crick won the Nobel Prize for figuring out the structure of DNA and how it encodes genes. Your body is made up of important things such as enzymes, which are made up of carefully-constructed proteins, which are made up of special sequences of 18 different amino acids. Your genes, which are parts of the DNA, specify these amino acid sequences. Each three DNA bases specify a particular amino acid in the sequence. For instance, the DNA sequence CTATGG specifies the amino acids leucine and tryptophan because CTA indicates leucine, and TGG indicates tryptophan. Crick and Watson and Rosalind Franklin discovered the genetic code that associates a particular amino acid with each of the 64 possible DNA triples. I'm omitting lots of interesting details here, but the key point is that each three DNA symbols specifies one amino acid.

Encoding text in DNA

Now, the puzzle is to figure out how JCVI encoded text in the synthetic bacterium's DNA. The obvious extension is instead of assigning an amino acid to each of the 64 DNA triples, assign a character to it. (I should admit that this was only obvious to me in retrospect, as I explained in my earlier blog posting.) Thus if the DNA sequence is GTTCGA, then GTT will be one letter and CGA will be another, and so on. Now we just need to figure out what letter goes with each DNA triple.

If we know some text and the DNA that goes with it, it is straightforward to crack the code that maps between the characters and the DNA triples. For instance, if we happen to know that the text "LIFE" corresponds to the DNA "AAC CTG GGC TAA", then we can conclude that AAA goes to L, CTG goes to I, GGC goes to F, and TAA goes to E. But how do we get started?

Conveniently, the Singularity Hub article gives the quotes that appear in the DNA, as reported by JCVI:

"TO LIVE, TO ERR, TO FALL, TO TRIUMPH, TO RECREATE LIFE OUT OF LIFE."
"SEE THINGS NOT AS THEY ARE, BUT AS THEY MIGHT BE."
"WHAT I CANNOT BUILD, I CANNOT UNDERSTAND."
We can try matching the quotes against each position in the DNA and see if there is any position that works. A match can fail in two ways. First, if the same DNA triple corresponds to two different letters, something must be wrong. For instance, if we try to match "LIFE" against "AAC CTG GGC AAC", we would conclude that AAC means both L and E. Second, if the same letter corresponds to two DNA triples, we can reject it. For instance, if we try to match "ERR" against "TAA CTA GTC", then both CTA and GTC would mean R. (Interestingly, the real genetic code has this second type of ambiguity. Because there are only 18 amino acids and 64 DNA triples, many DNA triples indicate the same amino acid.)

Using Arc to break the code

To break the code, first download the DNA sequences of the four watermarks and assign them to variables w1, w2, w3, w4. (Download full code here.)
(= w1 "GTTCGAATATTTCTATAGCTGTACA...")
(= w2 "CAACTGGCAGCATAAAACATATAGA...")
(= w3 "TTTAACCATATTTAAATATCATCCT...")
(= w4 "TTTCATTGCTGATCACTGTAGATAT...")
Next, lets break the four watermarks into triples by first converting to a list and then taking triples of length 3, which we call t1 through t4:
(def strtolist (s) (accum accfn (each x s (accfn (string x)))))
(def tripleize (s) (map string (tuples (strtolist s) 3)))

(= t1 (tripleize w1))
(= t2 (tripleize w2))
(= t3 (tripleize w3))
(= t4 (tripleize w4))
The next function is the most important. It takes triples and text, matches the triples against the text, and generates a table that maps from each triple to the corresponding characters. If it runs into a problem (either two characters assigned to the same triple, or the same character assigned to two triples) then it fails. Otherwise it returns the code table. It uses tail recursion to scan through the triples and input text together.
; Match the characters in text against the triples.
; codetable is a table mapping triples to characters.
; offset is the offset into text
; Return codetable or nil if no match.
(def matchtext (triples text (o codetable (table)) (o offset 0))
  (if (>= offset (len text)) codetable ; success
      (with (trip (car triples) ch (text offset))
  ; if ch is assigned to something else in the table
  ; then not a match
  (if (and (mem ch (vals codetable))
           (isnt (codetable trip) ch))
        nil
             (empty (codetable trip))
        (do (= (codetable trip) ch) ; new char
           (matchtext (cdr triples) text codetable (+ 1 offset)))
             (isnt (codetable trip) ch)
        nil ; mismatch
      (matchtext (cdr triples) text codetable (+ 1 offset))))))
Finally, the function findtext finds a match anywhere in the triple list. In other words, the previous function matchtext assumes the start of the triples corresponds to the start of the text. But findtext tries to find a match at any point in the triples. It calls matchtext, and if there's no match then it drops the first triple (using cdr) and calls itself recursively, until it runs out of triples.
(def findtext (triples text)
  (if (< (len triples) (len text)) nil
    (let result (matchtext triples text)
      (if result result
        (findtext (cdr triples) text)))))
The Singularity Hub article said that the DNA contains the quote:
"TO LIVE, TO ERR, TO FALL, TO TRIUMPH, TO RECREATE LIFE OUT OF LIFE." - JAMES JOYCE 
Let's start by just looking for "LIFE OUT OF LIFE", since we're unlikely to get LIFE to match twice just by chance:
arc> (findtext t1 "LIFE OUT OF LIFE")
nil
No match in the first watermark, so let's try the second.
arc> (findtext t2 "LIFE OUT OF LIFE")
#hash(("CGT" . #\O) ("TGA" . #\T) ("CTG" . #\I) ("ATA" . #\space) ("TAA" . #\E) ("AAC" . #\L) ("TCC" . #\U) ("GGC" . #\F))
How about that! Looks like a match in watermark 2, with DNA "AAC" corresponding to L, "CGT" corresponding to "I", "GGC" corresponding to "F", and so on. (The Arc format for hash tables is pretty ugly, but hopefully you can see that in the output.) Let's try matching the full quote and store the table in code2:
arc> (= code2 (findtext t2 "TO LIVE, TO ERR, TO FALL, TO TRIUMPH, TO RECREATE LIFE OUT OF LIFE."))
#hash(("TCC" . #\U) ("TCA" . #\H) ("TTG" . #\V) ("CTG" . #\I) ("ACA" . #\P) ("CTA" . #\R) ("CGA" . #\.) ("ATA" . #\space) ("CGT" . #\O) ("TTT" . #\C) ("TGA" . #\T) ("TAA" . #\E) ("AAC" . #\L) ("CAA" . #\M) ("GTG" . #\,) ("TAG" . #\A) ("GGC" . #\F))
Likewise, we can try matching the other quotes:
arc> (= code3 (findtext t3 "SEE THINGS NOT AS THEY ARE, BUT AS THEY MIGHT BE."))
#hash(("CTA" . #\R) ("TCA" . #\H) ("CTG" . #\I) ("GCT" . #\S) ("CGA" . #\.) ("ATA" . #\space) ("CAT" . #\Y) ("CGT" . #\O) ("TCC" . #\U) ("AGT" . #\B) ("TGA" . #\T) ("TAA" . #\E) ("TGC" . #\N) ("TAC" . #\G) ("CAA" . #\M) ("GTG" . #\,) ("TAG" . #\A))
arc> (= code4 (findtext t4 "WHAT I CANNOT BUILD, I CANNOT UNDERSTAND"))
#hash(("TCC" . #\U) ("TCA" . #\H) ("ATT" . #\D) ("CTG" . #\I) ("GCT" . #\S) ("TTT" . #\C) ("ATA" . #\space) ("TAA" . #\E) ("CGT" . #\O) ("CTA" . #\R) ("TGA" . #\T) ("AGT" . #\B) ("AAC" . #\L) ("TGC" . #\N) ("GTG" . #\,) ("TAG" . #\A) ("GTC" . #\W))
Happily, they all decode the same triples to the same letters, or else we would have a problem. We can merge these three decode tables into one:
arc> (= code (listtab (join (tablist code2) (tablist code3) (tablist code4))))#hash(("TCC" . #\U) ("TCA" . #\H) ("CTA" . #\R) ("CGT" . #\O) ("TGA" . #\T) ("CGA" . #\.) ("TAA" . #\E) ("AAC" . #\L) ("TAC" . #\G) ("TAG" . #\A) ("GGC" . #\F) ("ACA" . #\P) ("ATT" . #\D) ("TTG" . #\V) ("CTG" . #\I) ("GCT" . #\S) ("TTT" . #\C) ("CAT" . #\Y) ("ATA" . #\space) ("AGT" . #\B) ("GTG" . #\,) ("TGC" . #\N) ("CAA" . #\M) ("GTC" . #\W))
Now, let's make a decode function that will apply the string to the unknown DNA and see what we get. (We'll leave unknown triples unconverted.)
(def decode (decodemap triples)
  (string (accum accfn (each triple triples (accfn
      (if (decodemap triple)
        (decodemap triple)
        (string "(" triple ")" )))))))
arc> (decode code t1)
"(GTT). CRAIG VENTER INSTITUTE (ACT)(TCT)(TCT)(GTA)(GGG)ABCDEFGHI(GTT)(GCA)LMNOP(TTA)RSTUVW(GGT)Y(TGG)(GGG) (TCT)(CTT)(ACT)(AAT)(AGA)(GCG)(GCC)(TAT)(CGC)(GTA)(TTC)(TCG)(CCG)(GAC)(CCC)(CCT)(CTC)(CCA)(CAC)(CAG)(CGG)(TGT)(AGC)(ATC)(ACC)(AAG)(AAA)(ATG)(AGG)(GGA)(ACG)(GAT)(GAG)(GAA).,(GGG)SYNTHETIC GENOMICS, INC.(GGG)(CGG)(GAG)DOCTYPE HTML(AGC)(CGG)HTML(AGC)(CGG)HEAD(AGC)(CGG)TITLE(AGC)GENOME TEAM(CGG)(CAC)TITLE(AGC)(CGG)(CAC)HEAD(AGC)(CGG)BODY(AGC)(CGG)A HREF(CCA)(GGA)HTTP(CAG)(CAC)(CAC)WWW.(GTT)CVI.ORG(CAC)(GGA)(AGC)THE (GTT)CVI(CGG)(CAC)A(AGC)(CGG)P(AGC)PROVE YOU(GAA)VE DECODED THIS WATERMAR(GCA) BY EMAILING US (CGG)A HREF(CCA)(GGA)MAILTO(CAG)MRO(TTA)STI(TGG)(TCG)(GTT)CVI.ORG(GGA)(AGC)HERE(GAG)(CGG)(CAC)A(AGC)(CGG)(CAC)P(AGC)(CGG)(CAC)BODY(AGC)(CGG)(CAC)HTML(AGC)"
It looks like we're doing pretty well with the decoding, as there's a lot of recognizable text and some HTML in there. There's also conveniently the entire alphabet as a decoding aid. From this, we can fill in a lot of the gaps, e.g. GTT is J and GCA is K. From the HTML tags we can figure out angle brackets, quotes, and slash. We can guess that there are numbers in there and figure out that ACT TCT TCT GTA is 2009. These deductions can be added manually to the decode table:
arc> (= (code "GTT") #\J)
arc> (= (code "GCA") #\K)
arc> (= (code "ACT") #\2)
...
After a couple cycles of adding missing characters and looking at the decodings, we get the almost-complete decodings of the four watermarks:

The contents of the watermarks

J. CRAIG VENTER INSTITUTE 2009
ABCDEFGHIJKLMNOPQRSTUVWXYZ 0123456789(TTC)@(CCG)(GAC)-(CCT)(CTC)=/:<(TGT)>(ATC)(ACC)(AAG)(AAA)(ATG)(AGG)"(ACG)(GAT)!'.,
SYNTHETIC GENOMICS, INC.
<!DOCTYPE HTML><HTML><HEAD><TITLE>GENOME TEAM</TITLE></HEAD><BODY><A HREF="HTTP://WWW.JCVI.ORG/">THE JCVI</A><P>PROVE YOU'VE DECODED THIS WATERMARK BY EMAILING US <A HREF="MAILTO:[email protected]">HERE!</A></P></BODY></HTML>

MIKKEL ALGIRE, MICHAEL MONTAGUE, SANJAY VASHEE, CAROLE LARTIGUE, CHUCK MERRYMAN, NINA ALPEROVICH, NACYRA ASSAD-GARCIA, GWYN BENDERS, RAY-YUAN CHUANG, EVGENIA DENISOVA, DANIEL GIBSON, JOHN GLASS, ZHI-QING QI.
"TO LIVE, TO ERR, TO FALL, TO TRIUMPH, TO RECREATE LIFE OUT OF LIFE." - JAMES JOYCE

CLYDE HUTCHISON, ADRIANA JIGA, RADHA KRISHNAKUMAR, JAN MOY, MONZIA MOODIE, MARVIN FRAZIER, HOLLY BADEN-TILSON, JASON MITCHELL, DANA BUSAM, JUSTIN JOHNSON, LAKSHMI DEVI VISWANATHAN, JESSICA HOSTETLER, ROBERT FRIEDMAN, VLADIMIR NOSKOV, JAYSHREE ZAVERI.
"SEE THINGS NOT AS THEY ARE, BUT AS THEY MIGHT BE."

CYNTHIA ANDREWS-PFANNKOCH, QUANG PHAN, LI MA, HAMILTON SMITH, ADI RAMON, CHRISTIAN TAGWERKER, J CRAIG VENTER, EULA WILTURNER, LEI YOUNG, SHIBU YOOSEPH, PRABHA IYER, TIM STOCKWELL, DIANA RADUNE, BRIDGET SZCZYPINSKI, SCOTT DURKIN, NADIA FEDOROVA, JAVIER QUINONES, HANNA TEKLEAB.
"WHAT I CANNOT BUILD, I CANNOT UNDERSTAND." - RICHARD FEYNMAN

The first watermark consists of a copyright-like statement, a list of all the characters, and the hidden HTML page (which I showed above.) The second, third, and fourth watermarks consist of a list of the authors and three quotations.

Note that there are 14 undecoded triples that only appear once in the list of characters. They aren't in ASCII order, keyboard order, or any other order I can figure out, so I can't determine what they are, but I assume they are missing punctuation and special characters.

The DNA decoding table

The following table summarizes the 'secret' DNA to character code:
AAA ? AAC L AAG ? AAT 3
ACA P ACC ? ACG ? ACT 2
AGA 4 AGC > AGG ? AGT B
ATA space ATC ? ATG ? ATT D
CAA M CAC / CAG : CAT Y
CCA = CCC - CCG ? CCT ?
CGA . CGC 8 CGG < CGT O
CTA R CTC ? CTG I CTT 1
GAA ' GAC ? GAG ! GAT ?
GCA K GCC 6 GCG 5 GCT S
GGA " GGC F GGG newline GGT X
GTA 9 GTC W GTG , GTT J
TAA E TAC G TAG A TAT 7
TCA H TCC U TCG @ TCT 0
TGA T TGC N TGG Z TGT ?
TTA Q TTC ? TTG V TTT C
As far as I can tell, this table is in random order. I analyzed it a bunch of ways, from character frequency and DNA frequency to ASCII order and keyboard order, and couldn't figure out any rhyme or reason to it. I was hoping to find either some structure or another coded message, but I couldn't find anything.

More on breaking the code

It was convenient that the JCVI said in advance what quotations were in the DNA, making it easier to break the code. But could the code still be broken if they hadn't?

One way to break the code is to look at statistics of how often different triples appear. We count the triples, convert the table to a list, and then sort the list. In the sort we use a special comparison function that compares the counts, to sort on the counts, not on the triples.

arc> (sort
  (fn ((k1 v1) (k2 v2)) (> v1 v2))
  (tablist (counts t2)))
(("ATA" 41) ("TAG" 27) ("TAA" 25) ("CTG" 18) ("TGC" 16) ("GTG" 16) ("CTA" 15) ("CGT" 14) ("AAC" 13) ("TGA" 10) ("TTT" 10) ("GCT" 10) ("TAC" 10) ("TCA" 8) ("CAA" 7) ("CAT" 7) ("TCC" 7) ("TTG" 5) ("GGC" 4) ("GTT" 4) ("ATT" 4) ("CCC" 4) ("GCA" 3) ("AGT" 2) ("TTA" 2) ("ACA" 2) ("CGA" 2) ("GGA" 2) ("GTC" 1) ("TGG" 1) ("GGG" 1))
This tells us that ATA appears 41 times in the second watermark, TAG 27 times, and so on. If we assume that this encodes English text, then the most common characters will be space, E, T, A, O, and so on. Then it's a matter of trial-and-error trying the high-frequency letters for the high-frequency triples until we find a combination that yields real words. (It's just like solving a newspaper cryptogram.) You'll notice that the high-frequency triples do turn out to match high-frequency letters, but not in the exact order. (I've written before on simple cryptanalysis with Arc.)

Another possible method is to guess that a phrase such as "CRAIG VENTER" appears in the solution, and try to match that against the triples. This turns out to be the case. It doesn't give a lot of letters to work on, but it's a start.

Arc: still bad for exploratory programming

A couple years ago I wrote that Arc is bad for exploratory programming, which turned out to be hugely controversial. I did this DNA exploration using both Python and Arc, and found Python to be much better for most of the reasons I described in my earlier article.
  • Libraries: Matching DNA triples was trivial in Python because I could use regular expressions. Arc doesn't have regular expressions (unless you use a nonstandard variant such as Anarki). Another issue was when I wanted to do graphical display of watermark contents; Arc has no graphics support.
  • Performance: for the sorts of exploratory programming I do, performance is important. For instance, one thing I did when trying to figure out the code was match all substrings of one watermark against another, to see if there were commonalities. This is O(N^3) and was tolerably fast in Python, but Arc would be too painful.
  • Ease of and speed of programming. A lot of the programming speed benefit is that I have more familiarity with Python, but while programming in Arc I would constantly get derailed by cryptic error messages, or trying to figure out how to do simple things like breaking a string into triples. It doesn't help that most of the Arc documentation I wrote myself.
To summarize, I started off in Arc, switched to Python when I realized it would take me way too long to figure out the DNA code using Arc, and then went back to Arc for this writeup after I figured out what I wanted to do. In other words, Python was much better for the exploratory part.

Thoughts on the DNA coding technique

Part of the gene map.  Watermark 2b is shown in the DNA, with an arc gene below it I see some potential ways that the DNA coding technique could be improved and extended. Since the the full details of the DNA coding technique haven't been published by the JCVI yet, they may have already considered these ideas.

Being able to embed text in the DNA of a living organism is extremely cool. However, I'm not sure how practical it is. The data density in DNA is very high, maybe 500 times that of a hard drive (ref), but most of the DNA is devoted to keeping the bacterium alive and only about 1K of text is stored in the watermarks.

I received email from JCVI that said the coding mechanism also supports Java, math (LaTeX?), and Perl as well as HTML. Embedding Perl code in a living organism seems even more crazy than text or HTML.

If encoding text in DNA catches on, I'd expect that error correction would be necessary to handle mutations. An error-correction code like Reed-Solomon won't really work because DNA can suffer deletions and insertions as well as mutations, so you'd need some sort of generalized deletion/insertion correcting code.

I just know that a 6-bit code isn't going to be enough. Sooner or later people will want lower-case, accented character, Japanese, etc. So you might as well come up with a way to put Unicode into DNA, probably as UTF-8. And people will want arbitrary byte data. My approach would be to use four DNA base pairs instead of three, and encode bytes. Simply assign A=00, C=01, G=10, and T=11 (for instance), and encode your bytes.

If I were writing an RFC for data storage in DNA, I'd suggest that most of the time you'd want to store the actual data on the web, and just store a link in the DNA. (Similar to how a QR code or tinyurl can link to a website.) The encoded link could be surrounded by a special DNA sequence that indicates a tiny DNA link code. This sequence could also be used to find the link code by using PCR, so you don't need to do a genome sequence on the entire organism to find the watermark. (The existing watermarks are in fact surrounded by fixed sequences, which may be for that purpose.)

I think there should also be some way to tag and structure data that is stored DNA, to indicate what is identification, description, copyright, HTML, owners, and so forth. XML seems like an obvious choice, but embedding XML in living organisms makes me queasy.

Conclusions

Being able to fabricate a living organism from a DNA computer file is an amazing accomplishment of Craig Venter and the JCVI. Embedding the text of a web page in a living organism seems like science fiction, but it's actually happened. Of course the bacterium doesn't actually serve the web page... yet.

Sources

Decoding the secret DNA code in Venter's synthetic bacterium

Craig Venter recently created a synthetic bacterium with a secret message encoded in the DNA. This is described in more detail at singularityhub. (My article will make much more sense if you read the singularityhub article.)

I tried unsuccessfully to decode the secret message yesterday. I realized the problem was that I was thinking like a computer scientist, not a biologist. Once I started thinking like a biologist, the solution was obvious.

I won't spoil your fun by giving away the full answer (at least for now), but I'll mention a few things. [Update: I've written up the details here.] There are four watermarks. The first contains HTML. (That's right! They put actual HTML - complete with DOCTYPE - into the DNA of a living creature. That's just crazy!)
The HTML in the genome
I sent email to the link, and they told me I was the 31st person to break the code. I guess I'll need to be faster next time.

The second, third, and fourth watermarks each contain co-authors and an interesting scientific quote. The first watermark also contains the full character set in order, to help verify the decoding, although I was unable to decode a few of the special characters because they only appear once.

I found the following quotes (which match the quotes given by JCVI and in the singularityhub article):
"TO LIVE, TO ERR, TO FALL, TO TRIUMPH, TO RECREATE LIFE OUT OF LIFE." - JAMES JOYCE
"SEE THINGS NOT AS THEY ARE, BUT AS THEY MIGHT BE."
WHAT I CANNOT BUILD, I CANNOT UNDERSTAND." - RICHARD FEYNMAN
Note that the quotes are one per watermark, not all in the fourth watermark as the singularityhub article claims.

If you want to try decoding the code, the DNA of the watermarks is at Science Magazine. The complete genome is a NCBI, in case you want it.

My loyal readers will be disappointed that I didn't use the Arc language to decode this, unlike my previous cryptanalysis and genome adventures. I started out with Arc, but my existing Arc cryptanalysis tools didn't do the job. I switched to Python since I'm faster in Python and I wanted to do some heavy-duty graphical analysis.

I tried a bunch of things that didn't work for analysis. I tried autocorrelation to try to determine the length of each code element, but didn't get much of a signal. I tried looking for common substrings between the watermarks both visually and symbolically, and I found some substantial strings that appeared multiple times, but that didn't really help. I discovered that the sequence "CGAT" never appears in the watermarks, but that was entirely useless. I tried applying the standard genetic code (mapping DNA triples to amino acids), since apparently Craig Venter used that in the past, but didn't come up with anything. I tried to make various estimates of how many bits of data were in the watermarks and how many bits of data would be required using various encodings. I briefly considered the possibility that the DNA encoded vectors that would draw out the answers (I think a Pickover book suggested analyzing DNA in this way), or that the DNA drew out text as a raster, but decided the bit density would be too low and didn't actually try this.

Then I asked myself: "How would I, myself, encode text in DNA?" I figured Unicode would be overkill, so probably it would be best to use either ASCII or a 6-bit encoding (maybe base-64). Then each pair of bits could be encoded with one of C, G, A, and T. Based on the singularityhub article I assumed the word "TRIUMPH" appeared, so I took "riumph" (to avoid possible capitalization issues with the first letter), encoded it as 6-bits or 8-bits, with any possible starting value for 'a', and the 24 possibilities for mapping C/G/A/T to 2-bits, and then tried big-endian and little-endian, to yield 15360 possible DNA encodings for the word. I was certain that one of them would have to appear. However, a brute force search found that none of them were in the DNA. At that point, I started trying even more implausible encodings, with no success, and started to worry that maybe the data was zipped first, which would would make decoding almost impossible. I started to think about variable-length encodings (such as Huffman encodings), and then gave up for the night. I was wondering if it would be worth a blog post on things that don't work for decrypting the code or if I should quit in silence.

In the middle of the night it occurred to me that the designers of the code were biologists, so I should think like a biologist not a computer scientist to figure out the code. With this perspective, it was obvious how to extend the genetic code to handle a larger character set. The problem was there were 64! possibilities. Or maybe 64^26 or 26 choose 64 or something like that; the exact number doesn't matter, but it's way too many to brute force like I did earlier. However, based on the quotes in the singularityhub article, I assumed that the substring " OUT O" appeared in the string, made a regular expression, and boom, it actually matched the watermark, confirming my theory. Then it was a quick matter of extending the technique to decode the full watermarks. As far as I can tell, the encoding used was arbitrary, and does not have any fundamental meaning. As I mentioned earlier, I'm leaving out the details for now, since I don't want to ruin anyone else's fun in decoding the message.

In conclusion, decoding the secret message was a fun challenge.

Your relationship: mathematically doomed or not?

I came across an interesting paper that uses a mathematical model of relationships to show that relationships are likely to fall apart unless you put more energy into them than you'd expect. To prove this, the paper, A Mathematical Model of Sentimental Dynamics Accounting for Marital Dissolution by José-Manuel Rey, uses optimal control theory to find the amount of effort to put into a relationship to maximize lifetime happiness. Unfortunately, due to a mathematical error, the most interesting conclusions of the paper are faulty. (Yes, this is wildly off topic for my blog.)

To briefly summarize the paper, it considers the feeling level of the relationship to be a function of time: x(t). The success of the relationship depends on the effort you put into the relationship, which is also a function of time: c(t). In the absence of effort, the feelings in the relationship will decay, but effort will boost the relationship. This is described with the catchphrase "The second thermodynamic law for sentimental interaction", which claims relationships deteriorate unless energy is added.

Next, your overall happiness (i.e. utility) is based on two more functions. The first function U(x) indicates how much happiness you get out of the relationship. The better the relationship, the happier you are, but to a limit. The second function D(c) indicates how much it bothers you to work hard on the relationship. All else being equal, you want to put some amount of effort c* into the relationship. Putting more effort than that into the relationship drains you, and putting a lot more effort drains you a lot more.

To summarize the model so far, you need to put effort into the relationship or it will deteriorate. If you put more effort into the relationship, you'll be happier because the relationship is better, but unhappier because you're working so hard, so there's a tradeoff.

The heavy-duty mathematics comes into play now. You sum up your happiness over your entire life to obtain a single well-being number W. (A discount rate is used so what happens in the short term affects W more than what happens many years in the future.) Your total happiness is given by this equation:
Equation W
Your goal is to figure out how much effort to put into the relationship at every point in the future, to obtain the biggest value of W. (I.e. determine the function c(t).) By using optimal control theory, the "best" effort function will be a solution to the paper's Equation 2:
Equation 2
From this, you can compute how much effort to put into the relationship at every point in the future, and what the final destiny of the relationship will be. The paper determines that there is an optimal equilibrium point E for the relationship, and you should adjust the effort you put into the relationship in order to reach this point.

However the paper has one major flaw at this point. It assumes that all trajectories satisfy Equation 2, not just the optimal one. (As the paper puts it, "The stable manifold is the only curve supporting trajectories leading to equilibrium.") From this, the paper reaches the erroneous conclusion that relationship dynamics are unstable, so a small perturbation will send the relationship spiraling off in another direction. In fact, a small perturbation will cause a small change in the relationship. The paper's second erroneous conclusion is that "effort inattentions" (small decreases in the effort put into the relationship) will cause the relationship to diverge down to zero and relationship breakup. In fact, the paper's model predicts that small decreases in the effort will cause small decreases in the relationship.

To make this clearer, I have annotated Figure 3 from the paper. My annotations are the "yellow notes":
Annotated Figure 3
The trajectories that the model obtains according to Equation 2 are mostly nonsensical, and exclude sensible trajectories, as I will explain.

Black line A' shows what happens if you start off putting "too much" effort into the relationship. You end up putting more and more effort into the relationship (upper-right), which will yield worse and worse well-being (W), turning the relationship into an obsession. Obviously this is neither sensible or optimal.

The model claims that Black line A'' is explains relationship breakups, where the relationship effort drops to 0, followed by the collapse of the relationship below xmin. Again, this is a nonsensical result obtained by using Equation 2 where it does not apply. According to the model, effort c* is the easiest level of relationship effort to provide. Thus, a trajectory that drops below c* is mathematically forced to worsen well-being function W, and doesn't make sense according to the model of the paper.

The paper poses the "failure paradox", that relationships often fail even though people don't expect them to. This is explained as a consequence of "effort inattention", which drops your relationship from a good (equilibrium) trajectory to a deteriorating trajectory such as A''. I show above (in orange) two alternate trajectories that recover the relationship, rather than causing breakup. The horizontal orange line assumes that after some decline, you keep the relationship effort fixed, causing the relationship to reach the stable orange dot above Wu. The diagonal orange line shows that even if your relationship is on a downwards trajectory, you can reverse this by increasing the effort and reach the "best" point E. Note that both of these trajectories satisfy the basic model of the paper, and the trajectories achieve much higher wellbeing than trajectory A''. This proves that A'' is not an optimal trajectory. (I believe the optimal trajectory would actually be to jump back up to the yellow-green line as soon as possible and proceed to E.)

Another erroneous conclusion of the paper is that E is the unique equilibrium point and is an unstable equilibrium. In fact, any point along the upwards dotted diagonal line is a stable equilibrium point. Along this line, the effort c is the exact amount to preserve the relationship feeling level x at its current value. Any perturbation in the relationship feeling x will be exponentially damped out according to Equation 1. In other words, if the feeling level in the relationship gets shifted for some reason (and the effort is unchanged), it will move back to its original path. Alternatively, if the effort level changes by a small amount, it will cause a correspondingly small (but permanent) shift in the relationship feeling, moving along the diagonal line. The relationship will not suddenly shift to line A' or A'' and go crazy.

To summarize, the paper makes the faulty assumption that all relationship trajectories (and not just the optimal one) will follow Equation 2. As a result, the paper yields nonsensical conclusions: relationships can surge up to infinity (line A') or down to 0 (line A''). The paper also reaches the mistaken conclusions that relationships have a single equilibrium point, this equilibrium point is unstable, and temporary inattention can cause relationships to break up. These conclusions are all erroneous. According to the paper's model followed correctly, relationships are stable to a rather boring degree.

Overall, I found the mathematics much more convincing in Gottman's The Mathematics of Marriage: Dynamic Nonlinear Models which applies catastrophe theory (the hot math trend of the 70s) to marriage. To vastly oversimplify, once your relationship is in an unhappy state, it takes a great deal of effort to flip it back to a happy state. This is pretty much the exact opposite of the linear model that Rey uses.

Disclaimer: My original posting was rather hand-waving; I've significantly edited this article to fill in some of the mathematical details.

Credits: I came across this article via Andrew Sullivan. ("Hat tip" is just too precious a term for me to use.)

USB Panic Button with Linux and Python

This article describes how to use a USB Panic Button with Python. The panic button is a pushbutton that can be read over USB. Unfortunately, it only comes with drivers for Windows, so using it with Linux is a bit of a challenge. I found a Perl library that can read the Panic Button using low-level USB operations, so I ported that simple library to Python.

First, download and install PyUSB. (You may also need to install python-devel.)

Next, dowwnload my PanicButton library.

Finally, use the library. The API is very simple: create a button object with PanicButton(), and then call read() on the button to see if the button is pressed. For example, the following code will print "Pressed" when the button is pressed. (I know, a button like this should be more dramatic...)

import PanicButton
import time

button = PanicButton.PanicButton()

while 1:
  if button.read():
    print "Pressed"
  time.sleep(.5)
A couple caveats. First, you need to run the Python code as root in order to access the device. (Maybe you can get around this with udev magic, but I couldn't.) Second, the button actually triggers when it is released, not when it is pressed.

How the library works

By running lsusb, you can see that the Panic Button's USB id is 1130:0202. We use the PyUSB library to get a device object:
class PanicButton:
  def __init__(self):
    # Device is: ID 1130:0202 Tenx Technology, Inc. 
    self.dev = usb.core.find(idVendor=0x1130, idProduct=0x0202)
The Linux kernel grabs the device and makes it into a hidraw device, so we need to detach that kernel driver before using the device:
    try:
      self.dev.detach_kernel_driver(0)
    except Exception, e:
      pass # already unregistered
Reading the status of the device is done through a USB control transfer. All the magic numbers come from the PanicButton Perl library. See details (translated from German).
  def read(self):
    return self.dev.ctrl_transfer(bmRequestType=0xA1, bRequest=1, wValue=0x300, data_or_wLength=8, timeout=500)[0]
Hopefully this will be of help to anyone trying to interface the USB Panic Button through Python.

Spanish vocabulary from "Harry Potter y la piedra filosofal"

Recently, I've been attempting to improve my Spanish by reading Harry Potter y la piedra filosofal, the Spanish version of Harry Potter and the Sorcerer's Stone. It's definitely improving my vocabulary, as there are many words I've needed to look up. I've listed many of them below, in the hopes this list may be helpful to others as well.

To make this slightly more relevant to my usual topic of programming languages, I've added Javascript that will show or hide the English translations, which will also show up if you hover over a link. (See the end for explanation of the Javascript.)

How the Javascript works

I'm including some programming content for my regular blog readers, since this is a bit off topic. If you're interested in the Spanish, you can stop reading here :-)

I have two CSS classes, one for the enabled text, and one for the disabled text. I simply change the color to show and hide the text. The #f6f6f6 color is to match the background of my blog page, which isn't quite white.

<style TYPE="text/css"> 
       .enabled {color: black}
       .disabled {color: #f6f6f6}
</style>
Next, each word entry has a span tag around the English text that I want to show or hide:
<a href="http://..." title="to add">añadir</a>
<span class="enabled"> - to add</span>
Then, I have a Javscript function that will toggle the class for each span tag. It simply loops through all the span tags switching ones with the enabled class to the disabled class and vice versa. Other span tags are left alone.
<script language="JavaScript">
function toggle() {
 elts = document.getElementsByTagName("span");
 for (var i = 0; i < elts.length; i++) {
  if (elts[i].className == "enabled") {
   elts[i].className = "disabled";
  } else if (elts[i].className == "disabled") {
   elts[i].className = "enabled";
  }
 }
}
</script>
Finally, the button calls the Javascript toggle routine to toggle the visibility.
<button onclick="toggle();">Show/Hide English</button>

Viking Dishwasher Fault Codes

Here are Viking Dishwasher error codes for anyone who needs them. I was visting family over Easter and the Viking dishwasher quit working - it would run a bit and then stop, with two indicator lights flashing. Inconveniently, the service guy claimed he couldn't come to fix it for two weeks. I knew these flashing lights must be error codes, but I couldn't find the error codes listed anywhere on the internet. It turns out that the fault codes are in the instruction manual for the DFUD041 and DFUD141 dishwashers, but in case anyone is looking for them online, here's a summary. One or more of the mode lights to the right of "PROG" will flash, indicating the problem:
Viking dishwasher control panel
"Heavy" (the pot) flashes - too much water in dishwasher. Contact service.
"Light china" (the wine glass) flashes - A fault with the water inlet. Make sure the water valve is open.
"Quick" (the wine glass with arrows above) flashes - valve leakage. Contact service.
"Pots/Pans" (the pot with an underline) and "Heavy" (the pot) both flash - blocked drain. Remove and clean the coarse strainer. Unscrew and clean the fine strainer. Lift out and clean the fine filter. Make sure the drain hose is not kinked. Make sure the air gap is not clogged. This picture shows the strainers and filter:
Viking dishwasher strainers

The dishwasher's problem turned out to be a blocked drain - specifically a whole almond caught in the strainer. After removing that and cleaning the filters, the dishwasher worked properly and we didn't need to wash the dishes by hand.

Hopefully this information will be helpful if your Viking dishwasher isn't running properly.

Understanding Sony IR remote codes, LIRC files, and the Arduino library

This article describes how to understand Sony IR codes and how to get them from the LIRC files. The article is intended to describe codes for use with my Arduino infrared remote library, but the information should be generally applicable. This article is aimed at the hardware hacker and assumes that you are familiar with binary and hexadecimal, so be warned :-)

One key point of this article is that there are three different ways to interpret the same IR signal and turn it into a hex code. Understanding these three ways will allow you to get codes from different sources and understand them correctly.

The IR transmission of the code

When you press a button on a Sony remote control, an infrared signal is transmitted. This transmission consists of a 40kHz signal which is turned on and off in a particular pattern. Different buttons correspond to different codes, which cause the signal to be turned on and off in different patterns.

The following waveform shows the IR code transmitted for STOP on my Sony CD remote control (RM-DC335). When the signal is high, a 40 kHz IR signal is transmitted, and when the signal is low, nothing is transmitted. In other words, the signal is actually rapidly turning on and off when it appears to be on in the figure. (The IR receiver demodulated the signal, so you don't see the 40 kHz transitions.)
IR transmission for a Sony remote control
A Sony IR signal starts with 2400 microseconds on and 600 microseconds off; that's the first wide pulse. A "1" bit is transmitted with 1200 microseconds on and 600 microseconds off, while a "0" bit is transmitted with 600 microseconds on and 600 microseconds off. (This encoding is also called SIRC or SIRCS encoding.)

You may notice the "on" parts of the waveform appear wider than the "off" parts, even when both are supposed to be 600 microseconds. This is a result of the IR receiver, which switches on faster than switching off.

The above waveform represents one transmission of a 12-bit code. This transmission is normally repeated as long as the button is held down, and a minimum of three times. Each transmission starts 45ms after the previous one started. The Sony protocol also supports 15 and 20 bit codes, which are the same as above except with more bits.

For more information on the low-level transmission of Sony codes, see sbprojects.net.

Three ways to interpret the codes

The Sony encoding seems straightforward, but there are several different ways the signal can be interpreted. I will call these official decoding, bit decoding, and bit-1 decoding. Different sources use any of these three, which can cause confusion. I will explain these three decodings, using the previous waveform as an example.

Official decoding

Official Sony decoding
The "official" Sony protocol views the 12-bit code as 7 command bits and 5 address or device bits, transmitted least-significant-bit first (i.e. "backwards"). The device bits specify the type of device receiving the code, and the command bits specify a particular command for this device. In this example, the device bits (yellow) are 10001 when read right-to-left, which is 17 decimal. The command bits (blue) are 0111000 when read right-to-left, which is 56 decimal. The device value 17 corresponds to a CD player, and the command 56 corresponds to the STOP command (details).

Sony 15 bit codes are similar, with 7 command bits and 8 device bits. Sony 20 bit codes have 7 command bits, 5 device bits, and 8 extended device bits.

Bit decoding

Decoding Sony IR code as a sequence of bits
Many IR decoders just treat the signal as a sequence of bits, most-significant-bit first. I will call this bit decoding. Applying this interpretation to the above code, the code is interpreted as 0001 1101 0001 binary, or 1d1 hex, or 465 in decimal. Note that the last bit doesn't really consist of 1200 microseonds on and 600 microseconds off; it consists of 1200 microseconds on followed by a lot of time off. In other words, the transmission is off for 600 microseconds and then continues to be off until the next code is transmitted.

An alternative but equivalent interpretation is to view the code as a 2400 microsecond header, followed by 12 bits, where each bit is off then on (rather than on then off). A "1" bit is 600 microseconds off and 1200 microseconds on, while a "0" bit is 600 microseconds off and 600 microseconds on. This yields the same value as before (232 decimal), but avoids the special handling of the last bit.
Short one bit decoding

Bit-1 decoding

Short one bit decoding
Many IR decoders drop the last bit, which I will call bit-1 decoding. Because the last bit doesn't end nicely with 600 microseconds off, some IR decoding algorithms treat the signal as 11 bits of data, ending with 600 microseconds on as a trailer. In this interpretation, the above code is 000 1110 1000 binary, or 0e8 hex, or 232 decimal. (Note that doubling this value and adding 1 yields the previous decoding of 465.)

Discussion

Is the code really 17/44 or 465 or 232? The official decoding is "right" in the sense that it is what the manufacturer intends. In addition, it reveals the internal structure of the code and the codes make more sense. For instance, the buttons 1-9 have consecutive codes with the official decoding, but not with the others. The other decodings are fine to use as long as you're consistent; the main thing is to understand that different sources use different decodings. My Arduino library uses the second bit decoding interpretation. The different decodings can be converted from one form to another with binary arithmetic.

Getting codes from a remote

Probably the easiest way to get the codes for your device is to use your existing remote control and see what codes it transmits. You can use my Arduino IR library to do the decoding, with the IRrecvDump demo program. Take a 3-pin IR decoder, hook it up to an Arduino, and then you can read the values for each button press on the serial port.

Alternatively, you can look at the transmitted codes with an oscilloscope. For the diagrams above, I used an IR receiver module connected to two resistors (to drop the voltage), connected to the line input of my PC. I used the Zeitnitz Soundcard Oscilloscope program to display the signal. This lets you see exactly what is being transmitted, but you will need to stare at the screen, write down a bunch of 0's and 1's, and convert the binary value to get your codes.

Getting codes from LIRC files

The best source for IR codes that I've found is the Linux Infrared Remote Control project (lirc.org), which has a huge collection of config files for various remotes. (LIRC also includes a large collection of device drivers for many types of IR input/output hardware, and a software library.)

The LIRC config file format is documented at WinLIRC, but I will walk through some examples.

Bit decoding

The RM-S530 LIRC file treats the codes as 12 bits long, using what I call bit decoding:
begin remote

  name  Sony
  bits           12
  flags SPACE_ENC|CONST_LENGTH
  eps            30
  aeps          100

  header       2470   557
  one          1243   556
  zero          644   556
  gap          45076
  min_repeat      2
  toggle_bit      0


      begin codes
          sleep                    0x0000000000000061
          cd_stop                  0x00000000000001D1
...
This file indicates that each entry is 12 bits long. A header consists of on for 2470 microseconds and off for 557 microseconds. A one bit consists of on for 1243 microseconds and off for 556 microseconds. A zero bit consists of on for 644 microseconds and off for 556 microseconds. Codes are repeated a minimum of 2 more times, with a gap of 45076 microseconds from start to start as the codes are constant length (CONST_LENGTH).

You may be wondering why these time values don't match the official values of 2400, 1200, and 600 microseconds. First, the LIRC data is generally measured from actual remotes, so the real-world timings don't quite match the theory. Second, IR sensors have some lag in detecting "on" and "off", and they typically stretch out the "on" time by ~100 microseconds, shortening the "off" time equally.

The LIRC file then lists the hex code associated with each button. For example the CD STOP code is hex 1D1, which is the same value as described earlier.

Bit-1 decoding

The LIRC file for the RM-D302 remote treats the codes as 11 bits and a trailing pulse. This is what I call the bit-1 decoding:
begin remote

  name  RM-D302
  bits           11
  flags SPACE_ENC|CONST_LENGTH
  eps            30
  aeps          100

  header       2367   638
  one          1166   637
  zero          565   637
  ptrail       1166
  gap          45101
  min_repeat      2
  toggle_bit      0


      begin codes
          cd_stop                  0x00000000000000E8
The key differences with the previous file are the ptrail field, indicating a trailing pulse of 1166 microseconds; and the bit count of 11 instead of 12. Note that the code value is different from the first file, even thought the IR transmission is exactly the same. The hex code 0E8 is the same as described earlier under bit-1 decoding.

Pre_data and post_data bits

Some LIRC files break apart the codes into constant pre_data bits, the code itself, and constant post_data bits. For instance, the file for my RM-D335 remote:
begin remote

  name  SONY
  bits            7
  flags SPACE_ENC
  eps            20
  aeps            0

  header       2563   520
  one          1274   520
  zero          671   520
  ptrail       1274
  post_data_bits  4
  post_data      0x8
  gap          25040
  min_repeat      2

      begin codes
          CONTINUE                 0x000000000000005C
...
          STOP                     0x000000000000000E
...
This file indicates that each code entry is 7 bits long, but there are also 4 post data bits. This means that after transmitting the 7 bits for the code, 4 additional bits are transmitted with the "post data" hex value 0x8, i.e. 1000 binary.

Putting this together the STOP button has the hex value 0E, which corresponds to the seven bits 000 1110. This is followed by four post data bits 1000, so the total transmission is the eleven bits 000 1110 1000, which is 0E8 hex, the same as before for the bit-1 decoding.

One more thing to notice in this LIRC file is it doesn't have the CONST_LENGTH flag, and the gap is 25ms instead of 45ms. This indicates the gap is from the end of one code to the start of the next, rather than from the start of one code to the start of the next. Specifying a gap between codes isn't how Sony codes are actually defined, it's close enough.

LIRC summary

Note that these LIRC files all indicate exactly the same IR transmission; they just interpret it differently. The first file defines the STOP code as 1D1, the second as E8, and the third as 0E.

What does this mean to you? If you want to get a Sony code from a LIRC file and use it with the Arduino library, you need to have a 12 bit (or 15 or 20 bit) code to pass to the library (bit decoding). Look up the code in the file and extract the specified number of bits. If there are any pre_data or post_data bits, append them as appropriate. If the result is one bit short and the LIRC file has a ptrail value, append a 1 bit on the end to convert from bit-1 decoding to bit decoding. Convert the result to hex and you should have the proper code for your device, that can be used with the Arduino library.

Getting codes from hifi-remote.com

The most detailed site I've found on Sony codes specifically is hifi-remote.com/sony. An interesting thing about this site is it analyzes the structure of the codes. While the LIRC files just list the codes, the hifi-remote site tries to explain why the codes are set up the way they are.

Note that this site expresses codes in Sony format, i.e. most-significant-bit first, and separating the device part of the code from the command part of the code. Also for a 20-bit Sony code, the 13 bit device code is expressed as a 5 bit value and an 8 bit value separated by a period. As a result, you may need to do some binary conversion and reverse the bits to use these codes.

To work through an example, I can look up the data for a Sony CD. There are multiple device codes, but assume for now I know my device code is 17. The table gives the code 56 for STOP. Convert 56 to the 7 bit binary value 0111000 and reverse it to get 0001110. Convert 17 to the 5 bit binary value 10001 and reverse it to get 10001. Put these together to et 000111010001, which is 1D1 hex, as before.

To work through a 20-bit example, if I have a Sony VCR/DVD Combo with a device code of 26.83, the site tells me that the code for "power" is 21. Convert 21 to a 7 bit binary value, 26 to a 5 bit binary value, and 83 to a 8 bit binary value. Then reverse and concatenate the bits: binary 1010100 + 01011 + 11001010, which is a8bca in hex. To confirm, the RMT-V501A LIRC file lists power as A8B with post_data of CA.

To use this site, you pretty much have to know the device code for your device already. To find that, obtain a code for your device (e.g. from your existing remote or from LIRC), split out and reverse the appropriate bits, and look up the device code on the site. Alternatively, there are only a few different device codes for a particular type of device, so you can just try them all and see what works.

This site also has information on "discrete codes". To understand discrete codes, consider the power button on a remote that toggles between "on" and "off". This may be inconvenient for automated control, since without knowing the current state, you don't know if sending a code will turn the device on or off. The solution is the "discrete code", which provides separate "on" and "off" codes. Discrete codes may also be provided for operations such as selecting an input or mode. Since these codes aren't on the remote, they are difficult to obtain.

Other sites

Additional config files are available at irremote.psiloc.com. These are in LIRC format, but translated to XML. The site remotecentral.com has a ton of information on remotes, but mostly expressed in proprietary formats.

Conclusion

Hopefully this will clear up some of the confusion around Sony remote codes, without adding additional confusion :-)