Tuesday, January 18, 2011

Chemical Words

A problem I've been spending some time playing with recently is looking at words that can be made from the names of the different chemical elements: for example, you can make "ranch" from Radium, Nitrogen, Carbon, and Hydrogen, but there's no way to make "dressing" because there are no elements named "D" or "Dr". When I discovered that Mathematica has some pretty extensive dictionary functionality built in, I decided to actually go after this more systematically and see what kind of words and how many can be made with element names.
Specifically, I'm working from the list of all elements that have actually been observed in nature. I even included the three-letter temporary names for completeness, although I doubt there are many words with "Uut" in them. Also, for now I'm not opening the pandora's box of worrying about multiple-word phrases that have element names crossing word boundaries, such as "get money" from Germanium, Thulium, Oxygen, Neon, and Yttrium. That adds an enormous level of complexity, and I'm not ready to tackle that yet.

The procedure I wrote to test to see if a word is a "chemical word" is actually a pretty simple little bit of recursion. If you're interested, here's the code in Mathematica:

IsChemicalWord[word_] := Module[{i}, (
   If[word == "", Return[True]];
   For[i = 1, i <= Length[elements], i += 1, (
      If[StartsWithStr[word, elements[[i]]] &&
         IsChemicalWord[StringDrop[word, StringLength[elements[[i]]]]],
         Return[True]]
   )];
   Return[False];
)]


It's a little verbose just because Mathematica is a little verbose when it comes to iteration. It looks to see if the word in question starts with an element - if that fails, the word is obviously not a chemical word, so it returns false. However, if it does start with an element, then it works on a the recursive subproblem of "well, is the rest of the word a chemical word itself?", following the same procedure. For a base case, the empty string is trivially considered a chemical word. Here's an example of how it would examine the word "aspirin":

"aspirin" starts with Arsenic, so is "pirin" a chemical word?
      "pirin" starts with Phosphorous, so is "irin" a chemical word?
            "irin" starts with Iodine, so is "rin" a chemical word?
                  "rin" doesn't start with any chemicals, so it's not a chemical word.
            "irin" also starts with Iridium, so is "in" a chemical word?
                  "in" starts with Indium, so is "" a chemical word?
                        Yes, "" is a chemical word (base case).
                  Yes, "in" is a chemical word.
            Yes, "irin" is a chemical word.
      Yes, "pirin" is a chemical word.
Yes, "aspirin" is a chemical word.

IsChemicalWord just tells me "yes" or "no", but I have a similarly written function, MakeChemicalWord, which would take "aspirin" and give me back the properly capitalized "AsPIrIn".

So, the obvious first question to ask is what the longest chemical word is? Using Mathematica's built-in dictionary, it's "nonrepresentational", or rather "NONRePReSeNTaTiONAl". We can also look at how many chemicals it takes to make up each word, and it turns out that the word that takes the most chemicals is "subconsciousness". It's 3 letters shorter than nonrepresentational, but when you spell it all out, it's almost entirely single letter elements: "SUBCONSCIOUSNEsS".

Interestingly, the distribution of word lengths in the whole dictionary is very similar to the distribution for the chem words. The average length is about 8.4 for the dictionary and 7.4 for the chem words, but the standard deviations are just about equal: 2.5 and 2.4 respectively. That was pretty surprising to me, I didn't expect the chemical words to be distributed so evenly (w.r.t. length).

Mathematica also has a huge amount of linguistic information on the words in its dictionary, so I started trying to generate some basic (randomized) chemical sentences. I started with the simplest possible sentence structure: " ". Here's a couple sentences it gives:

PAsSErBY BArRaGeS
NeCKLaCe HUSBaNd
NOVICEs PReSAgEs
ReTeNTiONS BLuNdEr

Unfortunately, the dictionary includes inflected forms as well, so subject-verb agreement isn't always there. Now, when you get word data from Mathematica, it returns the data the canonical (i.e. non-inflected) form of the word - in other words, WordData["runs"] and WordData["ran"] would both be equal to WordData["run"]. So, it was pretty simple to filter out the non-canonical words, but now I'm stuck with singular nouns and first-person verbs, so I'll get sentences like "dog run" and "cat jump". As a heuristic fix, I just tacked an S for Sulfur on the end of the noun, which gave me sentences like these:

WAsTeS NeCK
PIThS WINe
ReCIFeS BOUNCe
InVErTeBRaTeS MoB
AcHeS STaTe
AcUPReSSUReS HINdEr

I didn't really realize until doing this just how many words do double duty as both nouns and verbs. The first sentence looked totally backwards to me, until I realized that "wastes" can be a noun, and "neck" can actually be a verb. I would go further with the sentence generation idea, but 1) I don't have detailed enough technical knowledge of syntax to make anything really interesting and 2) I'd probably start reinventing the wheel before long. It might be cool to find existing algorithms for computer-generated speech, and just tack on the chemical word list - Eliza the psychotherapist, now moonlighting as a chemist?

If you want to see my actual Mathematica code, it's all here. I wouldn't recommend running the code to generate chemwords more than once ever, because it takes an extraordinarily long time to run. I just ran it once, copied the huge output so it said "chemwords = {a, b, c, ......};", marked it as an initialization cell, and hid it so it didn't take up all that screen real estate.

No comments:

Post a Comment