So apparently I really like working on lexical and phonetic analysis. By the time I was fully into the swing of things, it felt like a puzzle. And, since this was about words in the English language, it should theoretically be a puzzle whose pieces I already knew.
So, the background:
This was developed in the early stages of an academic research project at Hamilton College that would serve two purposes: first, to help students analyze a particular type of poetry, and second, to give the same students a tool that would help them create their own.
Like most poetry, this means there’s a particular set of rules regarding the form, rhyme, meter, etc., of these works.
To help get this started, my job was to start thinking of ways that we can essentially ask a web application to look at either a word or a whole string of words (a line, couplet, etc.), and then get some of the information back as stats. In our case, we would need to know the number of syllables, and the rhyming features of the final word (or phrase) of each line.
Well, lucky for me, these lexical features are available at least in part through the excellent CMU Pronouncing Dictionary, which can tell you (almost) any English word’s phonetic sounds and emphasis. And while that doesn’t tell us the number of syllables in the word, or provide us with rhymes, it does give us enough information to get a lot closer to our goals.
Setting Up
The first hurdle was making this dictionary available to a web application — for my sake, I would need this to be something that I could query as often as I needed (without API key limitations, etc.). So, while the CMU dictionary page does offer a searchable input box (which returns the sort of thing you’d hope for), there was no obvious way to query that set externally, let alone in rapid succession. With more time, I would normally reach out to them and see what kind of tool they use for their own searches, but, in a hurry, it seemed like it would be faster to come up with my own solution. (More on that in a minute.)
I should also give another positive mention here to Steve Hanov and his “A Rhyming Engine” (now turned into the even bigger/better RhymeBrain, and its API), which were also strong contenders for the tool of choice, regarding the rhyming portion. These could easily be the right tool for similar tasks, although in the moment it seemed like any kinds API limits wouldn’t be a great fit for messy development work.
The CMU Dictionary
The CMU Pronouncing Dictionary (“CMUdict,” they call it) is essentially just a gigantic, tab-separated list of dictionary words followed by their ARPAbet phonemes and lexical stress markers, represented as numerals at the end of each vowel sound. Even by itself, that gives the bulk of the actual content I needed, although it’s not as accessible as we will need it to be.
The next trick was to convert the whole dictionary into a (world’s simplest) MySQL table, so that I could just query it the old-fashioned way. (I would love suggestions of a better way to do this. I did burn a couple of unsatisfying hours trying other query/retrieval tools I found around the web, but didn’t find anything that seemed actually easier.)
Building the Database
I am the furthest thing from a database admin, and I usually feel like I’m behind the times with tools/tasks like these. That used to feel intimidating to me, but at this point I’m finding there’s also a lot of value in just getting something working with the tools I know, and then learning from other people, in review, what tool would make this easier or more powerful the next time around.
My process: loading this entire dictionary text into a text editor, and doing a dead-simple search/replace of the spaces with commas, creating a sort of quick-and-easy CSV (comma separated values) file.
(Fun fact: since the word “NULL” is one of the dictionary words, MySQL hated this on import, and quit out of the import task with an error. That took a couple tries to figure out what was going wrong. I thought it was funny. Thankfully easily solved by manually substituting that one word “NULL” (with “fixme”), and then switching it back once it was in the database.)
On the database side of things, I set up a simple two-column table called `words` — one column for the word itself, and another for the phonetic/lexical stress value. That gives us our basic structure, and maps it easily to the simple CSV format. From there it’s happy enough (after that whole ‘fixme’ thing) to load it into the database via phpMyAdmin’s import tool.
This isn’t quite enough by itself. To make it properly editable, the database still needs a unique-key ID column, which is easy enough to add on after the fact. (I do this after importing the CSV, so that I can skip the process of manually adding IDs to each field in my text file.) MySQL is happy enough to add that in one query:
So, with that finished, we now have an easy, searchable database that’s happy to let you find either exact matches, or partial matches, with queries such as:
SELECT * FROM `words` WHERE `w_word` = 'searchterm'
Or, it can return on partial matches (with the query syntax):
SELECT * FROM `words` WHERE `w_word` LIKE '%searchterm%'
(And so forth.)
This also lets us use the ‘%’ wildcard at either the beginning or the end (instead of both), to find words that just begin or end with our search terms. (That ends up becoming important to searches for rhyme. More on that later.)
By itself, this database adequately tackles the single largest item on the project’s to-do list in a few steps. (And, in my mind, that seemed like a huge dragon to slay, so I was already happy at this point.)
From there, it’s easy enough to store a few generic queries that will get us most of the search/retrieval functionally we’ll need. Those queries can then be stuffed those into a PHP script or three, which we can then add into a web form, feeding it with input search terms via $_GET or $_POST variables:
Quick and ugly, but this was already enough functionality to let us access this from a web form and see the results. And it’s almost enough to then turn into an Ajax version, which we can then query in real-time, as often as we need, as users type their search terms.
More on that part next.