So apparently I really like working on lexical and phonetic analysis. Â Who knew? Â And apparently I like it to the point that when I finally had a weekend with some spare time to (at long last) play The Witcher 3, I instead found myself sitting at my desk working on an algorithm to split words up by their syllables and vowel sounds. Â For hours. Â Having fun.
And I guess thereâ€™s no reason it shouldnâ€™t be fun. Â By the time I was fully into the swing of things (which was surprisingly quickly), it felt like a puzzle. Â And, since this was about words in the English language, it was even a puzzle where I was already pretty familiar with all of theÂ pieces.
So, the background:
Without saying too much about the project itself (I’m leaving that to be the researcherâ€™s privilege to announce and documentÂ as she likes), we’re brainstorming the early stages of a projectÂ at Hamilton that would both analyze a particular type of poetry, and give its readers the chance to create some of their own.
Like most poetry, this means thereâ€™s a particular set of rules (which are also a fun puzzle to sort out, programmatically) regarding the form, rhyme, meter, etc., of these works.
My job, to get things started, was therefore to start thinking of ways that we can essentially ask a web application, in real time, to look at either a word or a whole string of words (a line, couplet, etc.), and get some of these bits of information back.
Well, lucky for me, these lexical features areÂ available at least in part through the excellent CMU Pronouncing Dictionary, which can tell you (almost) any English wordâ€™s phonetic sounds and emphasis. Â And while that doesnâ€™t tell us the number of syllables in the word, or provide rhymes, having the rest of that information actuallyÂ gets us a lot closer than it might seem.
The first hurdle was making this available to a webpage as something that IÂ could query with reckless abandon. Â So, while their page shows a searchable input box (which returns the sort of thing youâ€™d hope for), there was no obvious way to set that kind of searchable system up for yourself. Â (And, me being my impatient self, I didnâ€™t ask them for their solution.)
Before I go on, I should also give another positive mention here to Steve Hanov and his â€œA Rhyming Engineâ€ (now turned into the mightier RhymeBrain,Â and its API), which were also strong contenders for the tool of choice, regarding the rhyming portion. Â (I did reach out to Steve, who kindly responded with the suggestion of trying out that API for my purposes. Â I didnâ€™t end up going that route, but thatâ€™s just the control freak in me — part of me wanted to figure some of this stuff out for myself, and part of me wanted a tool that I could hammer away at, without API call limitations.)
The CMU Dictionary
The CMU Pronouncing Dictionary (â€œCMUdictâ€) is essentially just a gigantic (tab-separated) text list of dictionary words followed by their ARPAbet phonemesÂ and lexical stress markers (represented as numerals at the end of the vowel sounds). Â So, while that right there is the bulk of the content I think this task needs, itâ€™s not exactly asÂ accessible asÂ we will need it to be.
So, for my next trick, I simply converted this whole dictionary into the worldâ€™s simplest MySQL table, so that I could just query it the old-fashioned way. Â (I’d love suggestions of a better way to do this. Â I did burn a couple of unsatisfying hours trying other tools I found around the web, to equally unsatisfying ends.)
Disclaimer: I am the furthest thing from a database admin, and am usually quite far behind the times on the easiest or sexiest tools for jobs like these. Â I used to be pretty intimidated by that, but at this point Iâ€™m finding the value in that — which is using approaches like these, describing them to people such as yourselves, and hearing what tool would make this a thousand times easier, or more powerful, the next time around. Â (So, letâ€™s hear them, this time!) Â In the meantime, itâ€™s nice to know that at least I can accomplish the task, and probably appreciate the power of better tools all the better for knowing how clunky approaches like these really are.
My process: load this entire dictionary text into a text editor (Iâ€™ve been using the surprisingly excellent Visual Studio Code for this project — and all projects on Mac recently), and literally just search/replace the spaces with commas, creating a sort of quick-and-easy CSV (comma separated values) file.
(Fun fact, since the word â€œNULLâ€ is one of the dictionary words, MySQL hates this on its import, and quits out of the import with an error. Â I thought it was funny. Â Thus, the manual substitution of â€œNULLâ€ with â€œfixmeâ€, which I later, of course, fix.)
On the database side of things, I set up a dead-simple two-column table called `words` that had a column for the word itself, and another for the phonetic/lexical stress value. Â That gives us our basic structure that maps to this simple CSV, and from there itâ€™s happy enough (after that â€œfixmeâ€ substitution) to let you load it in via phpMyAdminâ€™s â€œimportâ€ tool.
This isnâ€™t quite enough by itself. Â To make it properly editable, the database stillÂ needs a unique-key ID column, which is easy enough to addÂ on after the fact. Â (I do this after importing the CSV, so that I donâ€™t have to dream up some annoying solution to manually adding IDs to each field in my text file.) Â MySQL is happy enough to add that in one query.
That query being:
So, with that finished, we now have a nice little searchable database thatâ€™s happy to let you find either exact matches, or partial matches, with queries such as:
WHERE `w_word` = 'searchterm'
or, for partial matches (with the query syntax):
WHERE `w_word` LIKE '%searchterm%'
(And so forth.) Â This also lets us use those ‘%’ wildcards at either onlyÂ the beginning orÂ onlyÂ the end, to find words that just begin or end with our search terms. Â (That becomes big on searches for rhyme. Â More on that later.)
Mercifully, this is probably the biggest single line on the projectâ€™s to-do list, sorted out (well enough) in a few steps. Â (And, in my mind, I had made that part into quite the dragon to slay, so I was smiling at this point already — which is always nice after only an hour or two.)
From here, itâ€™s easy enough to jot downÂ a few generic queries that will get us most of the search/retrieval functionally we’ll need, and then start stuffing those into a PHP script or three, which we’ll feed words into via $_GET or $_POST variables:
Quick and ugly, but it’s already enough functionality to let us access thisÂ from a webpage and see the results. Â (And almost enough to soon turn into an Ajax version that we can query in real-time, as often as we need, to look up words as the user types them.)
w00t. Â (Which, by the way, is a word that is strangely not in the dictionary. Â Weird.)