To start with a quick review of what we’re ultimately building here: the goal of this project was essentially to build two tools:
- A web tool that we could feed a poem, and have it analyze its rhyming and meter features, and, more importantly,
- A web tool that lets users start writing a poem, line by line, into a webform, and have the features of this poetic form suggested and/or enforced by the tool. (For example, this particular type of poems requires that the last phrase of each couplet be repeated. So, this tool will automatically add these ending phrases to the subsequent couplets. Better examples of this ahead.)
So, last time we defined some basic rhyming patterns, which satisfies (well enough) one of the major tools for this project. The next step is to enforce the structure of the poems our users create — with the equally important component: meter.
This means we need to start figuring out how to count syllables.
Fair warning that for this version of the tool, we were able to set the bar at an easier level, because the particular variety of poem we’re analyzing (and writing) doesn’t actually enforce stress patterns. This would certainly be a factor in many other poetic forms, though, and would make a great next step for a tool like this, if we ever wanted to generalize it to new types. _To quickly give that thought some context, stress would mean something like the alternating stress pattern of iambic pentameter: “da DA da DA…,” etc..)
For the most part CMUdict give us what we need here once again, in that we can rely on the data to derive syllable counts for every word in its dictionary. (It isn’t quite as simple as that from the data they provide, but it was a straightforward enough process to get what we need from it. More on that next.)
Much like the previous task, we’re able to derive the number of syllables in a CMUdict word easily enough, just by counting the vowel sounds. The brief workflow overview would be:
CMUdict stores its vowel sounds with numerals attached, where numerals indicate the relative stress versus the other syllables. Thanks to those numerals, a regular expression of the phonetic column lets us count how many times we find numerals. Since these are the only numerals in that data, if we find a number, you’ve found a vowel — and, basically, vowels mean syllables.)
But one obvious performance problem is that you don’t exactly want to be splitting and regex-ing every word for every time you want to count its syllables. And, worse, you definitely can’t do that to every entry in the database every time you want to find a word of a particular syllable count. In other words, it’s not easily query-able information. And we want it to be.
A simple enough solution was to just run a process once over the whole data set, deriving that information for every word in the database, and store it there permanently as another column. (Basically, brute force.)
This is far from glamorous code, but it got the job done easily enough. Briefly summarizing that code:
We select every line from that dictionary (all 133,803 of them!), split apart each result’s “pronunciation” field into an array (splitting by spaces) into a temporary string, and then pattern match each of that string’s (that word’s) phonetic segments for numerals. Every time we find a numeral, we increment a “syllable count” variable, per word. Finally, we print (echo) each word along with its new syllable count number, formatting the output as CSV file that now has a fourth column of its syllable count. This will give us something we can easily save out to a CSV, and import as a database table, wiping out our old table for this new and improved one.
Again, this is pretty straightforward brute force, and I’m sure that it’s something an SQL expert could do with a single query or two. (I am not that SQL expert.) But, this works.
The result is a big ugly screen of output — but the source code (which is all we were really generating) is what we need:
The fourth “column” (the number after the third comma) is our new syllable count, which we can then SELECT in MySQL. (Also, it seemed fitting that this screenshot contains both “abracadabra” and “abrasions.”)
All of that was only about a half-hour’s worth of work, we now have an easily query-able syllable count for every word in this dictionary. I think it would be reasonable to stop here, and call that a wide enough set of countable words that we’re helping the user count the syllables in their line.
But, the lurking question here: what if our word isn’t in the database?
This is a different problem than the rhyme-suggestion portion of the tool. Where rhyming is understood to have loose rules and words that don’t rhyme (like “orange,” as they say), meter is much more strict and harder to cheat. So, where failing to find a rhyme doesn’t really hurt anything, not being able to count the syllables in a line theoretically does.
(The user could count their lines’ syllables by hand, of course, but the goal is to make this information visible and easy, to help people write a poem — and to be able to analyze existing poems, as well.)
So, we need a plan B on syllables. We can find a way to give at least a “best guess” at the number of syllables in a line, essentially by writing an algorithm that parses any word for its syllable count.
(I feel like I should add that, looking back on this project as I edit all of this down from a blog ago, I’m really not sure this kind of guess is actually better than just leaving a user to count their own, or perhaps write their own count into a line, replacing the one the tool came up with. But it was definitely a fun rabbit hole to fall into, and I think that’s the most honest answer at the motivation to write all of this.)
Probably unsurprisingly, this turns out to be a big job.
I’ll warn ahead of time that everything that follows is basically a ball of yarn that only became more tangled the more I tried to work with it, but it did actually yield a tool that did what I set out to do — as in, make a worthy “best guess.” So, this is where my next few hours went, revising and reworking things, finally calling it “good enough for now.”
The general approach was to start thinking of examples, and to keep adding rules as exceptions to the last rule presented themselves. So, to start with, the easy parts:
1. Borrowing from our approach to parsing CMUdict’s entries for syllables, the biggest and easiest single step toward getting a reasonable syllable guess would be:”count the vowels.” This works great for a lot of words (“a – lph – a – b – e -t,” for instance). But obviously it falls apart pretty quickly (like in the words “obviously,” or “quickly.”)
2. That first rule obviously forgets about compound vowels (or, the always fun-to-say, “diphthong”). These, of course, are everywhere — and, worse, they’re super inconsistent. So, of course, we need rules that cover these pairs. A next-step-but-certainly-not-the-last-step might be to add the rule:
“ignore (as in, don’t count as a syllable) a vowel that follows a vowel.” Finished! (Not really.)
But then, you get those idiosyncratic vowel pairs (like in the word “idiosyncratic”) that break that rule. (And the letter “i” actually quickly seems like a prime offender of this exception, even.)
With that in mind, the next step was to add on to the previous rule, (too) simply:
“…unless the previous vowel was an i.”
At this point, it was already doing a much better job. (I wasn’t collecting performance statistics yet at this point, but it was noticeable from the small list of non-CMUdict words I was using as tests.)
And then you get into classics like “except after C.” which applies here in a different way, which turns out to apply here too. In special cases (like in the word “special”) that “c” just removed what the last rule would have called a syllable. It’s at this point that it becomes obvious that a tool like this needs some ongoing awareness of at least a small few of the letters before and after the letter we’re looking at. That was easy enough to add, basically keeping a separate string that kept track of letters letters, two in each direction, as we iterated through the word.
The inevitable tangled mess of these rules is already starting to show itself at this point, when the rule starts to become:
“count any two vowels in a row as one syllable…
unless the first vowel is an i…
and unless this vowel is an a that follows an i, when that i follows a c”
I’d love to say that I ended up finding some elegant way to simplify all of this, but the truth is that I didn’t. (And I doubt that there is one.) It started to really drive home for me the extent to which English is a language derived from a great many other languages, and inheriting a great many rules along with them. This is something we all know, but this was a great context to really appreciate that.
I spent a long while just adding to and untangling these rules, attempting to cover at least the best-performing set of rules when there were unsolvable inconsistencies. While it was fun to find these special rules (“ia except after c,” like “special”), it eventually has to associate exceptions (like in the word “associate.”)
And, making matters worse, we can’t get too comfortable even with that approach, since there are plenty of words that offer multiple acceptable pronunciations and syllable counts (like the word “comfortable.”)
From all of this, it seems like there are really only three ways forward:
The first, and hardest: we could start flagging certain words (or even vowels/strings) as “multiple possibility” syllable counts, and giving the end user some way to choose between them.
The second, and most lenient: to just not enforce these things. We could far more easily just let this algorithm be what it really is: our best reasonable guess. If it’s wrong, it’s hopefully at least not that wrong.
The third, and most reassuring: to let this be plenty wrong, remembering that this tool is actually just a last-resort supplement to a dictionary that has already solved this question the proper way (as in, with great time and human effort) — by cataloguing these words by hand, accounting for the inconsistencies and multiple possibilities. For any of the 133,000-odd words from that dictionary, a user will get a plenty-confident count. (And even that can be wrong, as we know from multiple-pronunciation words.)
The rules I gave as examples were only the first few of what ended up being 30+, before eventually stopping and conceding that this is a task with no perfect answer. And, obviously, my approach was another brute force example, where a linguist probably could easily navigated half of that without much issue. (Or, maybe, known better than to even try?)
But: after the first 10 or 20 of those little refinements, I was actually starting to get quite happy with how good (generally speaking) it was doing. And the rules also became more interesting and sometimes surprising to me, putting new context on words I’d never have thought much about. Some examples:
- an e at the end of the word, when it follows an L which follows a consonant, is probably a syllable” (like “stable” or “bottle”)…
- but if that L follows a vowel, it probably isn’t (like “tale” or “joule”)
- if a word ends in -sm, it probably requires an extra syllable even without an extra vowel” (“altruism,” “chasm”)
- if an a follows an i, which follows a t, and the three are followed by an n, (-tian-), ignore that second syllable (“Christian,” “martian,” — although with plenty of exceptions like “faustian,” etc.)
- an i before an e is likely two syllables, except as the last letters of a word (“die,” “sortie”)
- …and many more! (Punctuation, pluralization, tense, etc.).
The source code will let you stumble through the gnarled branches of its many if/else trees, with my apologies in advance.
The number of exceptions definitely made the whole task seem questionable. But, again, the solution of this project and tool is to assist with poem creation, not to enforce it.
And, to my surprise again, keeping on with this for a few hours did at least lead to tests that got these words right far more often than wrong.
More on that testing process in the next post.