So, last time we got some rhymes working, which was lovely. Â While rhyming is a huge part of enforcing the structure of the poems weâ€™re going to be creating, an equally important part is meter.
Which means we’ve got some syllables to count.
Now, fair warning that for this version of this (eventual) tool, weâ€™re setting theÂ bar fairly low and only counting syllables. Â This is partially because (mercifully) the structure of poem we’re analyzing actually doesnâ€™t enforce stress patterns (like, say, the alternating stress â€œda DA da DAâ€¦â€ of an iambic pentameter sonnet, etc.). Â But itâ€™s also partially because we have to start somewhere, and syllables are a big enough question in themselves for now.
For the most part, CMUdict saves our dayÂ again, in that we can rely on that (great) work to derive syllable counts for every word in its dictionary. Â It isnâ€™t quiteÂ as simple as that (as I’ll cover below), but it’s a straightforward enough process to get what we need from it.
As I mentioned in the lastÂ section, we’re able to derive the number of syllables in a CMUdict word easily enough, simply by counting the vowel sounds. Â (Briefly: aÂ simple regular expression of the phonetic column lets us count how many times we find numerals, which only occur as markers for the lexical stress of a vowel sound. Â Find a number, you’ve found a vowel — and, thus, a syllable.)
But theÂ problemÂ is that you don’t exactly want to be splitting and reg-ex-ing every word every time you want to count its syllables — and, worse, you can’t do that to every entry in the database every time you wantÂ to find a word of a particular syllable count. Â In other words, it’s not easily query-able information. Â And we want it to be.
My solution, simply enough, was to just derive that informationÂ once, for every word in the database, and store it there permanently as another column.
Time for some brute force.
This is far from glamorous code, I know, but this actually gets the job done quite nicely. Â Simply put, we selectÂ every line from that dictionary (as in, all 133,803 of them!), split apart each result’sÂ pronunciation field into an array, split by spaces, and thenÂ pattern match each of the word’s phonetic segments for numerals. Â Every time we find a numeral, we increment a “syllable count”, per word, and then echo the whole database result again, along with that new syllable count number, setting it up as a new fourth column in what will become a new CSV file we can import, wiping out our old table for this new and improved one.
This is obviously as “brute force” as it gets, and probably something that people more familiar with SQL could do with a query. Â I am not those people. Â So, for now, for me, this works!
The result is a big ugly screen of output whose source code, thankfully, is at least less ugly:
The fourth “column” (the number after the third comma) is our new syllable count, ready to SELECT on in MySQL. (And, fitting that this screenshot contains both â€œabracadabraâ€ and â€œabrasive,â€ right?)
Ugly and brute force or not, in about a half-hourâ€™s worth of work, we now have an effortlessly query-able syllable count for every word in this dictionary. Â So, that’s our “syllable counting” needs settled, then, right?
Well, no. Â Not exactly.
What if our word’sÂ not in the database?
So, in the case of rhyming, itâ€™s probably understandable enough that not every word you look for is going to be able to give you a good rhyme.
(â€œOrange,â€ for instance. Heh.)
But, while that failing to find a rhyme doesn’tÂ necessarily â€œbreakâ€ the application, not being able to get a syllable count for a line theoretically would. Â The user could count their lines’ syllables by hand, sure, but they could have done that with a pen and paper, too. Â This is an application meantÂ to help them structure their own poems!
(And, off the top of my head, Iâ€™d say rhyming is at least more subjective than meter, even if syllable counts are a bit slippery in and of themselves. Â Maybe?)
So, we need a plan B on syllables, if weâ€™re going to give the user at least a good â€œbest guessâ€ at the number of syllables in their lines. Â My not-nearly-as-easy-as-it-seemed-when-I-thought-of-itÂ answer: write an algorithm that parses “any” word for its syllable count.
This, it turns out, is a big job. Â (Who’d have guessed?!)
What I’ll describe below is what I came up with after about 3-4 hours of revising, finally calling it â€œgood enough for now,â€ with the absolute certainty of needing to come back and finesse it many times over in the future.
One thing I’m trying to keep in mind, as well: thereâ€™s also a definiteÂ â€œdiminishing returnsâ€ problem past a certain point, given that even the duct-taped madness I’ve built so farÂ is actually doing a pretty admirable job right now (about 90+-ish% accuracy, according to some also-questionably-accurate testing).
Probably the easiest way to present this is to document the process that I went through, thinking and adding rules to this. Â So, to start with, the easy parts:
- Borrowing from our approach to parsing CMUdictâ€™s entries for syllables, the biggest and easiest single step toward getting a reasonable syllable guess would be:â€œcount the vowelsâ€
From Wikipedia, â€œa vowel is a speech sound made by the vocal cords,â€ which, of course, lines up pretty well with the definition of syllable, where â€œa syllable is typically made up of a syllable nucleus (most often a vowel) with optional initial and final margins (typically, consonants).â€So, if we were to take, say, â€œalphabetâ€ — well, there we go. Â a â€¢ lph â€¢ a â€¢ b â€¢ e â€¢ t. Â Three vowels makes three syllables. Â Cool!Well, that was easy. Â We done here?What do you mean, â€œno?â€
- So, â€œa â€¢ b â€¢Â o â€¢ u â€¢ tâ€ that ruleâ€¦Itâ€™s easy to forget those pesky diphthongs (even for as funÂ of a name as they have). Â And, of course, theyâ€™re everywhere — and, worse, theyâ€™re crazy inconsistent.Disclaimer: The people that actually study these things are probably trying to reach through the screen and strangle me right now for approaching this so amateurishly. Â But the truth is, my entire English program (probably unsurprisingly) didnâ€™t involve more than the occasional toe dipped into the waters of actual linguistics. Â So, for as much as I love language (and syntax, and morphology, and etymology, and Iâ€™m going to stop listing things now) and all of that fun stuff (and I actually really do), mine is unfortunately the starry-eyed fascination of the complete amateur. Â But weâ€™re going to make it work anyway!I tend to be this sort of unreasonable â€œrenaissanceâ€ person, notoriously bad at consulting the subject expects first. Â So, consider this whole info-dump to be my “baited web” here, bringing more knowledgeable people to me by frustrating them on the internet until they speak up to correct me.Back to the point, we need rules that cover these little pairs. Â The easiest rule, to start with (and again, a thoroughly broken rule to use by itself), is:â€œignore (as in, donâ€™t count as a syllable) a vowel that follows a vowelâ€There. Â Finished!
- Except that, applied too broadly, that last ruleÂ becomes kind of an â€œi â€¢ d â€¢ io â€¢ t â€¢ i â€¢ câ€ rule, of course.(And Iâ€™m picking on â€œiâ€ there as the prime offender of this rule.) Â With that in mind,Â my next step was to add on to the previous rule, as if inÂ apology, simply:â€œ…unless that last vowel was an i.â€At this point, it was already doing a much better job. Â (I hadn’tÂ thought to record my tests at each step yet, and started only MUCH later into this process — I might still retroactively do this, actually, by taking rules apart from my algorithm — I think the test data would be interesting.)
- Whatâ€™s that thing about â€œexcept after C?â€ Â Well, it works here too.Take â€œs â€¢Â o â€¢Â c â€¢ i â€¢Â a â€¢ lâ€ for instance. Â That â€œcâ€ just cost us a syllable — and opened this algorithm up to something it inevitably needed all along, of course: awareness of a couple-or-few letters in each direction of the letter weâ€™re looking at.Now we can say something like:â€œtwo vowels in a row is one syllable, unless the first vowel is an i… unless this vowel is an a, following an i, following a câ€This is obviously starting to get harder — and while Iâ€™d love to add something reassuring here about eventually finding some elegant underlying simplicity, the truth is, I didnâ€™t.Â Â (And I donâ€™t think there is one.)English is, of course, a language derived from and constantly incorporating a great many other languages, with a great many rules, andÂ obviously many ofÂ these differ wildly on a case-by-case basis.So, you might be reading that last rule and (quite correctly) shaking your head saying â€œwhat about words like â€˜associateâ€™ vs â€˜associatiaveâ€™â€ — and even between those words, the different acceptable syllable counts for them?
Well, to that, Iâ€™d say there are really threeÂ ways to look at this:
The first, and hardest: we could start flagging certain tricky vowels/strings as â€œmultiple possibilityâ€ syllables, and recording multiple values for these, giving the end user some way to choose between them, to appease some strict syllabic enforcement in ourÂ final applicationâ€¦ or…
The second, and most lenient: to just not enforce these things. Â We could far more easily justÂ let this algorithm be what it really is: our best reasonable guess.Â If it’s wrong, it’s not going to beÂ crazyÂ wrong, it’s just going to… I dunno, wiggle a bit?, on little rules and oddities like “is ‘comfortable’ three syllables or four?”
The third, and most reassuring: to remember that this tool is actually just a last-resortÂ supplement to a dictionary that has already solved this question the proper way (at great time and human effort) — by cataloguing these words by hand, accounting for the inconsistencies and multiple possibilities. Â We can and will still look up each word a user types in that dictionary, and will get a far more definitive answer for any and all of the 133,000-odd words.
(And, we will simply ask for the userâ€™s patience when they want to write a poem that includes the allegedly single-syllable word â€œpwnage,â€ and live with a “syllable warning,” or something, for that line.)
(Hey, Iâ€™m satisfied if you are.)
Covering all bases
Those rules above areÂ only the first fewÂ rules that I added — I think I’ve gotten up to aboutÂ 30? such rules by now, with many more left to add. Â But those showÂ a good overview of how that thinking process worked.
And after about 10 or 20 of those little refinements, I was actually starting to get quite happy with how good (generally speaking) it was doing.
For context, a few of the other rules that ended up coming in were things like:
- â€œan e at the end of the word, following an L, following a consonant, is probably a syllableâ€ (like â€œstableâ€ or â€œbottleâ€) â€¦ but if that L follows a vowel, it probably isnâ€™t (like â€œtaleâ€ or â€œjouleâ€)
- â€œif a word ends in -sm, it requires an extra syllable even without an extra vowelâ€ (altruism, chasm)
- â€œif an a follows an i, which follows a t, and the three are followed by an n, (-tian-), ignore that second syllableâ€ (martian, venetian, although with snags like faustian, etc.)
- â€œan i before an e is one syllable, except at the end of the wordâ€ (die, sortie)
And many moreâ€¦ Â (punctuation, pluralization, tense, etc.). Â The source code will let you stumbleÂ through the gnarled branches of its many if/else trees.
In a way, these generalizations started to seemÂ so problematic that it almost made me want to cancel the whole effort, wondering if this was borderline irresponsible to allow so many rules to go in, when each of them had obviousÂ exceptions and contradictions. Â But, again, the solution of this project and tool isÂ to assist with poem creation,Â not to enforce it.
And, to my surprise again, keeping on with this for a few hours led to an extremely gratifying series of tests that got these words right far more often than not.
I’ll cover that testing process in the next post.