Automating Poetry, Pt. 3

So, last time we got some rhymes working, which was lovely.  While rhyming is a huge part of enforcing the structure of the poems we’re going to be creating, an equally important part is meter.

Which means we’ve got some syllables to count.

Now, fair warning that for this version of this (eventual) tool, we’re setting the bar fairly low and only counting syllables.  This is partially because (mercifully) the structure of poem we’re analyzing actually doesn’t enforce stress patterns (like, say, the alternating stress “da DA da DA…” of an iambic pentameter sonnet, etc.).  But it’s also partially because we have to start somewhere, and syllables are a big enough question in themselves for now.

For the most part, CMUdict saves our day again, in that we can rely on that (great) work to derive syllable counts for every word in its dictionary.  It isn’t quite as simple as that (as I’ll cover below), but it’s a straightforward enough process to get what we need from it.

As I mentioned in the last section, we’re able to derive the number of syllables in a CMUdict word easily enough, simply by counting the vowel sounds.  (Briefly: a simple regular expression of the phonetic column lets us count how many times we find numerals, which only occur as markers for the lexical stress of a vowel sound.  Find a number, you’ve found a vowel — and, thus, a syllable.)

But the problem is that you don’t exactly want to be splitting and reg-ex-ing every word every time you want to count its syllables — and, worse, you can’t do that to every entry in the database every time you want to find a word of a particular syllable count.  In other words, it’s not easily query-able information.  And we want it to be.

My solution, simply enough, was to just derive that information once, for every word in the database, and store it there permanently as another column.

Time for some brute force.

image11

This is far from glamorous code, I know, but this actually gets the job done quite nicely.  Simply put, we select every line from that dictionary (as in, all 133,803 of them!), split apart each result’s pronunciation field into an array, split by spaces, and then pattern match each of the word’s phonetic segments for numerals.  Every time we find a numeral, we increment a “syllable count”, per word, and then echo the whole database result again, along with that new syllable count number, setting it up as a new fourth column in what will become a new CSV file we can import, wiping out our old table for this new and improved one.

This is obviously as “brute force” as it gets, and probably something that people more familiar with SQL could do with a query.  I am not those people.  So, for now, for me, this works!

The result is a big ugly screen of output whose source code, thankfully, is at least less ugly:

image19

The fourth “column” (the number after the third comma) is our new syllable count, ready to SELECT on in MySQL. (And, fitting that this screenshot contains both “abracadabra” and “abrasive,” right?)

Ugly and brute force or not, in about a half-hour’s worth of work, we now have an effortlessly query-able syllable count for every word in this dictionary.  So, that’s our “syllable counting” needs settled, then, right?

…right?

Well, no.  Not exactly.

What if our word’s not in the database?

So, in the case of rhyming, it’s probably understandable enough that not every word you look for is going to be able to give you a good rhyme.

image15

(“Orange,” for instance. Heh.)

But, while that failing to find a rhyme doesn’t necessarily “break” the application, not being able to get a syllable count for a line theoretically would.  The user could count their lines’ syllables by hand, sure, but they could have done that with a pen and paper, too.  This is an application meant to help them structure their own poems!

(And, off the top of my head, I’d say rhyming is at least more subjective than meter, even if syllable counts are a bit slippery in and of themselves.  Maybe?)

So, we need a plan B on syllables, if we’re going to give the user at least a good “best guess” at the number of syllables in their lines.  My not-nearly-as-easy-as-it-seemed-when-I-thought-of-it answer: write an algorithm that parses “any” word for its syllable count.

This, it turns out, is a big job.  (Who’d have guessed?!)

What I’ll describe below is what I came up with after about 3-4 hours of revising, finally calling it “good enough for now,” with the absolute certainty of needing to come back and finesse it many times over in the future.

One thing I’m trying to keep in mind, as well: there’s also a definite “diminishing returns” problem past a certain point, given that even the duct-taped madness I’ve built so far is actually doing a pretty admirable job right now (about 90+-ish% accuracy, according to some also-questionably-accurate testing).

Probably the easiest way to present this is to document the process that I went through, thinking and adding rules to this.  So, to start with, the easy parts:

  1. Borrowing from our approach to parsing CMUdict’s entries for syllables, the biggest and easiest single step toward getting a reasonable syllable guess would be:“count the vowels”
    From Wikipedia, “a vowel is a speech sound made by the vocal cords,” which, of course, lines up pretty well with the definition of syllable, where “a syllable is typically made up of a syllable nucleus (most often a vowel) with optional initial and final margins (typically, consonants).”So, if we were to take, say, “alphabet” — well, there we go.  a • lph • a • b • e • t.  Three vowels makes three syllables.  Cool!Well, that was easy.  We done here?What do you mean, “no?”
  2. So, “a • b • o • u • t” that rule…It’s easy to forget those pesky diphthongs (even for as fun of a name as they have).  And, of course, they’re everywhere — and, worse, they’re crazy inconsistent.Disclaimer: The people that actually study these things are probably trying to reach through the screen and strangle me right now for approaching this so amateurishly.  But the truth is, my entire English program (probably unsurprisingly) didn’t involve more than the occasional toe dipped into the waters of actual linguistics.  So, for as much as I love language (and syntax, and morphology, and etymology, and I’m going to stop listing things now) and all of that fun stuff (and I actually really do), mine is unfortunately the starry-eyed fascination of the complete amateur.  But we’re going to make it work anyway!I tend to be this sort of unreasonable “renaissance” person, notoriously bad at consulting the subject expects first.  So, consider this whole info-dump to be my “baited web” here, bringing more knowledgeable people to me by frustrating them on the internet until they speak up to correct me.Back to the point, we need rules that cover these little pairs.  The easiest rule, to start with (and again, a thoroughly broken rule to use by itself), is:“ignore (as in, don’t count as a syllable) a vowel that follows a vowel”There.  Finished!

    …<cough.>

  3. Except that, applied too broadly, that last rule becomes kind of an “i • d • io • t • i • c” rule, of course.(And I’m picking on “i” there as the prime offender of this rule.)  With that in mind, my next step was to add on to the previous rule, as if in apology, simply:“…unless that last vowel was an i.”At this point, it was already doing a much better job.  (I hadn’t thought to record my tests at each step yet, and started only MUCH later into this process — I might still retroactively do this, actually, by taking rules apart from my algorithm — I think the test data would be interesting.)
  4. What’s that thing about “except after C?”  Well, it works here too.Take “s • o • c • i • a • l” for instance.  That “c” just cost us a syllable — and opened this algorithm up to something it inevitably needed all along, of course: awareness of a couple-or-few letters in each direction of the letter we’re looking at.Now we can say something like:“two vowels in a row is one syllable, unless the first vowel is an i… unless this vowel is an a, following an i, following a c”This is obviously starting to get harder — and while I’d love to add something reassuring here about eventually finding some elegant underlying simplicity, the truth is, I didn’t.  (And I don’t think there is one.)English is, of course, a language derived from and constantly incorporating a great many other languages, with a great many rules, and obviously many of these differ wildly on a case-by-case basis.So, you might be reading that last rule and (quite correctly) shaking your head saying “what about words like ‘associate’ vs ‘associatiave’” — and even between those words, the different acceptable syllable counts for them?

    Well, to that, I’d say there are really three ways to look at this:

    The first, and hardest: we could start flagging certain tricky vowels/strings as “multiple possibility” syllables, and recording multiple values for these, giving the end user some way to choose between them, to appease some strict syllabic enforcement in our final application… or…

    The second, and most lenient: to just not enforce these things.  We could far more easily just let this algorithm be what it really is: our best reasonable guess.  If it’s wrong, it’s not going to be crazy wrong, it’s just going to… I dunno, wiggle a bit?, on little rules and oddities like “is ‘comfortable’ three syllables or four?”

    The third, and most reassuring: to remember that this tool is actually just a last-resort supplement to a dictionary that has already solved this question the proper way (at great time and human effort) — by cataloguing these words by hand, accounting for the inconsistencies and multiple possibilities.  We can and will still look up each word a user types in that dictionary, and will get a far more definitive answer for any and all of the 133,000-odd words.

    (And, we will simply ask for the user’s patience when they want to write a poem that includes the allegedly single-syllable word “pwnage,” and live with a “syllable warning,” or something, for that line.)

    (Hey, I’m satisfied if you are.)

Covering all bases

Those rules above are only the first few rules that I added — I think I’ve gotten up to about 30? such rules by now, with many more left to add.  But those show a good overview of how that thinking process worked.

And after about 10 or 20 of those little refinements, I was actually starting to get quite happy with how good (generally speaking) it was doing.

For context, a few of the other rules that ended up coming in were things like:

    • “an e at the end of the word, following an L, following a consonant, is probably a syllable” (like “stable” or “bottle”) … but if that L follows a vowel, it probably isn’t (like “tale” or “joule”)
    • “if a word ends in -sm, it requires an extra syllable even without an extra vowel” (altruism, chasm)
    • “if an a follows an i, which follows a t, and the three are followed by an n, (-tian-), ignore that second syllable” (martian, venetian, although with snags like faustian, etc.)
    • “an i before an e is one syllable, except at the end of the word” (die, sortie)

And many more…  (punctuation, pluralization, tense, etc.).  The source code will let you stumble through the gnarled branches of its many if/else trees.

In a way, these generalizations started to seem so problematic that it almost made me want to cancel the whole effort, wondering if this was borderline irresponsible to allow so many rules to go in, when each of them had obvious exceptions and contradictions.  But, again, the solution of this project and tool is to assist with poem creation, not to enforce it.

And, to my surprise again, keeping on with this for a few hours led to an extremely gratifying series of tests that got these words right far more often than not.

I’ll cover that testing process in the next post.

Leave a Reply