We left off last time with writing an algorithm (well, that’s a fancy word for it) for deriving a best-guess syllable count for “any” word.  Which involved a lot of rules…

And, obviously, these many (many) rules need testing.

While there are certainly better ways to test this (as in, automatically), I actually chose a very hands-on, “manual” way to test this as I was working.  And I’d actually recommend this approach, for a few reasons.

My quite-simple process was:

  1. Use an online “random word generator” (there are many available online, and I imagine most are about the same) to generate a list of 20 or so words at a time.
  2. Copy/paste those words into a text editor, and find/replace the “newline” breaks with quotes and commas, to make it into an array notation that we can feed into PHP.
    image09

    image05

    This adds a comma and close-quotes to the end of each line, and open-quotes to the next.

    (This requires adding the open quotes to line 1, and close quotes to line 20, by hand.  You might have fancier ways to do that last step automagically, which I’d be happy to learn.)

    image24

  3. Copy/paste that list of words into a PHP array:
    image22

And then iterate through them, running my syllable-counting function on each individual word, and echoing out each letter, and any rules-based decisions that accompany that letter (or string of letters).

The output is very verbose, but hugely helpful to see why it made the decisions it did (and, of course, whether any of those needed adjusting):

image16

I kept commented lines as a tally of how many words, out of 20 per test (and 100 on the last big test of that day) had needed fixing, and what rules I was adding.  (I should have been more careful to add more of those there.)

I also kept a far-more-important array of failed words, to come back to as another test pass after adding/changing rules.  (And once those issues were resolved, those words graduated up into the “fixed words” array, as a kind of log of my progress.)

These failed and fixed words didn’t have to be PHP arrays, of course, but it helped to be able to quickly feed those arrays into the algorithm, in place of the test words array, for review.  (That PHP loop up above doesn’t care how many words are in the array you give it, so it’s happy to go through your entire list of test words even if it gets quite long.)

image12

The “failed words” array, which graduate up to the “fixed words” once the necessary rules are added. Useful for keeping a list of things to come back to.

There were a lot of other fixes and notes, too, but that’s a decent glimpse into the general approach.

Some of those are still on the to-do list, now a day or two later, but it’s great to know which ones represent which rules to add/change.

A couple of the successes (yay!):

image01

Sample verbose output from my testing script — with the “-sm” ending rule at work in this one.

 

image02

Properly ignoring two “e”s that might have otherwise made this incorrectly read as a 4-syllable word.

The above couple of examples show a few places where I was pleased (relieved?) to see the algorithm correctly making adjustments for the trickier couple of rules in English.  And while it’s definitely not the most scientific testing method, of the thousand-or-so words I fed this, 93.3% of them were correctly guessed, which is honestly a better percentage than I was willing to call a successful experiment (for now).  (Obviously luckier or unluckier choices for those random words would have had far lower percentages, but hey, that’s the joy of sample sizes.)

And, with all of the inconsistencies in English, it’s worth bearing in mind that nothing is going to ever hit 100%.

Which brings me to:

A couple of the failures (boo!) to be fixed:

image03

“de-” as a prefix breaks what it otherwise assumes is the single-syllable diphthong “ea”.  Prefixes (and suffixes) in general represent what could be a whole extra pass through a word, and possibly even a separate dictionary lookup (for, say, anything after the “de-” in a word that begins with it).

fail-ing

Come on, a prefix and a suffix?  And how often does “-ing” follow a vowel?! Now you’re just making stuff up, English! (Clearly I need to be splitting words on common suffixes, but in this case it wouldn’t be enough to evaluate this word even three times, removing its prefix and suffix both — “valu”, after all, is not a word!)

There definitely need to be prefix and suffix checks, for “de-” and “re-” and “-ness”, etc.  And then checks against the word that’s left after splitting those off.  (So that, say, “preamble” checks for the word “amble” after removing the “pre”, and thereby learns that that “ea” vowel pair is not actually a single long-e sound, like “seam”, but instead a prefixed word — and therefore another syllable.  And, conversely, so that it either skips the check, or chooses not to add another syllable, when the prefix precedes a consonant, as in “premium,” since a prefix or suffix status on matters to us if it affects the vowel/syllable rules.)

And so on, and so on…

So, that’s the script so far, and with plenty more rules in that “failed words” array ready to write new rules for and try to smooth over.

Some of the last big remaining steps, as I can imagine them, will be to try to find a way to split up prefixes and suffixes (not always obvious, considering “reanimated” vs “rear”), compound words (“elsewhere”), and better rules about tense.  To some extent these might also need lookups from the dictionary (where “elsewhere” was in it, but the compound word “salesgirl” was not).

The downside there is that it starts to require more queries than might be reasonable for the task, given that it already queries our dictionary database once per word (either by a button click on the user’s part, or more taxing still, automatically) each time the user stops typing for more than a second or so within a line.

In any case, since this entire algorithm component is purely a supplement to a professional database (which will ideally serve the huge majority of a user’s words), the accuracy of this humble tool already seems reasonable enough that I don’t mind incorporating it into the project as is, albeit with a couple of warnings and the promise of better results in days ahead.

These scripts all need a bit of cleanup (and security measures, since a database that isn’t just on my own hard drive will be involved), but I’ll make sure that the revised script (along with some basic PHP functions to call it and retrieve results from it), will be available in the source code along with the others, shortly — and in a Github repo, for anyone who wants to join the fun.

More soon.

So, last time we got some rhymes working, which was lovely.  While rhyming is a huge part of enforcing the structure of the poems we’re going to be creating, an equally important part is meter.

Which means we’ve got some syllables to count.

Now, fair warning that for this version of this (eventual) tool, we’re setting the bar fairly low and only counting syllables.  This is partially because (mercifully) the structure of poem we’re analyzing actually doesn’t enforce stress patterns (like, say, the alternating stress “da DA da DA…” of an iambic pentameter sonnet, etc.).  But it’s also partially because we have to start somewhere, and syllables are a big enough question in themselves for now.

For the most part, CMUdict saves our day again, in that we can rely on that (great) work to derive syllable counts for every word in its dictionary.  It isn’t quite as simple as that (as I’ll cover below), but it’s a straightforward enough process to get what we need from it.

As I mentioned in the last section, we’re able to derive the number of syllables in a CMUdict word easily enough, simply by counting the vowel sounds.  (Briefly: a simple regular expression of the phonetic column lets us count how many times we find numerals, which only occur as markers for the lexical stress of a vowel sound.  Find a number, you’ve found a vowel — and, thus, a syllable.)

But the problem is that you don’t exactly want to be splitting and reg-ex-ing every word every time you want to count its syllables — and, worse, you can’t do that to every entry in the database every time you want to find a word of a particular syllable count.  In other words, it’s not easily query-able information.  And we want it to be.

My solution, simply enough, was to just derive that information once, for every word in the database, and store it there permanently as another column.

Time for some brute force.

image11

This is far from glamorous code, I know, but this actually gets the job done quite nicely.  Simply put, we select every line from that dictionary (as in, all 133,803 of them!), split apart each result’s pronunciation field into an array, split by spaces, and then pattern match each of the word’s phonetic segments for numerals.  Every time we find a numeral, we increment a “syllable count”, per word, and then echo the whole database result again, along with that new syllable count number, setting it up as a new fourth column in what will become a new CSV file we can import, wiping out our old table for this new and improved one.

This is obviously as “brute force” as it gets, and probably something that people more familiar with SQL could do with a query.  I am not those people.  So, for now, for me, this works!

The result is a big ugly screen of output whose source code, thankfully, is at least less ugly:

image19

The fourth “column” (the number after the third comma) is our new syllable count, ready to SELECT on in MySQL. (And, fitting that this screenshot contains both “abracadabra” and “abrasive,” right?)

Ugly and brute force or not, in about a half-hour’s worth of work, we now have an effortlessly query-able syllable count for every word in this dictionary.  So, that’s our “syllable counting” needs settled, then, right?

…right?

Well, no.  Not exactly.

What if our word’s not in the database?

So, in the case of rhyming, it’s probably understandable enough that not every word you look for is going to be able to give you a good rhyme.

image15

(“Orange,” for instance. Heh.)

But, while that failing to find a rhyme doesn’t necessarily “break” the application, not being able to get a syllable count for a line theoretically would.  The user could count their lines’ syllables by hand, sure, but they could have done that with a pen and paper, too.  This is an application meant to help them structure their own poems!

(And, off the top of my head, I’d say rhyming is at least more subjective than meter, even if syllable counts are a bit slippery in and of themselves.  Maybe?)

So, we need a plan B on syllables, if we’re going to give the user at least a good “best guess” at the number of syllables in their lines.  My not-nearly-as-easy-as-it-seemed-when-I-thought-of-it answer: write an algorithm that parses “any” word for its syllable count.

This, it turns out, is a big job.  (Who’d have guessed?!)

What I’ll describe below is what I came up with after about 3-4 hours of revising, finally calling it “good enough for now,” with the absolute certainty of needing to come back and finesse it many times over in the future.

One thing I’m trying to keep in mind, as well: there’s also a definite “diminishing returns” problem past a certain point, given that even the duct-taped madness I’ve built so far is actually doing a pretty admirable job right now (about 90+-ish% accuracy, according to some also-questionably-accurate testing).

Probably the easiest way to present this is to document the process that I went through, thinking and adding rules to this.  So, to start with, the easy parts:

  1. Borrowing from our approach to parsing CMUdict’s entries for syllables, the biggest and easiest single step toward getting a reasonable syllable guess would be:“count the vowels”
    From Wikipedia, “a vowel is a speech sound made by the vocal cords,” which, of course, lines up pretty well with the definition of syllable, where “a syllable is typically made up of a syllable nucleus (most often a vowel) with optional initial and final margins (typically, consonants).”So, if we were to take, say, “alphabet” — well, there we go.  a • lph • a • b • e • t.  Three vowels makes three syllables.  Cool!Well, that was easy.  We done here?What do you mean, “no?”
  2. So, “a • b • o • u • t” that rule…It’s easy to forget those pesky diphthongs (even for as fun of a name as they have).  And, of course, they’re everywhere — and, worse, they’re crazy inconsistent.Disclaimer: The people that actually study these things are probably trying to reach through the screen and strangle me right now for approaching this so amateurishly.  But the truth is, my entire English program (probably unsurprisingly) didn’t involve more than the occasional toe dipped into the waters of actual linguistics.  So, for as much as I love language (and syntax, and morphology, and etymology, and I’m going to stop listing things now) and all of that fun stuff (and I actually really do), mine is unfortunately the starry-eyed fascination of the complete amateur.  But we’re going to make it work anyway!I tend to be this sort of unreasonable “renaissance” person, notoriously bad at consulting the subject expects first.  So, consider this whole info-dump to be my “baited web” here, bringing more knowledgeable people to me by frustrating them on the internet until they speak up to correct me.Back to the point, we need rules that cover these little pairs.  The easiest rule, to start with (and again, a thoroughly broken rule to use by itself), is:“ignore (as in, don’t count as a syllable) a vowel that follows a vowel”There.  Finished!

    …<cough.>

  3. Except that, applied too broadly, that last rule becomes kind of an “i • d • io • t • i • c” rule, of course.(And I’m picking on “i” there as the prime offender of this rule.)  With that in mind, my next step was to add on to the previous rule, as if in apology, simply:“…unless that last vowel was an i.”At this point, it was already doing a much better job.  (I hadn’t thought to record my tests at each step yet, and started only MUCH later into this process — I might still retroactively do this, actually, by taking rules apart from my algorithm — I think the test data would be interesting.)
  4. What’s that thing about “except after C?”  Well, it works here too.Take “s • o • c • i • a • l” for instance.  That “c” just cost us a syllable — and opened this algorithm up to something it inevitably needed all along, of course: awareness of a couple-or-few letters in each direction of the letter we’re looking at.Now we can say something like:“two vowels in a row is one syllable, unless the first vowel is an i… unless this vowel is an a, following an i, following a c”This is obviously starting to get harder — and while I’d love to add something reassuring here about eventually finding some elegant underlying simplicity, the truth is, I didn’t.  (And I don’t think there is one.)English is, of course, a language derived from and constantly incorporating a great many other languages, with a great many rules, and obviously many of these differ wildly on a case-by-case basis.So, you might be reading that last rule and (quite correctly) shaking your head saying “what about words like ‘associate’ vs ‘associatiave’” — and even between those words, the different acceptable syllable counts for them?

    Well, to that, I’d say there are really three ways to look at this:

    The first, and hardest: we could start flagging certain tricky vowels/strings as “multiple possibility” syllables, and recording multiple values for these, giving the end user some way to choose between them, to appease some strict syllabic enforcement in our final application… or…

    The second, and most lenient: to just not enforce these things.  We could far more easily just let this algorithm be what it really is: our best reasonable guess.  If it’s wrong, it’s not going to be crazy wrong, it’s just going to… I dunno, wiggle a bit?, on little rules and oddities like “is ‘comfortable’ three syllables or four?”

    The third, and most reassuring: to remember that this tool is actually just a last-resort supplement to a dictionary that has already solved this question the proper way (at great time and human effort) — by cataloguing these words by hand, accounting for the inconsistencies and multiple possibilities.  We can and will still look up each word a user types in that dictionary, and will get a far more definitive answer for any and all of the 133,000-odd words.

    (And, we will simply ask for the user’s patience when they want to write a poem that includes the allegedly single-syllable word “pwnage,” and live with a “syllable warning,” or something, for that line.)

    (Hey, I’m satisfied if you are.)

Covering all bases

Those rules above are only the first few rules that I added — I think I’ve gotten up to about 30? such rules by now, with many more left to add.  But those show a good overview of how that thinking process worked.

And after about 10 or 20 of those little refinements, I was actually starting to get quite happy with how good (generally speaking) it was doing.

For context, a few of the other rules that ended up coming in were things like:

    • “an e at the end of the word, following an L, following a consonant, is probably a syllable” (like “stable” or “bottle”) … but if that L follows a vowel, it probably isn’t (like “tale” or “joule”)
    • “if a word ends in -sm, it requires an extra syllable even without an extra vowel” (altruism, chasm)
    • “if an a follows an i, which follows a t, and the three are followed by an n, (-tian-), ignore that second syllable” (martian, venetian, although with snags like faustian, etc.)
    • “an i before an e is one syllable, except at the end of the word” (die, sortie)

And many more…  (punctuation, pluralization, tense, etc.).  The source code will let you stumble through the gnarled branches of its many if/else trees.

In a way, these generalizations started to seem so problematic that it almost made me want to cancel the whole effort, wondering if this was borderline irresponsible to allow so many rules to go in, when each of them had obvious exceptions and contradictions.  But, again, the solution of this project and tool is to assist with poem creation, not to enforce it.

And, to my surprise again, keeping on with this for a few hours led to an extremely gratifying series of tests that got these words right far more often than not.

I’ll cover that testing process in the next post.

auto-poetry

So apparently I really like working on lexical and phonetic analysis.  Who knew?  And apparently I like it to the point that when I finally had a weekend with some spare time to (at long last) play The Witcher 3, I instead found myself sitting at my desk working on an algorithm to split words up by their syllables and vowel sounds.  For hours.  Having fun.

And I guess there’s no reason it shouldn’t be fun.  By the time I was fully into the swing of things (which was surprisingly quickly), it felt like a puzzle.  And, since this was about words in the English language, it was even a puzzle where I was already pretty familiar with all of the pieces.

So, the background:

Without saying too much about the project itself (I’m leaving that to be the researcher’s privilege to announce and document as she likes), we’re brainstorming the early stages of a project at Hamilton that would both analyze a particular type of poetry, and give its readers the chance to create some of their own.

Like most poetry, this means there’s a particular set of rules (which are also a fun puzzle to sort out, programmatically) regarding the form, rhyme, meter, etc., of these works.

My job, to get things started, was therefore to start thinking of ways that we can essentially ask a web application, in real time, to look at either a word or a whole string of words (a line, couplet, etc.), and get some of these bits of information back.

Well, lucky for me, these lexical features are available at least in part through the excellent CMU Pronouncing Dictionary, which can tell you (almost) any English word’s phonetic sounds and emphasis.  And while that doesn’t tell us the number of syllables in the word, or provide rhymes, having the rest of that information actually gets us a lot closer than it might seem.

Setting Up

The first hurdle was making this available to a webpage as something that I could query with reckless abandon.  So, while their page shows a searchable input box (which returns the sort of thing you’d hope for), there was no obvious way to set that kind of searchable system up for yourself.  (And, me being my impatient self, I didn’t ask them for their solution.)

Before I go on, I should also give another positive mention here to Steve Hanov and his “A Rhyming Engine” (now turned into the mightier RhymeBrain, and its API), which were also strong contenders for the tool of choice, regarding the rhyming portion.  (I did reach out to Steve, who kindly responded with the suggestion of trying out that API for my purposes.  I didn’t end up going that route, but that’s just the control freak in me — part of me wanted to figure some of this stuff out for myself, and part of me wanted a tool that I could hammer away at, without API call limitations.)

The CMU Dictionary

The CMU Pronouncing Dictionary (“CMUdict”) is essentially just a gigantic (tab-separated) text list of dictionary words followed by their ARPAbet phonemes and lexical stress markers (represented as numerals at the end of the vowel sounds).  So, while that right there is the bulk of the content I think this task needs, it’s not exactly as accessible as we will need it to be.

So, for my next trick, I simply converted this whole dictionary into the world’s simplest MySQL table, so that I could just query it the old-fashioned way.  (I’d love suggestions of a better way to do this.  I did burn a couple of unsatisfying hours trying other tools I found around the web, to equally unsatisfying ends.)

Disclaimer: I am the furthest thing from a database admin, and am usually quite far behind the times on the easiest or sexiest tools for jobs like these.  I used to be pretty intimidated by that, but at this point I’m finding the value in that — which is using approaches like these, describing them to people such as yourselves, and hearing what tool would make this a thousand times easier, or more powerful, the next time around.  (So, let’s hear them, this time!)  In the meantime, it’s nice to know that at least I can accomplish the task, and probably appreciate the power of better tools all the better for knowing how clunky approaches like these really are.

My process: load this entire dictionary text into a text editor (I’ve been using the surprisingly excellent Visual Studio Code for this project — and all projects on Mac recently), and literally just search/replace the spaces with commas, creating a sort of quick-and-easy CSV (comma separated values) file.

(Fun fact, since the word “NULL” is one of the dictionary words, MySQL hates this on its import, and quits out of the import with an error.  I thought it was funny.  Thus, the manual substitution of “NULL” with “fixme”, which I later, of course, fix.)

On the database side of things, I set up a dead-simple two-column table called `words` that had a column for the word itself, and another for the phonetic/lexical stress value.  That gives us our basic structure that maps to this simple CSV, and from there it’s happy enough (after that “fixme” substitution) to let you load it in via phpMyAdmin’s “import” tool.

This isn’t quite enough by itself.  To make it properly editable, the database still needs a unique-key ID column, which is easy enough to add on after the fact.  (I do this after importing the CSV, so that I don’t have to dream up some annoying solution to manually adding IDs to each field in my text file.)  MySQL is happy enough to add that in one query.

That query being:

So, with that finished, we now have a nice little searchable database that’s happy to let you find either exact matches, or partial matches, with queries such as:

WHERE `w_word` = 'searchterm'

or, for partial matches (with the query syntax):

WHERE `w_word` LIKE '%searchterm%'

(And so forth.)  This also lets us use those ‘%’ wildcards at either only the beginning or only the end, to find words that just begin or end with our search terms.  (That becomes big on searches for rhyme.  More on that later.)

Mercifully, this is probably the biggest single line on the project’s to-do list, sorted out (well enough) in a few steps.  (And, in my mind, I had made that part into quite the dragon to slay, so I was smiling at this point already — which is always nice after only an hour or two.)

From here, it’s easy enough to jot down a few generic queries that will get us most of the search/retrieval functionally we’ll need, and then start stuffing those into a PHP script or three, which we’ll feed words into via $_GET or $_POST variables:

Quick and ugly, but it’s already enough functionality to let us access this from a webpage and see the results.  (And almost enough to soon turn into an Ajax version that we can query in real-time, as often as we need, to look up words as the user types them.)

w00t.  (Which, by the way, is a word that is strangely not in the dictionary.  Weird.)