Automating Poetry, Pt. 4

We left off last time with writing an algorithm (well, that’s a fancy word for it) for deriving a best-guess syllable count for “any” word.  Which involved a lot of rules…

And, obviously, these many (many) rules need testing.

While there are certainly better ways to test this (as in, automatically), I actually chose a very hands-on, “manual” way to test this as I was working.  And I’d actually recommend this approach, for a few reasons.

My quite-simple process was:

  1. Use an online “random word generator” (there are many available online, and I imagine most are about the same) to generate a list of 20 or so words at a time.
  2. Copy/paste those words into a text editor, and find/replace the “newline” breaks with quotes and commas, to make it into an array notation that we can feed into PHP.
    image09

    image05

    This adds a comma and close-quotes to the end of each line, and open-quotes to the next.

    (This requires adding the open quotes to line 1, and close quotes to line 20, by hand.  You might have fancier ways to do that last step automagically, which I’d be happy to learn.)

    image24

  3. Copy/paste that list of words into a PHP array:
    image22

And then iterate through them, running my syllable-counting function on each individual word, and echoing out each letter, and any rules-based decisions that accompany that letter (or string of letters).

The output is very verbose, but hugely helpful to see why it made the decisions it did (and, of course, whether any of those needed adjusting):

image16

I kept commented lines as a tally of how many words, out of 20 per test (and 100 on the last big test of that day) had needed fixing, and what rules I was adding.  (I should have been more careful to add more of those there.)

I also kept a far-more-important array of failed words, to come back to as another test pass after adding/changing rules.  (And once those issues were resolved, those words graduated up into the “fixed words” array, as a kind of log of my progress.)

These failed and fixed words didn’t have to be PHP arrays, of course, but it helped to be able to quickly feed those arrays into the algorithm, in place of the test words array, for review.  (That PHP loop up above doesn’t care how many words are in the array you give it, so it’s happy to go through your entire list of test words even if it gets quite long.)

image12

The “failed words” array, which graduate up to the “fixed words” once the necessary rules are added. Useful for keeping a list of things to come back to.

There were a lot of other fixes and notes, too, but that’s a decent glimpse into the general approach.

Some of those are still on the to-do list, now a day or two later, but it’s great to know which ones represent which rules to add/change.

A couple of the successes (yay!):

image01

Sample verbose output from my testing script — with the “-sm” ending rule at work in this one.

 

image02

Properly ignoring two “e”s that might have otherwise made this incorrectly read as a 4-syllable word.

The above couple of examples show a few places where I was pleased (relieved?) to see the algorithm correctly making adjustments for the trickier couple of rules in English.  And while it’s definitely not the most scientific testing method, of the thousand-or-so words I fed this, 93.3% of them were correctly guessed, which is honestly a better percentage than I was willing to call a successful experiment (for now).  (Obviously luckier or unluckier choices for those random words would have had far lower percentages, but hey, that’s the joy of sample sizes.)

And, with all of the inconsistencies in English, it’s worth bearing in mind that nothing is going to ever hit 100%.

Which brings me to:

A couple of the failures (boo!) to be fixed:

image03

“de-” as a prefix breaks what it otherwise assumes is the single-syllable diphthong “ea”.  Prefixes (and suffixes) in general represent what could be a whole extra pass through a word, and possibly even a separate dictionary lookup (for, say, anything after the “de-” in a word that begins with it).

fail-ing

Come on, a prefix and a suffix?  And how often does “-ing” follow a vowel?! Now you’re just making stuff up, English! (Clearly I need to be splitting words on common suffixes, but in this case it wouldn’t be enough to evaluate this word even three times, removing its prefix and suffix both — “valu”, after all, is not a word!)

There definitely need to be prefix and suffix checks, for “de-” and “re-” and “-ness”, etc.  And then checks against the word that’s left after splitting those off.  (So that, say, “preamble” checks for the word “amble” after removing the “pre”, and thereby learns that that “ea” vowel pair is not actually a single long-e sound, like “seam”, but instead a prefixed word — and therefore another syllable.  And, conversely, so that it either skips the check, or chooses not to add another syllable, when the prefix precedes a consonant, as in “premium,” since a prefix or suffix status on matters to us if it affects the vowel/syllable rules.)

And so on, and so on…

So, that’s the script so far, and with plenty more rules in that “failed words” array ready to write new rules for and try to smooth over.

Some of the last big remaining steps, as I can imagine them, will be to try to find a way to split up prefixes and suffixes (not always obvious, considering “reanimated” vs “rear”), compound words (“elsewhere”), and better rules about tense.  To some extent these might also need lookups from the dictionary (where “elsewhere” was in it, but the compound word “salesgirl” was not).

The downside there is that it starts to require more queries than might be reasonable for the task, given that it already queries our dictionary database once per word (either by a button click on the user’s part, or more taxing still, automatically) each time the user stops typing for more than a second or so within a line.

In any case, since this entire algorithm component is purely a supplement to a professional database (which will ideally serve the huge majority of a user’s words), the accuracy of this humble tool already seems reasonable enough that I don’t mind incorporating it into the project as is, albeit with a couple of warnings and the promise of better results in days ahead.

These scripts all need a bit of cleanup (and security measures, since a database that isn’t just on my own hard drive will be involved), but I’ll make sure that the revised script (along with some basic PHP functions to call it and retrieve results from it), will be available in the source code along with the others, shortly — and in a Github repo, for anyone who wants to join the fun.

More soon.

Leave a Reply