We left off last time with writing an algorithm (well, that’s a fancy word for it) for deriving a best-guess syllable count for “any” word. Â Which involved a lot of rules…
And, obviously, these many (many) rules need testing.
While there are certainly better ways to test this (as in, automatically), I actually chose a very hands-on, â€œmanualâ€ way to test this as I was working. Â And Iâ€™d actually recommend this approach, for a few reasons.
My quite-simple process was:
- Use an online â€œrandom word generatorâ€ (there are many available online, and I imagine most are about the same) to generate a list of 20 or so words at a time.
- Copy/paste those words into a text editor, and find/replace the â€œnewlineâ€ breaks with quotes and commas, to make it into an array notation that we can feed into PHP.
(This requires adding the open quotes to line 1, and close quotes to line 20, by hand. Â You might have fancier ways to do that last step automagically, which I’d be happy to learn.)
- Copy/pasteÂ that list of words into a PHP array:
And then iterate through them, running my syllable-counting function on each individual word, and echoing out each letter, and any rules-based decisions that accompany that letter (or string of letters).
The output is very verbose, but hugely helpful to see why it made the decisions it did (and, of course, whether any of those needed adjusting):
I also kept a far-more-important array of failed words, to come back to as another test pass after adding/changing rules. Â (And once those issues were resolved, those words graduated up into the “fixed words” array, as a kind of log of my progress.)
These failed and fixed words didn’t have to be PHP arrays, of course, but it helped to be able to quickly feed those arrays into the algorithm, in place of the test words array, for review. Â (That PHP loop up above doesn’t care how many words are in the array you give it, so it’s happy to go through your entire list of test words even if it gets quite long.)
There were a lot of otherÂ fixes and notes, too, but thatâ€™s a decentÂ glimpse into the general approach.
Some of those are still on the to-do list, now a day or two later, but itâ€™s great to know which ones represent which rules to add/change.
A couple of the successes (yay!):
The above couple of examples show a few places where I was pleased (relieved?) to see the algorithm correctly making adjustments for the trickier couple of rules in English. Â And while it’s definitelyÂ not the most scientific testing method, of the thousand-or-so words I fed this, 93.3% of them were correctly guessed, which is honestly a better percentage than I was willing to call a successful experiment (for now). Â (Obviously luckier or unluckier choices for those random words would have had far lower percentages, but hey, that’s the joy of sample sizes.)
And, with all of the inconsistencies in English, it’s worth bearing in mind that nothing is going to ever hit 100%.
Which brings me to:
AÂ couple of the failures (boo!) to be fixed:
There definitely need to be prefix and suffix checks, for â€œde-â€ and â€œre-â€ and â€œ-nessâ€, etc. Â And then checks against the word that’s left after splitting those off. Â (So that, say, “preamble” checks for the word “amble” after removing the “pre”, and thereby learns that that “ea” vowel pair is not actually a single long-e sound, like “seam”, but instead a prefixed word — and therefore another syllable. Â And, conversely, so that it either skips the check, or chooses not to add another syllable, when the prefix precedes a consonant, as in “premium,” since a prefix or suffix status on matters to us if it affects the vowel/syllable rules.)
AndÂ so on, and so on…
So, thatâ€™s the script so far, and with plenty more rules in that â€œfailed wordsâ€ array ready to write new rules for and try to smooth over.
Some of the last big remaining steps, as I can imagine them, will be to try to find a way to split up prefixes and suffixes (not always obvious, considering â€œreanimatedâ€ vs â€œrearâ€), compound words (â€œelsewhereâ€), and better rules about tense. Â To some extent these might also needÂ lookups from the dictionary (where â€œelsewhereâ€ was in it, but the compound word â€œsalesgirlâ€ was not).
The downside there is that itÂ starts to require more queries than might be reasonable forÂ the task, given that it alreadyÂ queries our dictionary database once per word (either by a button click on the userâ€™s part, or more taxing still, automatically) each time the user stops typing for more than a second or so within a line.
In any case, since this entire algorithm component is purely a supplement to a professional database (which will ideally serve the huge majorityÂ of a userâ€™sÂ words), the accuracy of this humbleÂ tool already seems reasonable enough that I don’t mind incorporating it into the project as is, albeit with a couple of warnings and theÂ promise of better resultsÂ in days ahead.
These scripts all need a bit of cleanup (and security measures, since a database that isn’t just on my own hard drive will be involved), but I’ll make sure that the revised script (along with some basic PHP functions to call it and retrieve results from it), will be available in the source code along with the others, shortly — and inÂ a Github repo, for anyone who wants to join the fun.