Kangaroo words
Recently I saw two great art projects by @whisbe. Where a sign that has a phrase written on it. The phrase’s meaning changes when certain letters or words are added or removed.
I loved these projects and I wanted to make something similar of my own. I searched for other examples of it, without too much success. Maybe my google-fu is lacking today.
In my search, I discovered List of different types of word play and Kangaroo words.
A Kangaroo word is a word that contains letters of another word, in order (without transposing any letters). For example: encourage contains courage, cog, cur, urge, core, cure, nag, rag, age, nor, rage and enrage.
Kangaroo word wasn’t exactly what I was looking for but it was on the right path. I started searching for a list of all the common Kangaroo word, and what is the largest Kangaroo words that exist. I was able to find a few small lists but nothing that was complete.
The Word Circus: A Letter - Perfect Book (Lighter Side of Language Series) was a a good book with lots of Kangaroo words and phrases.
Because there was no exhaustive list Kangaroo words, and I needed to practice my python I decided to create my own list. The script goes thought all ~250,000 words and finds all the sub words that appear in the word. The problem with this version is that a lot of the sub words that were being found were not common words.
For example: Districts has 44 subword that include: ric, iris, srs, ist, tit, disc, dcs, discs, itc, tis, irc, sic, sti, ics, dst, dir, str, tits, src, ict, sri, irs, its, sits, dss, dist, iss, sit, tic, district, tri, sis, rcs, rts, dit, sts, dts, dsc, isis, cts, iis, dis, strict, dirt,
I would have preferred that it only showed the most common words: disc, discs, strict, dirt, tits, district, sits
Source code, [Output]
The next version only used the top 20,000 most commonly used words generated from google’s n-gram frequency analysis of the Google’s Trillion Word Corpus. This subset of words also included slang, swear words, and names of companies. I limited this script to only find sub words that are greater then 3 letters, to reduce the noise. This produces a much better result.
For example:
- Facilities has 21 sub words that include: fats, clit, cite, lies, fail, facts, acts, files, cities, fits, ties, aces, flies, fact, face, cites, lite, file, fails, fate, faces
- Generation has 17 sub words that include: nato, neat, ratio, raton, erin, eaton, gene, rain, neon, gran, tion, grin, nero, enron, gain, grain, nation,
- Servants has 17 sub words that include: evans, vans, eats, seats, sean, seas, sent, rants, servant, serv, evan, sans, rant, rats, sets, ants, seat,
Now that I have a giant list of words and their sub words.
What is the largest Kangaroo word in the top 20,000 most commonly used words?
Telecommunications has 12 sub words that include: lemma, communion, comm, tion, cocos, cont, coco, elena, unions, lena, louis, loan, coats, counts, onion, union, lion, ciao, coca, ions, cain, conn, cons, icon, mains, cats, tons, econ, toons, locations, latin, comma, elec, onto, lent, lemon, telecommunication, eaton, coins, conan, comics, commits, elect, tele, mins, unit, communication, toni, luton, comic, unto, mans, laos, teas, location, como, outs, cuts, count, tits, tuna, toon, commit, nato, units, lots, oman, main, cunt, commons, econo, elem, loans, tout, elton, lions, icons, cans, lotion, lean, common, coma, omni, lucas, eats, tous, tomato, emma, coat, onions, nation, mais, lets, telecom, coin, leica, leon, tees, tent, teen, teens, tomcat, lens, mats, elections, cation, omit, luna, tuition, tents, cmos, lois, communications, tions, luis, otis, election, tens, telecoms, nations,
The next step create a phrase using the words with the most sub words. Then test different arrangements of sub words to see if they produce a phrase that also makes sense. Testing to see if a string of words creates a proper English word is harder then it sounds. I am going to try it manually first and I fail then I can let the robots at it.
Leave a comment