Monday, April 25, 2011

You could fry an omelette...

...not on our brains (human dignity prevents this), but on the CPU of our developers' computers, as they work on the foundation for our new vocabulary module.

Our brains are running red hot as well, and I don't think I've slept without dreaming of words and numbers for weeks.

The shape this project takes is amazing, and there's that great "rocket blast-off" tingling in the tip of my nose; something great is about to happen.

Over the next few blog entries, I'll share with you as I find a bit of time what we're up to. Feedback is very welcome!

Normally, when I explain to my friends what we're up to, their eyes glaze over, because honestly, what our bright LTW engineers are cooking right now is very powerful stuff, and a fairly complex undertaking.

To start with something less abstract than "language," let me give you a brief "stellar" explanation of why I'm so excited about the new LearnThatWord module.

Once upon a time, people would look up at the sky and see a random sprinkling of stars. And air was just invisible nothingness. 
Over many thousand years, and through careful observation and analysis, humankind slowly determined that there was an order to the stars, a "cosmos," a system and harmony.

Certain stars could be seen moving in groups, others had a certain quality that distinguished them.

Later on, we started to understand that we are looking at different systems and spheres, five in total, troposphere being the one closest to us, stacked into each other like a Russian doll.



I love this picture, and although I don't know the context it was created for, I see a learner and seeker who managed to break through the core sphere, and who is about to move on to the next. It's one of the most amazing illustrations of the process of "learning" in my mind.

However, once we go above the spheres, we're actually looking at an infinite collection of large units called solar systems. Most of us learned that unless you like to flirt with madness, it is quite enough to concern yourself with our local, hometown universe, since the size and complexity of this one alone will make you nauseated if you try to completely comprehend it.

Over time, humankind learned that what we call "the universe," is simply a word we use to represent something that nobody is actually able to visualize or comprehend. We soothe ourselves by using a term that makes our limitation less obvious, by using a singular term for the infinite vastness. Language is similar in that it gives the impression we're looking at one "thing," where in reality there's only infinite, morphing and evolving grandness.

So, this is how the old astronomers would sketch their astrologic knowledge. Keep this in mind as I make a leap from the stars to the English language, because you will better understand what the new quiz will bring if you visualize it with this structure.
Ok... how this relates to our new module:

Words are not created equal. It's fairly old knowledge that we use some words a lot, and others much less frequently, hence it is more important to know the very common words than the more exotic and obscure ones.

Already in the early part of the last century, people sat down and -- at the time, manually -- looked through large amounts of texts, counting words one by one.

These old frequency lists are still quite relevant today, because they only included a few hundred of the top words. There is not much evolution in high frequency words. They're words like "the" (the number 1), "be" (including it's relatives: am, is, are, was, been, etc.), "I," "you," etc.

This is an excerpt from Wikipedia:
So, owning the core words brings an instant advantaged, a quantum leap towards unlocking a language. Unfortunately, it seems as though progress is made rather slowly after the first 1,000 words.

However, to be fluent in a language, you need above  95% of word proficiency. If you are presented with a text of 100 words, not knowing 5 in them is still a high number, and you will need a lot of energy and concentration to make it through a text or conversation at this level. It's kind of like riding a bicycle with a flat tire. You can do it, but it's bumpy and a pain and you won't find it very fun.
Here's another word estimate:
1_1
2
To reach mastery, you actually need about 15,000-20,000 words, and by words most researchers mean the "word family". So dance, dances, dancing, danced would count as one word. If words were counted more strictly, without combining them into a "root word" or "word family" or "lemma," the number of words you'd need to know would be much, much larger.

There are countless ways to learn the 1,000 core words, because that's what many, many publishers focus on. It's a waste of energy, because these "core words" are words you'll learn nearly automatically anyway, and quite effortlessly. You'll encounter them everywhere, so your brain can easily build automaticity around them.

Going beyond these core words, effective support quickly dissipates and it becomes exponentially more difficult to learn.

To provide tutoring along the full frequency strand is possible only for LTW, being the only program designed around a comprehensive vocabulary data set of now 180,000 words (and continuously growing).

The meaning of 80/20 to language
What various language programs suggest is that if you learn the top 1,000 or 2,000 words you're close to mastery. Doesn't that sound great? Learn the 1,000 words that make up 80% of texts and your almost done!

Once you look closer, though, you'll find that these 1,000 core words are words that you will naturally pick up rather quickly; they are really very basic. However, to master living language, you need to be able to fill in the more advanced words in synergy with these core words to actually get something out of them. Meaning is most commonly communicated through the more advanced vocabulary, the more specific words.

Here are some randomly picked lines. Blanked out are the words with frequency rank larger than 1,000:

The world is very xxxxxxx.
Do you like your xxxxxx?
What do you think about xxxxxx?
I can't believe it's xxxxxx!

What all of these sentences have in common is that they use core words for 80% of the text volume. Despite this big text volume that's covered by the high frequency words, not knowing 20% makes communication useless!

Try it for yourself:
Take an average, casual text and blank out all the slightly more specific or advanced words. You'll see the 80/20 proportion (or something very similar). You'll also see that the text has become very hard to understand. If the text is slightly more specific, your primary core vocabulary, while essential, takes you nowhere at all.

It's the 80/20 thing all over again. If you've got the 1,000 core words down, you cover 80% of the text, but only 20% of the meaning. On easy-to-read texts, 20% of words, roughly, will be made up of non-core or advanced words. Unfortunately for the learner, often these 20% carry the bulk of the meaning in a text.

Good news is that researchers (including our own team at LTW) have been setting the big data monsters on the trail of the English language all over the world, investigating its structure from all different angles, and a "language cosmos" is starting to reveal itself.

The data monster has been digesting incredible amounts of words and has produced a lot of very valuable data sets, so that we now not only know the top 20,000 word families, but far beyond.

An exciting time to be in linguistics! Or language tutoring... ;-)

Vocabulary spheres

Using this data and a few important aspects I'll explain in future entries, it is possible to divide the language cosmos into spheres (remember the image above?). It allows us to give a scientifically and statistically sound approach to learning English. Once you reach general proficiency, you may choose to expand further into more specialized vocabulary areas (it's like launching into a new solar system).

Our new vocabulary assessment tool will allow users to tell us what their unique focus is:
Maybe you want to
-   focus on spoken language only,
-   prepare for medical school,
-   ... or business communications,
-   ... or explore humanities or social sciences,
-   ... or be on equal verbal turf with lawyers?

Tell our program what you're looking to accomplish and we will prep you accordingly. We have an incredible general frequency list. But, in addition to that, we have twelve (12) more specific frequency strings, each for a different learning focus and each extensive and comprehensive.

So with this frequency data, it is possible to break up the language learning progress into a cosmos of different spheres, and determine incredible accuracy how much space you already cover, in terms of vocabulary. Knowing what you know allows you to optimize which words you might want to learn next, so they're not too easy or too advanced.

We're excited to build an easy and effective vocabulary assessment tool right after launching our new vocabulary module. It will be online, interactive and allow users to determine their location in the English Word-iverse in a few minutes.

If you share our passion for learning and would love to wear sponsor laurels, please get in touch.

Frequency data is one of the core pillars of this project, but only one of them. I will post some more of the logic of the new algorithm as we go along, so please consider subscribing to this blog or joining us on Facebook...