Expert-Chatbot Aileen – Part 3 – Tackling Synonyms and Misspelling

After we taught Aileen in the last post how to react on the most important words in a sentence, we now want to improve her possibilities to detect the correct word. It happens quite often, that a person is writing and accidentally swap two characters in the word like “light” vs. “lihgt”. We as humans directly see what the real word should have looked like. However Aileen at the moment will only react to words, which are written 100% correct.

Another challenge for Aileen is the fact, that she does not know any kind of synonyms. Whereas synonyms are simply putting in more words into our database, adding Misspelling-Resistence is a more advanced task. That is why we dive into this topic deeper.

The first idea to tackle the misspelling challenge is by adding some kind of spell-checking dictionary. However we are not really interested in correcting every word possible. Instead we only care about the few descriptive words, which we want to find in a sentence, to know what to answer.

We achieve this by not searching for exact match anymore like we did before, but we calculate the similarity for each word and if the similarity exceeds a given value, we recognize the word as “found”.

In Psuedo-Code we may express it on this way:

At first we iterate over each word and check for the correct spelling. If the word exceeds a given similarity value then we replace the old word by the new one. On this way we are able to reduce errors. We could improve the algorithm even more by replacing the wrong-word with the best matching word in our dictionary. We leave this improvement open to you. Afterwards we map every synonym word to a given word, so we can in the end simply search for our predefined words of the post last time.

The code so far should be clear now. One big questionmark is still left: “How does the function calculateSimilarity look like?”

This brings us to the most interesting part. How to determine the similarity of two words? Actually the most accurate approach is to use the Levenshtein Distance, which Wikipedia defines as:

In information theory, Linguistics and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.

If the principle is not clear, do not worry, we will not dive into this deeper anyway. Due to the fact, that calculating the Levenshtein distance is quite computational heavy. That is why we calculate the word similarity by using n-grams.

The basic principle we use is dividing the word into bigrams. Bigrams are always two characters combined into “one bin”. For example dividing “Hello” in bigrams would result in “He”,”ll”,”o”. In the next step we count how often every bigram occurs in our original (maybe misspelled) word and in our reference word. The more bigrams differ the less similar are the two words. We will go through a small example using the two words “favorite” and “faforite”.

Bigrams “favorite”:

  • fa: 1
  • vo: 1
  • ri: 1
  • te: 1

Bigrams “faforite”:

  • fa: 1
  • fo: 1
  • ri: 1
  • te: 1

Now after calculating the bigrams and their amount they occur in the word, we subtract them from each other and sum them up:

  • fa: 1-1 = 0
  • vo: 1-0 = 1
  • fo: 1-0 = 1
  • ri: 1-1 = 0
  • te: 1-1 = 0

This results in a difference of 2 bigrams. If we calculate this as percent we get a similarity of:

1-2/8 = 75%

This means the two words are 75% similar. The Pseudo-Code to calculate the similarity:

If we merge now everything, we get the following Chatbot, which corrects some easy spelling-mistakes:

Open in new window

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.