Word Predictor - Tries

30 June, 2020


You can think of this as an intro to tries. It is a tree-like data structure whose nodes store alphabets. Once inserted, words can be searched in O(n) where n is the length of the string.
I'll go over the how to make a node, insert strings, find them and find next closest ones.
Node

I'll be working only on lowercase letters. Hence, total alphabets that we have are 26. The value each node stores is decided by the path taken from its parent node. So if we take the 0th path, the node stores 'a'. So each node has 26 children, some might be null. We also use two other values : end and size. End gives the number of words ending at this node. Size gives the size of the sub-graph rooted at this node, so it is basically the number of words which use this node and and don't end here.

Insertion & Search

Instead of working with alphabets, we can reference them using numbers. We also add a root node which is the starting point. While inserting a string to the trie, we start with the first letter. We check if that child is not null, and move to it if it isn't. If it is null, then we create a new node there. We keep repeating this process till we reach the end of the string. At the last node, we increase the value of its end. So this tells us a word ended here.

Searching is very similar to insertion. We keep travelling till either we reach the end of the string or a null pointer. If we encounter no null pointers and the node that we reach has end>0, then we have found this word. Otherwise, this word was not pushed into the trie.

Next Closest Words

So, now we know how to search for words. We now have to find the next closest words. First,we push in all the words that we could possibly have. So basically the whole dictionary. Then we take in a word, which we have to complete or predict the end of. So, once added, we start searching. For the next closest word, we will have to use some traversal technique. DFS or BFS ? But DFS will first traverse along one path and then move on the next child. This is different than what we want. We use BFS, as it traverse the layer completely and then moves on to the next one. We just check if a word ends there and add it to our predictions.

Improvements

Right now, it just predicts the closest words. But in reality we would want it according to usage. If one word is being used more, you would want to predict that instead of the closest. We could add another parameter called frequency. While traversing preference can be given to the one with higher frequency instead of plain BFS. A lot more needs to be figured out in this case. Coming soon.

If you have any queries or suggestions, feel free to hit me up at : dornumofficial@gmail.com. For the complete code , you can click here