Deep Dive – What is the difference between Supervised and Unsupervised Learning?

At this point in time, artificial intelligence needs little to no introduction. Over here in Pakistan, investment and interest in AI has only increased in recent years. Be it the number of Pakistani startups focused on AI, or the PAF’s decision to include AI in its warfare strategies, the average Pakistani has every reason to believe that AI will be a significant part of their life moving forward.

With all the buzz and intrigue around AI, it is natural to come up against all kinds of terms. Even if you aren’t actively studying the subject, the vast amount of AI-centric content you find online will use these terms to describe recent developments in the field. Considering the significant status that AI holds in our lives and our discourse now, it can’t hurt to educate ourselves on the main concept regarding AI.

Therefore, for this Deep Dive, we shall unpack supervised and unsupervised learning. As the terms suggest, they are merely different modes of learning for a “smart” program, which is any program capable of learning. In fact, creating such programs that can learn from experience is the domain of machine learning, which is yet another term you may have seen quite a lot. To round up, machine learning is a subset of artificial intelligence, and supervised and unsupervised learning are two popular means of achieving machine learning.

Right, so we know that there are two learning models that enable a program to learn something. In a supervised learning model, the algorithm learns on a labelled dataset, providing an answer key that the algorithm can use to evaluate its accuracy on the training data. Meanwhile, an unsupervised model provides unlabelled data that the algorithm tries to make sense of by extracting features and patterns on its own.

If these definitions sound confusing, that’s alright. Let’s break down both of these definitions separately now.

Supervised Learning

We know what supervision means. It’s what happens when you’re doing your classwork and your teacher is watching over your shoulder, making sure you’ve got everything right and correcting you when you make a mistake. In other words, a supervisor is judging your performance and letting you know how well you’re doing.

Similarly, in supervised learning, a program has a full set of labelled data for training purposes. It’s called a labelled dataset because for each sample, there is a label that lets the program know what the right answer is. This is kind of like the answer section you have at the back of your Math textbook that you can refer to whenever you want to make sure that your answer is correct.

For instance, a labelled dataset of animal images would tell the program which images are of dogs, cats, and rabbits. If you can connect the dots so far, you’ll realize that “dog”, “cat”, and “rabbit” are the labels in this case! If a sample with a particular set of features is labelled as “cat”, the program can learn from it by forming a connection between the features and the label. When it comes across a similar set of features after training, it is likely to classify it as a “cat” based on what it learned during the training phase.

Thus, supervised leaning is an ideal learning model for problems where there is a set of available reference points or data with which you can train a machine learning program. However, such reference points aren’t always available. In such cases, you rely on unsupervised learning.

Unsupervised Learning

Unsurprisingly, it’s not always easy to come across clean, perfectly labelled datasets. For some applications, enough data isn’t available. Other times, machine learning researchers want to ask their algorithm questions that they don’t really have answers to. So the absence of labelled data makes it impossible to carry out supervised learning, and unsupervised learning has to swoop in.

In unsupervised learning, a machine learning program is handed a dataset without any instructions on what to do with it. There are no associated answers or labels that it can refer to. The training dataset is simply a collection of samples without any desired outcome. The machine learning algorithm, therefore, has to find structure itself by extracting useful features and finding patterns. This is like trying to solve a problem from your Math textbook without having the correct answer to refer to. There will be a lot of trial and error, and if you keep at it long enough, you might just arrive at the right answer.

Unsupervised learning programs organize the given data in different ways, based on the nature of the problem. A few of the more popular ones are as follows:

Clustering

This is the process of looking for training data that are similar to each other and grouping them together. For instance, even if a program knows nothing about birds, it could look at a collection of bird images and cluster them based on feather color, size, or the shape of their beaks. Clustering happens to be the most common means of organizing data for unsupervised learning.

Anomaly Detection

This technique looks for unusual patterns in data, and flags any outliers. As one would suspect, this organization technique is suitable for fraud detection systems.

Association

This involves finding a correlation between features of a data item with other features. Using this technique, an unsupervised learning algorithm can look at a few key attributes of a sample and predict what other attributes it could be associated with. This is how recommendation systems work. If you’ve been watching a lot of crime thrillers on Netflix, the algorithm cam form an association and recommend more shows of a similar genre to you.

Autoencoders

Autoencoders take input data, compress it into a code, and then attempt to recreate the input data from the summarized code. This is like starting with a story, summarizing it into basic notes, and then rewriting the original story using only those notes. Autoencoders are generally used to remove noise from visual data like images and video.

Conclusion

So, there you have it. Two different learning models for machine learning algorithms, each with its own applications. As a student, practitioner, or enthusiast, it’s really only a matter of picking the best approach based on your requirements and the nature of the task at hand.