Machine learning is one of the fastest emerging and most in demand skills on the IT job market these days. The technology has captured the imaginations of the general public and budding computer scientists alike with its seemingly endless applications. Every few days, a new revolutionary research paper or tech product makes headlines promising to solve the most complex problems using machine learning.
Naturally, this has led to a lot of people wanting to learn machine learning, but not knowing where to start. Luckily, Santiago (@svpino), a computer scientist from Florida, has compiled a list of Machine Learning projects to help budding machine learning engineers begin their journeys. All of the datasets needed for these projects can be found on kaggle.com. We’ve chosen 5 projects from Santiago’s list which we believe would work best:
1. Housing Prices
The first project on the list is a housing prices calculator. Machine Learning aficionados are very familiar with this project as it was used extensively as an example in Andrew Ng’s introductory Machine Learning course on Coursera. This is a classic entry level machine learning project which uses linear regression as it’s machine learning algorithm. The calculator uses several parameters such as square footage, number of bedrooms, area, and other factors to help predict the price of a house.
The data set for this project can be found at https://www.kaggle.com/c/house-prices-advanced-regression-techniques
2. Titanic: Machine Learning from Disaster
Now that we’ve covered regression, let’s move on to the other type of machine learning problem: classification. This project uses machine learning to predict which passengers of the Titanic survived after the ill-fated ship collided with an ice berg and sank to the bottom of the Atlantic. The machine learning algorithm used here is a decision tree. The model will take in passenger data such as name, age, gender, socio-economic class, etc. to predict how likely it was for that passenger to survive. The model uses the data to calculate the probability of survival based on the examples it has been given to train on. In this project you will learn about decision trees and activation functions (including tanh, ReLU, and sigmoid). Once you’ve completed these two projects, you will be ready to tackle more difficult projects such as the ones given below.
The data set for this project can be found at https://www.kaggle.com/c/titanic
3. Bag of Words Meets Bags of Popcorn
This project focuses on Natural Language Processing, which is one of the most popular topics in Machine Learning. Natural language processing is increasingly becoming a part of our everyday lives with applications such as virtual assistants, chatbots, speech to text and text to speech conversion, and much more. This particular project involves performing sentiment analysis on IMDB movie reviews. It makes use of Google’s Word2Vec which is a machine learning tool that tries to identify the meaning of words by understanding the semantic relationship among words. Word2Vec works in a comparable way to deep learning techniques such as recurrent neural networks and deep neural networks (don’t worry about these for now), but in a more computationally efficient way. This projects aims to identify the real meaning behind movie reviews by navigating through sarcasm, ambiguity, and play on words which are quite difficult for computers to understand. The project has is broken down into 3 parts. The first is a basic natural language processing part which is geared towards beginners, and uses basic natural language processing techniques. We advise that you stick to this part of the project for now until you’ve built up confidence in your machine learning and natural language processing skills.
The dataset for this project can be found at https://www.kaggle.com/c/word2vec-nlp-tutorial
4. The Walmart Challenge
The Walmart Challenge is a time series analysis problem, which is another major topic of machine learning these days. A time series is a series of data points organized in chronological order, therefore time series analysis is the act of extracting meaning from a time series. The usual premise behind time series analysis is to make predictions about what will happen in the future. A few examples of where time series analysis is being used is in finance (such as predicting stock prices and consumer trends), statistics, engineering, weather forecasting, earthquake prediction, and so much more. Based on this, you’ve probably figured out by now that time series analysis is one of the most useful machine learning skills to have, especially if you want pursue a career in data science or business intelligence. In the Walmart Challenge, you will be predicting weekly sales figures based on previous data. It will help you get familiar with time series analysis, and allow you to build up some important skills needed to tackle more complex problems in this domain of machine learning.
The dataset for this project can be found at https://www.kaggle.com/bletchley/course-material-walmart-challenge
5. Open Images 2019 – Object Detection
For our final project recommendation, we have object detection using computer vision with the Open Images data set. Computer vision is perhaps the most widely known and glorified domains in machine learning due to its use in self driving cars, face recognition, and object detection. This projects help beginners get started in computer vision and object detection by providing a curated dataset of 9 million images ‘annotated with image-level labels, object bounding boxes, object segmentation masks, and visual relationships’. This huge dataset with a diverse collection of images is designed to be easy to work with, making it the perfect starting place for beginners on their journey into the world of computer vision. The project has 3 tracks ranging in difficulty from easy to difficult, similar to the Bag of Words Meets Bags of Popcorn project. We recommend the first track for getting started, which involves detecting bounding boxes around object instances in the pictures.
The dataset for this project can be found at https://www.kaggle.com/c/open-images-2019-object-detection
For more projects and datasets similar to the ones given above, you can visit www.kaggle.com. For more information about Santiago, you can follow him on twitter at @svpino or visit his blog on learning to program https://www.svpino.com/.