Machine learning is one of the most well-known branches of data science. Machine learning is a term that was first introduced in 1959 by IBM researcher Arthur Samuel.
Machine learning refers to a set of computer algorithms that improve and adapt by gathering information during operation. These algorithms rely on data to uncover patterns. Initially, the algorithm uses “training data” to develop an understanding of solving specific problems by identifying patterns. Once it completes the learning phase, it can apply this knowledge to tackle similar problems with different datasets. Understanding these concepts is a crucial part of the learning path to becoming a data scientist.
Machine learning algorithms can be categorized into 4 categories:
- Supervised algorithms: Algorithms that involve some guidance from the developer during the learning phase. To do that, the developer labels the training data and sets strict rules and boundaries for the algorithm to follow.
- Unsupervised algorithms: Algorithms that do not involve direct control from the developer. In this case, the algorithms’ desired results are unknown and will be discovered after the learning phase is completed.
- Semi-supervised algorithms: Algorithms that combine aspects of both supervised and unsupervised algorithms. For example, not all training data will be labeled, and not all rules will be provided when initializing the algorithm.
- Reinforcement algorithms: In these types of algorithms, exploration/exploitation is used. The gest of it is simple; the machine makes an action, observes the outcomes, and then considers those outcomes when executing the following step, and so on.
Each of these categories is designed for a purpose; for example, supervised learning is designed to scale the training data’s scope and make predictions of future or new data based on that. On the other hand, unsupervised algorithms organize and filter data to make sense of it.
Under each category lay various specific algorithms designed to perform certain tasks. This article will cover 4 basic algorithms every data scientist must know to cover machine learning basics.
№1: REGRESSION
Regression algorithms are supervised algorithms used to find possible relationships among different variables to understand how much the independent variables affect the dependent ones.
You can think of regression analysis as an equation. For example, suppose I have the equation y = 2x + z. In that case, y is my dependent variable, and x and z are the independent ones. Regression analysis finds how much x and z affect the value of y.
The same logic applies to more advanced and complex problems. To adapt to the various issues, there are many types of regression algorithms; perhaps the top 5 are:
- Linear Regression: The most straightforward regression technique uses a linear approach to featuring the relationship between the dependent (predicted) and independent variables (predictors).
- Logistic Regression: This type of regression is used on binary dependent variables. This regressing is widely used to analyze categorical data.
- Ridge Regression: When the regression model becomes too complex, ridge regression corrects the model’s coefficients’ size.
- Lasso Regression: Lasso (Least Absolute Shrinkage Selector Operator) Regression is used to select and regularize variables.
- Polynomial Regression: This type of algorithm is used to fit non-linear data. Using it, the best prediction is not a straight line; it is a curve that tries to fit all data points.
№2: CLASSIFICATION
Classification in machine learning is the process of grouping items into categories based on a pre-categorized training dataset. Classification is considered a supervised learning algorithm.
These algorithms use the training data’s categorization to calculate the likelihood that a new item will fall into one of the defined categories. A well-known example of classification algorithms is filtering incoming emails into a spam or not spam.
There are different types of classification algorithms; the top 4 ones are:
- K-nearest neighbor: KNN is an algorithm that uses training datasets to find the k-closest data points in some datasets.
- Decision trees: You can think of it as a flow chart, classifying each data point into two categories at a time and then each into two more, and so on.
- Naive Bayes: This algorithm calculates the probability that an item falls under a specific category using the conditional probability rule.
- Support Vector Machine (SVM): In this algorithm, the data is classified based on its degree of polarity, which can go beyond the X/Y prediction.
№3: ENSEMBLING
Ensembling algorithms are supervised algorithms made by combining the prediction of two or more other machine-learning algorithms to produce more accurate results. Combining the results can either be done by voting or averaging the results. Voting is often used during classification and averaging during regression.
Ensembling algorithms have 3 basic types: Bagging, Boosting, and Stacking.
- Bagging: In bagging, the algorithms are run in parallel on different training sets, all equal in size. All algorithms are tested using the same dataset, and voting is used to determine the overall results.
- Boosting: In the case of boosting, the algorithms are run sequentially. Then the overall results are chosen using weighted voting.
- Stacking: From the name, stacking has two-level stacked on top of each other, the base level is a combination of algorithms, and the top level is a meta-algorithm based on the base level results.
№4: CLUSTERING
Clustering algorithms are a group of unsupervised algorithms used to group data points. Points within the same cluster are more similar to each other than to points in different clusters.
There are 4 types of clustering algorithms:
- Centroid-based Clustering: This clustering algorithm organizes the data into clusters based on initial conditions and outliers. K-means is the most knowledgeable and used centroid-based clustering algorithm.
- Density-based Clustering: In this clustering type, the algorithm connects high-density areas into clusters creating arbitrary-shaped distributions.
- Distribution-based Clustering: This clustering algorithm assumes the data is composed of probability distributions and then clusters the data into various versions of that distribution.
- Hierarchical Clustering: This algorithm creates a tree of hierarchical data clusters. The number of clusters can be varied by cutting the tree at the correct level.
Machine learning is one of the most famous, well-researched sub-field of data science. New machine learning algorithms are constantly developing to reach better accuracy and faster execution.
Regardless of the algorithm, it can generally be categorized as one of four categories: supervised, unsupervised, semi-supervised, and reinforced algorithms. Each of these categories holds many algorithms used for different purposes.
Most famous Python machine learning modules — such as Scikit Learn — contain a pre-defined version of most — if not all — of these algorithms.