# CatBoost – ML

**Gradient Boosting** is an ensemble machine learning algorithm and typically used for solving classification and regression problems. It is easy to use and works well with heterogeneous data and even relatively small data. It essentially creates a strong learner from an ensemble of many weak learners.

**CatBoost** or Categorical Boosting is an open-source boosting library developed by Yandex. In addition to regression and classification, CatBoost can be used in ranking, recommendation systems, forecasting and even personal assistants.

Now, Gradient Boosting takes an additive form where it iteratively builds a sequence of approximations in a greedy manner, given a loss function . Here we would like to emphasize that the loss function has two input values, the i^{th} expected output value y_{i}, and the t^{th} function F^{t} that estimates y_{i}. Assuming we have constructed function F^{t} we can improve our estimates of y_{i} by finding another function , where is a step size and function is a base predictor chosen from a family of functions H in order to minimize the expected loss. That is, . The minimization is approached by using Taylor approximation or the negative gradients, such that, . CatBoost makes refinements to this gradient boosting technique.

Let there be a dataset D with n samples. Each sample has m set of features in a vector x, and a real valued target, y.

#### Handling categorical features:

Often datasets contain categorical features and there are various techniques to handle categorical features in boosted trees. Unlike other gradient boosting algorithms (require numeric data), CatBoost **automatically handles categorical features**. One of the most common techniques for handling categorical data is one-hot encoding, but it becomes infeasible with many features. To tackle this, features are grouped in categories by target statistics (estimate target value for each category). Target statistics can be calculated in different ways: Greedy, Hold out, Leave one out and Ordered. CatBoost uses **Ordered** **target statistics.**

The greedy approach takes an average of the target for a category group. But it suffers from target leakage as the target value is being used to calculate a representation for the categorical variables and then using those features for prediction ( is calculated using target ). The Holdout method tries to reduce this by partitioning the training data set. But this significantly reduces the effective use of training data. Leave one out excludes the target sample but is not very effective. Ordered target statistics are inspired by Online Learning algorithms which get the training examples sequentially in time. It introduces artificial time, that is, a random permutation of the training examples. It will only rely on the training examples encountered in its past (samples occurred before that particular sample in the artificial time) thereby avoiding target leakage.

Mathematically, the target estimate of the ith categorical variable of the kth element of D can be represented as,

, where a>0

The indicator function takes the value 1 when the i^{th} component of CatBoost’s input vector x_{j} is equal to the i^{th} component of the input vector x_{k}. Here we use k as in the kth element according to the order we put on D with the random permutation , and i takes on the integer values 1 through k−1. The parameters a and p (prior) save the equation from underflowing. The if condition ensures the exclusion of the value of y_{k} in the computation of values for x^{i} when encoding the value . This technique also ensures the use all the available past for each example to compute its target statistics and thereby encoding the categorical variables.

#### Ordered boosting

Gradient boosting algorithms often have a tendency to overfit. Since ensembles work iteratively building upon the base learners over the same dataset, it affects the generalization capability of the model.

When we use ordered target statistics to encode categorical variables, the partial derivatives (gradients) of the loss function L with respect to the function F^{t-1} is also a random variable because we use the random permutation to choose the elements of D_{k} to encode categorical variables that influence the value of F^{t-1}. Therefore, the distribution of gradients can be shifted under the condition that we calculated with a particular encoding for . This conditional shift leads to bias in the estimate we make for h^{t}, and that negatively impacts the metrics we obtain when evaluating of F^{t-1} on data, we did not use at training time. This impact on F^{t-1} is referred to as its generalization capability and the problem is called a **prediction shift**.

CatBoost introduces ordered boosting to avoid this problem. In ordered boosting, a random permutation of the training examples is performed and n different supporting models maintained (i^{th} model trained using only the first *i* samples in the permutation) and at each step residual/error is obtained by using previous model residuals. But this is not feasible as data is finite and the memory requirement for maintaining different models would be too high.

So CatBoost uses a variant for practical purposes. In this variant, one tree structure (sequence of splitting features) is shared by all the models. That is, CatBoost uses the same D_{k}, that determined the ordered target statistics, as the data for determining the structure or fitting the decision tree h^{t}, and uses the complete dataset D as the data for evaluating whether h^{t} is the decision tree that minimizes the expected loss. It uses multiple permutations to compute a number of sets of residual values that it can use to find h, to obtain F^{t-1}, and maintain the guarantee that none of the values of is used to compute the values of the gradients. Thereby reducing variance in the estimates of the gradients (rate of change of the loss function) and avoiding prediction shift.

**CatBoost advantages**

- CatBoost implements oblivious decision trees (binary tree in which same features are used to make left and right split for each level of the tree) thereby restricting the features split per level to one, which help in decreasing prediction time.
- It handles categorical features effectively by ordered target statistics.
- It is easy to use with packages in R and Python.
- It has effective usage with default parameters thereby reducing the time needed for parameter tuning.

## Please

Loginto comment...