LIGHTGBM

LightGBM

This blog post will provide you with a comprehensive overview of LightGBM, exploring the theory behind this probabilistic algorithm and demonstrating its implementation using Python libraries. Dive in to uncover the advantages and disadvantages of neural network, as well as its real-world applications across various domains. With that, enjoy your journey in QDO!

WHAT IS LightGBM

LightGBM, short for Light Gradient Boosting Machine, is a variation of gradient boosting that is designed to be a lighter and faster version. It can be compared to a group of friends who are excellent at solving puzzles, where each friend specializes in a different type of puzzle but works together to find the best solution. This analogy reflects how LightGBM builds models by using multiple decision trees, each focusing on different aspects of the data to improve accuracy. Unlike traditional gradient boosting methods, LightGBM is optimized for speed and efficiency, making it a powerful choice for handling complex machine-learning problems.

Concepts of LightGBM

Smart split optimization

One key reason LightGBM is faster than other boosting algorithms is its smart split optimization, which involves using binning to categorize numerical features into smaller groups. This reduces the number of comparisons needed when splitting a tree, allowing for more efficient processing.

For example, should we have a series of age values of a person.

Instead of evaluating a continuous variable like age at every possible value, LightGBM groups similar values into bins, such as "under 20," "20-40," and "40+." This method significantly speeds up computation while maintaining accuracy.

Exclusive Feature Bundling

Exclusive Feature Bundling (EFB) is an optimization technique in LightGBM that helps speed up training by reducing the number of features processed. The main idea is simple: if two or more features are mutually exclusive (meaning they are never active at the same time for a single data point), they can be combined into a single feature without losing any information.

In high-dimensional datasets, many features are sparse, meaning most of their values are zero. Instead of treating them separately, EFB bundles these sparse features together, reducing memory usage and speeding up computation.

For example, we have 2 binary columns namely male and female. If the person is a male, the male column would be having the value of 1 and 0 inside the female column vise versa.

Since "Male" and "Female" features are mutually exclusive (only one can be 1 at a time), LightGBM combines them into a single bundled feature (10 or 01) to reduce memory usage and improve computational efficiency.

Gradient Based One Side-Sampling (GOSS)

When a LightGBM model runs on a dataset with 500 records, it generates 500 gradients corresponding to each data point. These gradients indicate how much a particular record contributes to the model’s overall error. Higher gradients mean worse performance, while lower gradients mean better performance.

Step-by-Step GOSS Process:

Sorting the Gradients:
- The 500 gradients are sorted in descending order (from highest to lowest).
Selecting the Most Important Data Points (Top 20%)
- Based on a 20/80 splitting criteria, the top 20% of gradients (100 records) are always kept since they correspond to the hardest-to-predict cases that need improvement.
Random Sampling from the Lower 80%
- The remaining 80% of data (400 records) mostly consists of well-performing instances (low gradients).
- Instead of keeping all 400, only 10% of these records (40 records) are randomly selected to maintain overall distribution while reducing computational cost.
Merging the Two Groups
- The 100 high-gradient records + 40 randomly selected low-gradient records create a new training subset (140 records) for LightGBM to train on.
- This ensures that the model prioritizes hard-to-learn cases while still maintaining some information from well-performing samples.

Why GOSS is Efficient?

The focus remains on improving the 20% worst-performing data points while reducing the number of easy cases in training.
Sampling only occurs within the well-performing (low gradient) group, leading to the name "Gradient-Based One-Side Sampling (GOSS)."
This technique reduces training time without sacrificing accuracy, making it ideal for large-scale datasets.

By applying GOSS, LightGBM enhances model efficiency by focusing computational resources on the most critical data points, leading to better performance in less time.

Implementation of LightGBM in python

Importing libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import datetime as datetime
from sklearn import metrics
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
import seaborn as sns

Import dataset

df = pd.read_csv('wdbc.data', sep = ',')

Rename column names

df = df.rename ( columns= {'Diagnosis':'Label'})

Determine dependent and independent attributes

Y = df['Label']

X = df.drop(labels = ['Label','ID'], axis=1)

Storing feature names in array

feature_names = np.array(X.columns)

Apply LabelEncoder on dependent attribute

from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
Y = labelencoder.fit_transform(Y)

Scaling the data

from sklearn.preprocessing import StandardScaler
scales = StandardScaler()
X = scales.fit_transform(X)

Splitting the data for training and testing

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2,random_state=0)

Importing lightgbm

import lightgbm as lgb
d_train = lgb.Dataset(X_train, label=Y_train)

Tuning the paremeters for lightgbm

lgbm_params = {
    'boosting_type': 'gbdt', #GradientBoostingDecisionTree because we have binary 
output
    'objective': 'binary',
    'metric': ['auc','binary_logloss'], #AUC is the metric for binary 
classification, binary_logloss is the loss function
    'num_leaves': 100, #Number of leaves in one tree
    'max_depth': 50, #Maximum depth of tree
}

Implementing the model

clf = lgb.train(lgbm_params, d_train, 100)

Get prediction results

y_pred=clf.predict(X_test)

array([9.99977496e-01, 8.96680177e-03, 5.54788033e-05, 9.87736858e-04, 4.68710232e-04, 3.41087242e-05, 6.26370201e-05, 4.44510896e-05, 9.78436239e-05, 2.84531780e-05, 2.21010462e-02, 5.43004883e-04, 2.11233388e-05, 4.38349233e-01, 3.30038629e-02, 9.97813273e-01, 1.09896675e-03, 9.99103515e-01, 9.99952126e-01, 9.99962842e-01,

Convert label to binary values

for i in range(0, X_test.shape[0]):
    if y_pred[i]>=.5:       # setting threshold to .5
        y_pred[i]=1
    else:  
        y_pred[i]=0

Check model accuracy

accuracy = metrics.accuracy_score(Y_test, y_pred)
print("Accuracy score:",accuracy)

Accuracy score: 0.9736842105263158

Parameters that you can tune in LightGBM

objective : Defines the task ('binary', 'multiclass', 'regression', 'lambdarank', etc.)
boosting_type : Type of boosting algorithm ('gbdt', 'dart', 'rf', 'goss')
num_iterations (or num_boost_round) : Number of boosting rounds
learning_rate : Step size for updating weights
num_leaves : Maximum number of leaves per tree
max_depth : Maximum depth of trees
min_data_in_leaf : Minimum number of samples per leaf
min_sum_hessian_in_leaf : Minimum sum of instance weight (hessian) in a leaf
feature_fraction : Fraction of features used per iteration (colsample_bytree)
bagging_fraction : Fraction of data used per iteration (subsample)
bagging_freq : Frequency of bagging (used with bagging_fraction)

Advantages and disadvantages of LightGBM

Advantages

1) Faster Training Speed

Uses optimizations like histogram-based learning, Exclusive Feature Bundling (EFB), and Gradient-Based One-Side Sampling (GOSS) to speed up training, especially on large datasets.

2) Efficient Memory Usage

Consumes less memory than other boosting algorithms by bundling mutually exclusive features and using histogram-based feature binning.

3) Handles Large Datasets Well

Can efficiently process millions of records and high-dimensional data, making it suitable for big data applications.

Disadvantage

1) Sensitive to Hyperparameters

Requires careful tuning (like learning rate, max depth, num leaves) to prevent overfitting.

2) Prone to Overfitting on Small Datasets

Since LightGBM builds deeper trees, it can overfit on small datasets if regularization techniques (e.g., feature pruning, min data in leaf) are not applied.

3) Not Ideal for Small Datasets

Performs best with large datasets; for smaller ones, simpler models like Random Forest or Logistic Regression might work better.

Implementation of LightGBM in real life

1. Fraud Detection & Risk Management

PayPal uses LightGBM to analyze millions of transactions in real time. GOSS helps prioritize suspicious transactions while efficiently handling a massive volume of legitimate transactions. Results in faster fraud detection with fewer false positives compared to traditional models.

2. Search Ranking & Ads Optimization

Microsoft, which developed LightGBM, uses it for ranking web search results in Bing. It also helps in personalized ad recommendations by learning user preferences efficiently. The leaf-wise splitting of LightGBM improves ranking precision compared to other gradient boosting methods.

3. E-Commerce Recommendation System

Alibaba integrates LightGBM into its recommendation engine to provide personalized product suggestions based on user behavior. The model processes massive user interaction data while maintaining fast inference times. EFB (Exclusive Feature Bundling) helps handle high-dimensional sparse data in product listings.

PRINCIPAL COMPONENT ANALYSIS (PCA)

PRINCIPAL COMPONENT ANALYSIS (PCA) Figure 1: PCA This blogpost will bring to you the concept of principal component analysis which is one of the commonly used descriptive analysis that emphasizes of dimensionality reduction. You will learn how to implement this machine learning model in python, its advantages and disadvantages as well as how companies benefits from this machine learning model. What is PCA PCA is a statistical dimensionality-reducing technique. It takes a large set of variables and transforms them into a smaller set, retaining most of the information in the large set. This can be done by identifying the directions along which the data varies the most. These components are orthogonal to one another, capture the maximum possible variance within the data, and hence form a powerful tool for the simplification of datasets without loss of essential patterns and relationships. Concept of PCA One of the key concepts behind PCA concerns diminishing the complexity of high-di...

QDO

ADABOOST