LightGBM
WHAT IS LightGBM
LightGBM, short for Light Gradient Boosting Machine, is a variation of gradient boosting that is designed to be a lighter and faster version. It can be compared to a group of friends who are excellent at solving puzzles, where each friend specializes in a different type of puzzle but works together to find the best solution. This analogy reflects how LightGBM builds models by using multiple decision trees, each focusing on different aspects of the data to improve accuracy. Unlike traditional gradient boosting methods, LightGBM is optimized for speed and efficiency, making it a powerful choice for handling complex machine-learning problems.
Concepts of LightGBM
Smart split optimization
One key reason LightGBM is faster than other boosting algorithms is its smart split optimization, which involves using binning to categorize numerical features into smaller groups. This reduces the number of comparisons needed when splitting a tree, allowing for more efficient processing.
For example, should we have a series of age values of a person.
Instead of evaluating a continuous variable like age at every possible value, LightGBM groups similar values into bins, such as "under 20," "20-40," and "40+." This method significantly speeds up computation while maintaining accuracy.
Exclusive Feature Bundling
Gradient Based One Side-Sampling (GOSS)
When a LightGBM model runs on a dataset with 500 records, it generates 500 gradients corresponding to each data point. These gradients indicate how much a particular record contributes to the model’s overall error. Higher gradients mean worse performance, while lower gradients mean better performance.
Step-by-Step GOSS Process:
-
Sorting the Gradients:
-
The 500 gradients are sorted in descending order (from highest to lowest).
-
-
Selecting the Most Important Data Points (Top 20%)
-
Based on a 20/80 splitting criteria, the top 20% of gradients (100 records) are always kept since they correspond to the hardest-to-predict cases that need improvement.
-
-
Random Sampling from the Lower 80%
-
The remaining 80% of data (400 records) mostly consists of well-performing instances (low gradients).
-
Instead of keeping all 400, only 10% of these records (40 records) are randomly selected to maintain overall distribution while reducing computational cost.
-
-
Merging the Two Groups
-
The 100 high-gradient records + 40 randomly selected low-gradient records create a new training subset (140 records) for LightGBM to train on.
-
This ensures that the model prioritizes hard-to-learn cases while still maintaining some information from well-performing samples.
-
Why GOSS is Efficient?
-
The focus remains on improving the 20% worst-performing data points while reducing the number of easy cases in training.
-
Sampling only occurs within the well-performing (low gradient) group, leading to the name "Gradient-Based One-Side Sampling (GOSS)."
-
This technique reduces training time without sacrificing accuracy, making it ideal for large-scale datasets.
By applying GOSS, LightGBM enhances model efficiency by focusing computational resources on the most critical data points, leading to better performance in less time.
Implementation of LightGBM in python
Importing libraries
Import dataset
Rename column names
df = df.rename ( columns= {'Diagnosis':'Label'})
Determine dependent and independent attributes
Storing feature names in array
Apply LabelEncoder on dependent attribute
Scaling the data
Splitting the data for training and testing
Importing lightgbm
Tuning the paremeters for lightgbm
Implementing the model
Get prediction results
array([9.99977496e-01, 8.96680177e-03, 5.54788033e-05, 9.87736858e-04, 4.68710232e-04, 3.41087242e-05, 6.26370201e-05, 4.44510896e-05, 9.78436239e-05, 2.84531780e-05, 2.21010462e-02, 5.43004883e-04, 2.11233388e-05, 4.38349233e-01, 3.30038629e-02, 9.97813273e-01, 1.09896675e-03, 9.99103515e-01, 9.99952126e-01, 9.99962842e-01,
Convert label to binary values
Check model accuracy
accuracy = metrics.accuracy_score(Y_test, y_pred)print("Accuracy score:",accuracy)
Accuracy score: 0.9736842105263158
Parameters that you can tune in LightGBM
boosting_type : Type of boosting algorithm ('gbdt', 'dart', 'rf', 'goss')
num_iterations (or num_boost_round) : Number of boosting rounds
learning_rate : Step size for updating weights
num_leaves : Maximum number of leaves per tree
max_depth : Maximum depth of trees
min_data_in_leaf : Minimum number of samples per leaf
min_sum_hessian_in_leaf : Minimum sum of instance weight (hessian) in a leaf
feature_fraction : Fraction of features used per iteration (colsample_bytree)
bagging_fraction : Fraction of data used per iteration (subsample)
bagging_freq : Frequency of bagging (used with bagging_fraction)



Comments
Post a Comment