Skip to main content

ADABOOST

AdaBoost This blog post will provide you with a comprehensive overview of Adaboost, exploring the theory behind this probabilistic algorithm and demonstrating its implementation using Python libraries. Dive in to uncover the advantages and disadvantages of neural network, as well as its real-world applications across various domains. With that, enjoy your journey in QDO! What is  Adaboost AdaBoost (Adaptive Boosting) is an ensemble learning technique that combines multiple weak classifiers (often decision trees) to create a strong classifier. It works by training the weak classifiers sequentially, giving more weight to misclassified instances at each step so that subsequent classifiers focus more on the harder cases. The final prediction is made by combining the weighted votes of all weak classifiers. AdaBoost is effective at reducing bias and variance, and it’s particularly good for binary classification problems. However, it can be sensitive to noisy data and outliers. Concepts o...

TIME SERIES ANALYSIS

     

TIME SERIES ANALYSIS


This blog post will provide you with a comprehensive overview of time series analysis, exploring the theory behind this probabilistic algorithm and demonstrating its implementation using Python libraries. Dive in to uncover the advantages and disadvantages of time series analysis, as well as its real-world applications across various domains. With that, enjoy your journey in QDO!

WHAT IS Time Series Analysis


Time Series Analysis (TSA) is a powerful statistical and machine learning approach used to analyze time-ordered data, primarily to understand underlying patterns, trends, and seasonality or to make predictions about future data points. TSA models focus on capturing temporal dependencies by considering the order and spacing of data points, making it especially useful for data collected at regular intervals, such as daily temperatures, stock prices, or sales figures.


Concept of time series analysis

For example, we wanted to predict the recreational goods for each month in the future



1. Trend

The trend is the long-term movement or direction in the data over an extended period.

     From January 2000 to August 2001, we can look for any overall upward or downward movement. For instance, there is a visible upward trend from September to December 2000, culminating in a significant peak at 232.2 in December. However, this rise isn’t consistent throughout the series; some months show declines, as in April 2000 and June 2000. Nonetheless, observing the long-term trend can help identify general growth or decline in recreational goods.

2. Seasonality

Seasonality refers to recurring patterns at regular intervals, usually linked to specific times of the year, like monthly or quarterly cycles due to seasonal or behavioral factors.

     If we had several years of data, we could see if certain months or quarters consistently show higher or lower values. For instance, if recreational goods typically spike in December due to holiday purchases, that could indicate seasonality. In this data, December 2000 shows a significant spike, suggesting potential seasonality due to holiday demand.

3. Cycles

Cycles refer to repeating up-and-down patterns that are not tied to seasonality. Cycles can span varying lengths and often relate to economic or business cycles.

        Cyclic patterns are harder to detect in this dataset since we only have a little over a year of data. However, with multiple years, we might see a repeating multi-year cycle influenced by broader economic factors affecting recreational goods demand. Cyclic behavior would be different from the predictable peaks and troughs seen in seasonality, as it doesn't follow a set interval.

4. Irregularities

Irregularities or noise are random fluctuations that don't follow any pattern or trend.

       In this dataset, certain months show unexpected spikes or drops, such as the large jump from 158.4 in November 2000 to 232.2 in December 2000. These irregularities could be due to unique, non-recurring events affecting demand or reporting anomalies. Identifying these helps analysts differentiate between natural patterns and one-time anomalies.

5. Variation

Variation can be regular (small, predictable changes) or irregular (sudden and unpredictable changes).

        The variation in this data set includes regular monthly fluctuations as well as irregular spikes, like the sudden increase in December 2000. Understanding variation allows analysts to predict normal levels of fluctuation and to investigate unusual changes further.


Implementation of time series analysis in python

Importing libraries 

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import xgboost as xgb
from sklearn.metrics import mean_squared_error
color_pal = sns.color_palette()
plt.style.use('fivethirtyeight')

Loading dataset

df=pd.read_csv("PJME_hourly.csv")
df = df.set_index('Datetime')
df.index = pd.to_datetime(df.index)

Overview of the current dataset

df.plot(style='.',
        figsize=(15, 5),
        color=color_pal[0],
        title='PJME Energy Use in MW')
plt.show()



Splitting the dataset for training and testing

train = df.loc[df.index < '01-01-2015']
test = df.loc[df.index >= '01-01-2015']

fig, ax = plt.subplots(figsize=(15, 5))
train.plot(ax=ax, label='Training Set', title='Data Train/Test Split')
test.plot(ax=ax, label='Test Set')
ax.axvline('01-01-2015', color='black', ls='--')
ax.legend(['Training Set', 'Test Set'])
plt.show()


Feature Engineering

def create_features(df):
    """
    Create time series features based on time series index.
    """
    df = df.copy()
    df['hour'] = df.index.hour
    df['dayofweek'] = df.index.dayofweek
    df['quarter'] = df.index.quarter
    df['month'] = df.index.month
    df['year'] = df.index.year
    df['dayofyear'] = df.index.dayofyear
    df['dayofmonth'] = df.index.day
    df['weekofyear'] = df.index.isocalendar().week
    return df

df = create_features(df)

Defining the dependent and independent variable for both training and testing dataset

train = create_features(train)
test = create_features(test)

FEATURES = ['dayofyear', 'hour', 'dayofweek', 'quarter', 'month', 'year']
TARGET = 'PJME_MW'

X_train = train[FEATURES]
y_train = train[TARGET]

X_test = test[FEATURES]
y_test = test[TARGET]

Applying the model

reg=xgb.XGBRegressor(n_estimators=1000, early_stopping_rounds=50, learning_rate=0.01)
reg.fit(X_train,y_train,eval_set=[(X_train,y_train),(X_test,y_test)],verbose=100)
fi=pd.DataFrame(data=reg.feature_importances_,index=reg.feature_names_in_,
columns=['importance'])

Displaying the importance of each feature

fi=pd.DataFrame(data=reg.feature_importances_,index=reg.feature_names_in_,
columns=['importance'])

fi.sort_values('importance').plot(kind='barh',title='Feature Importance')



Displaying  the prediction result

test['prediction'] = reg.predict(X_test)
df = df.merge(test[['prediction']], how='left', left_index=True, right_index=True)
ax = df[['PJME_MW']].plot(figsize=(15, 5))
df['prediction'].plot(ax=ax, style='.')
plt.legend(['Truth Data', 'Predictions'])
ax.set_title('Raw Data and Prediction')
plt.show()


Parameters that you can tune in time series analysis

1. ARIMA Model
  • p: How many past values (lags) to use.
  • d: How many times to difference the series to make it stationary.
  • q: How many past forecast errors to include.
2. Exponential Smoothing (ETS)
  • Trend Type: Controls if there's a trend (linear, exponential).
  • Seasonal Type: Controls if seasonality is additive or multiplicative.
  • Alpha: Controls how much weight recent values get.
  • Beta: Controls how much trend is smoothed.
  • Gamma: Controls how much seasonal patterns are smoothed.
3. Prophet
  • Growth: Type of growth in data (linear or logistic).
  • Changepoint Flexibility: How easily trend shifts are allowed.
  • Seasonality: Can toggle yearly, weekly, or daily patterns.
  • Holiday Effects: Adds known holiday impacts if relevant.
4. STL (Seasonal-Trend Decomposition)
  • Seasonal Window: Size of the seasonal cycle.
  • Trend Window: How smooth or rough the trend line is.
5. RNN Models (like LSTM)
  • Layers: Number of neural network layers.
  • Hidden Units: Amount of data each layer can learn from.
  • Learning Rate: Speed of learning during training.
  • Sequence Length: Number of time steps used in each input sequence.
6. Tree-based Models (e.g., XGBoost for Time Series)
  • Number of Trees: How many decision trees to use.
  • Learning Rate: Controls update size in training.
  • Max Depth: Tree depth, affecting model complexity.
  • Lag Features: Number of past values (lags) used as inputs.


Advantages and disadvantages of time series analysis

Advantages

  • Trend and Seasonality Detection: TSA can identify trends (long-term direction) and seasonality (repeating patterns) in data, helping to forecast future values more accurately.
  • Data-Driven Decision Making: TSA provides insights into historical data patterns, enabling better planning and decision-making based on expected future trends.
  • Anomaly Detection: TSA can identify unusual events or anomalies by detecting deviations from expected patterns, which is valuable for monitoring and maintenance.

Disadvantages

  • Dependency on Quality and Quantity of Data: TSA requires a large amount of historical data, and poor-quality or limited data can reduce the model’s accuracy and reliability.
  • Assumption of Stationarity: Many TSA methods assume that statistical properties (e.g., mean, variance) remain constant over time, which may not hold true in real-world scenarios, reducing model effectiveness.
  • Limited to Short-Term Forecasting: TSA often performs better for short-term forecasts, as long-term forecasts become increasingly uncertain, especially with complex or highly variable data.

Implementation of time series analysis in real life

1. Demand Forecasting



  • Retailers like Walmart and Amazon use TSA to predict future demand for products, especially during peak seasons like Black Friday or holiday sales. This helps in inventory management, reducing stockouts or excess inventory. 

2. Demand Forecasting for Power Supply

                                                    


  • Power companies like National Grid use TSA to predict energy demand across different times and seasons, allowing for efficient generation and distribution planning.

3. Healthcare and Pharmaceuticals


  • Pharmaceutical companies predict the demand for drugs based on TSA, especially during seasonal illnesses, to manage inventory and avoid shortages.

Comments

Popular posts from this blog

PRINCIPAL COMPONENT ANALYSIS (PCA)

PRINCIPAL COMPONENT ANALYSIS (PCA) Figure 1: PCA This blogpost will bring to you the concept of principal component analysis which is one of the commonly used descriptive analysis that emphasizes of dimensionality reduction. You will learn how to implement this machine learning model in python, its advantages and disadvantages as well as how companies benefits from this machine learning model. What is PCA PCA is a statistical dimensionality-reducing technique. It takes a large set of variables and transforms them into a smaller set, retaining most of the information in the large set. This can be done by identifying the directions along which the data varies the most. These components are orthogonal to one another, capture the maximum possible variance within the data, and hence form a powerful tool for the simplification of datasets without loss of essential patterns and relationships. Concept of  PCA One of the key concepts behind PCA concerns diminishing the complexity of high-di...

LINEAR REGRESSION

 LINEAR REGRESSION Figure 1: Linear regression figure This blogpost will walk you through the concept of linear regression which is another machine learning model under the regression category of supervised learning. Introducing the parameters that you can turn while applying the logistic regression as well as the factors that play a significant impact upon the performance of the linear regression. What is linear regression Linear regression is a machine learning algorithm that could be used in predictive analysis. From predicting prices of houses to sales forecasting, linear regression is undoubtedly the first choice to many data scientists to implement within the dataset. In short, linear regression involves plotting your data on the graph base on the x and y coordinate and proceed to draw the best fit line upon the graph. The best fit line will be used as a reference to predict the independent variable in the future. However, do you have the skill to conduct a excellent analysis...

DECISION TREE

 DECISION TREE Figure 1: Decision Tree      This blogpost aims to introduce to you regarding to a machine learning model called decision trees. After reading this blogpost, you are able to deepen your knowledge on the concepts of decision trees model, its terminology, pros and cons as well as its application in real life scenarios that lends a hand in solving complex problems thus boosting the living quality of many.  What is decision tree      Imagine you’re wondering through a forest, each path branching off into multiple directions, and you need to make a series of decisions to escape the forest. Now, picture having a map that not only shows you all possible routes but also guides you on the specific conditions you encounter. Decision trees model which applies various splitting criteria's within the branches assists the user in decision making purposes. Compared to regression models which applies complex mathematical formulas like logistic regr...