The complete Data Science pipeline on a simple problem
The Problem:
Sales Prediction for Big Mart Outlets
Big Mart’s data scientists have gathered 2013 sales information for 1559 products from 10 stores located in various cities. Also defined are the characteristics of each product and retailer. The goal is to create a predictive model that can foretell the sales of each item at a specific retailer.
With the aid of this model, Big Mart will attempt to comprehend the characteristics of the merchandise and retail locations that are essential to boosting sales.
It is a Regression problem where we need to find the Item Outlet Sales which is the target variable in our dataset.
Our steps to Modelling:
1. Univariate Analysis
2. Bivariate or Multivariate Analysis
3. Missing Values Treatment
4. Outlier Identification
5. Feature Engineering
6. Standardization — This is the last Step of EDA popularly known as Data Pre-Processing Step.
7.Applying Machine Learning Models
The data consists of the following rows:
Exploratory data analysis:
We’ll be using seaborn for visualization and pandas for data manipulation. You can download the dataset from here :
We’ll import the necessary libraries and load the data :
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
plt.rcParams['figure.figsize']=[10,6]
train=pd.read_csv('train_v9rqX0R.csv')
train.head()
test=pd.read_csv('test_AbJTz2l.csv')
test.head()
Some basic information about the dataset:
train.shape,test.shape
train.info()
Since it is a hackathon, we will create a file with a base model with the mean of the Item Outlet Sales so that our model predictions value should not go below this mean value.
# Submission file:
submission = pd.DataFrame({'Item_Identifier':test.Item_Identifier,
'Outlet_Identifier':test.Outlet_Identifier,
'Item_Outlet_Sales':train.Item_Outlet_Sales.mean()})
submission.to_csv('Basemodel.csv',index=False)
#Your score for this submission is : 1773.8251377790564 dollars.
Okay after creating a base model let’s combine our train and test data and start working on the Exploratory Data Analysis.
combined=pd.concat([train,test],ignore_index=True)
combined.head()
UNIVARIATE ANALYSIS
It would be interesting to study the distribution of the numerical variables and categorical variables separately so we divide the data accordingly.
# List of numerical columns
combined.select_dtypes(include=[np.number]).columns
num_cols=['Item_Weight', 'Item_Visibility', 'Item_MRP','Item_Outlet_Sales']
nrows=2
ncols=2
iterator=1
for i in num_cols:
plt.subplot(nrows,ncols,iterator)
sns.distplot(combined.loc[:,i],color='red')
plt.title(i)
iterator=iterator+1
plt.tight_layout()
plt.show()
Since Outlet Establishment Year is a date column with unique values we did not consider it for the analysis.
From the plots we can interpret that :
1)Item weight has no pattern of skewness. It appears to be uniform.
2)Item visibility is skewed.
3)Item MRP has four modes (multimodal) and the Item outlet sales is skewed.
# List of categorical columns
train.select_dtypes(include=[np.object]).columns
cat_cols=['Item_Fat_Content','Outlet_Identifier','Outlet_Size','Outlet_Location_Type','Outlet_Type']
nrows=3
ncols=2
iterator=1
for i in cat_cols:
plt.subplot(nrows,ncols,iterator)
sns.countplot(combined.loc[:,i])
plt.title(i)
iterator=iterator+1
plt.tight_layout()
plt.show()
Since Item Identifier and Item Type have unique values we did not consider it for the analysis.
From the count plots we can infer that:
1) Most of the outlets are medium sized outlets.
2) The item types are only two in the data(Low-fat and Regular).
3) Most of the outlets are established in Tier 3 cities and the outlets are mostly Supermarket type 1 outlets
Let’s see what other basic information we can get from the data.
combined.Item_Type.value_counts().plot(kind='bar',color='darkblue')
#Top 5 selling items are: Fruits and vegetables,Dairy,Snacks,Household and Frozen
combined.Outlet_Identifier.value_counts()
# Outlet027 is the highly visible outlet in the buisness.
These are the information we could collect so far till now:
- Most spread-out outlet in the business is ‘OUT027’.
- Most of the outlets are in Tier 3 cities.
- The top 5 selling Items are Fruits and Veggies, Dairy, Snacks, Frozen and Household.
- Generally the Outlets are of S1 Type in nature.
With the above information lets try to generate some business insights which the big mart can use:
- Using the top 5 selling items the mart could clear out stock of the least selling items as a part of combo offers and discounts.
- The qualities of Outlet 27 may be directly copied by Big Mart in the future if it wants to create a new outlet in any location, saving money on marketing analysts for the outlet’s size and location. This is due to Outlet 27’s most successful business model.
Let’s do some basic data cleaning:
# Fix the item fat content
combined.Item_Fat_Content.unique()
# Replace
combined.Item_Fat_Content=combined.Item_Fat_Content.replace(to_replace=['low fat','LF','reg'],value=['Low Fat','Low Fat','Regular'])
combined.Item_Fat_Content.unique()
BIVARIATE ANALYSIS
#Numerical vs Numerical
nrows=2
ncols=2
iterator=1
for i in num_cols:
plt.subplot(nrows,ncols,iterator)
sns.scatterplot(combined.loc[:,i],combined.Item_Outlet_Sales,color='red')
plt.title(i)
iterator+=1
plt.tight_layout()
plt.show()
Summary:
1) Item visibility has lot of zeros that need to be fixed.
2) Item visibility has a negative effect as the visibility increases the sales go down.
3) Item weight has no relation with sales since it is uniform in nature.
4) Since it was a multimodal data there are 4 clusters of Sales emerging.We also understand that due to combination effect, the clusters are developed.(In Sales vs Item_MRP scatterplot).
# Categorical vs Numerical
nrows=3
ncols=2
iterator=1
for i in cat_cols:
plt.subplot(nrows,ncols,iterator)
sns.boxplot(combined.loc[:,i],combined.Item_Outlet_Sales)
plt.title(i)
iterator+=1
plt.tight_layout()
plt.show()
Summary:
1) Low fat content have the highest sales.
2) Tier3 cities followed by S3 have the highest sales.
3) Most revenue generating outlets are OUT027 and OUT013.
4) The worst performing outlets are OUT013 and OUT010 followed by Grocery Store.
MISSING VALUES
Let’s check if there are any missing values in the dataset.
combined.isnull().sum()
MISSING VALUE IMPUTATION
Data should not have missing values in the dataset so it should be imputed first before applying models in it. Lets impute the missing values one by one.
# Item Weight
combined.loc[combined.Item_Weight.isnull()].head()
Every unique item will have the same item weight. So using that logic, we can impute the null values in the Item Weight column by grouping by item identifier and item weight column.
For example: A packet of strawberries will have the same weight of other packets of strawberry. So using the Item identifier we can impute the weight of that particular product using the average of the weights.
# Method:
combined['Item_Weight']=combined.groupby('Item_Identifier')['Item_Weight'].apply(lambda x:x.fillna(x.mean()))
Similarly for the Item Visibility column we will use similar logic to impute the null values.
combined['Item_Visibility']=combined.groupby('Item_Identifier')['Item_Visibility'].apply(lambda x:x.replace(to_replace=0,value=x.mean()))
# Missing value in outlet_size
combined.loc[combined.Outlet_Size.isnull()].head()
# Outlet size will depend largely on outlet type
combined.groupby('Outlet_Type')['Outlet_Size'].value_counts()
Procedure to deal with the missing values.
# Grocery Store — Small
# Supermarket Type2 — Medium
# Supermarket Type3 — Medium
# Supermarket Type1 — Small (Mode)
# Code for Missing values in Outlets.
combined.loc[(combined.Outlet_Type=='Grocery Store')&(combined.Outlet_Size.isnull()),'Outlet_Size']='Small'
combined.loc[(combined.Outlet_Type=='Supermarket Type1')&(combined.Outlet_Size.isnull()),'Outlet_Size']='Small'
combined.isnull().sum()
So we have imputed all the null values in the dataset. The value 5681 in the Item Outlet Sales is basically the test data sales which we need to predict using the machine learning models.
FEATURE ENGINEERING
Lets try to create new variables from the existing data using our domain knowledge.
combined.Item_Type.unique()
So using the above list of food items, we can categorize it into perishable and non-perishable food items.
perishables=['Dairy','Meat','Fruits and Vegetables','Breakfast','Breads','Starchy Foods','Seafood']
def perish(x):
if x in perishables:
return('Perishables')
else:
return('Non_Perishables')
combined['Item_Type_cat']=combined.Item_Type.apply(perish)
Let’s create a new column called ‘IDs’ by extracting the first two letters of the Item Identifier column which generally represents the item type in a super mart data.
ids=[]
for i in combined.Item_Identifier:
ids.append(i[:2])
combined['Item_Ids']=pd.Series(ids)
combined.head()
pd.crosstab(combined.Item_Ids,combined.Item_Fat_Content)
Since non-consumable items (NC) cannot have any fat, it would be better if we create a new category ‘Non-Edible’ in Item Fat Content for NC.
# Apply Non Edible in item Fat Content for NC
combined.loc[combined.Item_Ids=='NC','Item_Fat_Content']='Non_Edible'
pd.crosstab(combined.Item_Ids,combined.Item_Fat_Content)
The data was collected on the year 2013 so let’s create a variable called ‘Vintage’ to find the age of the outlet by using the column ‘Outlet Establishment Year’.
# Vintage of the outlets.
combined['Vintage']=2013-combined.Outlet_Establishment_Year
Let us create a variable called ‘Price per unit’ using Item MRP and Item Weight.
# Price_per_unit = Item_MRP/Item_Weight
combined["Price_Per_Unit"] = combined["Item_MRP"]/combined["Item_Weight"]
combined.head()
# Check with target variable
sns.scatterplot(combined["Price_Per_Unit"],combined['Item_Outlet_Sales'])
Replace "outlet identifier" with the mean of item outlet sales for that specific outlet identifier to find the average sales in that specific outlet.
# Sales Summary basis Outlets
outlet_sales=combined.groupby('Outlet_Identifier')['Item_Outlet_Sales'].mean().to_dict()
# Mapping using dictionary
combined['Outlet_Identifier']=combined.Outlet_Identifier.map(outlet_sales)
combined.head()
Lets drop the columns from which we extracted new features from:
# Drop the columns
newdata=combined.drop(['Item_Identifier','Outlet_Establishment_Year','Item_Type'],axis=1)
# Split the data into train and test
train.shape,test.shape
newtrain=newdata.loc[0:train.shape[0]-1,:]
newtest=newdata.loc[train.shape[0]:,:]
newtest=newtest.drop('Item_Outlet_Sales',axis=1)
print(newtrain.shape,newtest.shape)
newtrain.columns
newtest.columns
# Apply Statistical test on the data.
cols=newdata.select_dtypes(include=[np.number]).columns
cols
import scipy.stats as stats
for i in cols:
teststats,pvalue=stats.ttest_ind(newtrain.loc[:,i],newtrain.Item_Outlet_Sales)
print(i,':',pvalue)
sns.heatmap(newtrain.loc[:,cols].corr(),annot=True,cmap='YlGnBu')
Outlier Treatment
Using IQR method lets remove the outliers from the dataset.
q1=newtrain.quantile(0.25)
q3=newtrain.quantile(0.75)
iqr=q3-q1
ul=q3+1.5*iqr
ll=q1-1.5*iqr
wt_outliers=newtrain.loc[~((newtrain<ll)|(newtrain>ul)).any(axis=1)]
wt_outliers.shape
newtrain.shape
Using Label Encoding to encode the categorical variables:
# Label Encoding
newtrain.Outlet_Size.value_counts()
mapped={'Small':3,'Medium':2,'High':1}
newtrain['Outlet_Size']=newtrain.Outlet_Size.map(mapped)
wt_outliers['Outlet_Size']=wt_outliers.Outlet_Size.map(mapped)
# Dummy Encoding
train_encoded=pd.get_dummies(newtrain,drop_first=True)
wt_encoded=pd.get_dummies(wt_outliers,drop_first=True)
print(train_encoded.shape,wt_encoded.shape)
#newtest encoding
newtest['Outlet_Size']=newtest.Outlet_Size.map(mapped)
newtest_encoded=pd.get_dummies(newtest,drop_first=True)
newtest_encoded.shape
MODELLING
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor,AdaBoostRegressor
from xgboost import XGBRegressor
lr=LinearRegression()
dtree=DecisionTreeRegressor()
rf=RandomForestRegressor()
gbm=GradientBoostingRegressor()
xgb=XGBRegressor()
ada=AdaBoostRegressor()
# Linear Regression
pred=[]
from sklearn.model_selection import KFold
kfold=KFold(n_splits=5,shuffle=True,random_state=0)
X=train_encoded.drop('Item_Outlet_Sales',axis=1)
y=train_encoded.Item_Outlet_Sales
for train_index,test_index in kfold.split(X,y):
xtrain=X.loc[train_index]
ytrain=y.loc[train_index]
pred.append(lr.fit(xtrain,ytrain).predict(newtest_encoded))
# Prediction file
finalpred=np.abs(pd.DataFrame(pred).T.mean(axis=1))
submission=pd.DataFrame({'Item_Identifier':test.Item_Identifier,
'Outlet_Identifier':test.Outlet_Identifier,
'Item_Outlet_Sales':finalpred})
submission.to_csv('LRModel_.csv',index=False)
Score:
Your score for this LR submission is : 1193.1653903044592
# Random forest
pred=[]
from sklearn.model_selection import KFold
kfold=KFold(n_splits=5,shuffle=True,random_state=0)
X=train_encoded.drop('Item_Outlet_Sales',axis=1)
y=train_encoded.Item_Outlet_Sales
for train_index,test_index in kfold.split(X,y):
xtrain=X.loc[train_index]
ytrain=y.loc[train_index]
pred.append(rf.fit(xtrain,ytrain).predict(newtest_encoded))
# Prediction file
finalpred=np.abs(pd.DataFrame(pred).T.mean(axis=1))
submission=pd.DataFrame({'Item_Identifier':test.Item_Identifier,
'Outlet_Identifier':test.Outlet_Identifier,
'Item_Outlet_Sales':finalpred})
submission.to_csv('RFModel.csv',index=False)
Your score for this Random forest submission is : 1188.096011200597.
# Gradient Boosting
pred=[]
from sklearn.model_selection import KFold
kfold=KFold(n_splits=5,shuffle=True,random_state=0)
X=train_encoded.drop('Item_Outlet_Sales',axis=1)
y=train_encoded.Item_Outlet_Sales
for train_index,test_index in kfold.split(X,y):
xtrain=X.loc[train_index]
ytrain=y.loc[train_index]
pred.append(gbm.fit(xtrain,ytrain).predict(newtest_encoded))
# Prediction file
finalpred=np.abs(pd.DataFrame(pred).T.mean(axis=1))
submission=pd.DataFrame({'Item_Identifier':test.Item_Identifier,
'Outlet_Identifier':test.Outlet_Identifier,
'Item_Outlet_Sales':finalpred})
submission.to_csv('GBMModel.csv',index=False)
Your score for this Gradient Boosting submission is : 1153.8916809413179.
# XGBoosting
pred=[]
from sklearn.model_selection import KFold
kfold=KFold(n_splits=5,shuffle=True,random_state=0)
X=train_encoded.drop('Item_Outlet_Sales',axis=1)
y=train_encoded.Item_Outlet_Sales
for train_index,test_index in kfold.split(X,y):
xtrain=X.loc[train_index]
ytrain=y.loc[train_index]
pred.append(xgb.fit(xtrain,ytrain).predict(newtest_encoded))
# Prediction file
finalpred=np.abs(pd.DataFrame(pred).T.mean(axis=1))
submission=pd.DataFrame({'Item_Identifier':test.Item_Identifier,
'Outlet_Identifier':test.Outlet_Identifier,
'Item_Outlet_Sales':finalpred})
submission.to_csv('XGBModel.csv',index=False)
Your score for this XGBOOST submission is :1178.3383516159906.
Therefore, till now the best model is Gradient Boosting Classifier.