SpaceTitanic Pipeline - Model Impute - 81 score
Published:
The objective in this Notebook is to use a Pipeline to streamline development of code. I will not be focusing on data analysis and charts.
Part 1: Simple pipeline
Benefit Demonstrated: Improves Code Readability and Maintenance
Building a simple pipeline with 2 components: 1) A simple imputer to fill null values, and in this case the most frequest value 2) Onehot encoding of the column.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn import set_config
set_config(display="diagram")
fileNameTrain ='/kaggle/input/spaceship-titanic/train.csv';
fileNameTest = '/kaggle/input/spaceship-titanic/test.csv';
fileSubmission = '/kaggle/working/submission.csv'
train = pd.read_csv(fileNameTrain).set_index('PassengerId')
test = pd.read_csv(fileNameTest).set_index('PassengerId')
trainX = train.drop(['Transported'], axis=1)
trainy = train['Transported']
testX = test.copy();
Simple Imputer:
Provding some extra print statement for some explanation and validation
print(f">>>> null values before: {trainX['Destination'].isnull().sum()}")
imputer = SimpleImputer(strategy='most_frequent')
asArray = imputer.fit_transform(train[['Destination']])
print(f">>>> The simple imputer return result as n array \n{asArray}")
print(f">>>> shape of array {asArray.shape}")
asDF = pd.DataFrame(asArray, columns =['Destination'])
asDF
print(f">>>> converting back to dataframe just for explaination: \n{asDF.head(3)}")
print(f">>>> null values after: {asDF['Destination'].isnull().sum()}")
>>>> null values before: 182
>>>> The simple imputer return result as n array
[['TRAPPIST-1e']
['TRAPPIST-1e']
['TRAPPIST-1e']
...
['TRAPPIST-1e']
['55 Cancri e']
['TRAPPIST-1e']]
>>>> shape of array (8693, 1)
>>>> converting back to dataframe just for explaination:
Destination
0 TRAPPIST-1e
1 TRAPPIST-1e
2 TRAPPIST-1e
>>>> null values after: 0
Combining a Simple Imputer and a One Hot encoder without using a pipeline.
imputer = SimpleImputer(strategy='most_frequent')
asArray = imputer.fit_transform(train[['Destination']])
one = OneHotEncoder(sparse_output=False,handle_unknown='ignore')
one.fit_transform(asArray)
array([[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
...,
[0., 0., 1.],
[1., 0., 0.],
[0., 0., 1.]])
Combining a Simple Imputer and a One Hot encoder using a pipeline.
Clear Structure for development and maintainability: Even with just 2 processing steps, we see that the code gets orgranized into a structure which focuses on the most important aspects of the pipeline and avoids boiler plate code.
imputation_pipeline_cat= Pipeline(steps=[('impute', SimpleImputer(strategy='most_frequent')),
('encode', OneHotEncoder(sparse_output=False, handle_unknown='ignore'))
])
imputation_pipeline_cat.fit_transform(train[['Destination']])
array([[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
...,
[0., 0., 1.],
[1., 0., 0.],
[0., 0., 1.]])
Reusing the pipeline for other columns with similar processing requirements.
Benefit Demonstrated: Simplifies the Workflow
Ease of Use: Pipelines can combine multiple steps into a single object and reuse the processing steps for multiple columns and simplifying the code. We are reusing the same steps across 4 pipelines.
imputation_pipeline_cat= Pipeline(steps=[('impute', SimpleImputer(strategy='most_frequent')),
('encode', OneHotEncoder(sparse_output=False, handle_unknown='ignore'))
])
imputation_pipeline_cat.fit_transform(train[['Destination','HomePlanet','VIP','CryoSleep']])
array([[0., 0., 1., ..., 0., 1., 0.],
[0., 0., 1., ..., 0., 1., 0.],
[0., 0., 1., ..., 1., 1., 0.],
...,
[0., 0., 1., ..., 0., 1., 0.],
[1., 0., 0., ..., 0., 1., 0.],
[0., 0., 1., ..., 0., 1., 0.]])
Part 2: Pipeline combined with classifcaiton.
Building a pipeline with the simple feature engineering and fit/predict with a classifcaiton model.
space_titanic_pipeline = Pipeline(steps=[('impute', SimpleImputer(strategy='most_frequent')),
('encode', OneHotEncoder(sparse_output=False, handle_unknown='ignore')),
('model', XGBClassifier(max_depth=3, min_child_weight=2, learning_rate =0.5, n_estimators=300,
gamma=0.001, subsample=0.9, colsample_bytree=0.9,
objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27))])
space_titanic_pipeline.fit(train[['Destination','HomePlanet','VIP','CryoSleep']], trainy)
space_titanic_pipeline.predict(test[['Destination','HomePlanet','VIP','CryoSleep']])
array([1, 0, 1, ..., 1, 0, 1])
Cross validaing the pipeline as we keep adding features provides a robust framework for progress
from sklearn.model_selection import cross_validate, StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=123)
ct_cv_res = cross_validate(estimator = space_titanic_pipeline,
X = train[['Destination','HomePlanet','VIP','CryoSleep']],
y = trainy,
cv = skf,
scoring = 'accuracy')['test_score'].mean()
print(f"Average cross-validated accuracy from\ncolumn transformer pipeline: {ct_cv_res:.3f}")
Average cross-validated accuracy from
column transformer pipeline: 0.717
Part 3: Function (Simple) Transformer
Until now we have used 2 OOTB tranformation on the data. But very often we would need specific functions which was developed and required to beused This is used to use custom stateless transformations for data processing within a pipeline.
Develop the Function and call the same using a FunctionTransformer within a pipeline
def binAge(dfbinIn: pd.DataFrame) -> pd.DataFrame:
dfbin = dfbinIn.copy();
labels=[0,10,20,30,40,50,60,70]
bins=[-np.inf,0,10,20,30,40,50,60,np.inf]
binnedArray = pd.cut(dfbin["Age"], bins = bins, labels=labels)
dfbin["Age"]=binnedArray.tolist()
mode_value = dfbin['Age'].mode().iloc[0] if not dfbin['Age'].mode().empty else False
tempDF = pd.DataFrame()
tempDF = dfbin['Age'].copy()
tempDF.fillna(mode_value, inplace=True)
dfbin['Age']=tempDF
return dfbin
pipeline_Age = Pipeline([('binAge', FunctionTransformer(binAge, validate=False))])
pipeline_Age.fit_transform(trainX)
HomePlanet | CryoSleep | Cabin | Destination | Age | VIP | RoomService | FoodCourt | ShoppingMall | Spa | VRDeck | Name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
PassengerId | ||||||||||||
0001_01 | Europa | False | B/0/P | TRAPPIST-1e | 40.0 | False | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | Maham Ofracculy |
0002_01 | Earth | False | F/0/S | TRAPPIST-1e | 30.0 | False | 109.0 | 9.0 | 25.0 | 549.0 | 44.0 | Juanna Vines |
0003_01 | Europa | False | A/0/S | TRAPPIST-1e | 60.0 | True | 43.0 | 3576.0 | 0.0 | 6715.0 | 49.0 | Altark Susent |
0003_02 | Europa | False | A/0/S | TRAPPIST-1e | 40.0 | False | 0.0 | 1283.0 | 371.0 | 3329.0 | 193.0 | Solam Susent |
0004_01 | Earth | False | F/1/S | TRAPPIST-1e | 20.0 | False | 303.0 | 70.0 | 151.0 | 565.0 | 2.0 | Willy Santantines |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9276_01 | Europa | False | A/98/P | 55 Cancri e | 50.0 | True | 0.0 | 6819.0 | 0.0 | 1643.0 | 74.0 | Gravior Noxnuther |
9278_01 | Earth | True | G/1499/S | PSO J318.5-22 | 20.0 | False | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | Kurta Mondalley |
9279_01 | Earth | False | G/1500/S | TRAPPIST-1e | 30.0 | False | 0.0 | 0.0 | 1872.0 | 1.0 | 0.0 | Fayey Connon |
9280_01 | Europa | False | E/608/S | 55 Cancri e | 40.0 | False | 0.0 | 1049.0 | 0.0 | 353.0 | 3235.0 | Celeon Hontichre |
9280_02 | Europa | False | E/608/S | TRAPPIST-1e | 50.0 | False | 126.0 | 4688.0 | 0.0 | 0.0 | 12.0 | Propsh Hontichre |
8693 rows × 12 columns
Part 4: Column Tranformer
As we have noticed, we would like to have 2 seperate types of data processeing. One for the Categorical variables ([‘Destination’,’HomePlanet’,’VIP’,’CryoSleep’] and another seperate one for Age. Column Transfoer allows to apply different preprocessing transformations to different columns of your dataset.
Column Transformer for 2 pipelines
The below chart also diagramatically details the processing steps.
multicolumn_prep = ColumnTransformer([
('pipeline_cat',imputation_pipeline_cat,['Destination','HomePlanet','VIP','CryoSleep']),
('pipeline_Age', pipeline_Age, ['Age'])
],
remainder='passthrough')
multicolumn_prep
ColumnTransformer(remainder='passthrough', transformers=[('pipeline_cat', Pipeline(steps=[('impute', SimpleImputer(strategy='most_frequent')), ('encode', OneHotEncoder(handle_unknown='ignore', sparse_output=False))]), ['Destination', 'HomePlanet', 'VIP', 'CryoSleep']), ('pipeline_Age', Pipeline(steps=[('binAge', FunctionTransformer(func=<function binAge at 0x79aaf72604c0>))]), ['Age'])])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ColumnTransformer(remainder='passthrough', transformers=[('pipeline_cat', Pipeline(steps=[('impute', SimpleImputer(strategy='most_frequent')), ('encode', OneHotEncoder(handle_unknown='ignore', sparse_output=False))]), ['Destination', 'HomePlanet', 'VIP', 'CryoSleep']), ('pipeline_Age', Pipeline(steps=[('binAge', FunctionTransformer(func=<function binAge at 0x79aaf72604c0>))]), ['Age'])])
['Destination', 'HomePlanet', 'VIP', 'CryoSleep']
SimpleImputer(strategy='most_frequent')
OneHotEncoder(handle_unknown='ignore', sparse_output=False)
['Age']
FunctionTransformer(func=<function binAge at 0x79aaf72604c0>)
passthrough
Pipeline with the column Transformer and the classificaiton model to fit and predict
space_titanic_pipeline = Pipeline([('preprocessing', multicolumn_prep),
('XG_model', XGBClassifier(max_depth=3, min_child_weight=2, learning_rate =0.5, n_estimators=300,
gamma=0.001, subsample=0.9, colsample_bytree=0.9,
objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27))])
ct_cv_res = cross_validate(estimator = space_titanic_pipeline,
X = train[['Destination','HomePlanet','VIP','CryoSleep','Age']],
y = trainy,
cv = skf,
scoring = 'accuracy')['test_score'].mean()
print(f"Average cross-validated accuracy from\ncolumn transformer pipeline: {ct_cv_res:.3f}")
Average cross-validated accuracy from
column transformer pipeline: 0.731
Part 4: Function (a bit more) Transformer
The previous Function Transformer one bined the age. The below demonstrates we are able to perform more complex requirements like spliting columns and creating additional columns. Again we are able create a specific sub pipeline and combine it to a lager pipeline within a column transformer
Splitting a column into new columns
def split_Cabin_column(dfin):
df = dfin.copy()
# Track which rows were originally NaN
is_null = df['Cabin'].isnull()
# Handle null values before splitting by filling with empty string
df['Cabin'] = df['Cabin'].fillna('//')
# Split the column and create new columns
split_data = df['Cabin'].str.split('/', expand=True)
split_data.columns = ['Cabin_Deck','Cabin_Num','Cabin_Side']
# Reset split columns back to np.nan where original column was np.nan
split_data[is_null] = np.nan
# Join the new columns with the original dataframe
df = df.join(split_data)
return df.drop(columns=['Cabin','Cabin_Num'])
#new_df = split_column(trainX)
def split_column_cabin_wrapper(df):
return split_Cabin_column(df)
pipeline_splitColumns_Cabin = Pipeline([
('column_splitter', FunctionTransformer(split_column_cabin_wrapper, validate=False)),
('impute', SimpleImputer(strategy='most_frequent')),
('encode', OneHotEncoder(sparse_output=False, handle_unknown='ignore'))
])
trainXCabin =pipeline_splitColumns_Cabin.fit_transform(trainX[['Cabin']]);
trainXCabin[0]
array([0., 1., 0., 0., 0., 0., 0., 0., 1., 0.])
multicolumn_prep = ColumnTransformer([
('pipeline_cat',imputation_pipeline_cat,['Destination','HomePlanet','VIP','CryoSleep']),
('pipeline_Age', pipeline_Age, ['Age']),
('pipeline_splitColumns_Cabin', pipeline_splitColumns_Cabin, ['Cabin'])
],
remainder='passthrough')
space_titanic_pipeline = Pipeline([
('preprocessing', multicolumn_prep),
('XG_model', XGBClassifier(max_depth=3, min_child_weight=2, learning_rate =0.5, n_estimators=300,
gamma=0.001, subsample=0.9, colsample_bytree=0.9,
objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27))])
ct_cv_res = cross_validate(estimator = space_titanic_pipeline,
X = train[['Destination','HomePlanet','VIP','CryoSleep','Age','Cabin']],
y = trainy,
cv = skf,
scoring = 'accuracy')['test_score'].mean()
print(f"Average cross-validated accuracy from\ncolumn transformer pipeline: {ct_cv_res:.3f}")
space_titanic_pipeline
Average cross-validated accuracy from
column transformer pipeline: 0.735
Pipeline(steps=[('preprocessing', ColumnTransformer(remainder='passthrough', transformers=[('pipeline_cat', Pipeline(steps=[('impute', SimpleImputer(strategy='most_frequent')), ('encode', OneHotEncoder(handle_unknown='ignore', sparse_output=False))]), ['Destination', 'HomePlanet', 'VIP', 'CryoSleep']), ('pipeline_Age', Pipeline(steps=[('binAge', FunctionTransfor... feature_types=None, gamma=0.001, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=0.5, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=3, max_leaves=None, min_child_weight=2, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=300, n_jobs=None, nthread=4, num_parallel_tree=None, ...))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocessing', ColumnTransformer(remainder='passthrough', transformers=[('pipeline_cat', Pipeline(steps=[('impute', SimpleImputer(strategy='most_frequent')), ('encode', OneHotEncoder(handle_unknown='ignore', sparse_output=False))]), ['Destination', 'HomePlanet', 'VIP', 'CryoSleep']), ('pipeline_Age', Pipeline(steps=[('binAge', FunctionTransfor... feature_types=None, gamma=0.001, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=0.5, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=3, max_leaves=None, min_child_weight=2, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=300, n_jobs=None, nthread=4, num_parallel_tree=None, ...))])
ColumnTransformer(remainder='passthrough', transformers=[('pipeline_cat', Pipeline(steps=[('impute', SimpleImputer(strategy='most_frequent')), ('encode', OneHotEncoder(handle_unknown='ignore', sparse_output=False))]), ['Destination', 'HomePlanet', 'VIP', 'CryoSleep']), ('pipeline_Age', Pipeline(steps=[('binAge', FunctionTransformer(func=<function binAge at 0x79aaf72604c0>))]), ['Age']), ('pipeline_splitColumns_Cabin', Pipeline(steps=[('column_splitter', FunctionTransformer(func=<function split_column_cabin_wrapper at 0x79aaf7262200>)), ('impute', SimpleImputer(strategy='most_frequent')), ('encode', OneHotEncoder(handle_unknown='ignore', sparse_output=False))]), ['Cabin'])])
['Destination', 'HomePlanet', 'VIP', 'CryoSleep']
SimpleImputer(strategy='most_frequent')
OneHotEncoder(handle_unknown='ignore', sparse_output=False)
['Age']
FunctionTransformer(func=<function binAge at 0x79aaf72604c0>)
['Cabin']
FunctionTransformer(func=<function split_column_cabin_wrapper at 0x79aaf7262200>)
SimpleImputer(strategy='most_frequent')
OneHotEncoder(handle_unknown='ignore', sparse_output=False)
passthrough
XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=0.9, device=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=0.001, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=0.5, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=3, max_leaves=None, min_child_weight=2, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=300, n_jobs=None, nthread=4, num_parallel_tree=None, ...)
Benifit Demonstrated: Enhance Model Management
By combining all both data preprocessing and model development into a single pipeline object, which can be easily saved, loaded, and used for predictions brings clarity and reduce chaos in the development. At the same time where is modularity with each step in the pipeline is a separate component, allowing for easy modification and experimentation with different preprocessing techniques and models.
We will be able to quickly put things together and check if there is an improvement in the model or not. (Note: This is a needs to be coupled with EDA and putting it to the test easily via pipeline).
We notice there is a dip (though very slight) in performance.
def split_for_groupCount(df):
df_reset = df.copy()
df_reset.reset_index(inplace = True)
df_pass = df_reset.copy();
split_data =df_reset['PassengerId'].str.split('_', expand=True)
split_data.columns = ['PGroup','PGroupPP']
split_data['PassengerId'] = df_pass['PassengerId'].copy()
df_reset = df_reset.set_index('PassengerId')
groupbyDF = pd.DataFrame()
# Check to see if the group size has an impact on the target
groupbyDF = split_data.groupby(["PGroup"], as_index=False)['PGroupPP'].count()
groupbyDF = groupbyDF.rename(columns={"PGroupPP": "PGroupPPCounts"})
split_data = pd.merge(split_data,groupbyDF,on = "PGroup")
split_data["PGroupPP"] = split_data["PGroupPP"].astype(int)
split_data = split_data.set_index('PassengerId')
split_data.drop(columns=['PGroup','PGroupPP']);
df_reset = df_reset.join(split_data)
# Join the new columns with the original dataframe
return df_reset.drop(['PGroup','PGroupPP'], axis = 1);
def split_for_groupCount_wrapper(df):
return split_for_groupCount(df)
pipeline_pass_splitColumnsGroup = Pipeline([
('passenger_splitGroup', FunctionTransformer(split_for_groupCount_wrapper, validate=False))]);
multicolumn_prep = ColumnTransformer([
('pipeline_cat',imputation_pipeline_cat,['Destination','HomePlanet','VIP','CryoSleep']),
('pipeline_Age', pipeline_Age, ['Age']),
('pipeline_splitColumns_Cabin', pipeline_splitColumns_Cabin, ['Cabin'])
],
remainder='passthrough')
ct_pipeline = Pipeline([ ('pipeline-splitIndexColumns-Group', pipeline_pass_splitColumnsGroup),
('preprocessing', multicolumn_prep),
('XG_model', XGBClassifier(max_depth=3, min_child_weight=2, learning_rate =0.5, n_estimators=300,
gamma=0.001, subsample=0.9, colsample_bytree=0.9,
objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27))])
ct_cv_res = cross_validate(estimator = ct_pipeline,
X = train[['Destination','HomePlanet','VIP','CryoSleep','Age','Cabin']],
y = trainy,
cv = skf,
scoring = 'accuracy')['test_score'].mean()
print(f"Average cross-validated accuracy from\ncolumn transformer pipeline: {ct_cv_res:.3f}")
ct_pipeline
Average cross-validated accuracy from
column transformer pipeline: 0.733
Pipeline(steps=[('pipeline-splitIndexColumns-Group', Pipeline(steps=[('passenger_splitGroup', FunctionTransformer(func=<function split_for_groupCount_wrapper at 0x79aaf7262560>))])), ('preprocessing', ColumnTransformer(remainder='passthrough', transformers=[('pipeline_cat', Pipeline(steps=[('impute', SimpleImputer(strategy='most_frequent')), ('encode', OneHotE... feature_types=None, gamma=0.001, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=0.5, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=3, max_leaves=None, min_child_weight=2, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=300, n_jobs=None, nthread=4, num_parallel_tree=None, ...))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('pipeline-splitIndexColumns-Group', Pipeline(steps=[('passenger_splitGroup', FunctionTransformer(func=<function split_for_groupCount_wrapper at 0x79aaf7262560>))])), ('preprocessing', ColumnTransformer(remainder='passthrough', transformers=[('pipeline_cat', Pipeline(steps=[('impute', SimpleImputer(strategy='most_frequent')), ('encode', OneHotE... feature_types=None, gamma=0.001, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=0.5, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=3, max_leaves=None, min_child_weight=2, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=300, n_jobs=None, nthread=4, num_parallel_tree=None, ...))])
Pipeline(steps=[('passenger_splitGroup', FunctionTransformer(func=<function split_for_groupCount_wrapper at 0x79aaf7262560>))])
FunctionTransformer(func=<function split_for_groupCount_wrapper at 0x79aaf7262560>)
ColumnTransformer(remainder='passthrough', transformers=[('pipeline_cat', Pipeline(steps=[('impute', SimpleImputer(strategy='most_frequent')), ('encode', OneHotEncoder(handle_unknown='ignore', sparse_output=False))]), ['Destination', 'HomePlanet', 'VIP', 'CryoSleep']), ('pipeline_Age', Pipeline(steps=[('binAge', FunctionTransformer(func=<function binAge at 0x79aaf72604c0>))]), ['Age']), ('pipeline_splitColumns_Cabin', Pipeline(steps=[('column_splitter', FunctionTransformer(func=<function split_column_cabin_wrapper at 0x79aaf7262200>)), ('impute', SimpleImputer(strategy='most_frequent')), ('encode', OneHotEncoder(handle_unknown='ignore', sparse_output=False))]), ['Cabin'])])
['Destination', 'HomePlanet', 'VIP', 'CryoSleep']
SimpleImputer(strategy='most_frequent')
OneHotEncoder(handle_unknown='ignore', sparse_output=False)
['Age']
FunctionTransformer(func=<function binAge at 0x79aaf72604c0>)
['Cabin']
FunctionTransformer(func=<function split_column_cabin_wrapper at 0x79aaf7262200>)
SimpleImputer(strategy='most_frequent')
OneHotEncoder(handle_unknown='ignore', sparse_output=False)
passthrough
XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=0.9, device=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=0.001, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=0.5, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=3, max_leaves=None, min_child_weight=2, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=300, n_jobs=None, nthread=4, num_parallel_tree=None, ...)
Adding multiple columns
from sklearn.preprocessing import KBinsDiscretizer
def sum_expenses(X):
df = X.copy()
df['SumOfExpenses']=df['RoomService']+df['FoodCourt']+df['ShoppingMall']+df['Spa']+df['VRDeck']
labels=[0,1000,2000]
bins=[-np.inf,0,1000,np.inf]
binnedArray = pd.cut( df['SumOfExpenses'], bins = bins, labels=labels)
df['SumOfExpenses']=binnedArray.tolist()
return df.drop(['RoomService','FoodCourt','ShoppingMall','Spa','VRDeck'], axis = 1)
def sum_expenses_wrapper(df):
return sum_expenses(df)
imputation_pipeline_sumExp= Pipeline(steps=[('sumExp', FunctionTransformer(sum_expenses_wrapper, validate=False)),
('impute', SimpleImputer(strategy='most_frequent'))
])
# XS = imputation_pipeline_sumExp.fit_transform(trainX[['RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']])
multicolumn_prep = ColumnTransformer([
('pipeline_cat',imputation_pipeline_cat,['Destination','HomePlanet','VIP','CryoSleep']),
('imputation_pipeline_sumExp', imputation_pipeline_sumExp,['RoomService','FoodCourt','ShoppingMall','Spa','VRDeck','Age']),
('pipeline_splitColumns_Cabin',pipeline_splitColumns_Cabin,['Cabin'])
],
remainder='passthrough')
ct_pipeline = Pipeline([ ('preprocessing', multicolumn_prep),
('XG_model', XGBClassifier(max_depth=3, min_child_weight=2, learning_rate =0.5, n_estimators=300,
gamma=0.001, subsample=0.9, colsample_bytree=0.9,
objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27))])
ct_cv_res = cross_validate(estimator = ct_pipeline,
X = trainX[['Destination','HomePlanet','VIP','CryoSleep',
'Cabin','Age',
'RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']],
y = trainy,
cv = skf,
scoring = 'accuracy')['test_score'].mean()
print(f"Average cross-validated accuracy from\ncolumn transformer pipeline: {ct_cv_res:.3f}")
ct_pipeline
Average cross-validated accuracy from
column transformer pipeline: 0.732
Pipeline(steps=[('preprocessing', ColumnTransformer(remainder='passthrough', transformers=[('pipeline_cat', Pipeline(steps=[('impute', SimpleImputer(strategy='most_frequent')), ('encode', OneHotEncoder(handle_unknown='ignore', sparse_output=False))]), ['Destination', 'HomePlanet', 'VIP', 'CryoSleep']), ('imputation_pipeline_sumExp', Pipeline(steps=[('sumExp', Fu... feature_types=None, gamma=0.001, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=0.5, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=3, max_leaves=None, min_child_weight=2, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=300, n_jobs=None, nthread=4, num_parallel_tree=None, ...))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocessing', ColumnTransformer(remainder='passthrough', transformers=[('pipeline_cat', Pipeline(steps=[('impute', SimpleImputer(strategy='most_frequent')), ('encode', OneHotEncoder(handle_unknown='ignore', sparse_output=False))]), ['Destination', 'HomePlanet', 'VIP', 'CryoSleep']), ('imputation_pipeline_sumExp', Pipeline(steps=[('sumExp', Fu... feature_types=None, gamma=0.001, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=0.5, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=3, max_leaves=None, min_child_weight=2, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=300, n_jobs=None, nthread=4, num_parallel_tree=None, ...))])
ColumnTransformer(remainder='passthrough', transformers=[('pipeline_cat', Pipeline(steps=[('impute', SimpleImputer(strategy='most_frequent')), ('encode', OneHotEncoder(handle_unknown='ignore', sparse_output=False))]), ['Destination', 'HomePlanet', 'VIP', 'CryoSleep']), ('imputation_pipeline_sumExp', Pipeline(steps=[('sumExp', FunctionTransformer(func=<function s... SimpleImputer(strategy='most_frequent'))]), ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'Age']), ('pipeline_splitColumns_Cabin', Pipeline(steps=[('column_splitter', FunctionTransformer(func=<function split_column_cabin_wrapper at 0x79aaf7262200>)), ('impute', SimpleImputer(strategy='most_frequent')), ('encode', OneHotEncoder(handle_unknown='ignore', sparse_output=False))]), ['Cabin'])])
['Destination', 'HomePlanet', 'VIP', 'CryoSleep']
SimpleImputer(strategy='most_frequent')
OneHotEncoder(handle_unknown='ignore', sparse_output=False)
['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'Age']
FunctionTransformer(func=<function sum_expenses_wrapper at 0x79aaf7262950>)
SimpleImputer(strategy='most_frequent')
['Cabin']
FunctionTransformer(func=<function split_column_cabin_wrapper at 0x79aaf7262200>)
SimpleImputer(strategy='most_frequent')
OneHotEncoder(handle_unknown='ignore', sparse_output=False)
passthrough
XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=0.9, device=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=0.001, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=0.5, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=3, max_leaves=None, min_child_weight=2, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=300, n_jobs=None, nthread=4, num_parallel_tree=None, ...)
Part 5: Custom Transfomer: To handle Model Based Imputation
Benefit:Prevents Data Leakage
By encapsulating all preprocessing steps within a pipeline and thus avoid applying transformations that might inadvertently expose information from the test set to the training set, thus preventing data leakage.
We are going to use DecisionTreeClassifier to impute missing data from the available remaining expenses. We do see a signifiant improvement in the model.
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.tree import DecisionTreeClassifier
def bin_expenses(X):
df = X.copy()
expenses=['RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']
for exp in expenses:
labels=[0,1000,2000]
bins=[-np.inf,0,1000,np.inf]
binnedArray = pd.cut( df[exp], bins = bins, labels=labels)
df[exp]=binnedArray.tolist()
return df
# Custom Binning Transformer
class CustomModelImputer(BaseEstimator, TransformerMixin):
def __init__(self):
self.expenses=['RoomService','FoodCourt','ShoppingMall','Spa','VRDeck','Age']
def fit(self,X, y=None):
modelExpenses = self.expenses.copy();
modelExpenses.remove('Age')
models = {};
for exp in modelExpenses:
X_train = X[self.expenses].copy()
X_train = X_train.dropna(subset=self.expenses).copy() # we will build models with only the non-null values
tree_model = DecisionTreeClassifier()
y_train = X_train.pop(exp)
tree_model.fit(X_train, y_train);
models[exp] = tree_model;
self.models = models;
return self
def transform(self, Xin):
X = Xin.copy()
#X.reset_index(inplace = True)
imputedDF = X.copy();
modelExpenses = self.expenses.copy();
modelExpenses.remove('Age')
for exp in modelExpenses:
dfbase = X[self.expenses].copy()
expenses_copy = self.expenses.copy()
expenses_copy.remove(exp)
othernonnullExpDFToPred = dfbase.dropna(subset=expenses_copy).copy()
missingValuesExpense = othernonnullExpDFToPred[pd.isnull(othernonnullExpDFToPred[exp])].copy()
#nonnullExpDFToPred_Pid = missingValuesExpense[['PassengerId']].copy()
X_missingValuesAllExpense = missingValuesExpense[self.expenses].copy();
y_nonnullExp = X_missingValuesAllExpense.pop(exp);
ypred = self.models.get(exp).predict(X_missingValuesAllExpense)
missing_mask = missingValuesExpense[exp].isnull()
missingValuesExpense.loc[missing_mask, exp]= ypred;
missingValuesExpense = missingValuesExpense[[exp]].copy()
#missingValuesExpense.insert(1, "PassengerId", nonnullExpDFToPred_Pid["PassengerId"].values, True)
missingValuesExpense.rename(columns = {exp : exp+"_temp"}, inplace = True)
# Merge DataFrames on 'key' with a left join
df_merged = pd.merge(imputedDF, missingValuesExpense, left_index=True, right_index=True, how='left')
# Update 'value1' with 'value2' where available
df_merged[exp] = df_merged[exp+"_temp"].combine_first(df_merged[exp])
# Drop the 'value2' column as it is no longer needed
df_merged = df_merged.drop(columns=[exp+"_temp"])
imputedDF[exp] = df_merged[exp].copy()
return imputedDF;
#('CustomModelImputer', CustomModelImputer())
imputation_pipeline_Exp= Pipeline(steps=[('binAge', FunctionTransformer(binAge, validate=False)),
('bin_expenses', FunctionTransformer(bin_expenses, validate=False)),
('CustomModelImputer', CustomModelImputer()),
])
multicolumn_prep = ColumnTransformer([
('pipeline_cat',imputation_pipeline_cat,['Destination','HomePlanet','VIP','CryoSleep']),
('imputation_pipeline_Exp', imputation_pipeline_Exp,['RoomService','FoodCourt','ShoppingMall','Spa','VRDeck','Age']),
('pipeline_splitColumns_Cabin',pipeline_splitColumns_Cabin,['Cabin'])
],
remainder='passthrough')
ct_pipeline = Pipeline([ ('preprocessing', multicolumn_prep),
('XG_model', XGBClassifier(max_depth=3, min_child_weight=2, learning_rate =0.5, n_estimators=300,
gamma=0.001, subsample=0.9, colsample_bytree=0.9,
objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27))])
ct_cv_res = cross_validate(estimator = ct_pipeline,
X = trainX[['Destination','HomePlanet','VIP','CryoSleep',
'Cabin','Age',
'RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']],
y = trainy,
cv = skf,
scoring = 'accuracy')['test_score'].mean()
print(f"Average cross-validated accuracy from\ncolumn transformer pipeline: {ct_cv_res:.3f}")
ct_pipeline
Average cross-validated accuracy from
column transformer pipeline: 0.777
Pipeline(steps=[('preprocessing', ColumnTransformer(remainder='passthrough', transformers=[('pipeline_cat', Pipeline(steps=[('impute', SimpleImputer(strategy='most_frequent')), ('encode', OneHotEncoder(handle_unknown='ignore', sparse_output=False))]), ['Destination', 'HomePlanet', 'VIP', 'CryoSleep']), ('imputation_pipeline_Exp', Pipeline(steps=[('binAge', Funct... feature_types=None, gamma=0.001, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=0.5, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=3, max_leaves=None, min_child_weight=2, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=300, n_jobs=None, nthread=4, num_parallel_tree=None, ...))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocessing', ColumnTransformer(remainder='passthrough', transformers=[('pipeline_cat', Pipeline(steps=[('impute', SimpleImputer(strategy='most_frequent')), ('encode', OneHotEncoder(handle_unknown='ignore', sparse_output=False))]), ['Destination', 'HomePlanet', 'VIP', 'CryoSleep']), ('imputation_pipeline_Exp', Pipeline(steps=[('binAge', Funct... feature_types=None, gamma=0.001, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=0.5, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=3, max_leaves=None, min_child_weight=2, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=300, n_jobs=None, nthread=4, num_parallel_tree=None, ...))])
ColumnTransformer(remainder='passthrough', transformers=[('pipeline_cat', Pipeline(steps=[('impute', SimpleImputer(strategy='most_frequent')), ('encode', OneHotEncoder(handle_unknown='ignore', sparse_output=False))]), ['Destination', 'HomePlanet', 'VIP', 'CryoSleep']), ('imputation_pipeline_Exp', Pipeline(steps=[('binAge', FunctionTransformer(func=<function binA... CustomModelImputer())]), ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'Age']), ('pipeline_splitColumns_Cabin', Pipeline(steps=[('column_splitter', FunctionTransformer(func=<function split_column_cabin_wrapper at 0x79aaf7262200>)), ('impute', SimpleImputer(strategy='most_frequent')), ('encode', OneHotEncoder(handle_unknown='ignore', sparse_output=False))]), ['Cabin'])])
['Destination', 'HomePlanet', 'VIP', 'CryoSleep']
SimpleImputer(strategy='most_frequent')
OneHotEncoder(handle_unknown='ignore', sparse_output=False)
['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'Age']
FunctionTransformer(func=<function binAge at 0x79aaf72604c0>)
FunctionTransformer(func=<function bin_expenses at 0x79aaf72624d0>)
CustomModelImputer()
['Cabin']
FunctionTransformer(func=<function split_column_cabin_wrapper at 0x79aaf7262200>)
SimpleImputer(strategy='most_frequent')
OneHotEncoder(handle_unknown='ignore', sparse_output=False)
passthrough
XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=0.9, device=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=0.001, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=0.5, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=3, max_leaves=None, min_child_weight=2, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=300, n_jobs=None, nthread=4, num_parallel_tree=None, ...)
Part 6: Testing and Submitting the Model
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
X_train, X_test, y_train, y_test = train_test_split(trainX, trainy, test_size=0.10, random_state=10)
ct_pipeline.fit(X_train[['Destination','HomePlanet','VIP','CryoSleep',
'Cabin','Age',
'RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']],y_train)
y_pred = ct_pipeline.predict(X_test[['Destination','HomePlanet','VIP','CryoSleep',
'Cabin','Age',
'RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']])
f1_score_all = f1_score(y_test, y_pred, average="weighted")
print(f"performed with value: {f1_score_all}")
performed with value: 0.7942487959983674
ct_pipeline.fit(trainX[['Destination','HomePlanet','VIP','CryoSleep',
'Cabin','Age',
'RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']],trainy)
y_pred = ct_pipeline.predict(testX[['Destination','HomePlanet','VIP','CryoSleep',
'Cabin','Age',
'RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']])
sub = testX.copy();
sub = sub.reset_index();
submission = pd.DataFrame();
submission['PassengerId'] = sub['PassengerId'].copy()
submission['Transported'] = y_pred
submission['Transported']= submission['Transported'].apply(lambda x: True if x == 1 else False)
submission.to_csv(fileSubmission, index=False)
Extra: GridSearchCV with Pipeline: Its simple
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
multicolumn_prep = ColumnTransformer([
('pipeline_cat',imputation_pipeline_cat,['Destination','HomePlanet','VIP','CryoSleep']),
('imputation_pipeline_Exp', imputation_pipeline_Exp,['RoomService','FoodCourt','ShoppingMall','Spa','VRDeck','Age']),
('pipeline_splitColumns_Cabin',pipeline_splitColumns_Cabin,['Cabin'])
],
remainder='passthrough')
ct_pipeline = Pipeline([ ('preprocessing', multicolumn_prep),
('XG_model', XGBClassifier(objective= 'binary:logistic', seed=27))])
# defining parameter range
param_grid = {'XG_model__max_depth': [ 3],
'XG_model__min_child_weight': [2],
'XG_model__nthread': [2,4],
'XG_model__scale_pos_weight': [1],
'XG_model__n_estimators': [300],
'XG_model__learning_rate': [ .05],
'XG_model__subsample': [.8],
'XG_model__colsample_bytree': [.8],
'XG_model__gamma': [ 0.001],
}
grid = GridSearchCV(ct_pipeline, param_grid, refit = True, verbose = 3,n_jobs=-1)
grid.fit(X_train[['Destination','HomePlanet','VIP','CryoSleep',
'Cabin','Age',
'RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']],y_train)
# print best parameter after tuning
print(grid.best_params_)
grid_predictions = grid.predict(X_test)
# print classification report
print(classification_report(y_test, grid_predictions))
Fitting 5 folds for each of 2 candidates, totalling 10 fits
{'XG_model__colsample_bytree': 0.8, 'XG_model__gamma': 0.001, 'XG_model__learning_rate': 0.05, 'XG_model__max_depth': 3, 'XG_model__min_child_weight': 2, 'XG_model__n_estimators': 300, 'XG_model__nthread': 2, 'XG_model__scale_pos_weight': 1, 'XG_model__subsample': 0.8}
precision recall f1-score support
False 0.82 0.80 0.81 434
True 0.80 0.82 0.81 436
accuracy 0.81 870
macro avg 0.81 0.81 0.81 870
weighted avg 0.81 0.81 0.81 870