SpaceTitanic Pipeline - Model Impute - 81 score

18 minute read

Published:

The objective in this Notebook is to use a Pipeline to streamline development of code. I will not be focusing on data analysis and charts.

Part 1: Simple pipeline

Benefit Demonstrated: Improves Code Readability and Maintenance

Building a simple pipeline with 2 components: 1) A simple imputer to fill null values, and in this case the most frequest value 2) Onehot encoding of the column.

import numpy as np 
import pandas as pd 

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier
from sklearn.preprocessing import OneHotEncoder


from sklearn import set_config
set_config(display="diagram")
fileNameTrain ='/kaggle/input/spaceship-titanic/train.csv';
fileNameTest = '/kaggle/input/spaceship-titanic/test.csv';
fileSubmission = '/kaggle/working/submission.csv'

train = pd.read_csv(fileNameTrain).set_index('PassengerId')
test = pd.read_csv(fileNameTest).set_index('PassengerId')
trainX = train.drop(['Transported'], axis=1)
trainy = train['Transported']
testX = test.copy();

Simple Imputer:

Provding some extra print statement for some explanation and validation

print(f">>>> null values before: {trainX['Destination'].isnull().sum()}")
imputer = SimpleImputer(strategy='most_frequent')
asArray = imputer.fit_transform(train[['Destination']])
print(f">>>> The simple imputer return result as n array \n{asArray}")
print(f">>>> shape of array {asArray.shape}")
asDF = pd.DataFrame(asArray, columns =['Destination'])
asDF
print(f">>>> converting back to dataframe just for explaination: \n{asDF.head(3)}")
print(f">>>> null values after: {asDF['Destination'].isnull().sum()}")

>>>> null values before: 182
>>>> The simple imputer return result as n array 
[['TRAPPIST-1e']
 ['TRAPPIST-1e']
 ['TRAPPIST-1e']
 ...
 ['TRAPPIST-1e']
 ['55 Cancri e']
 ['TRAPPIST-1e']]
>>>> shape of array (8693, 1)
>>>> converting back to dataframe just for explaination: 
   Destination
0  TRAPPIST-1e
1  TRAPPIST-1e
2  TRAPPIST-1e
>>>> null values after: 0

Combining a Simple Imputer and a One Hot encoder without using a pipeline.

imputer = SimpleImputer(strategy='most_frequent')
asArray = imputer.fit_transform(train[['Destination']])
one = OneHotEncoder(sparse_output=False,handle_unknown='ignore')
one.fit_transform(asArray)
array([[0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       ...,
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 0., 1.]])

Combining a Simple Imputer and a One Hot encoder using a pipeline.

Clear Structure for development and maintainability: Even with just 2 processing steps, we see that the code gets orgranized into a structure which focuses on the most important aspects of the pipeline and avoids boiler plate code.

imputation_pipeline_cat= Pipeline(steps=[('impute', SimpleImputer(strategy='most_frequent')),
                                         ('encode', OneHotEncoder(sparse_output=False, handle_unknown='ignore'))
                                      ])
imputation_pipeline_cat.fit_transform(train[['Destination']])
array([[0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       ...,
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 0., 1.]])

Reusing the pipeline for other columns with similar processing requirements.

Benefit Demonstrated: Simplifies the Workflow

Ease of Use: Pipelines can combine multiple steps into a single object and reuse the processing steps for multiple columns and simplifying the code. We are reusing the same steps across 4 pipelines.


imputation_pipeline_cat= Pipeline(steps=[('impute', SimpleImputer(strategy='most_frequent')),
                                         ('encode', OneHotEncoder(sparse_output=False, handle_unknown='ignore'))
                                      ])
imputation_pipeline_cat.fit_transform(train[['Destination','HomePlanet','VIP','CryoSleep']])

array([[0., 0., 1., ..., 0., 1., 0.],
       [0., 0., 1., ..., 0., 1., 0.],
       [0., 0., 1., ..., 1., 1., 0.],
       ...,
       [0., 0., 1., ..., 0., 1., 0.],
       [1., 0., 0., ..., 0., 1., 0.],
       [0., 0., 1., ..., 0., 1., 0.]])

Part 2: Pipeline combined with classifcaiton.

Building a pipeline with the simple feature engineering and fit/predict with a classifcaiton model.


space_titanic_pipeline = Pipeline(steps=[('impute', SimpleImputer(strategy='most_frequent')),
                                      ('encode', OneHotEncoder(sparse_output=False, handle_unknown='ignore')),
                                      ('model', XGBClassifier(max_depth=3, min_child_weight=2, learning_rate =0.5, n_estimators=300,
        gamma=0.001, subsample=0.9, colsample_bytree=0.9,
        objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27))])
space_titanic_pipeline.fit(train[['Destination','HomePlanet','VIP','CryoSleep']], trainy)
space_titanic_pipeline.predict(test[['Destination','HomePlanet','VIP','CryoSleep']])

array([1, 0, 1, ..., 1, 0, 1])

Cross validaing the pipeline as we keep adding features provides a robust framework for progress

from sklearn.model_selection import cross_validate, StratifiedKFold

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=123)
ct_cv_res = cross_validate(estimator = space_titanic_pipeline, 
                           X = train[['Destination','HomePlanet','VIP','CryoSleep']],
                           y = trainy,
                           cv = skf,
                           scoring = 'accuracy')['test_score'].mean()
print(f"Average cross-validated accuracy from\ncolumn transformer pipeline: {ct_cv_res:.3f}")
Average cross-validated accuracy from
column transformer pipeline: 0.717

Part 3: Function (Simple) Transformer

Until now we have used 2 OOTB tranformation on the data. But very often we would need specific functions which was developed and required to beused This is used to use custom stateless transformations for data processing within a pipeline.

Develop the Function and call the same using a FunctionTransformer within a pipeline

def binAge(dfbinIn: pd.DataFrame) -> pd.DataFrame:
    dfbin = dfbinIn.copy();
    labels=[0,10,20,30,40,50,60,70]
    bins=[-np.inf,0,10,20,30,40,50,60,np.inf]
    binnedArray = pd.cut(dfbin["Age"], bins = bins, labels=labels)
    dfbin["Age"]=binnedArray.tolist()
    mode_value = dfbin['Age'].mode().iloc[0] if not dfbin['Age'].mode().empty else False
    tempDF = pd.DataFrame()
    tempDF = dfbin['Age'].copy()
    tempDF.fillna(mode_value, inplace=True)
    dfbin['Age']=tempDF
    return dfbin

pipeline_Age = Pipeline([('binAge', FunctionTransformer(binAge, validate=False))])

pipeline_Age.fit_transform(trainX)
HomePlanetCryoSleepCabinDestinationAgeVIPRoomServiceFoodCourtShoppingMallSpaVRDeckName
PassengerId
0001_01EuropaFalseB/0/PTRAPPIST-1e40.0False0.00.00.00.00.0Maham Ofracculy
0002_01EarthFalseF/0/STRAPPIST-1e30.0False109.09.025.0549.044.0Juanna Vines
0003_01EuropaFalseA/0/STRAPPIST-1e60.0True43.03576.00.06715.049.0Altark Susent
0003_02EuropaFalseA/0/STRAPPIST-1e40.0False0.01283.0371.03329.0193.0Solam Susent
0004_01EarthFalseF/1/STRAPPIST-1e20.0False303.070.0151.0565.02.0Willy Santantines
.......................................
9276_01EuropaFalseA/98/P55 Cancri e50.0True0.06819.00.01643.074.0Gravior Noxnuther
9278_01EarthTrueG/1499/SPSO J318.5-2220.0False0.00.00.00.00.0Kurta Mondalley
9279_01EarthFalseG/1500/STRAPPIST-1e30.0False0.00.01872.01.00.0Fayey Connon
9280_01EuropaFalseE/608/S55 Cancri e40.0False0.01049.00.0353.03235.0Celeon Hontichre
9280_02EuropaFalseE/608/STRAPPIST-1e50.0False126.04688.00.00.012.0Propsh Hontichre

8693 rows × 12 columns

Part 4: Column Tranformer

As we have noticed, we would like to have 2 seperate types of data processeing. One for the Categorical variables ([‘Destination’,’HomePlanet’,’VIP’,’CryoSleep’] and another seperate one for Age. Column Transfoer allows to apply different preprocessing transformations to different columns of your dataset.

Column Transformer for 2 pipelines

The below chart also diagramatically details the processing steps.

multicolumn_prep = ColumnTransformer([
                                    ('pipeline_cat',imputation_pipeline_cat,['Destination','HomePlanet','VIP','CryoSleep']),
                                    ('pipeline_Age', pipeline_Age, ['Age'])
                                     ],
                                     remainder='passthrough')
multicolumn_prep
ColumnTransformer(remainder='passthrough',
                  transformers=[('pipeline_cat',
                                 Pipeline(steps=[('impute',
                                                  SimpleImputer(strategy='most_frequent')),
                                                 ('encode',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                sparse_output=False))]),
                                 ['Destination', 'HomePlanet', 'VIP',
                                  'CryoSleep']),
                                ('pipeline_Age',
                                 Pipeline(steps=[('binAge',
                                                  FunctionTransformer(func=<function binAge at 0x79aaf72604c0>))]),
                                 ['Age'])])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Pipeline with the column Transformer and the classificaiton model to fit and predict

space_titanic_pipeline = Pipeline([('preprocessing', multicolumn_prep),
                        ('XG_model', XGBClassifier(max_depth=3, min_child_weight=2, learning_rate =0.5, n_estimators=300,
        gamma=0.001, subsample=0.9, colsample_bytree=0.9,
        objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27))])
ct_cv_res = cross_validate(estimator = space_titanic_pipeline, 
                           X = train[['Destination','HomePlanet','VIP','CryoSleep','Age']],
                           y = trainy,
                           cv = skf,
                           scoring = 'accuracy')['test_score'].mean()
print(f"Average cross-validated accuracy from\ncolumn transformer pipeline: {ct_cv_res:.3f}")
Average cross-validated accuracy from
column transformer pipeline: 0.731

Part 4: Function (a bit more) Transformer

The previous Function Transformer one bined the age. The below demonstrates we are able to perform more complex requirements like spliting columns and creating additional columns. Again we are able create a specific sub pipeline and combine it to a lager pipeline within a column transformer

Splitting a column into new columns

def split_Cabin_column(dfin):
    df = dfin.copy()
    
    # Track which rows were originally NaN
    is_null = df['Cabin'].isnull()
    
    # Handle null values before splitting by filling with empty string
    df['Cabin'] = df['Cabin'].fillna('//')
    
    # Split the column and create new columns
    split_data = df['Cabin'].str.split('/', expand=True)
    split_data.columns = ['Cabin_Deck','Cabin_Num','Cabin_Side']
    
    # Reset split columns back to np.nan where original column was np.nan
    split_data[is_null] = np.nan
    
    # Join the new columns with the original dataframe
    df = df.join(split_data)
    
    return df.drop(columns=['Cabin','Cabin_Num'])
#new_df = split_column(trainX)

def split_column_cabin_wrapper(df):
    return split_Cabin_column(df)


pipeline_splitColumns_Cabin = Pipeline([
                                        ('column_splitter', FunctionTransformer(split_column_cabin_wrapper, validate=False)),
                                        ('impute', SimpleImputer(strategy='most_frequent')),
                                        ('encode', OneHotEncoder(sparse_output=False, handle_unknown='ignore'))
])
trainXCabin =pipeline_splitColumns_Cabin.fit_transform(trainX[['Cabin']]);
trainXCabin[0]
array([0., 1., 0., 0., 0., 0., 0., 0., 1., 0.])
multicolumn_prep = ColumnTransformer([
                                    ('pipeline_cat',imputation_pipeline_cat,['Destination','HomePlanet','VIP','CryoSleep']),
                                    ('pipeline_Age', pipeline_Age, ['Age']),
                                    ('pipeline_splitColumns_Cabin', pipeline_splitColumns_Cabin, ['Cabin'])
                                     ],
                                     remainder='passthrough')
space_titanic_pipeline = Pipeline([    
                            ('preprocessing', multicolumn_prep),
                            ('XG_model', XGBClassifier(max_depth=3, min_child_weight=2, learning_rate =0.5, n_estimators=300,
        gamma=0.001, subsample=0.9, colsample_bytree=0.9,
        objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27))])



ct_cv_res = cross_validate(estimator = space_titanic_pipeline, 
                           X = train[['Destination','HomePlanet','VIP','CryoSleep','Age','Cabin']],
                           y = trainy,
                           cv = skf,
                           scoring = 'accuracy')['test_score'].mean()
print(f"Average cross-validated accuracy from\ncolumn transformer pipeline: {ct_cv_res:.3f}")

space_titanic_pipeline
Average cross-validated accuracy from
column transformer pipeline: 0.735
Pipeline(steps=[('preprocessing',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('pipeline_cat',
                                                  Pipeline(steps=[('impute',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('encode',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                                 sparse_output=False))]),
                                                  ['Destination', 'HomePlanet',
                                                   'VIP', 'CryoSleep']),
                                                 ('pipeline_Age',
                                                  Pipeline(steps=[('binAge',
                                                                   FunctionTransfor...
                               feature_types=None, gamma=0.001,
                               grow_policy=None, importance_type=None,
                               interaction_constraints=None, learning_rate=0.5,
                               max_bin=None, max_cat_threshold=None,
                               max_cat_to_onehot=None, max_delta_step=None,
                               max_depth=3, max_leaves=None, min_child_weight=2,
                               missing=nan, monotone_constraints=None,
                               multi_strategy=None, n_estimators=300,
                               n_jobs=None, nthread=4, num_parallel_tree=None, ...))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Benifit Demonstrated: Enhance Model Management

By combining all both data preprocessing and model development into a single pipeline object, which can be easily saved, loaded, and used for predictions brings clarity and reduce chaos in the development. At the same time where is modularity with each step in the pipeline is a separate component, allowing for easy modification and experimentation with different preprocessing techniques and models.

We will be able to quickly put things together and check if there is an improvement in the model or not. (Note: This is a needs to be coupled with EDA and putting it to the test easily via pipeline).

We notice there is a dip (though very slight) in performance.

def split_for_groupCount(df):

    df_reset = df.copy()
    df_reset.reset_index(inplace = True)
    df_pass = df_reset.copy();

    split_data =df_reset['PassengerId'].str.split('_', expand=True)

    split_data.columns = ['PGroup','PGroupPP']
    split_data['PassengerId'] = df_pass['PassengerId'].copy()
    
    df_reset = df_reset.set_index('PassengerId')
    groupbyDF = pd.DataFrame()
    # Check to see if the group size has an impact on the target
    groupbyDF = split_data.groupby(["PGroup"], as_index=False)['PGroupPP'].count()
    
    groupbyDF = groupbyDF.rename(columns={"PGroupPP": "PGroupPPCounts"})
    split_data = pd.merge(split_data,groupbyDF,on = "PGroup")
    split_data["PGroupPP"] = split_data["PGroupPP"].astype(int)
    split_data = split_data.set_index('PassengerId')
    split_data.drop(columns=['PGroup','PGroupPP']);
    df_reset = df_reset.join(split_data)
    # Join the new columns with the original dataframe
    return  df_reset.drop(['PGroup','PGroupPP'], axis = 1);

def split_for_groupCount_wrapper(df):
    return split_for_groupCount(df)

pipeline_pass_splitColumnsGroup = Pipeline([
    ('passenger_splitGroup', FunctionTransformer(split_for_groupCount_wrapper, validate=False))]);


multicolumn_prep = ColumnTransformer([
                                    ('pipeline_cat',imputation_pipeline_cat,['Destination','HomePlanet','VIP','CryoSleep']),
                                    ('pipeline_Age', pipeline_Age, ['Age']),
                                    ('pipeline_splitColumns_Cabin', pipeline_splitColumns_Cabin, ['Cabin'])
                                     ],
                                     remainder='passthrough')
ct_pipeline = Pipeline([    ('pipeline-splitIndexColumns-Group', pipeline_pass_splitColumnsGroup),
                            ('preprocessing', multicolumn_prep),
                            ('XG_model', XGBClassifier(max_depth=3, min_child_weight=2, learning_rate =0.5, n_estimators=300,
        gamma=0.001, subsample=0.9, colsample_bytree=0.9,
        objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27))])

ct_cv_res = cross_validate(estimator = ct_pipeline, 
                           X = train[['Destination','HomePlanet','VIP','CryoSleep','Age','Cabin']],
                           y = trainy,
                           cv = skf,
                           scoring = 'accuracy')['test_score'].mean()


                            
print(f"Average cross-validated accuracy from\ncolumn transformer pipeline: {ct_cv_res:.3f}")

ct_pipeline 
Average cross-validated accuracy from
column transformer pipeline: 0.733
Pipeline(steps=[('pipeline-splitIndexColumns-Group',
                 Pipeline(steps=[('passenger_splitGroup',
                                  FunctionTransformer(func=<function split_for_groupCount_wrapper at 0x79aaf7262560>))])),
                ('preprocessing',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('pipeline_cat',
                                                  Pipeline(steps=[('impute',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('encode',
                                                                   OneHotE...
                               feature_types=None, gamma=0.001,
                               grow_policy=None, importance_type=None,
                               interaction_constraints=None, learning_rate=0.5,
                               max_bin=None, max_cat_threshold=None,
                               max_cat_to_onehot=None, max_delta_step=None,
                               max_depth=3, max_leaves=None, min_child_weight=2,
                               missing=nan, monotone_constraints=None,
                               multi_strategy=None, n_estimators=300,
                               n_jobs=None, nthread=4, num_parallel_tree=None, ...))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Adding multiple columns

from sklearn.preprocessing import KBinsDiscretizer

def sum_expenses(X):
    df = X.copy()
    df['SumOfExpenses']=df['RoomService']+df['FoodCourt']+df['ShoppingMall']+df['Spa']+df['VRDeck']
    labels=[0,1000,2000]
    bins=[-np.inf,0,1000,np.inf]
    binnedArray = pd.cut( df['SumOfExpenses'], bins = bins, labels=labels)
    df['SumOfExpenses']=binnedArray.tolist()
    return df.drop(['RoomService','FoodCourt','ShoppingMall','Spa','VRDeck'], axis = 1)

def sum_expenses_wrapper(df):
    return sum_expenses(df)

imputation_pipeline_sumExp= Pipeline(steps=[('sumExp', FunctionTransformer(sum_expenses_wrapper, validate=False)),
                                            ('impute', SimpleImputer(strategy='most_frequent'))
                                      ])

# XS = imputation_pipeline_sumExp.fit_transform(trainX[['RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']])

multicolumn_prep = ColumnTransformer([
                                    ('pipeline_cat',imputation_pipeline_cat,['Destination','HomePlanet','VIP','CryoSleep']),
                                    ('imputation_pipeline_sumExp', imputation_pipeline_sumExp,['RoomService','FoodCourt','ShoppingMall','Spa','VRDeck','Age']),
                                    ('pipeline_splitColumns_Cabin',pipeline_splitColumns_Cabin,['Cabin'])
                                     ],
                                     remainder='passthrough')

ct_pipeline = Pipeline([   ('preprocessing', multicolumn_prep),
                           ('XG_model', XGBClassifier(max_depth=3, min_child_weight=2, learning_rate =0.5, n_estimators=300,
        gamma=0.001, subsample=0.9, colsample_bytree=0.9,
        objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27))])

ct_cv_res = cross_validate(estimator = ct_pipeline, 
                           X = trainX[['Destination','HomePlanet','VIP','CryoSleep',
                                                        'Cabin','Age',
                                                        'RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']],
                           y = trainy,
                           cv = skf,
                           scoring = 'accuracy')['test_score'].mean()


                            
print(f"Average cross-validated accuracy from\ncolumn transformer pipeline: {ct_cv_res:.3f}")

ct_pipeline 
Average cross-validated accuracy from
column transformer pipeline: 0.732
Pipeline(steps=[('preprocessing',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('pipeline_cat',
                                                  Pipeline(steps=[('impute',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('encode',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                                 sparse_output=False))]),
                                                  ['Destination', 'HomePlanet',
                                                   'VIP', 'CryoSleep']),
                                                 ('imputation_pipeline_sumExp',
                                                  Pipeline(steps=[('sumExp',
                                                                   Fu...
                               feature_types=None, gamma=0.001,
                               grow_policy=None, importance_type=None,
                               interaction_constraints=None, learning_rate=0.5,
                               max_bin=None, max_cat_threshold=None,
                               max_cat_to_onehot=None, max_delta_step=None,
                               max_depth=3, max_leaves=None, min_child_weight=2,
                               missing=nan, monotone_constraints=None,
                               multi_strategy=None, n_estimators=300,
                               n_jobs=None, nthread=4, num_parallel_tree=None, ...))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Part 5: Custom Transfomer: To handle Model Based Imputation

Benefit:Prevents Data Leakage

By encapsulating all preprocessing steps within a pipeline and thus avoid applying transformations that might inadvertently expose information from the test set to the training set, thus preventing data leakage.

We are going to use DecisionTreeClassifier to impute missing data from the available remaining expenses. We do see a signifiant improvement in the model.

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.tree import DecisionTreeClassifier

def bin_expenses(X):
    df = X.copy()
    expenses=['RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']
    for exp in expenses:
        labels=[0,1000,2000]
        bins=[-np.inf,0,1000,np.inf]
        binnedArray = pd.cut( df[exp], bins = bins, labels=labels)
        df[exp]=binnedArray.tolist()
    return df

# Custom Binning Transformer
class CustomModelImputer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.expenses=['RoomService','FoodCourt','ShoppingMall','Spa','VRDeck','Age']

    def fit(self,X, y=None):
        modelExpenses = self.expenses.copy();
        modelExpenses.remove('Age')
        models = {};
        for exp in modelExpenses:
            X_train = X[self.expenses].copy()
            X_train = X_train.dropna(subset=self.expenses).copy()  # we will build models with only the non-null values
            tree_model = DecisionTreeClassifier()
            y_train = X_train.pop(exp)
            tree_model.fit(X_train, y_train);
            models[exp] = tree_model;
        self.models = models;
        return self

    def transform(self, Xin):
        X = Xin.copy()
        #X.reset_index(inplace = True)
        imputedDF = X.copy();
        modelExpenses = self.expenses.copy();
        modelExpenses.remove('Age')
        for exp in modelExpenses:
            dfbase = X[self.expenses].copy()
            expenses_copy = self.expenses.copy()
            expenses_copy.remove(exp)

            othernonnullExpDFToPred = dfbase.dropna(subset=expenses_copy).copy()

            missingValuesExpense = othernonnullExpDFToPred[pd.isnull(othernonnullExpDFToPred[exp])].copy()

            #nonnullExpDFToPred_Pid = missingValuesExpense[['PassengerId']].copy()
            
            X_missingValuesAllExpense = missingValuesExpense[self.expenses].copy();
           
            y_nonnullExp = X_missingValuesAllExpense.pop(exp);
            ypred = self.models.get(exp).predict(X_missingValuesAllExpense)
            missing_mask = missingValuesExpense[exp].isnull()
            missingValuesExpense.loc[missing_mask, exp]= ypred;
            missingValuesExpense = missingValuesExpense[[exp]].copy()
            #missingValuesExpense.insert(1, "PassengerId", nonnullExpDFToPred_Pid["PassengerId"].values, True)
            missingValuesExpense.rename(columns = {exp : exp+"_temp"}, inplace = True)
            # Merge DataFrames on 'key' with a left join
            df_merged = pd.merge(imputedDF, missingValuesExpense, left_index=True, right_index=True, how='left')
            # Update 'value1' with 'value2' where available
            df_merged[exp] = df_merged[exp+"_temp"].combine_first(df_merged[exp])
            # Drop the 'value2' column as it is no longer needed
            df_merged = df_merged.drop(columns=[exp+"_temp"])
            imputedDF[exp] = df_merged[exp].copy()
        return imputedDF;
#('CustomModelImputer', CustomModelImputer())

imputation_pipeline_Exp= Pipeline(steps=[('binAge', FunctionTransformer(binAge, validate=False)),
                                        ('bin_expenses', FunctionTransformer(bin_expenses, validate=False)),
                                         ('CustomModelImputer', CustomModelImputer()), 
                                      ])
multicolumn_prep = ColumnTransformer([
                                    ('pipeline_cat',imputation_pipeline_cat,['Destination','HomePlanet','VIP','CryoSleep']),
                                    ('imputation_pipeline_Exp', imputation_pipeline_Exp,['RoomService','FoodCourt','ShoppingMall','Spa','VRDeck','Age']),
                                    ('pipeline_splitColumns_Cabin',pipeline_splitColumns_Cabin,['Cabin'])
                                     ],
                                     remainder='passthrough')

ct_pipeline = Pipeline([   ('preprocessing', multicolumn_prep),
                           ('XG_model', XGBClassifier(max_depth=3, min_child_weight=2, learning_rate =0.5, n_estimators=300,
        gamma=0.001, subsample=0.9, colsample_bytree=0.9,
        objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27))])

ct_cv_res = cross_validate(estimator = ct_pipeline, 
                           X = trainX[['Destination','HomePlanet','VIP','CryoSleep',
                                                        'Cabin','Age',
                                                        'RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']],
                           y = trainy,
                           cv = skf,
                           scoring = 'accuracy')['test_score'].mean()
                            
print(f"Average cross-validated accuracy from\ncolumn transformer pipeline: {ct_cv_res:.3f}")

ct_pipeline 
Average cross-validated accuracy from
column transformer pipeline: 0.777
Pipeline(steps=[('preprocessing',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('pipeline_cat',
                                                  Pipeline(steps=[('impute',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('encode',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                                 sparse_output=False))]),
                                                  ['Destination', 'HomePlanet',
                                                   'VIP', 'CryoSleep']),
                                                 ('imputation_pipeline_Exp',
                                                  Pipeline(steps=[('binAge',
                                                                   Funct...
                               feature_types=None, gamma=0.001,
                               grow_policy=None, importance_type=None,
                               interaction_constraints=None, learning_rate=0.5,
                               max_bin=None, max_cat_threshold=None,
                               max_cat_to_onehot=None, max_delta_step=None,
                               max_depth=3, max_leaves=None, min_child_weight=2,
                               missing=nan, monotone_constraints=None,
                               multi_strategy=None, n_estimators=300,
                               n_jobs=None, nthread=4, num_parallel_tree=None, ...))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Part 6: Testing and Submitting the Model

from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
X_train, X_test, y_train, y_test = train_test_split(trainX, trainy, test_size=0.10, random_state=10)
ct_pipeline.fit(X_train[['Destination','HomePlanet','VIP','CryoSleep',
                                                    'Cabin','Age',
                                                    'RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']],y_train)
y_pred = ct_pipeline.predict(X_test[['Destination','HomePlanet','VIP','CryoSleep',
                                                    'Cabin','Age',
                                                    'RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']])

f1_score_all = f1_score(y_test, y_pred, average="weighted")
print(f"performed with value: {f1_score_all}")
performed with value: 0.7942487959983674
ct_pipeline.fit(trainX[['Destination','HomePlanet','VIP','CryoSleep',
                                                    'Cabin','Age',
                                                    'RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']],trainy)
y_pred = ct_pipeline.predict(testX[['Destination','HomePlanet','VIP','CryoSleep',
                                                    'Cabin','Age',
                                                    'RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']])
sub = testX.copy();
sub = sub.reset_index();
submission = pd.DataFrame();
submission['PassengerId'] = sub['PassengerId'].copy()
submission['Transported'] = y_pred
submission['Transported']= submission['Transported'].apply(lambda x: True if x == 1 else False)
submission.to_csv(fileSubmission, index=False)

Extra: GridSearchCV with Pipeline: Its simple

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

multicolumn_prep = ColumnTransformer([
                                    ('pipeline_cat',imputation_pipeline_cat,['Destination','HomePlanet','VIP','CryoSleep']),
                                    ('imputation_pipeline_Exp', imputation_pipeline_Exp,['RoomService','FoodCourt','ShoppingMall','Spa','VRDeck','Age']),
                                    ('pipeline_splitColumns_Cabin',pipeline_splitColumns_Cabin,['Cabin'])
                                     ],
                                     remainder='passthrough')

ct_pipeline = Pipeline([   ('preprocessing', multicolumn_prep),
                           ('XG_model', XGBClassifier(objective= 'binary:logistic', seed=27))])

# defining parameter range 
param_grid = {'XG_model__max_depth': [ 3],
              'XG_model__min_child_weight': [2],
              'XG_model__nthread': [2,4],
              'XG_model__scale_pos_weight': [1],
              'XG_model__n_estimators': [300],
              'XG_model__learning_rate': [ .05],
              'XG_model__subsample': [.8],
              'XG_model__colsample_bytree': [.8],
              'XG_model__gamma': [ 0.001],  
              }

grid = GridSearchCV(ct_pipeline, param_grid, refit = True, verbose = 3,n_jobs=-1) 

grid.fit(X_train[['Destination','HomePlanet','VIP','CryoSleep',
                                                    'Cabin','Age',
                                                    'RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']],y_train)
# print best parameter after tuning 
print(grid.best_params_) 
grid_predictions = grid.predict(X_test) 
   
# print classification report 
print(classification_report(y_test, grid_predictions)) 
Fitting 5 folds for each of 2 candidates, totalling 10 fits
{'XG_model__colsample_bytree': 0.8, 'XG_model__gamma': 0.001, 'XG_model__learning_rate': 0.05, 'XG_model__max_depth': 3, 'XG_model__min_child_weight': 2, 'XG_model__n_estimators': 300, 'XG_model__nthread': 2, 'XG_model__scale_pos_weight': 1, 'XG_model__subsample': 0.8}
              precision    recall  f1-score   support

       False       0.82      0.80      0.81       434
        True       0.80      0.82      0.81       436

    accuracy                           0.81       870
   macro avg       0.81      0.81      0.81       870
weighted avg       0.81      0.81      0.81       870