Introduction

The aim of this projcet is to build a predictive model for the given data set. Two datasets are provided; train and test data sets. There are 254 features in for each set and a target value for training set. Ultimately, the goal is to assess predictive accuracy on the test set using the mean squared error metric.

Before starting to inspect the data we need to import required tools. We will use pandas to inspect the data, and numpy and sklearn for implementing predictor models.

import os 
import time

import pandas as pd 
import numpy as np 
import cPickle
from sklearn.preprocessing import scale
from sklearn.decomposition import PCA
from sklearn.cross_validation import KFold
from sklearn.linear_model import LinearRegression, ElasticNetCV, RidgeCV, LassoCV 
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import LinearSVR, NuSVR, SVR
/home/spartonia/.venv/experiments/lib/python2.7/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

We also add the following to suppress warnings.

import warnings
warnings.filterwarnings('ignore') 

Data Inspection

We load train and test data sets and separate X and y of training set accordingly.

data_dir = '.'
df = pd.read_csv(os.path.join(data_dir, 'codetest_train.txt'), delimiter='\t')
X_test = pd.read_csv(os.path.join(data_dir, 'codetest_test.txt'), delimiter='\t')

To start off, we print the head of dataset

df.head(3)
target f_0 f_1 f_2 f_3 f_4 f_5 f_6 f_7 f_8 ... f_244 f_245 f_246 f_247 f_248 f_249 f_250 f_251 f_252 f_253
0 3.066056 -0.653 0.255 -0.615 -1.833 -0.736 NaN 1.115 -0.171 -0.351 ... -1.607 -1.400 -0.920 -0.198 -0.945 -0.573 0.170 -0.418 -1.244 -0.503
1 -1.910473 1.179 -0.093 -0.556 0.811 -0.468 -0.005 -0.116 -1.243 1.985 ... 1.282 0.032 -0.061 NaN -0.061 -0.302 1.281 -0.850 0.821 -0.260
2 7.830711 0.181 -0.778 -0.919 0.113 0.887 -0.762 1.872 -1.709 0.135 ... -0.237 -0.660 1.073 -0.193 0.570 -0.267 1.435 1.332 -1.147 2.580

3 rows × 255 columns

It seems the data is mostly in numeric format, including some missing values as well. The target is first column and the rest of columns represent fearures. So we can separate X and y

X = df.iloc[:, 1:]
y = df.target
y = y.values  # To np.array

Now we need to make sure that all columns of X are in numeric format.

print set(X.dtypes)
set([dtype('O'), dtype('float64')])

Seems that we have dtype('O') along with float64 too. So we inspect more those columns separately.

categorical_cols = []
for col in X.columns:
    if X[col].dtype != 'float64':
        categorical_cols.append(col)
        print 'column:', col 
        print X[col].head(3)
        print '-' * 15
column: f_61
0    b
1    a
2    b
Name: f_61, dtype: object
---------------
column: f_121
0    D
1    A
2    B
Name: f_121, dtype: object
---------------
column: f_215
0       red
1      blue
2    orange
Name: f_215, dtype: object
---------------
column: f_237
0    Canada
1    Canada
2    Canada
Name: f_237, dtype: object
---------------

As we can see, we have a few columns with non-numeric data types (categorical), so we should convert them to numeric ones. Columns to be decoded: [f_61, f_121, f_215, f_237] should be decoded.

We add dummy columns to encode categorical data.

X[categorical_cols].head()
f_61 f_121 f_215 f_237
0 b D red Canada
1 a A blue Canada
2 b B orange Canada
3 a C blue USA
4 b E orange Canada

As we saw earlier the data inclued missing values too. So we need to fix those raws with null values. We could either remove all raws with missing values but it seems like a good idea, beacuse we will miss majority of data:

print X.dropna().shape
(32, 254)

So only have 32 samples without missing values. Another approach is to use interploation to fill missing values. We use interpolation with limit=2 and in both directions to fill null values. We also scale the data such that all values fall in the same interval. This will speed up the training process. The following function performs these 3 operations on the data.

def preprocess_data(data, categorical_cols):
    """Preprocess and clean the data:
        * add dummy clumns for categorical values
        * fix missing values
        * scale 
    """
    # dd dummy clumns for categorical values
    X_categorical = pd.get_dummies(data[categorical_cols])
    X = pd.concat([data.drop(categorical_cols, axis=1), X_categorical], axis=1)
    
    # Use interpolation to deal with missing values. 
    X_filled = X.interpolate(limit=2,limit_direction='both').dropna()
    
    # Scale
    X_scaled = pd.DataFrame(data=scale(X_filled), columns=X.columns)
    
    return X_scaled
X_scaled = preprocess_data(X, categorical_cols)

The next thing we can try is to use PCA to see if we can reduce the dimension (number of features):

# Do some PCA to see if we can reduce the dimension..
pca = PCA()
# Scale and transform data to get principal components
X_reduced = pca.fit_transform(X_scaled)
# Variance (% cumulative) explained by the principal components
print np.cumsum(np.round(pca.explained_variance_ratio_, decimals=4) * 100)[::-1]
[ 99.96  99.95  99.94  99.93  99.92  99.82  99.72  99.61  99.5   99.39
  99.16  98.93  98.7   98.46  98.22  97.98  97.74  97.5   97.26  97.02
  96.77  96.52  96.27  96.02  95.77  95.52  95.27  95.01  94.75  94.49
  94.23  93.97  93.71  93.45  93.18  92.91  92.64  92.37  92.1   91.83
  91.56  91.29  91.01  90.73  90.45  90.17  89.89  89.61  89.33  89.05
  88.77  88.49  88.21  87.92  87.63  87.34  87.05  86.76  86.47  86.18
  85.89  85.6   85.31  85.01  84.71  84.41  84.11  83.81  83.51  83.21
  82.91  82.61  82.3   81.99  81.68  81.37  81.06  80.75  80.44  80.13
  79.82  79.51  79.2   78.88  78.56  78.24  77.92  77.6   77.28  76.96
  76.64  76.32  76.    75.68  75.35  75.02  74.69  74.36  74.03  73.7
  73.37  73.04  72.71  72.37  72.03  71.69  71.35  71.01  70.67  70.33
  69.99  69.65  69.31  68.96  68.61  68.26  67.91  67.56  67.21  66.86
  66.51  66.16  65.81  65.45  65.09  64.73  64.37  64.01  63.65  63.29
  62.93  62.57  62.21  61.84  61.47  61.1   60.73  60.36  59.99  59.62
  59.25  58.88  58.51  58.13  57.75  57.37  56.99  56.61  56.23  55.85
  55.47  55.08  54.69  54.3   53.91  53.52  53.13  52.74  52.35  51.96
  51.57  51.17  50.77  50.37  49.97  49.57  49.17  48.77  48.37  47.97
  47.56  47.15  46.74  46.33  45.92  45.51  45.1   44.69  44.28  43.86
  43.44  43.02  42.6   42.18  41.76  41.34  40.91  40.48  40.05  39.62
  39.19  38.76  38.33  37.9   37.46  37.02  36.58  36.14  35.7   35.26
  34.82  34.38  33.93  33.48  33.03  32.58  32.13  31.68  31.23  30.78
  30.32  29.86  29.4   28.94  28.48  28.01  27.54  27.07  26.6   26.13
  25.66  25.19  24.71  24.23  23.75  23.27  22.79  22.31  21.82  21.33
  20.84  20.35  19.86  19.37  18.88  18.38  17.88  17.38  16.88  16.37
  15.86  15.35  14.84  14.33  13.82  13.3   12.78  12.26  11.74  11.21
  10.68  10.15   9.61   9.07   8.53   7.98   7.43   6.88   6.32   5.76
   5.19   4.62   4.01   3.39   2.74   2.08   1.4    0.7 ]

As we can see from above, seems that all components are required to explain the variance in data. So we’ll use all components.

Once the data is cleaned, we are ready to test a few regressors. We create a dictionary called ‘models’, which includes name to regressor items. By this, we can train all models together and compatre the results to chhose the best performing model. Feel free to uncomment any of models to include in the results.



models = {
    'LinearRegression': LinearRegression(fit_intercept=True),
    'ElasticNetCV': ElasticNetCV(fit_intercept=True,n_jobs=-1),
    'LassoCV': LassoCV(n_jobs=-1),
    'LinearSVR': LinearSVR(), 
    'Ridge': RidgeCV(),
#     'NuSVR': NuSVR(),
#     'SVR': SVR(),
#     'RFRegressor': RandomForestRegressor(n_estimators=50, n_jobs=-1)  # --> Dangerous! Very time consuming..
}

We also use cross validation to choose the best model.

n_folds = 20
kf = KFold(X_reduced.shape[0], n_folds=n_folds)
err = dict.fromkeys(models.keys(), 0)

for each model, we run a k-fold cross validation and save the results in a dict, one result for each model.

for model_name, model in models.iteritems():
    print 'training', model_name, '..'
    t0 = time.time()
    for train, test in kf: 
        model.fit(X_scaled.values[train], y[train])
    
        p = np.squeeze(map(model.predict, X_scaled.values[test]))
        e = p - y[test]
        error = np.dot(e,e)
        err[model_name] += error

    print '\t..', time.time()-t0, 'seconds.'
print 'Done!'
training Ridge ..
	.. 8.10740995407 seconds.
training LinearSVR ..
	.. 26.9117820263 seconds.
training LinearRegression ..
	.. 3.52279186249 seconds.
training ElasticNetCV ..
	.. 9.98840403557 seconds.
training LassoCV ..
	.. 10.8082270622 seconds.
Done!

Once training is done, we can calculate the rmse for each model. The results for 10, and 20 fold cross validations are listed below.

print 'RMSE on {}-fold CV:'.format(n_folds)
for model, error in err.iteritems():
    rmse_10cv = np.sqrt(error/X_scaled.shape[0])
    print model, rmse_10cv    
RMSE on 20-fold CV:
LinearRegression 3.50829586837
LinearSVR 3.47428941377
Ridge 3.50748745066
ElasticNetCV 3.43133680584
LassoCV 3.41826205147

RMSE on 20-fold CV:

  • Ridge 3.50748745066
  • ElasticNetCV 3.49292650999
  • LinearSVR 3.47267712099
  • LinearRegression 3.50829586837
  • LassoCV 3.4933509662
  • SVR 3.76794728891

The best performing model is RFRegressor followed by LassoCV, both trained on 20-fold cross validation. Considering the training time, which is huge fir RFRegressor, we choose second best performing i.e. LassoCV. Now we traing a new lassoCV model with all data to use for predicting test data.

regressor = LassoCV(cv=20)
regressor.fit(X_scaled, y)
LassoCV(alphas=None, copy_X=True, cv=20, eps=0.001, fit_intercept=True,
    max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False,
    precompute='auto', random_state=None, selection='cyclic', tol=0.0001,
    verbose=False)
regressor.coef_[0:6]
array([ 0., -0., -0.,  0.,  0.,  0.])

Before using the predictor on test data, we need to make sure that test data has the same format as train. So we call the preprocessing function again, with same settings, to unify data formats.

X_test_scaled = preprocess_data(X_test, categorical_cols)

We can now predict the test data and write the results to a txt file.

y_test = regressor.predict(X_test_scaled)
result_file = os.path.join(data_dir, 'prediction.txt')
np.savetxt(result_file, y_test, delimiter='\n', fmt='%1.6f')

We can also get to n features and therir weights:

coef_idx = [i for i in range(len(regressor.coef_))]
lst = zip(abs(regressor.coef_), coef_idx)
# sort the list 
lst_sort = sorted(lst, reverse=True)
top_n_feats = 10
print 'Top {} fetures'.format(top_n_feats)
print '=' * 20
for i in lst_sort[:10]:
    print X_scaled.columns[i[1]] , '-->', i[0]
Top 10 fetures
====================
f_175 --> 2.77409249023
f_205 --> 1.70538319659
f_61_e --> 1.24219269182
f_35 --> 0.808882249231
f_218 --> 0.759748432363
f_61_c --> 0.577622944907
f_237_Mexico --> 0.41012307336
f_94 --> 0.33767073516
f_237_Canada --> 0.309637288423
f_195 --> 0.12192988248
# Top 10 fetures
# ====================
# f_175 --> 2.77409249023
# f_205 --> 1.70538319659
# f_61_e --> 1.24219269182
# f_35 --> 0.808882249231
# f_218 --> 0.759748432363
# f_61_c --> 0.577622944907
# f_237_Mexico --> 0.41012307336
# f_94 --> 0.33767073516
# f_237_Canada --> 0.309637288423
# f_195 --> 0.12192988248

Save the model into disk:

save_model_name = 'Best_model.pkl'
with open(os.path.join(data_dir, save_model_name), 'wb') as f: 
    cPickle.dump(regressor, f)


Saeed Partonia

Software Engineer, with passion in machine learning