Cameron Summers, Tinkerer

Oct 27, 2016

Lending Club Deep Learning

Overview

I've recently started investing with Lending Club, which is a peer-to-peer loan service that matches individuals wanting to borrow money with individuals looking to invest. As an investor, you are like your own little bank. But instead of your peer coming to you asking for money, which requires you to approve or disapprove and set the terms, s/he asks Lending Club. Lending Club assesses the risk of the loan and decides whether to approve and at what interest rate. Then you can choose from a collection of loans that have been approved, each with accompanying data such as the reason for the loan, the borrower's credit rating, and the interest rate set by Lending Club from their risk analysis. This allows you to easily invest given your personal preferences for risk and return.

As someone who regularly uses statistics and machine learning at my job (primarily in the audio domain), I naturally contemplated whether they can provide an edge on the Lending Club platform to a regular investor like myself. Luckily, Lending Club provides historical data sets of the loans they offer and the remainder of this post is a walkthrough of this process culminating with some nice results.

The Data

The idea is to apply this model in the near future so we'll focus on the most recent data set provided, which as of this writing is the loans from 2016 Q2. But we'll also get some additional data from 2016 Q1 and 2015 for additional evaluation.

In [1]:
import pandas as pd

data_2016q2 = '2016 Q2'
data_2016q1 = '2016 Q1'
data_2015 = '2015'

all_data = {
    data_2016q2: pd.read_csv('LoanStats3a_securev1.csv', skiprows=1, low_memory=False),
    data_2016q1: pd.read_csv('LoanStats_2016Q1.csv', skiprows=1, low_memory=False),
    data_2015: pd.read_csv('LoanStats3d.csv', skiprows=1, low_memory=False)
}

active_data_descr = data_2016q2
active_data = all_data[active_data_descr]

And to give an idea of what kind of information is available in the data set, here is a peek.

In [2]:
active_data.head()
Out[2]:
id member_id loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade ... num_tl_90g_dpd_24m num_tl_op_past_12m pct_tl_nvr_dlq percent_bc_gt_75 pub_rec_bankruptcies tax_liens tot_hi_cred_lim total_bal_ex_mort total_bc_limit total_il_high_credit_limit
0 1077501 1296599 5000 5000 4975 36 months 10.65% 162.87 B B2 ... NaN NaN NaN NaN 0 0 NaN NaN NaN NaN
1 1077430 1314167 2500 2500 2500 60 months 15.27% 59.83 C C4 ... NaN NaN NaN NaN 0 0 NaN NaN NaN NaN
2 1077175 1313524 2400 2400 2400 36 months 15.96% 84.33 C C5 ... NaN NaN NaN NaN 0 0 NaN NaN NaN NaN
3 1076863 1277178 10000 10000 10000 36 months 13.49% 339.31 C C1 ... NaN NaN NaN NaN 0 0 NaN NaN NaN NaN
4 1075358 1311748 3000 3000 3000 60 months 12.69% 67.79 B B5 ... NaN NaN NaN NaN 0 0 NaN NaN NaN NaN

5 rows × 115 columns

In [3]:
print 'Number of loans in {} dataset: {}'.format(active_data_descr, len(active_data))
Number of loans in 2016 Q2 dataset: 42542

Lending Club specifies if they have verified the information in the application or not:

In [4]:
active_data.verification_status.value_counts()
Out[4]:
Not Verified       18758
Verified           13471
Source Verified    10306
dtype: int64

The distribution of risk for the loans where A is least risky with lower interest rates, and G is the most risky with the highest interest rates:

In [5]:
active_data.grade.value_counts()
Out[5]:
B    12389
A    10183
C     8740
D     6016
E     3394
F     1301
G      512
dtype: int64

The distribution of statuses for loans. Charged Off essentially means the loan has defaulted. We'll be focusing on modeling Fully Paid and Charged Off loans.

In [6]:
active_data.loan_status.value_counts()
Out[6]:
Fully Paid                                             33586
Charged Off                                             5653
Does not meet the credit policy. Status:Fully Paid      1988
Does not meet the credit policy. Status:Charged Off      761
Current                                                  513
In Grace Period                                           16
Late (31-120 days)                                        12
Late (16-30 days)                                          5
Default                                                    1
dtype: int64

We can define a function that will do some light cleaning of the data since we'll need numeric representations for our modeling. Of note here is that we're filtering for loans where there is a definitive result, either Charged Off or Fully Paid.

In [7]:
def clean(data, grades):
    
    # Filter for grades and target variables
    grade_mask = data.grade.isin(grades)
    y_mask = data.loan_status.isin(['Charged Off', 'Fully Paid'])
    data_modified = data[grade_mask & y_mask].copy()
    
    # Some cleaning
    data_modified.replace('n/a', np.nan,inplace=True)
    data_modified.loan_status = \
        data_modified.loan_status.map({'Fully Paid': 1, 'Charged Off': 0})
    data_modified.verification_status = \
        data_modified.verification_status.map({'Source Verified': 0, 
                                               'Verified': 1, 'Not Verified': 2})
    data_modified.term = data_modified.term.map({'60 months': 1, '36 months': 0})
    data_modified.int_rate = data_modified.int_rate.str.replace('%', '').astype(float)
    data_modified.revol_util = data_modified.revol_util.str.replace('%', '').astype(float)
    data_modified.emp_length.fillna(value=0,inplace=True)
    data_modified.emp_length.replace(to_replace='[^0-9]+', value='', inplace=True, regex=True)
    data_modified.emp_length = data_modified.emp_length.astype(int)
    data_modified.fillna(value=0,inplace=True)
    
    return data_modified
In [8]:
LOAN_GRADES = ['A', 'B', 'C', 'D', 'E', 'F', 'G']
In [9]:
%matplotlib inline

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style("darkgrid")

from keras.optimizers import RMSprop
from keras.models import Sequential
from keras.layers import Dense, Activation

from sklearn.preprocessing import StandardScaler, binarize

from numpy.random import RandomState
rstate = RandomState(1234)
Using Theano backend.

The Goal

It is important to clearly define a goal so we can properly evaluate the results. For this post here's the goal:

  • Use a machine learning model to create a portfolio of 100 loans with a higher return than a Lending Club baseline

Lending Club's promotional materials tout that portfolios with at least 100 loans will have low volatility and solid returns. A sample size of 100 should be large enough to be representative of the loan population so we'll use that. As for a Lending Club baseline return, we can calculate the return for the whole data set by multiplying the probability of loan payback with the loan interest rate, E[Return] = Pr(payback) * InterestRate.

Note that this is not the real return on the loan because we are not taking into account the loss from the principal when a loan defaults. But that loss varies (e.g. the borrower defaults after a few payments or many payments) and modeling this increases our complexity here substantially, so we'll use the formula above as our proxy. If we assume that the loss from principal is independent and identically distributed (a convenient assumption for data scientist!), then our return will be proportional to the real return, which is all we need for comparisons.

In [29]:
active_data_clean = clean(active_data, LOAN_GRADES)
active_data_clean.int_rate.multiply(active_data_clean.loan_status).mean()
Out[29]:
9.989008129667404

Ok, so we have a 10% return for the total loan population. And just so we feel good about using 100 loans as a representative sample, let's plot a histogram of the return for 100 loans randomly selected 5000 times.

In [30]:
sample_returns = []
for i in range(5000):
    sample_indices = np.random.randint(0, len(active_data_clean), (100))
    sample_mask = np.zeros(shape=(len(active_data_clean)))
    sample_mask[sample_indices] = 1
    sample_mask = sample_mask.astype(bool)
    sample_data = active_data_clean[sample_mask]
    our_return = \
        sample_data.int_rate.multiply(active_data_clean[sample_mask].loan_status).mean()
    sample_returns.append(our_return)
plt.hist(sample_returns)
plt.ylabel('Number of occurrences')
plt.xlabel('Effective Interest Rate')
plt.show()

Looks good to me. The mean of the effective interest rate on the histogram is right around the return for the whole loan population. And we have a pretty narrow distribution so we won't need to worry much that we'd accidentally pick 100 loans that will all default. With more loans, this would get a narrower eventually converging to the value we calculated above.

In [12]:
MIN_LOANS = 100

Lending Club has already done some risk analysis prior to our selecting the loans so we will look to layer our model on top of this to bolster our return. Because our return for a given loan is of the probability of loan payback multiplied by the loan interest rate, E[Return] = Pr(payback) * InterestRate, we can increase this product by increasing either value while avoiding decreasing the other. This can be difficult if Lending Club estimates the risk well since a higher interest rate will tend to have a lower probability of payback, but let's see what we can do. From a machine learning standpoint, the typical approach to this problem is to develop a model that can predict whether a loan is likely to default, such as here and here. This can be used to increase Pr(payback), so let's start there.

Strategy #1: Reduce Risk

Our risk is derived from determining whether a loan will default so we'll train our model in a supervised manner where the target, y, is a 0 if the loan defaulted and a 1 if the loan was fully paid back. Then for a new loan, our model will output a value between 0 and 1 that we can interpret as the probability that the loan will default.

Next we'll need some input features, X, to train with. There has been other analysis online around feature selection for this data, so I'm going to cherry pick some of those rather than do my own feature selection analysis so we can focus on modeling strategies. Many of these are straightforward, such as annual income (annual_inc) and debt to income ratio (dti), but you can read more on these at the source.

In [13]:
FEATURES = [
            'int_rate', 
            'revol_util',
            'annual_inc', 
            'dti', 
            'delinq_2yrs', 
            'loan_amnt', 
            'revol_bal',
            'total_acc', 
            'verification_status',
            'open_acc',
            'pub_rec',
            'chargeoff_within_12_mths',
            'pub_rec_bankruptcies',
            'tax_liens',
            'emp_length',
            'term'
    ]
In [14]:
overall_best_return_cs = {}
overall_best_return_lc = {}

And then we can define functions for splitting our data into train and test sets, for training the model and for predicting on the test set. Of note here is the simplicity of the model, a 2-layer network with roughly 3x over-completeness of layer neurons to features. Through some trial and error I arrived here since it produced pretty good results. There is probably room to optimize the hyperparameters of the model, but that's outside the scope of this post.

In [15]:
def split(data):
    
    # Split into 80/20 train and test
    msk = rstate.rand(len(data)) < 0.8
    train = data[msk]
    test = data[~msk]
    
    return train, test
    
def train_model_reduce_risk(train_data, verbose=1):
    
    opt = RMSprop(lr=0.001)
    clf = Sequential()
    clf.add(Dense(50, input_dim=len(FEATURES)))
    clf.add(Activation('relu'))
    clf.add(Dense(50, input_dim=50))
    clf.add(Activation('relu'))
    clf.add(Dense(1, input_dim=50))
    clf.add(Activation('sigmoid'))
    clf.compile(optimizer=opt,
              loss='binary_crossentropy',
              metrics=['accuracy'],
              )
    
    scl = StandardScaler()
    
    X_train = train_data[FEATURES]
    y_train = train_data.loan_status

    X_train = scl.fit_transform(X_train)  
    
    clf.fit(X_train, y_train, 
            nb_epoch=15,
            validation_split=0.1,
            verbose=verbose
           )
    
    return clf, scl

def predict_reduce_risk(clf, scl, test):
    
    X_test = test[FEATURES]
    X_test = scl.transform(X_test)
    y_test = test.loan_status
    y_pred = clf.predict(X_test)
    
    return y_test, y_pred

Next, we need a way to evaluate our model's effectiveness at solving our problem. It's useful to think of our model as a filter for the loans. We can set a threshold for our model output that says to accept any loans above the threshold. Then we can evaluate this population of loans for their risk and return. For those familiar with binary classification, the precision of our model at any given threshold is equivalent to the payback probability.

In [16]:
def eval_max_avg_int_rate(test, y_test, y_pred, min_loans=MIN_LOANS, descr='', plot=True):

    return_cs = []  # our return at all thresholds
    num_loans = []  # number of loans at all thresholds
    charge_off_rate = [] # our risk at all thresholds
    avg_int_rate = []  # mean of interest rates for all recommended loans at all thresholds
    low_num_loan_risk_thresh = 1.0  # thresh where we have fewer than 100 loans
    low_num_loan_risk_reached = False
    best_return_cs = 0.0
    best_return_cs_thresh = 0.0
    zero_loans_thresh = 1.0  # thresh where we aren't recommending any loans
    zero_loans_reached = False
    thresh = np.arange(0.0, 1.0, 0.01)
    
    # Lending club effective return on all loans
    if len(test) > min_loans:
        return_lc = test.int_rate.multiply(test.loan_status).mean()
    else:
        return_lc = 0
    
    # sweep the thresholds
    for i, th in enumerate(thresh):
        y_pred_bin = binarize(y_pred, th)
                
        num_loans_for_thresh = np.sum(y_pred_bin)
        num_loans.append(num_loans_for_thresh)
            
        # Once we hit this, we'll have zero loans for the rest of the thresh sweep
        if num_loans_for_thresh == 0:
            avg_int_rate.append(None)
            return_cs.append(None)
            charge_off_rate.append(None)
            if not zero_loans_reached:
                zero_loans_thresh = th
                zero_loans_reached = True
            continue
        
        if num_loans_for_thresh < min_loans and not low_num_loan_risk_reached:
            low_num_loan_risk_thresh = th
            low_num_loan_risk_reached = True
                    
        payback_rate = float(test.loan_status[y_pred_bin.astype(bool)[:, 0]].sum()) / \
                        len(test.loan_status[y_pred_bin.astype(bool)[:, 0]])      
        return_at_thresh = test[y_pred_bin.astype(bool)].int_rate \
                           .multiply(test.loan_status[y_pred_bin.astype(bool)[:, 0]]).mean()
        
        return_cs.append(return_at_thresh)
        avg_int_rate.append(test[y_pred_bin.astype(bool)].int_rate.mean())
        charge_off_rate.append(1-payback_rate)
                
        if return_at_thresh > best_return_cs and num_loans_for_thresh > min_loans:
            best_return_cs_thresh = th
            best_return_cs = return_at_thresh
                  
    if plot:
        fig,(ax1,ax3) = plt.subplots(2, sharex=True, figsize=(8,8))

        plt.suptitle('Classifier Performance {}'.format(descr))
        ax1.plot(thresh, return_cs, color='g', label='Effective Return')
        ax1.plot(thresh, avg_int_rate, color='g', linestyle='--', label='Avg Int Rate')
        ax1.plot((low_num_loan_risk_thresh, low_num_loan_risk_thresh), (0, np.max(return_cs)),
                  color='purple', linestyle='--', label='{} loans'.format(min_loans))
        ax1.plot((zero_loans_thresh, zero_loans_thresh), (0, np.max(return_cs)),
                  color='black', linestyle='--', label='0 loans')
        ax1.plot((best_return_cs_thresh, best_return_cs_thresh), (0, np.max(return_cs)), 
                color='orange', linestyle='-.', label='Best Return')
        ax1.set_ylabel('Return on Recommended Loans')
        ax1.legend(loc='lower left')
        for tl in ax1.get_yticklabels():
            tl.set_color('g')
        ax2 = ax1.twinx()
        ax2.plot(thresh, charge_off_rate, color='r', linestyle='--', label='Risk')
        ax2.set_ylabel('Charge Off Rate')
        ax2.legend()
        for tl in ax2.get_yticklabels():
            tl.set_color('r')

        ax3.set_ylabel('Num Loans Recommended')
        ax3.set_xlabel('Classifier Threshold')
        ax3.plot(thresh, num_loans, color='b')

        plt.show()
        
    y_pred_bin_best = binarize(y_pred, best_return_cs_thresh)
    print 'Best Return Grade Distribution: \n{}' \
            .format(test[y_pred_bin_best.astype(bool)].grade.value_counts())
        
    return best_return_cs, return_lc 

And last, let's set up another function so we can compare our returns against those of Lending Club.

In [17]:
def compare_returns(return_cs, return_lc, descr='tmp', save=False):
    fig,ax = plt.subplots()
    ind = np.arange(len(return_cs))
    width = 0.35
    plt.title(descr)
    ax.bar(ind, [return_cs[k] for k in sorted(return_cs.keys())], width * 0.9, 
           label='CSummers', color='g')
    ax.bar(ind + width, [return_lc[k] for k in sorted(return_lc.keys())], width * 0.9, 
           label='Lending Club', color='grey')
    ax.set_xticks(ind + width)
    ax.set_ylabel('Return Rate')
    ax.set_xticklabels(sorted(return_cs.keys()))
    plt.legend()
    if save:
        plt.savefig(descr)
    plt.show()

So now we can finally train our risk reduction model and evaluate it's effectiveness at boosting our return.

In [18]:
return_strat1_cs = {}
return_strat1_lc = {}

data_descr = '{}'.format(active_data_descr)
data_clean = clean(active_data, LOAN_GRADES)
train_data, test_data = split(data_clean)
model_strat1, scl_strat1 = train_model_reduce_risk(train_data, verbose=0)
In [19]:
y_test, y_pred = predict_reduce_risk(model_strat1, scl_strat1, test_data)
return_strat1_cs[data_descr], return_strat1_lc[data_descr] = \
    eval_max_avg_int_rate(test_data, y_test, y_pred, descr=data_descr)
overall_best_return_cs['Strat1/Baseline'] = return_strat1_cs[data_descr]
overall_best_return_lc['Strat1/Baseline'] = return_strat1_lc[data_descr]
Best Return Grade Distribution: 
B    2334
A    2074
C    1570
D    1017
E     550
F     194
G      60
dtype: int64

In the top graph above, we can the risk, average interest rate, and return changes as we sweep the threshold on the output of our classifier, gradually recommending fewer loans (bottom graph above) that should as a population have a lower likelihood of default if our model is learning properly. Our baseline of the Lending Club return is at the threshold 0, which means we recommend all of the loans, and shows the 10% we saw earlier. Note: Pr(payback) = 1 - Charge Off Rate.

Well, as we start filtering loans you can see that our risk begins falling (red dotted line). Good! This means the model is effectively figuring out which loans are likely to be fully paid back and recommending them. But the main takeaway here is that there is no threshold at which our model recommends loans where the return (solid green line) is higher than what's provided by the original loan population at threshold 0. What gives?

The problem is that our average interest rate drops as well (green dotted line), indicating that our model succumbing to a bias in safe loans for which Lending Club has already given low interest rates. So in our bid to reduce risk without taking into account the associated interest rate, we end up with very safe loans and a lower return as shown in the grade distribution of the recommended loans at the best threshold. Let's compare the best return from the model (yellow dotted line) to that of the Lending Club baseline at threshold 0.

In [20]:
compare_returns(return_strat1_cs, return_strat1_lc, 
                descr='{} Best Return Strategy 1'.format(active_data_descr))

So simply reducing the risk of loan default doesn't help us that much. Let's try another strategy.

Strategy 2: Reduce Risk on High Interest Loans

Since we now have this nice model for identifying loan defaults, why not just manually apply the model on subsets of loans with higher interest rates? Lending Club has provided grades for each loan as a measure of their riskiness so we can try to filter each of these groups of loans as we did with the whole loan population.

In [21]:
return_strat2_cs = {}
return_strat2_lc = {}
for gr in LOAN_GRADES:
    data_descr = '{} {}'.format(active_data_descr, gr)
    data_clean = clean(active_data, [gr])
    y_test, y_pred = predict_reduce_risk(model_strat1, scl_strat1, data_clean)
    return_strat2_cs[data_descr], return_strat2_lc[data_descr] = \
        eval_max_avg_int_rate(data_clean, y_test, y_pred, descr=data_descr)
overall_best_return_cs['Strat2'] = max(return_strat2_cs.values())
overall_best_return_lc['Strat2'] = max(return_strat2_lc.values())
Best Return Grade Distribution: 
A    9634
dtype: int64
Best Return Grade Distribution: 
B    4538
dtype: int64
Best Return Grade Distribution: 
C    310
dtype: int64
Best Return Grade Distribution: 
D    245
dtype: int64
Best Return Grade Distribution: 
E    371
dtype: int64
Best Return Grade Distribution: 
F    140
dtype: int64
Best Return Grade Distribution: 
G    111
dtype: int64

Cool, now we're getting somewhere! The graphs for loans with grades A and B look a lot like Strategy 1, but for the riskier loans our best return (yellow dotted line) increased by reducing their default rates! We can see this a little more clearly when we compare the model against the Lending Club returns for each subset of loans.

In [22]:
compare_returns(return_strat2_cs, return_strat2_lc, 
                descr='{} Best Return Per Grade Strategy 2'.format(active_data_descr))

So using this strategy you can see that as Lending Club investor without a machine learning model, you probably want to invest in the riskier loans anyway. Their higher likelihood of default is overpowered by the higher interest rates for a higher return - 14% for the Lending Club grade G loans. But our model recommends loans with lower default rates in the F and G loans to give a nice boost in returns.

Strategy #3: Return Rate Optimization

While it's natural to focus on reducing the default rate of our loans, perhaps we can do better by optimizing for the return directly. Rather than having our targets, y, be a 0 for default and 1 for full repayment, we can structure things as a regression problem where our target is Pr(payback) * InterestRate for each loan. Below our new train function does just that. Note we're changing our loss function to Mean Squared Error, which is more appropriate in this case.

In [23]:
def train_model_mo_money(train_data, verbose=1):
    
    opt = RMSprop(lr=0.01)
    clf = Sequential()
    clf.add(Dense(50, input_dim=len(FEATURES)))
    clf.add(Activation('relu'))
    clf.add(Dense(50, input_dim=50))
    clf.add(Activation('relu'))
    clf.add(Dense(1, input_dim=50))
    clf.add(Activation('sigmoid'))
    clf.compile(optimizer=opt,
              loss='mse')
    
    scl = StandardScaler()
    
    X_train = train_data[FEATURES]
    y_train = train_data.loan_status.multiply(train_data.int_rate)
    y_train = y_train / y_train.max()

    X_train = scl.fit_transform(X_train)  
    
    clf.fit(X_train, y_train, 
            nb_epoch=15,
            validation_split=0.1,
           verbose=verbose)
    
    return clf, scl


def predict_mo_money(clf, scl, test):
    
    X_test = test[FEATURES]
    X_test = scl.transform(X_test)
    
    y_test = None
    if test.int_rate is not None:
        y_test = test.loan_status.multiply(test.int_rate)
        y_test = y_test / y_test.max()
        
    y_pred = clf.predict(X_test)
    
    return y_test, y_pred
    

Now let's train this new model using all the loans and see what we get.

In [24]:
return_strat3_cs = {}
return_strat3_lc = {}

data_descr = "{} Strategy 3".format(active_data_descr)
data_clean = clean(active_data, LOAN_GRADES)
train_data, test_data = split(data_clean)
model_strat3, scl_strat3 = train_model_mo_money(train_data, verbose=0)
y_test, y_pred = predict_mo_money(model_strat3, scl_strat3, test_data)
return_strat3_cs[data_descr], return_strat3_lc[data_descr] = \
    eval_max_avg_int_rate(test_data, y_test, y_pred, descr=data_descr)
    
overall_best_return_cs['Strat3'] = return_strat3_cs[data_descr]
overall_best_return_lc['Strat3'] = return_strat3_lc[data_descr]
Best Return Grade Distribution: 
F    36
E    27
D    19
G    16
C     7
B     1
A     1
dtype: int64

It works! Now as we use our model to filter the loans from the baseline, we are directly increasing our return rate. The model is simultaneously considering the probability of default with the current interest rate of the loan, which is why the risk is more wobbly. You can see the grade distribution for the recommended loans is a nice varied collection weighted toward the loans Lending Club considers higher risk. And to compare against Lending Club:

In [25]:
compare_returns(return_strat3_cs, return_strat3_lc, 
                descr='{} Best Return, Strategy 3'.format(active_data_descr))

Conclusion

We've looked at three different strategies for applying machine learning to optimize return rates on loans through Lending Club. To recap, Strategy 1 with our model looked to filter out risky loans from the whole population while the Lending Club gave the return on the natural distribution of loans. Strategy 2 conditioned the inital population to be groups of loans considered risky by Lending Club. The machine learning model filtered the the risky loans while the Lending Club again was the return of the conditioned distribution. And finally Strategy 3 with our model optimized the return directly compared to the return of the natural distribution. Let's look at them side by side:

In [26]:
compare_returns(overall_best_return_cs, overall_best_return_lc, 
                descr='{} Best Return, Strategy Comparison'.format(active_data_descr),
                save=True)

It is clear that Strategies 2 and 3 outperform Strategy 1 by a wide margin. And for Strategies 2 and 3 the model outperforms the Lending Club baseline but simply investing in the riskiest loans without machine learning gives a significant boost. So if we were looking for the highest return, we'd want to use Strategy 2 with the model, Strategy 3 with the model, and then Strategy 2 simply concentrated in the riskiest loans.

While Strategy 2 with model offers the best highest overall return, there is a potential problem in relying too much on Lending Club's analysis. As has happened before, there could be a change in the loan population or Lending Club could suddenly change how they assess risk. Particularly Strategy 2 without the model is susceptible to this. Strategy 3 offers a little lower return, but spreads the loans over a variety of Lending Club grades and potentially does a better job of balancing the natural loan default rates against the Lending Club interest rates. Just for good measure, let's redo this entire analysis on the two other recent data sets, 2016 Q1 and 2015.

In [27]:
from IPython.display import Image
Image(filename='overall_2016q1.png')
Out[27]:
In [28]:
Image(filename='overall_2015.png')
Out[28]:

The story is fairly consistent for the other datasets in terms of the relative returns between strategies, which is nice to see when evaluating your models!

That's all for now. I hope this was interesting and useful to some folks who are curious about how machine learning might fit into a peer-to-peer loan investment strategy. Feel free to drop me a line with comments or questions.