Predicting Time to Merge of a Pull Request

One of the machine learning explorations within the OpenShift CI Analysis project is predicting time to merge of a pull request (see this issue for more details). In a previous notebook we showed how to access the PR data for the openshift/origin repo, and then performed initial data analysis as well as feature engineering on it. Furthermore, we also split the time_to_merge values for the PRs into the following 10 discrete, equally populated bins, so that this task becomes a classification problem:

Class 0 : < 3 hrs
Class 1 : < 6 hrs
Class 2 : < 15 hrs
Class 3 : < 24 hrs / 1 day
Class 4 : < 36 hrs / 1.5 days
Class 5 : < 60 hrs / 2.5 days
Class 6 : < 112 hrs / ~4.5 days
Class 7 : < 190 hrs / ~8 days
Class 8 : < 462 hrs / ~19 days
Class 9: > 462 hrs

In this notebook, we will train a machine learning model to classify the time_to_merge values for PRs into one of these 10 bins (or "classes"), using the features engineered from the raw PR data.

Scale data

Define Training and Evaluation Pipeline

Here, we will define a function to train a given classifier on the training set and then evaluate it on the test set.

Define Models and Parameters

Next, we will define and initialize the classifiers that we will be exploring for the time-to-merge prediction task.

Gaussian Naive Bayes

SVM

Random Forest

XGBoost

Compare Model Results

Finally, we will run the train all of the classifiers defined above and evaluate their performance.

Train using all features

First, lets train the classifiers using all the engineered features as input.

Based on the results above, it seems like all the models outperform a random guess. The XGBoost classifier outperforms all others, followed closely by random forest. Furthermore, the Naive bayes and SVM models seem to be heavily biased towards a few classes. On the contrary, the random forest andthe XGBoost models seem to be less biased and have mis-classifications within the bordering classes amongst the ordinal classes.

Note that for model deployment (which is the eventual goal), we will also need to include any scaler or preprocessor objects. This is because the input to the inference service will be raw unscaled data. We plan to address this issue by using sklearn.Pipeline object to package the preprocessor(s) and model as one "combined" model. Since an XGBoost model baked into an sklearn.Pipeline object might be complicated to serve on a seldon sklearn server, and since random forest performs almost as well as xgboost, we will save the random forest as the "best" model here. In the step below, we create a copy of the model so that we can save it to S3 later on and use it for model deployment.

Train using pruned features

In the previous notebook we performed some feature engineering and pruned the number of features down to 96. However, it might be possible that further pruning the features based on the importances given to them by the models yields more generalizable and accurate models. So in this section, we will explore using Recursive Feature Elimination (RFE) to rank the features in terms of their importance, and recursively select the best subsets to train our models with.

From the confusion matrices above, we can conclude that the models perform slightly better when trained using all the features, instead of using only the RFE-pruned subset.

Create sklearn Pipeline

Here, we will create an sklearn pipeline consisting of 2 steps, scaling of the input features and the classifier itself. We will then save this Pipeline as a model.joblib file on S3 for serving the model pipeline using the Seldon Sklearn Server.

Write Model to S3

Conclusion

In this notebook, we explored various vanilla classifiers, namely, Naive Bayes, SVM, Random Forests, and XGBoost. The XGBoost classifier was able to predict the classes with a weighted average f1 score of 0.21 an accuracy of 22% when trained using all the available features. Also, all of the models perform better when trained using all available features than when trained using a top 20 features determined using RFE.

Even though all models outperform the baseline (random guess), we believe there is still some room for improvement. Since the target variable of the github PR dataset is an ordinal variable, an ordinal classifier could perform better than the models trained in this notebook. We will explore this idea in a future notebook.

As the immediate next step, we will to deploy the best model from this notebook as an inference service using Seldon.