Predicting Time to Merge of a Pull Request¶
One of the machine learning explorations within the OpenShift CI Analysis project is predicting time to merge of a pull request (see this issue for more details). In a previous notebook we showed how to access the PR data for the openshift/origin repo, and then performed initial data analysis as well as feature engineering on it. Furthermore, we also split the time_to_merge
values for the PRs into the following 10 discrete, equally populated bins, so that this task becomes a classification problem:
Class 0 : < 3 hrs
Class 1 : < 6 hrs
Class 2 : < 15 hrs
Class 3 : < 24 hrs / 1 day
Class 4 : < 36 hrs / 1.5 days
Class 5 : < 60 hrs / 2.5 days
Class 6 : < 112 hrs / ~4.5 days
Class 7 : < 190 hrs / ~8 days
Class 8 : < 462 hrs / ~19 days
Class 9: > 462 hrs
In this notebook, we will train a machine learning model to classify the time_to_merge
values for PRs into one of these 10 bins (or "classes"), using the features engineered from the raw PR data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | import os import sys import boto3 import warnings import tempfile from io import BytesIO from copy import deepcopy from joblib import dump, load import numpy as np import seaborn as sns import matplotlib.pyplot as plt from sklearn.feature_selection import RFE from sklearn.preprocessing import PowerTransformer from sklearn.model_selection import train_test_split from sklearn.svm import SVC from xgboost import XGBClassifier from sklearn.naive_bayes import GaussianNB from sklearn.ensemble import RandomForestClassifier from sklearn.pipeline import Pipeline from sklearn.metrics import confusion_matrix from sklearn.metrics import classification_report from dotenv import load_dotenv, find_dotenv metric_template_path = "../data-sources/TestGrid/metrics" if metric_template_path not in sys.path: sys.path.insert(1, metric_template_path) from ipynb.fs.defs.metric_template import ( # noqa: E402 CephCommunication, # noqa: E402 ) # noqa: E402 |
1 2 | warnings.filterwarnings("ignore") load_dotenv(find_dotenv()) |
True
1 2 3 4 5 6 7 8 9 | ## CEPH Bucket variables ## Create a .env file on your local with the correct configs, s3_endpoint_url = os.getenv("S3_ENDPOINT") s3_access_key = os.getenv("S3_ACCESS_KEY") s3_secret_key = os.getenv("S3_SECRET_KEY") s3_bucket = os.getenv("S3_BUCKET") s3_path = "github" REMOTE = os.getenv("REMOTE") INPUT_DATA_PATH = "../../../data/processed/github" |
1 2 3 4 5 6 7 8 9 10 | if REMOTE: cc = CephCommunication(s3_endpoint_url, s3_access_key, s3_secret_key, s3_bucket) ttm_dataset = cc.read_from_ceph(s3_path, "ttm_dataset.parquet") else: print( "The ttm_dataset.parquet file is not included in the ocp-ci-analysis github repo." ) print( "Please set REMOTE=1 in the .env file and read this data from the S3 bucket instead." ) |
1 2 3 | # extract X and y from dataset X = ttm_dataset.drop(columns=["ttm_class", "time_to_merge"]) y = ttm_dataset["ttm_class"] |
1 2 3 4 | # split into train and test sets X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) |
1 2 3 4 5 6 | # upload X_test and y_test to S3 bucket for testing / running sanity check on the model inference service ret = cc.upload_to_ceph(X_test, s3_path, "X_test.parquet") print(ret["ResponseMetadata"]["HTTPStatusCode"]) ret = cc.upload_to_ceph(y_test.to_frame(), s3_path, "y_test.parquet") print(ret["ResponseMetadata"]["HTTPStatusCode"]) |
200 200
1 2 3 | # convert from pandas series to lists to avoid warnings during training y_train = y_train.to_list() y_test = y_test.to_list() |
Scale data¶
1 2 3 4 5 | # lets apply a yeo johnson transform to try to make the data more gaussian scaler = PowerTransformer() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) |
Define Training and Evaluation Pipeline¶
Here, we will define a function to train a given classifier on the training set and then evaluate it on the test set.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | def train_evaluate_classifier(clf, xtrain, ytrain, xtest, ytest): # Train our classifier clf.fit(xtrain, ytrain) # Make predictions preds = clf.predict(xtest) # View classification report print(classification_report(ytest, preds)) # Plot confusion matrix heatmap plt.figure(figsize=(16, 12)) cf_matrix = confusion_matrix(ytest, preds) group_counts = ["{0:0.0f}\n".format(value) for value in cf_matrix.flatten()] group_percentages = [ "{0:.2%}".format(value) for value in cf_matrix.flatten() / np.sum(cf_matrix) ] box_labels = [ f"{v1}{v2}".strip() for v1, v2 in zip(group_counts, group_percentages) ] box_labels = np.asarray(box_labels).reshape(cf_matrix.shape[0], cf_matrix.shape[1]) sns.heatmap(cf_matrix, cmap="OrRd", annot=box_labels, fmt="") plt.xlabel("Predicted TTM Label") plt.ylabel("True TTM Label") plt.title("Confusion Matrix Heatmap") |
Define Models and Parameters¶
Next, we will define and initialize the classifiers that we will be exploring for the time-to-merge prediction task.
Gaussian Naive Bayes¶
1 2 | # Initialize classifier gnb = GaussianNB() |
SVM¶
1 2 | # Initialize classifier svc = SVC(random_state=42) |
Random Forest¶
1 2 3 4 5 6 7 | # Initialize classifier rf = RandomForestClassifier( n_estimators=200, max_features=0.75, random_state=42, n_jobs=-1, ) |
XGBoost¶
1 2 3 4 5 6 7 8 | # Initialize classifier xgbc = XGBClassifier( n_estimators=125, learning_rate=0.1, random_state=42, verbosity=1, n_jobs=-1, ) |
Compare Model Results¶
Finally, we will run the train all of the classifiers defined above and evaluate their performance.
Train using all features¶
First, lets train the classifiers using all the engineered features as input.
1 | train_evaluate_classifier(gnb, X_train_scaled, y_train, X_test_scaled, y_test) |
precision recall f1-score support 0 0.18 0.05 0.08 249 1 0.08 0.95 0.15 217 2 0.00 0.00 0.00 364 3 0.12 0.01 0.02 240 4 0.00 0.00 0.00 275 5 0.00 0.00 0.00 236 6 0.00 0.00 0.00 333 7 0.25 0.01 0.01 270 8 0.22 0.01 0.01 260 9 0.22 0.07 0.10 262 accuracy 0.09 2706 macro avg 0.11 0.11 0.04 2706 weighted avg 0.10 0.09 0.03 2706
1 | train_evaluate_classifier(svc, X_train_scaled, y_train, X_test_scaled, y_test) |
precision recall f1-score support 0 0.26 0.48 0.33 249 1 0.67 0.01 0.02 217 2 0.20 0.37 0.26 364 3 0.12 0.10 0.11 240 4 0.18 0.08 0.11 275 5 0.11 0.03 0.05 236 6 0.20 0.26 0.22 333 7 0.14 0.05 0.07 270 8 0.18 0.22 0.20 260 9 0.22 0.27 0.24 262 accuracy 0.20 2706 macro avg 0.23 0.19 0.16 2706 weighted avg 0.22 0.20 0.17 2706
1 | train_evaluate_classifier(rf, X_train_scaled, y_train, X_test_scaled, y_test) |
precision recall f1-score support 0 0.31 0.42 0.36 249 1 0.14 0.10 0.12 217 2 0.23 0.27 0.25 364 3 0.15 0.17 0.16 240 4 0.13 0.10 0.11 275 5 0.14 0.10 0.12 236 6 0.23 0.23 0.23 333 7 0.16 0.14 0.15 270 8 0.18 0.17 0.17 260 9 0.23 0.28 0.25 262 accuracy 0.20 2706 macro avg 0.19 0.20 0.19 2706 weighted avg 0.19 0.20 0.20 2706
1 | train_evaluate_classifier(xgbc, X_train_scaled, y_train, X_test_scaled, y_test) |
[14:04:24] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. precision recall f1-score support 0 0.28 0.43 0.34 249 1 0.09 0.05 0.06 217 2 0.22 0.27 0.25 364 3 0.16 0.17 0.16 240 4 0.18 0.11 0.14 275 5 0.18 0.13 0.15 236 6 0.25 0.30 0.28 333 7 0.17 0.13 0.14 270 8 0.20 0.21 0.21 260 9 0.26 0.31 0.28 262 accuracy 0.22 2706 macro avg 0.20 0.21 0.20 2706 weighted avg 0.20 0.22 0.21 2706
Based on the results above, it seems like all the models outperform a random guess. The XGBoost classifier outperforms all others, followed closely by random forest. Furthermore, the Naive bayes and SVM models seem to be heavily biased towards a few classes. On the contrary, the random forest andthe XGBoost models seem to be less biased and have mis-classifications within the bordering classes amongst the ordinal classes.
Note that for model deployment (which is the eventual goal), we will also need to include any scaler or preprocessor objects. This is because the input to the inference service will be raw unscaled data. We plan to address this issue by using sklearn.Pipeline
object to package the preprocessor(s) and model as one "combined" model. Since an XGBoost model baked into an sklearn.Pipeline object might be complicated to serve on a seldon sklearn server, and since random forest performs almost as well as xgboost, we will save the random forest as the "best" model here. In the step below, we create a copy of the model so that we can save it to S3 later on and use it for model deployment.
1 2 | # create a clone (create a copy of the object with the learned weights) selected_model = deepcopy(rf) |
1 2 | # sanity check print(classification_report(y_test, selected_model.predict(X_test_scaled))) |
precision recall f1-score support 0 0.31 0.42 0.36 249 1 0.14 0.10 0.12 217 2 0.23 0.27 0.25 364 3 0.15 0.17 0.16 240 4 0.13 0.10 0.11 275 5 0.14 0.10 0.12 236 6 0.23 0.23 0.23 333 7 0.16 0.14 0.15 270 8 0.18 0.17 0.17 260 9 0.23 0.28 0.25 262 accuracy 0.20 2706 macro avg 0.19 0.20 0.19 2706 weighted avg 0.19 0.20 0.20 2706
Train using pruned features¶
In the previous notebook we performed some feature engineering and pruned the number of features down to 96. However, it might be possible that further pruning the features based on the importances given to them by the models yields more generalizable and accurate models. So in this section, we will explore using Recursive Feature Elimination (RFE) to rank the features in terms of their importance, and recursively select the best subsets to train our models with.
1 2 3 | # use the xgboost classifier as the base estimator since it had the highest f1 selector = RFE(xgbc, n_features_to_select=20, step=5) selector = selector.fit(X_train_scaled, y_train) |
[14:04:52] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [14:05:28] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [14:06:12] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [14:07:00] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [14:07:42] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [14:08:23] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [14:09:05] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [14:09:32] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [14:10:00] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [14:10:25] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [14:10:50] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [14:11:13] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [14:11:35] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [14:11:55] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [14:12:14] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [14:12:32] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [14:12:50] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
1 2 | # No of top features to select top = 20 |
1 2 | ranks = selector.ranking_ ranks |
array([ 1, 1, 1, 3, 3, 1, 1, 15, 1, 2, 1, 1, 4, 3, 1, 1, 1, 1, 6, 5, 6, 7, 5, 7, 7, 7, 1, 5, 6, 9, 5, 9, 4, 1, 11, 12, 1, 13, 8, 15, 16, 12, 10, 5, 1, 11, 9, 17, 17, 8, 8, 10, 15, 17, 10, 13, 10, 17, 10, 14, 9, 11, 14, 14, 17, 15, 9, 8, 16, 14, 13, 15, 12, 12, 16, 7, 13, 12, 16, 14, 16, 4, 1, 1, 11, 4, 3, 8, 3, 11, 1, 1, 6, 6, 4, 13])
1 2 | cols = X_train.columns.to_numpy() cols |
array(['size', 'is_reviewer', 'is_approver', 'created_at_day', 'created_at_month', 'created_at_weekday', 'created_at_hour', 'change_in_.github', 'change_in_docs', 'change_in_pkg', 'change_in_test', 'change_in_vendor', 'change_in_root', 'changed_files_number', 'body_size', 'num_prev_merged_prs', 'commits_number', 'filetype_.go', 'filetype_.json', 'filetype_.1', 'filetype_.sh', 'filetype_.md', 'filetype_.yaml', 'filetype_BUILD', 'filetype_.proto', 'filetype_.yml', 'filetype_.html', 'filetype_.adoc', 'filetype_Dockerfile', 'filetype_LICENSE', 'filetype_Makefile', 'filetype_.txt', 'filetype_.gitignore', 'filetype_oc', 'filetype_.s', 'filetype_.mod', 'filetype_openshift', 'filetype_.sum', 'filetype_.conf', 'filetype_.bats', 'filetype_.feature', 'filetype_.xml', 'filetype_.crt', 'filetype_.spec', 'filetype_.template', 'filetype_AUTHORS', 'filetype_.service', 'filetype_.key', 'filetype_run', 'filetype_.mk', 'filetype_oadm', 'filetype_.rhel', 'filetype_.cert', 'filetype_result', 'filetype_MAINTAINERS', 'filetype_README', 'filetype_NOTICE', 'filetype_.c', 'filetype_.bash', 'filetype_.pl', 'filetype_Vagrantfile', 'filetype_.centos7', 'filetype_CONTRIBUTORS', 'filetype_.gz', 'filetype_cert', 'filetype_key', 'filetype_.files_generated_oc', 'filetype_Readme', 'filetype_.empty', 'filetype_PATENTS', 'filetype_.files_generated_openshift', 'filetype_.sec', 'filetype_VERSION', 'filetype_.ini', 'filetype_.mailmap', 'filetype_.markdown', 'filetype_test', 'filetype_.sysconfig', 'filetype_.yaml-merge-patch', 'filetype_.gitattributes', 'filetype_.signature', 'title_wordcount_add', 'title_wordcount_bug', 'title_wordcount_bump', 'title_wordcount_diagnostics', 'title_wordcount_disable', 'title_wordcount_fix', 'title_wordcount_haproxy', 'title_wordcount_oc', 'title_wordcount_publishing', 'title_wordcount_revert', 'title_wordcount_router', 'title_wordcount_sh', 'title_wordcount_staging', 'title_wordcount_support', 'title_wordcount_travis'], dtype=object)
1 2 | indices_by_ranks = ranks.argsort() indices_by_ranks |
array([ 0, 26, 36, 91, 90, 44, 17, 16, 15, 33, 83, 14, 10, 1, 8, 82, 6, 2, 11, 5, 9, 3, 4, 88, 86, 13, 85, 94, 81, 32, 12, 30, 27, 19, 22, 43, 93, 28, 18, 92, 20, 21, 23, 24, 75, 25, 87, 49, 50, 67, 38, 31, 60, 66, 29, 46, 54, 56, 58, 51, 42, 89, 84, 61, 34, 45, 35, 41, 73, 77, 72, 76, 95, 70, 37, 55, 62, 63, 79, 59, 69, 65, 71, 52, 39, 7, 80, 68, 40, 74, 78, 64, 57, 53, 48, 47])
1 2 | sorted_ranks = ranks[indices_by_ranks] sorted_ranks |
array([ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 17, 17, 17, 17, 17])
1 2 | cols_by_rank = cols[indices_by_ranks] cols_by_rank[:top] |
array(['size', 'filetype_.html', 'filetype_openshift', 'title_wordcount_router', 'title_wordcount_revert', 'filetype_.template', 'filetype_.go', 'commits_number', 'num_prev_merged_prs', 'filetype_oc', 'title_wordcount_bump', 'body_size', 'change_in_test', 'is_reviewer', 'change_in_docs', 'title_wordcount_bug', 'created_at_hour', 'is_approver', 'change_in_vendor', 'created_at_weekday'], dtype=object)
1 2 3 | # prune the training set X_train_scaled_pruned = X_train_scaled[:, selector.support_] X_test_scaled_pruned = X_test_scaled[:, selector.support_] |
1 2 3 | train_evaluate_classifier( gnb, X_train_scaled_pruned, y_train, X_test_scaled_pruned, y_test ) |
precision recall f1-score support 0 0.17 0.54 0.26 249 1 0.10 0.35 0.16 217 2 0.26 0.06 0.10 364 3 0.12 0.12 0.12 240 4 0.28 0.02 0.03 275 5 0.14 0.03 0.05 236 6 0.19 0.14 0.16 333 7 0.00 0.00 0.00 270 8 0.20 0.15 0.17 260 9 0.22 0.26 0.24 262 accuracy 0.16 2706 macro avg 0.17 0.17 0.13 2706 weighted avg 0.17 0.16 0.13 2706
1 2 3 | train_evaluate_classifier( svc, X_train_scaled_pruned, y_train, X_test_scaled_pruned, y_test ) |
precision recall f1-score support 0 0.24 0.47 0.32 249 1 0.40 0.03 0.05 217 2 0.20 0.32 0.24 364 3 0.11 0.11 0.11 240 4 0.16 0.08 0.11 275 5 0.12 0.05 0.07 236 6 0.24 0.24 0.24 333 7 0.18 0.12 0.14 270 8 0.14 0.15 0.14 260 9 0.21 0.29 0.25 262 accuracy 0.19 2706 macro avg 0.20 0.19 0.17 2706 weighted avg 0.20 0.19 0.17 2706
1 2 3 | train_evaluate_classifier( rf, X_train_scaled_pruned, y_train, X_test_scaled_pruned, y_test ) |
precision recall f1-score support 0 0.25 0.31 0.28 249 1 0.16 0.14 0.15 217 2 0.21 0.22 0.21 364 3 0.13 0.14 0.14 240 4 0.13 0.11 0.12 275 5 0.16 0.14 0.15 236 6 0.20 0.20 0.20 333 7 0.18 0.16 0.17 270 8 0.17 0.17 0.17 260 9 0.22 0.26 0.24 262 accuracy 0.19 2706 macro avg 0.18 0.19 0.18 2706 weighted avg 0.18 0.19 0.18 2706
1 2 3 | train_evaluate_classifier( xgbc, X_train_scaled_pruned, y_train, X_test_scaled_pruned, y_test ) |
[14:13:31] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. precision recall f1-score support 0 0.23 0.39 0.29 249 1 0.12 0.06 0.07 217 2 0.20 0.24 0.22 364 3 0.17 0.17 0.17 240 4 0.10 0.07 0.08 275 5 0.17 0.12 0.14 236 6 0.26 0.32 0.28 333 7 0.16 0.13 0.14 270 8 0.14 0.13 0.14 260 9 0.24 0.26 0.25 262 accuracy 0.20 2706 macro avg 0.18 0.19 0.18 2706 weighted avg 0.18 0.20 0.18 2706
From the confusion matrices above, we can conclude that the models perform slightly better when trained using all the features, instead of using only the RFE-pruned subset.
Create sklearn Pipeline¶
Here, we will create an sklearn pipeline consisting of 2 steps, scaling of the input features and the classifier itself. We will then save this Pipeline as a model.joblib
file on S3 for serving the model pipeline using the Seldon Sklearn Server.
1 | pipe = Pipeline(steps=[("scale", scaler), ("rf", selected_model)]) |
Write Model to S3¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | key = "github/ttm-model/pipeline" filename = "model.joblib" s3_resource = boto3.resource( "s3", endpoint_url=s3_endpoint_url, aws_access_key_id=s3_access_key, aws_secret_access_key=s3_secret_key, ) with tempfile.TemporaryFile() as fp: dump(pipe, fp) fp.seek(0) s3_obj = s3_resource.Object(s3_bucket, f"{key}/{filename}") s3_obj.put(Body=fp.read()) |
1 2 3 4 5 6 | ## Sanity Check buffer = BytesIO() s3_object = s3_resource.Object(s3_bucket, f"{key}/{filename}") s3_object.download_fileobj(buffer) model = load(buffer) model |
Pipeline(steps=[('scale', PowerTransformer()), ('rf', RandomForestClassifier(max_features=0.75, n_estimators=200, n_jobs=-1, random_state=42))])
1 2 | preds = model.predict(X_test) print(classification_report(y_test, preds)) |
precision recall f1-score support 0 0.31 0.42 0.36 249 1 0.14 0.10 0.12 217 2 0.23 0.27 0.25 364 3 0.15 0.17 0.16 240 4 0.13 0.10 0.11 275 5 0.14 0.10 0.12 236 6 0.23 0.23 0.23 333 7 0.16 0.14 0.15 270 8 0.18 0.17 0.17 260 9 0.23 0.28 0.25 262 accuracy 0.20 2706 macro avg 0.19 0.20 0.19 2706 weighted avg 0.19 0.20 0.20 2706
Conclusion¶
In this notebook, we explored various vanilla classifiers, namely, Naive Bayes, SVM, Random Forests, and XGBoost. The XGBoost classifier was able to predict the classes with a weighted average f1 score of 0.21 an accuracy of 22% when trained using all the available features. Also, all of the models perform better when trained using all available features than when trained using a top 20 features determined using RFE.
Even though all models outperform the baseline (random guess), we believe there is still some room for improvement. Since the target variable of the github PR dataset is an ordinal variable, an ordinal classifier could perform better than the models trained in this notebook. We will explore this idea in a future notebook.
As the immediate next step, we will to deploy the best model from this notebook as an inference service using Seldon.