## Kindly reposted to KDnuggets by Gregory Piatetsky-Shapiro

Preamble

The core application of machine learning models is a binary classification task. This appears in polyhedra of areas from medicine for diagnostic tests to credit risk decision making for consumers.  Techniques in building classifiers vary from simple decision trees to logistic regression and lately super cool deep learning models that leverage multilayered neural networks. However, they are mathematically different in construction and training methodology, when it comes to their performance measure, things get tricky. In this post, we propose a simple and interpretable performance measure for a binary classifier in practice. Some background in classification is assumed.

Why ROC-AUC is not intepretable? Varying threshold produces different confusion matrices (Wikpedia)
De-facto standard in reporting classifier performance is to use Receiver Operating Characteristic (ROC) - Area Under Curve (AUC) measure. It originates from the 1940s during the development of Radar by the US Navy, in measuring the performance of detection.  There are at least 5 different definitions of what does ROC-AUC means and even if you have a PhD in Machine Learning, people have an excessively difficult time to explain what does AUC means as a performance measure. As AUC functionality is available in almost all libraries and it becomes almost like a religious ritual to report in Machine Learning papers as a classification performance. However, its interpretation is not easy, apart from its absurd comparison issues, see hmeasure.  AUC measures the area under the True Positive Rate (TPR) curve as a function of the False Positive Rate (FPR) that are extracted from confusion matrices with different thresholds.

$$f(x)=y$$
$$\int_{0}^{1} f(x) dx = AUC$$

Whereby, y is TPR and x is FPR. Apart from a multitude of interpretations and easy to have confusions, there is no clear purpose of taking the integral over FPR. Obviously, we would like to have perfect classification by having FPR zero, but the area is not mathematically clear. Meaning that what is it as a mathematical object is not clear.

Probability of correct classification (PCC)

A simple and interpretable performance measure for a binary classifier would be great for both highly technical data scientist and non-technical stakeholders. The basic tenant in this direction is that the purpose of a classifier technology is the ability to differentiate two classes. This boils down to a probability value, Probability of correct classification (PCC). An obvious choice is so-called balanced accuracy (BA). This is usually recommended for unbalanced problems, even by SAS; though they used multiplication of probabilities. Here we will call BA as PCC and use addition instead, due to statistical dependence:
$PCC = (TPR+TNR)/2$

$TPR=TP/(ConditionPositive)=TP/(TP+FN)$
$TNR=TN/(ConditionNegative)=TN/(TN+FP)$.

PCC tells us how good the classifier in detecting either of the class and it is a probability value, $[0,1]$. Note that, using total accuracy over both positive and negative cases are misleading, even if our training data is balanced in production, batches we measure the performance may not be balanced so accuracy alone is not a good measure.

Production issues
Immediate question would be, how to choose the threshold in generating confusion matrix? One option would be to chose a threshold that maximizes PCC for production on the test set. To improve the estimation of PCC, resampling on the test set can be performed to get a good uncertainty.

Conclusion

We try to circumvent in reporting AUCs by introducing PCC, or balanced accuracy as a simple and interpretable performance measure for a binary classifier. This is easy to explain to a non-technical audience. An improved PCC, that takes into account better estimation properties can be introduced, but the main interpretation remains the same as probability of correct classification.