We want to walk through some common metrics in classification problems – such as accuracy, precision and recall – to get a feel for when to use which metric. Say we are looking for a needle in a haystack. There are very few needles in a large haystack full of straws. An automated machine is sifting through the objects in the haystack and predicting for each object whether it is a straw or a needle. A reasonable predictor will predict a small number of objects as needles and a large number as straws. A prediction has two attributes – positive/negative and accurate/inaccurate.
Positive Prediction: the object at hand is predicted to be the needle. A small number.
Negative Prediction: the object at hand is predicted not to be a needle. A large number.
True_Positive: of the total number of predictions, the number of predictions that were positive and correct. Correctly predicted Positives (needles). A small number.
True_Negative: of the total number of predictions, the number of predictions that were negative and correct. Correctly predicted Negatives (straws). A large number.
False_Positive: of the total number of predictions, the number of predictions that are positive but the prediction is incorrect. Incorrectly predicted Positives (straw predicted as needle). Could be large as the number of straws is large, but assuming the total number of predicted needles is small, this is less than or equal to predicted needles, hence small.
False_Negative: of the total number of predictions, the number of predictions that are negative but the prediction is incorrect. Incorrectly predicted Negatives (needle predicted as straw). Is this a large number ? It is unknown – this class is not large just because the class of negatives is large – it depends on the predictor and a “reasonable” predictor which predicts most objects as straws, could also predict many needles as straws. This is less than or equal to the total number of needles, hence small.
Predicted_Positives = True_Positives + False_Positives = Total number of objects predicted as needles.
Actual Positives = Actual number of needles, which is independant of the number of predictions either way, however Actual Positives = True Positives + False Negatives.
Accuracy = nCorrect _Predictons/nTotal_Predictions=(nTrue_Positives+nTrue_Negatives) / (nPredicted_Positives +nPredicted_Negatives) . # the reasonable assumption above is equivalent to a high accuracy. Most predictions will be hay, and be correct in this simply because of the skewed distribution. This does not shed light on FP or FN.
Precision = nTrue_Positives / nPredicted_Positives # correctly_identified_needles/predicted_needles; this sheds light on FP; Precision = 1 => FP=0 => all predictions of needles are in fact needles; a precision less than 1 means we got a bunch of hay with the needles – gives hope that with further sifting the hay can be removed. Precision is also called Specificity and quantifies the absence of False Positives or incorrect diagnoses.
Recall = nTrue_Positives / nActual_Positives = TP/(TP+FN)# correctly_identified_needles/all_needles; this sheds light on FN; Recall = 1 => FN = 0; a recall less than 1 is awful as some needles are left out in the sifting process. Recall is also called Sensitivity .
Precision > Recall => FN is higher than FP
Precision < Recall => FN is lower than FP
If at least one needle is correctly identified as a needle, both precision and recall will be positive; if zero needles are correctly identified, both precision and recall are zero.
F1 Score is the harmonic mean of Precision and Recall. 1/F1 = 1/2(1/P + 1/R) . F1=2PR/(P+R) . F1=0 if P=0 or R=0. F1=1 if P=1 and R=1.
ROC/AUC rely on Recall (=TP/TP+FN) and another metric False Positive Rate defined as FP/(FP+TN) = hay_falsely_identified_as_needles/total_hay . As TN >> FP, this should be close to zero and does not appear to be a useful metric in the context of needles in a haystack; as are ROC/AuC . The denominators are different in Recall and FPR, total needles and total hay respectively.
There’s a bit of semantic confusion when saying True Positive or False Positive. These shorthands can be interpreted as- it was known that an instance was a Positive and a label of True or False was applied to that instance. But what we mean is that it was not known whether the instance was a Positive, and that a determination was made that it was a Positive and this determination was later found to be correct (True) or incorrect (False). Mentally replace True/False with ‘Correct/Incorrectly identified as’ to remove this confusion.
Normalization: scale of 0-1, or unit norm; useful for dot products when calculating similarity.
Standardization: zero mean, divided by standard deviation; useful in neural network/classifier inputs
Regularization: used to reduce sensitivity to certain features. Uses regression. L1: Lasso regression L2: Ridge regression
Confusion matrix: holds number of predicted values vs known truth. Square matrix with size n equal to number of categories.