Title: | Algorithms for Class Distribution Estimation |
---|---|
Description: | Quantification is a prominent machine learning task that has received an increasing amount of attention in the last years. The objective is to predict the class distribution of a data sample. This package is a collection of machine learning algorithms for class distribution estimation. This package include algorithms from different paradigms of quantification. These methods are described in the paper: A. Maletzke, W. Hassan, D. dos Reis, and G. Batista. The importance of the test set size in quantification assessment. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI20, pages 2640–2646, 2020. <doi:10.24963/ijcai.2020/366>. |
Authors: | Andre Maletzke [aut, cre], Everton Cherman [ctb], Denis dos Reis [ctb], Gustavo Batista [ths] |
Maintainer: | Andre Maletzke <[email protected]> |
License: | GPL (>= 2.0) |
Version: | 0.2.0 |
Built: | 2025-02-21 04:00:48 UTC |
Source: | https://github.com/andregustavom/mlquantify |
It quantifies events based on testing scores using the Adjusted Classify
and Count (ACC) method. ACC is an extension of CC, applying a correction
rate based on the true and false positive rates (tpr
and fpr
).
ACC(test, TprFpr, thr=0.5)
ACC(test, TprFpr, thr=0.5)
test |
a numeric |
TprFpr |
a |
thr |
threshold value according to the |
A numeric vector containing the class distribution estimated from the test set.
Forman, G. (2006, August). Quantifying trends accurately despite classifier error and class imbalance. In ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 157-166).<doi.org/10.1145/1150402.1150423>.
library(randomForest) library(caret) cv <- createFolds(aeAegypti$class, 3) tr <- aeAegypti[cv$Fold1,] validation <- aeAegypti[cv$Fold2,] ts <- aeAegypti[cv$Fold3,] # -- Getting a sample from ts with 80 positive and 20 negative instances -- ts_sample <- rbind(ts[sample(which(ts$class==1),80),], ts[sample(which(ts$class==2),20),]) scorer <- randomForest(class~., data=tr, ntree=500) scores <- cbind(predict(scorer, validation, type = c("prob")), validation$class) TprFpr <- getTPRandFPRbyThreshold(scores) test.scores <- predict(scorer, ts_sample, type = c("prob")) ACC(test = test.scores[,1], TprFpr = TprFpr)
library(randomForest) library(caret) cv <- createFolds(aeAegypti$class, 3) tr <- aeAegypti[cv$Fold1,] validation <- aeAegypti[cv$Fold2,] ts <- aeAegypti[cv$Fold3,] # -- Getting a sample from ts with 80 positive and 20 negative instances -- ts_sample <- rbind(ts[sample(which(ts$class==1),80),], ts[sample(which(ts$class==2),20),]) scorer <- randomForest(class~., data=tr, ntree=500) scores <- cbind(predict(scorer, validation, type = c("prob")), validation$class) TprFpr <- getTPRandFPRbyThreshold(scores) test.scores <- predict(scorer, ts_sample, type = c("prob")) ACC(test = test.scores[,1], TprFpr = TprFpr)
Contains events generated by a laser sensor to capture the flight dynamism of insects. It is a binary dataset compose by events from Aedes Aegypti Female and Male.
data(aeAegypti)
data(aeAegypti)
The data set aeAegypti is a data frame of 1800 observations of 9 variables. Each event is described by the wing beat frequency (wbf), and the frequencies of the first six harmonics obtained when either female or male Aedes Aegypti mosquito cross an optical sensor's line-of-sigh. Both male (class = 2) and female (class = 1) of class factor.
The aeAegypti
dataset is a subset of widely data collection effort involving more than one million instances from 20 different insect species. The dataset was collected varying the temperature and humidity. An observation is associated with a temperature range that varies from 23ºC to 35ºC.
Andre Maletzke <[email protected]>
Maletzke, A. G. (2019). Binary quantification in non-stationary scenarios. Doctoral Thesis, Instituto de Ciências Matemáticas e de Computação, University of São Paulo, São Carlos. Retrieved 2020-07-21, from www.teses.usp.br. <doi.org/10.11606/T.55.2020.tde-19032020-091709>
Moreira dos Reis, D., Maletzke, A., Silva, D. F., & Batista, G. E. (2018). Classifying and counting with recurrent contexts. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 1983-1992). <doi.org/10.1145/3219819.3220059>
It quantifies events based on testing scores, applying the Classify and Count (CC). CC is the simplest quantification method that derives from classification (Forman, 2005).
CC(test, thr=0.5)
CC(test, thr=0.5)
test |
a numeric |
thr |
a numeric value indicating the decision threshold. A value between 0 and 1 (default = |
A numeric vector containing the class distribution estimated from the test set.
Forman, G. (2005). Counting positives accurately despite inaccurate classification. In European Conference on Machine Learning. Springer, Berlin, Heidelberg.<doi.org/10.1007/11564096_55>.
library(randomForest) library(caret) cv <- createFolds(aeAegypti$class, 2) tr <- aeAegypti[cv$Fold1,] ts <- aeAegypti[cv$Fold2,] # -- Getting a sample from ts with 80 positive and 20 negative instances -- ts_sample <- rbind(ts[sample(which(ts$class==1),80),], ts[sample(which(ts$class==2),20),]) scorer <- randomForest(class~., data=tr, ntree=500) test.scores <- predict(scorer, ts_sample, type = c("prob")) CC(test = test.scores[,1])
library(randomForest) library(caret) cv <- createFolds(aeAegypti$class, 2) tr <- aeAegypti[cv$Fold1,] ts <- aeAegypti[cv$Fold2,] # -- Getting a sample from ts with 80 positive and 20 negative instances -- ts_sample <- rbind(ts[sample(which(ts$class==1),80),], ts[sample(which(ts$class==2),20),]) scorer <- randomForest(class~., data=tr, ntree=500) test.scores <- predict(scorer, ts_sample, type = c("prob")) CC(test = test.scores[,1])
DyS is a framework for quantification data based on mixture models method. It quantifies events based on testing scores, applying the DyS framework proposed by Maletzke et al. (2019). It also works with several similarity functions.
DyS(p.score, n.score, test, measure="topsoe", bins=seq(2,20,2), err=1e-5)
DyS(p.score, n.score, test, measure="topsoe", bins=seq(2,20,2), err=1e-5)
p.score |
a numeric |
n.score |
a numeric |
test |
a numeric |
measure |
measure used to compare the mixture histogram against the
histogram obtained from the test set. Several functions can be used (Default:
|
bins |
a numeric |
err |
a numeric value defining the accepted error for the ternary search
(default: |
A numeric vector containing the class distribution estimated from the test set.
Maletzke, A., Reis, D., Cherman, E., & Batista, G. (2019). DyS: a Framework for Mixture Models in Quantification. in Proceedings of the The Thirty-Third AAAI Conference on Artificial Intelligence, ser. AAAI’19, 2019. <doi.org/10.1609/aaai.v33i01.33014552>.
library(randomForest) library(caret) cv <- createFolds(aeAegypti$class, 3) tr <- aeAegypti[cv$Fold1,] validation <- aeAegypti[cv$Fold2,] ts <- aeAegypti[cv$Fold3,] # -- Getting a sample from ts with 80 positive and 20 negative instances -- ts_sample <- rbind(ts[sample(which(ts$class==1),80),], ts[sample(which(ts$class==2),20),]) scorer <- randomForest(class~., data=tr, ntree=500) scores <- cbind(predict(scorer, validation, type = c("prob")), validation$class) test.scores <- predict(scorer, ts_sample, type = c("prob")) DyS(p.score = scores[scores[,3]==1,1], n.score = scores[scores[,3]==2,1], test = test.scores[,1])
library(randomForest) library(caret) cv <- createFolds(aeAegypti$class, 3) tr <- aeAegypti[cv$Fold1,] validation <- aeAegypti[cv$Fold2,] ts <- aeAegypti[cv$Fold3,] # -- Getting a sample from ts with 80 positive and 20 negative instances -- ts_sample <- rbind(ts[sample(which(ts$class==1),80),], ts[sample(which(ts$class==2),20),]) scorer <- randomForest(class~., data=tr, ntree=500) scores <- cbind(predict(scorer, validation, type = c("prob")), validation$class) test.scores <- predict(scorer, ts_sample, type = c("prob")) DyS(p.score = scores[scores[,3]==1,1], n.score = scores[scores[,3]==2,1], test = test.scores[,1])
This method is an instance of the well-known algorithm for finding maximum-likelihood estimates of the model's parameters. It quantifies events based on testing scores, applying the Expectation Maximization for Quantification (EMQ) method proposed by Saerens et al. (2002).
EMQ(train, test, it=5, e=1e-4)
EMQ(train, test, it=5, e=1e-4)
train |
a |
test |
a numeric |
it |
maximum number of iteration steps (default |
e |
a numeric value for the stop threshold (default |
A numeric vector containing the class distribution estimated from the test set.
Saerens, M., Latinne, P., & Decaestecker, C. (2002). Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural computation.<doi.org/10.1162/089976602753284446>.
library(randomForest) library(caret) cv <- createFolds(aeAegypti$class, 2) tr <- aeAegypti[cv$Fold1,] ts <- aeAegypti[cv$Fold2,] # -- Getting a sample from ts with 80 positive and 20 negative instances -- ts_sample <- rbind(ts[sample(which(ts$class==1),80),], ts[sample(which(ts$class==2),20),]) scorer <- randomForest(class~., data=tr, ntree=500) test.scores <- predict(scorer, ts_sample, type = c("prob")) EMQ(train=tr, test=test.scores)
library(randomForest) library(caret) cv <- createFolds(aeAegypti$class, 2) tr <- aeAegypti[cv$Fold1,] ts <- aeAegypti[cv$Fold2,] # -- Getting a sample from ts with 80 positive and 20 negative instances -- ts_sample <- rbind(ts[sample(which(ts$class==1),80),], ts[sample(which(ts$class==2),20),]) scorer <- randomForest(class~., data=tr, ntree=500) test.scores <- predict(scorer, ts_sample, type = c("prob")) EMQ(train=tr, test=test.scores)
This function provides the true and false positive rates (tpr
and fpr
) for a range of thresholds.
getTPRandFPRbyThreshold(validation_scores, label_pos = 1, thr_range = seq(0,1,0.01))
getTPRandFPRbyThreshold(validation_scores, label_pos = 1, thr_range = seq(0,1,0.01))
validation_scores |
|
label_pos |
numeric value or factor indicating the positive label. |
thr_range |
a numerical |
data.frame
where each row has both (tpr
and fpr
) rates for
each threshold value. This function varies the threshold from 0.01 to 0.99 with
increments 0.01.
Everton Cherman <[email protected]>
Andre Maletzke <[email protected]>
library(randomForest) library(caret) cv <- createFolds(aeAegypti$class, 2) tr <- aeAegypti[cv$Fold1,] validation <- aeAegypti[cv$Fold2,] scorer <- randomForest(class~., data=tr, ntree=500) scores <- cbind(predict(scorer, validation, type = c("prob")), validation$class) TprFpr <- getTPRandFPRbyThreshold(scores)
library(randomForest) library(caret) cv <- createFolds(aeAegypti$class, 2) tr <- aeAegypti[cv$Fold1,] validation <- aeAegypti[cv$Fold2,] scorer <- randomForest(class~., data=tr, ntree=500) scores <- cbind(predict(scorer, validation, type = c("prob")), validation$class) TprFpr <- getTPRandFPRbyThreshold(scores)
It computes the class distribution using the HDy algorithm proposed by González-Castro et al. (2013) with Laplace smoothing (Maletzke et al. (2019)).
HDy_LP(p.score, n.score, test)
HDy_LP(p.score, n.score, test)
p.score |
a numeric |
n.score |
a numeric |
test |
a numeric |
A numeric vector containing the class distribution estimated from the test set.
Andre Maletzke <[email protected]>
González-Castro, V., Alaíz-Rodriguez, R., & Alegre, E. (2013). Class distribution estimation based on the Hellinger distance. Information Sciences.<doi.org/10.1016/j.ins.2012.05.028>
Maletzke, A., Reis, D., Cherman, E., & Batista, G. (2019). DyS: a Framework for Mixture Models in Quantification. in Proceedings of the The Thirty-Third AAAI Conference on Artificial Intelligence, ser. AAAI’19, 2019. <doi.org/10.1609/aaai.v33i01.33014552>.
library(randomForest) library(caret) cv <- createFolds(aeAegypti$class, 3) tr <- aeAegypti[cv$Fold1,] validation <- aeAegypti[cv$Fold2,] ts <- aeAegypti[cv$Fold3,] # -- Getting a sample from ts with 80 positive and 20 negative instances -- ts_sample <- rbind(ts[sample(which(ts$class==1),80),], ts[sample(which(ts$class==2),20),]) scorer <- randomForest(class~., data=tr, ntree=500) scores <- cbind(predict(scorer, validation, type = c("prob")), validation$class) test.scores <- predict(scorer, ts_sample, type = c("prob")) HDy_LP(p.score = scores[scores[,3]==1,1], n.score=scores[scores[,3]==2,1], test=test.scores[,1])
library(randomForest) library(caret) cv <- createFolds(aeAegypti$class, 3) tr <- aeAegypti[cv$Fold1,] validation <- aeAegypti[cv$Fold2,] ts <- aeAegypti[cv$Fold3,] # -- Getting a sample from ts with 80 positive and 20 negative instances -- ts_sample <- rbind(ts[sample(which(ts$class==1),80),], ts[sample(which(ts$class==2),20),]) scorer <- randomForest(class~., data=tr, ntree=500) scores <- cbind(predict(scorer, validation, type = c("prob")), validation$class) test.scores <- predict(scorer, ts_sample, type = c("prob")) HDy_LP(p.score = scores[scores[,3]==1,1], n.score=scores[scores[,3]==2,1], test=test.scores[,1])
It quantifies events based on testing scores, applying an adaptation of the Kuiper's test for quantification problems.
KUIPER(p.score, n.score, test)
KUIPER(p.score, n.score, test)
p.score |
a numeric |
n.score |
a numeric |
test |
a numeric |
A numeric vector containing the class distribution estimated from the test set.
Denis dos Reis <[email protected]>
library(randomForest) library(caret) cv <- createFolds(aeAegypti$class, 3) tr <- aeAegypti[cv$Fold1,] validation <- aeAegypti[cv$Fold2,] ts <- aeAegypti[cv$Fold3,] # -- Getting a sample from ts with 80 positive and 20 negative instances -- ts_sample <- rbind(ts[sample(which(ts$class==1),80),], ts[sample(which(ts$class==2),20),]) scorer <- randomForest(class~., data=tr, ntree=500) scores <- cbind(predict(scorer, validation, type = c("prob")), validation$class) test.scores <- predict(scorer, ts_sample, type = c("prob")) KUIPER(p.score = scores[scores[,3]==1,1], n.score = scores[scores[,3]==2,1], test = test.scores[,1])
library(randomForest) library(caret) cv <- createFolds(aeAegypti$class, 3) tr <- aeAegypti[cv$Fold1,] validation <- aeAegypti[cv$Fold2,] ts <- aeAegypti[cv$Fold3,] # -- Getting a sample from ts with 80 positive and 20 negative instances -- ts_sample <- rbind(ts[sample(which(ts$class==1),80),], ts[sample(which(ts$class==2),20),]) scorer <- randomForest(class~., data=tr, ntree=500) scores <- cbind(predict(scorer, validation, type = c("prob")), validation$class) test.scores <- predict(scorer, ts_sample, type = c("prob")) KUIPER(p.score = scores[scores[,3]==1,1], n.score = scores[scores[,3]==2,1], test = test.scores[,1])
It quantifies events based on testing scores, applying MAX method, according to
Forman (2006). Same as T50, but it sets the threshold where tpr
–fpr
is maximized.
MAX(test, TprFpr)
MAX(test, TprFpr)
test |
a numeric |
TprFpr |
a |
A numeric vector containing the class distribution estimated from the test set.
Forman, G. (2006, August). Quantifying trends accurately despite classifier error and class imbalance. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 157-166).<doi.org/10.1145/1150402.1150423>.
library(randomForest) library(caret) cv <- createFolds(aeAegypti$class, 3) tr <- aeAegypti[cv$Fold1,] validation <- aeAegypti[cv$Fold2,] ts <- aeAegypti[cv$Fold3,] # -- Getting a sample from ts with 80 positive and 20 negative instances -- ts_sample <- rbind(ts[sample(which(ts$class==1),80),], ts[sample(which(ts$class==2),20),]) scorer <- randomForest(class~., data=tr, ntree=500) scores <- cbind(predict(scorer, validation, type = c("prob")), validation$class) TprFpr <- getTPRandFPRbyThreshold(scores) test.scores <- predict(scorer, ts_sample, type = c("prob")) MAX(test=test.scores[,1], TprFpr=TprFpr)
library(randomForest) library(caret) cv <- createFolds(aeAegypti$class, 3) tr <- aeAegypti[cv$Fold1,] validation <- aeAegypti[cv$Fold2,] ts <- aeAegypti[cv$Fold3,] # -- Getting a sample from ts with 80 positive and 20 negative instances -- ts_sample <- rbind(ts[sample(which(ts$class==1),80),], ts[sample(which(ts$class==2),20),]) scorer <- randomForest(class~., data=tr, ntree=500) scores <- cbind(predict(scorer, validation, type = c("prob")), validation$class) TprFpr <- getTPRandFPRbyThreshold(scores) test.scores <- predict(scorer, ts_sample, type = c("prob")) MAX(test=test.scores[,1], TprFpr=TprFpr)
It quantifies events based on testing scores, applying the Mixable Kolmogorov Smirnov (MKS) method proposed by Maletzke et al. (2019).
MKS(p.score, n.score, test)
MKS(p.score, n.score, test)
p.score |
a numeric |
n.score |
a numeric |
test |
a numeric |
A numeric vector containing the class distribution estimated from the test set.
Maletzke, A., Reis, D., Cherman, E., & Batista, G. (2019). DyS: a Framework for Mixture Models in Quantification. in Proceedings of the The Thirty-Third AAAI Conference on Artificial Intelligence, ser. AAAI’19, 2019.<doi.org/10.1609/aaai.v33i01.33014552>
library(randomForest) library(caret) cv <- createFolds(aeAegypti$class, 3) tr <- aeAegypti[cv$Fold1,] validation <- aeAegypti[cv$Fold2,] ts <- aeAegypti[cv$Fold3,] # -- Getting a sample from ts with 80 positive and 20 negative instances -- ts_sample <- rbind(ts[sample(which(ts$class==1),80),], ts[sample(which(ts$class==2),20),]) scorer <- randomForest(class~., data=tr, ntree=500) scores <- cbind(predict(scorer, validation, type = c("prob")), validation$class) test.scores <- predict(scorer, ts_sample, type = c("prob")) MKS(p.score = scores[scores[,3]==1,1], n.score = scores[scores[,3]==2,1], test = test.scores)
library(randomForest) library(caret) cv <- createFolds(aeAegypti$class, 3) tr <- aeAegypti[cv$Fold1,] validation <- aeAegypti[cv$Fold2,] ts <- aeAegypti[cv$Fold3,] # -- Getting a sample from ts with 80 positive and 20 negative instances -- ts_sample <- rbind(ts[sample(which(ts$class==1),80),], ts[sample(which(ts$class==2),20),]) scorer <- randomForest(class~., data=tr, ntree=500) scores <- cbind(predict(scorer, validation, type = c("prob")), validation$class) test.scores <- predict(scorer, ts_sample, type = c("prob")) MKS(p.score = scores[scores[,3]==1,1], n.score = scores[scores[,3]==2,1], test = test.scores)
It quantifies events based on testing scores, applying Median Sweep (MS) method, according to Forman (2006).
MS(test, TprFpr)
MS(test, TprFpr)
test |
a numeric |
TprFpr |
a |
A numeric vector containing the class distribution estimated from the test set.
Forman, G. (2006, August). Quantifying trends accurately despite classifier error and class imbalance. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 157-166).<doi.org/10.1145/1150402.1150423>.
library(randomForest) library(caret) cv <- createFolds(aeAegypti$class, 3) tr <- aeAegypti[cv$Fold1,] validation <- aeAegypti[cv$Fold2,] ts <- aeAegypti[cv$Fold3,] # -- Getting a sample from ts with 80 positive and 20 negative instances -- ts_sample <- rbind(ts[sample(which(ts$class==1),80),], ts[sample(which(ts$class==2),20),]) scorer <- randomForest(class~., data=tr, ntree=500) scores <- cbind(predict(scorer, validation, type = c("prob")), validation$class) TprFpr <- getTPRandFPRbyThreshold(scores) test.scores <- predict(scorer, ts_sample, type = c("prob")) MS(test = test.scores[,1], TprFpr = TprFpr)
library(randomForest) library(caret) cv <- createFolds(aeAegypti$class, 3) tr <- aeAegypti[cv$Fold1,] validation <- aeAegypti[cv$Fold2,] ts <- aeAegypti[cv$Fold3,] # -- Getting a sample from ts with 80 positive and 20 negative instances -- ts_sample <- rbind(ts[sample(which(ts$class==1),80),], ts[sample(which(ts$class==2),20),]) scorer <- randomForest(class~., data=tr, ntree=500) scores <- cbind(predict(scorer, validation, type = c("prob")), validation$class) TprFpr <- getTPRandFPRbyThreshold(scores) test.scores <- predict(scorer, ts_sample, type = c("prob")) MS(test = test.scores[,1], TprFpr = TprFpr)
It quantifies events using a modified version of the MS method that considers
only thresholds where the denominator (tpr
-fpr
) is greater than 0.25.
MS2(test, TprFpr)
MS2(test, TprFpr)
test |
a numeric |
TprFpr |
a |
A numeric vector containing the class distribution estimated from the test set.
Forman, G. (2006, August). Quantifying trends accurately despite classifier error and class imbalance. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 157-166).<doi.org/10.1145/1150402.1150423>.
library(randomForest) library(caret) cv <- createFolds(aeAegypti$class, 3) tr <- aeAegypti[cv$Fold1,] validation <- aeAegypti[cv$Fold2,] ts <- aeAegypti[cv$Fold3,] # -- Getting a sample from ts with 80 positive and 20 negative instances -- ts_sample <- rbind(ts[sample(which(ts$class==1),80),], ts[sample(which(ts$class==2),20),]) scorer <- randomForest(class~., data=tr, ntree=500) scores <- cbind(predict(scorer, validation, type = c("prob")), validation$class) TprFpr <- getTPRandFPRbyThreshold(scores) test.scores <- predict(scorer, ts_sample, type = c("prob")) MS2(test = test.scores[,1], TprFpr = TprFpr)
library(randomForest) library(caret) cv <- createFolds(aeAegypti$class, 3) tr <- aeAegypti[cv$Fold1,] validation <- aeAegypti[cv$Fold2,] ts <- aeAegypti[cv$Fold3,] # -- Getting a sample from ts with 80 positive and 20 negative instances -- ts_sample <- rbind(ts[sample(which(ts$class==1),80),], ts[sample(which(ts$class==2),20),]) scorer <- randomForest(class~., data=tr, ntree=500) scores <- cbind(predict(scorer, validation, type = c("prob")), validation$class) TprFpr <- getTPRandFPRbyThreshold(scores) test.scores <- predict(scorer, ts_sample, type = c("prob")) MS2(test = test.scores[,1], TprFpr = TprFpr)
It quantifies events based on testing scores, applying the Probabilistic Adjusted Classify and Count (PACC) method. This method is also called Scaled Probability Average (SPA).
PACC(test, TprFpr, thr=0.5)
PACC(test, TprFpr, thr=0.5)
test |
a numeric |
TprFpr |
a |
thr |
threshold value according to the |
A numeric vector containing the class distribution estimated from the test set.
Bella, A., Ferri, C., Hernández-Orallo, J., & Ramírez-Quintana, M. J. (2010). Quantification via probability estimators. In IEEE International Conference on Data Mining (pp. 737–742). Sidney.<doi.org/10.1109/ICDM.2010.75>.
library(randomForest) library(caret) cv <- createFolds(aeAegypti$class, 3) tr <- aeAegypti[cv$Fold1,] validation <- aeAegypti[cv$Fold2,] ts <- aeAegypti[cv$Fold3,] # -- Getting a sample from ts with 80 positive and 20 negative instances -- ts_sample <- rbind(ts[sample(which(ts$class==1),80),], ts[sample(which(ts$class==2),20),]) scorer <- randomForest(class~., data=tr, ntree=500) scores <- cbind(predict(scorer, validation, type = c("prob")), validation$class) TprFpr <- getTPRandFPRbyThreshold(scores) test.scores <- predict(scorer, ts_sample, type = c("prob"))[,1] # -- PACC requires calibrated scores. Be aware of doing this before using PACC -- # -- You can make it using calibrate function from the CORElearn package -- # if(requireNamespace("CORElearn")){ # cal_tr <- CORElearn::calibrate(as.factor(scores[,3]), scores[,1], class1=1, # method="isoReg",assumeProbabilities=TRUE) # test.scores <- CORElearn::applyCalibration(test.scores, cal_tr) #} PACC(test = test.scores, TprFpr = TprFpr)
library(randomForest) library(caret) cv <- createFolds(aeAegypti$class, 3) tr <- aeAegypti[cv$Fold1,] validation <- aeAegypti[cv$Fold2,] ts <- aeAegypti[cv$Fold3,] # -- Getting a sample from ts with 80 positive and 20 negative instances -- ts_sample <- rbind(ts[sample(which(ts$class==1),80),], ts[sample(which(ts$class==2),20),]) scorer <- randomForest(class~., data=tr, ntree=500) scores <- cbind(predict(scorer, validation, type = c("prob")), validation$class) TprFpr <- getTPRandFPRbyThreshold(scores) test.scores <- predict(scorer, ts_sample, type = c("prob"))[,1] # -- PACC requires calibrated scores. Be aware of doing this before using PACC -- # -- You can make it using calibrate function from the CORElearn package -- # if(requireNamespace("CORElearn")){ # cal_tr <- CORElearn::calibrate(as.factor(scores[,3]), scores[,1], class1=1, # method="isoReg",assumeProbabilities=TRUE) # test.scores <- CORElearn::applyCalibration(test.scores, cal_tr) #} PACC(test = test.scores, TprFpr = TprFpr)
It quantifies events based on testing scores, applying the Probabilistic Classify and Count (PCC) method.
PCC(test)
PCC(test)
test |
a numeric |
A numeric vector containing the class distribution estimated from the test set.
Bella, A., Ferri, C., Hernández-Orallo, J., & Ramírez-Quintana, M. J. (2010). Quantification via probability estimators. In IEEE International Conference on Data Mining (pp. 737–742). Sidney.<doi.org/10.1109/ICDM.2010.75>.
library(randomForest) library(caret) cv <- createFolds(aeAegypti$class, 3) tr <- aeAegypti[cv$Fold1,] validation <- aeAegypti[cv$Fold2,] ts <- aeAegypti[cv$Fold3,] # -- Getting a sample from ts with 80 positive and 20 negative instances -- ts_sample <- rbind(ts[sample(which(ts$class==1),80),], ts[sample(which(ts$class==2),20),]) scorer <- randomForest(class~., data=tr, ntree=500) scores <- cbind(predict(scorer, validation, type = c("prob")), validation$class) test.scores <- predict(scorer, ts_sample, type = c("prob"))[,1] # -- PCC requires calibrated scores. Be aware of doing this before using PCC -- # -- You can make it using calibrate function from the CORElearn package -- # if(requireNamespace("CORElearn")){ # cal_tr <- CORElearn::calibrate(as.factor(scores[,3]), scores[,1], class1=1, # method="isoReg",assumeProbabilities=TRUE) # test.scores <- CORElearn::applyCalibration(test.scores, cal_tr) # } PCC(test=test.scores)
library(randomForest) library(caret) cv <- createFolds(aeAegypti$class, 3) tr <- aeAegypti[cv$Fold1,] validation <- aeAegypti[cv$Fold2,] ts <- aeAegypti[cv$Fold3,] # -- Getting a sample from ts with 80 positive and 20 negative instances -- ts_sample <- rbind(ts[sample(which(ts$class==1),80),], ts[sample(which(ts$class==2),20),]) scorer <- randomForest(class~., data=tr, ntree=500) scores <- cbind(predict(scorer, validation, type = c("prob")), validation$class) test.scores <- predict(scorer, ts_sample, type = c("prob"))[,1] # -- PCC requires calibrated scores. Be aware of doing this before using PCC -- # -- You can make it using calibrate function from the CORElearn package -- # if(requireNamespace("CORElearn")){ # cal_tr <- CORElearn::calibrate(as.factor(scores[,3]), scores[,1], class1=1, # method="isoReg",assumeProbabilities=TRUE) # test.scores <- CORElearn::applyCalibration(test.scores, cal_tr) # } PCC(test=test.scores)
It is a nearest-neighbor classifier adapted for working over quantification problems. This method applies a weighting scheme, reducing the weight on neighbors from the majority class.
PWK(train, y, test, alpha=1, n_neighbors=10)
PWK(train, y, test, alpha=1, n_neighbors=10)
train |
a |
y |
a |
test |
a |
alpha |
a numeric value defining the proportion-weighted k-nearest neighbor algorithm
as proposed by Barranquero et al., (2012). (Default: |
n_neighbors |
a integer value defining the number of neighbors to use by default for
nearest neighbor queries (Default: |
A numeric vector containing the class distribution estimated from the test set.
Barranquero, J., González, P., Díez, J., & Del Coz, J. J. (2013). On the study of nearest neighbor algorithms for prevalence estimation in binary problems. Pattern Recognition, 46(2), 472-482.<doi.org/10.1016/j.patcog.2012.07.022>
library(caret) library(FNN) cv <- createFolds(aeAegypti$class, 2) tr <- aeAegypti[cv$Fold1,] ts <- aeAegypti[cv$Fold2,] # -- Getting a sample from ts with 80 positive and 20 negative instances -- ts_sample <- rbind(ts[sample(which(ts$class==1),80),], ts[sample(which(ts$class==2),20),]) PWK(train=tr[,-which(names(tr)=="class")], y=tr[,"class"], test= ts[,-which(names(ts)=="class")])
library(caret) library(FNN) cv <- createFolds(aeAegypti$class, 2) tr <- aeAegypti[cv$Fold1,] ts <- aeAegypti[cv$Fold2,] # -- Getting a sample from ts with 80 positive and 20 negative instances -- ts_sample <- rbind(ts[sample(which(ts$class==1),80),], ts[sample(which(ts$class==2),20),]) PWK(train=tr[,-which(names(tr)=="class")], y=tr[,"class"], test= ts[,-which(names(ts)=="class")])
SMM is a member of the DyS framework that uses simple means scores to represent the score distribution for positive, negative, and unlabelled scores. Therefore, the class distribution is given by a closed-form equation.
SMM(p.score, n.score, test)
SMM(p.score, n.score, test)
p.score |
a numeric |
n.score |
a numeric |
test |
a numeric |
A numeric vector containing the class distribution estimated from the test set.
Hassan, W., Maletzke, A., Batista, G. (2020). Accurately Quantifying a Billion Instances per Second. In IEEE International Conference on Data Science and Advanced Analytics (DSAA).
library(randomForest) library(caret) cv <- createFolds(aeAegypti$class, 3) tr <- aeAegypti[cv$Fold1,] validation <- aeAegypti[cv$Fold2,] ts <- aeAegypti[cv$Fold3,] # -- Getting a sample from ts with 80 positive and 20 negative instances -- ts_sample <- rbind(ts[sample(which(ts$class==1),80),], ts[sample(which(ts$class==2),20),]) scorer <- randomForest(class~., data=tr, ntree=500) scores <- cbind(predict(scorer, validation, type = c("prob")), validation$class) test.scores <- predict(scorer, ts_sample, type = c("prob")) SMM(p.score = scores[scores[,3]==1,1], n.score = scores[scores[,3]==2,1], test = test.scores[,1])
library(randomForest) library(caret) cv <- createFolds(aeAegypti$class, 3) tr <- aeAegypti[cv$Fold1,] validation <- aeAegypti[cv$Fold2,] ts <- aeAegypti[cv$Fold3,] # -- Getting a sample from ts with 80 positive and 20 negative instances -- ts_sample <- rbind(ts[sample(which(ts$class==1),80),], ts[sample(which(ts$class==2),20),]) scorer <- randomForest(class~., data=tr, ntree=500) scores <- cbind(predict(scorer, validation, type = c("prob")), validation$class) test.scores <- predict(scorer, ts_sample, type = c("prob")) SMM(p.score = scores[scores[,3]==1,1], n.score = scores[scores[,3]==2,1], test = test.scores[,1])
It quantifies events based on testing scores applying the framework DyS with the Sample ORD Dissimilarity (SORD) proposed by Maletzke et al. (2019).
SORD(p.score, n.score, test)
SORD(p.score, n.score, test)
p.score |
a numeric |
n.score |
a numeric |
test |
a numeric |
A numeric vector containing the class distribution estimated from the test set.
Maletzke, A., Reis, D., Cherman, E., & Batista, G. (2019). DyS: a Framework for Mixture Models in Quantification. in Proceedings of the The Thirty-Third AAAI Conference on Artificial Intelligence, ser. AAAI’19, 2019.<doi.org/10.1609/aaai.v33i01.33014552>.
library(randomForest) library(caret) cv <- createFolds(aeAegypti$class, 3) tr <- aeAegypti[cv$Fold1,] validation <- aeAegypti[cv$Fold2,] ts <- aeAegypti[cv$Fold3,] # -- Getting a sample from ts with 80 positive and 20 negative instances -- ts_sample <- rbind(ts[sample(which(ts$class==1),80),], ts[sample(which(ts$class==2),20),]) scorer <- randomForest(class~., data=tr, ntree=500) scores <- cbind(predict(scorer, validation, type = c("prob")), validation$class) test.scores <- predict(scorer, ts_sample, type = c("prob")) SORD(p.score = scores[scores[,3]==1,1], n.score = scores[scores[,3]==2,1], test = test.scores[,1])
library(randomForest) library(caret) cv <- createFolds(aeAegypti$class, 3) tr <- aeAegypti[cv$Fold1,] validation <- aeAegypti[cv$Fold2,] ts <- aeAegypti[cv$Fold3,] # -- Getting a sample from ts with 80 positive and 20 negative instances -- ts_sample <- rbind(ts[sample(which(ts$class==1),80),], ts[sample(which(ts$class==2),20),]) scorer <- randomForest(class~., data=tr, ntree=500) scores <- cbind(predict(scorer, validation, type = c("prob")), validation$class) test.scores <- predict(scorer, ts_sample, type = c("prob")) SORD(p.score = scores[scores[,3]==1,1], n.score = scores[scores[,3]==2,1], test = test.scores[,1])
It quantifies events based on testing scores, applying T50 method proposed by
Forman (2006). It sets the decision threshold of Binary Classifier where
tpr
= 50%.
T50(test, TprFpr)
T50(test, TprFpr)
test |
a numeric |
TprFpr |
a |
A numeric vector containing the class distribution estimated from the test set.
Forman, G. (2006, August). Quantifying trends accurately despite classifier error and class imbalance. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 157-166).<doi.org/10.1145/1150402.1150423>.
library(randomForest) library(caret) cv <- createFolds(aeAegypti$class, 3) tr <- aeAegypti[cv$Fold1,] validation <- aeAegypti[cv$Fold2,] ts <- aeAegypti[cv$Fold3,] # -- Getting a sample from ts with 80 positive and 20 negative instances -- ts_sample <- rbind(ts[sample(which(ts$class==1),80),], ts[sample(which(ts$class==2),20),]) scorer <- randomForest(class~., data=tr, ntree=500) scores <- cbind(predict(scorer, validation, type = c("prob")), validation$class) TprFpr <- getTPRandFPRbyThreshold(scores) test.scores <- predict(scorer, ts_sample, type = c("prob")) T50(test=test.scores[,1], TprFpr=TprFpr)
library(randomForest) library(caret) cv <- createFolds(aeAegypti$class, 3) tr <- aeAegypti[cv$Fold1,] validation <- aeAegypti[cv$Fold2,] ts <- aeAegypti[cv$Fold3,] # -- Getting a sample from ts with 80 positive and 20 negative instances -- ts_sample <- rbind(ts[sample(which(ts$class==1),80),], ts[sample(which(ts$class==2),20),]) scorer <- randomForest(class~., data=tr, ntree=500) scores <- cbind(predict(scorer, validation, type = c("prob")), validation$class) TprFpr <- getTPRandFPRbyThreshold(scores) test.scores <- predict(scorer, ts_sample, type = c("prob")) T50(test=test.scores[,1], TprFpr=TprFpr)
It quantifies events based on testing scores, applying the X method (Forman, 2006).
Same as T50, but set the threshold where (1
- tpr
) = fpr
.
X(test, TprFpr)
X(test, TprFpr)
test |
a numeric |
TprFpr |
a |
A numeric vector containing the class distribution estimated from the test set.
Forman, G. (2006, August). Quantifying trends accurately despite classifier error and class imbalance. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 157-166).<doi.org/10.1145/1150402.1150423>.
library(randomForest) library(caret) cv <- createFolds(aeAegypti$class, 3) tr <- aeAegypti[cv$Fold1,] validation <- aeAegypti[cv$Fold2,] ts <- aeAegypti[cv$Fold3,] # -- Getting a sample from ts with 80 positive and 20 negative instances -- ts_sample <- rbind(ts[sample(which(ts$class==1),80),], ts[sample(which(ts$class==2),20),]) scorer <- randomForest(class~., data=tr, ntree=500) scores <- cbind(predict(scorer, validation, type = c("prob")), validation$class) TprFpr <- getTPRandFPRbyThreshold(scores) test.scores <- predict(scorer, ts_sample, type = c("prob")) X(test=test.scores[,1], TprFpr=TprFpr)
library(randomForest) library(caret) cv <- createFolds(aeAegypti$class, 3) tr <- aeAegypti[cv$Fold1,] validation <- aeAegypti[cv$Fold2,] ts <- aeAegypti[cv$Fold3,] # -- Getting a sample from ts with 80 positive and 20 negative instances -- ts_sample <- rbind(ts[sample(which(ts$class==1),80),], ts[sample(which(ts$class==2),20),]) scorer <- randomForest(class~., data=tr, ntree=500) scores <- cbind(predict(scorer, validation, type = c("prob")), validation$class) TprFpr <- getTPRandFPRbyThreshold(scores) test.scores <- predict(scorer, ts_sample, type = c("prob")) X(test=test.scores[,1], TprFpr=TprFpr)