A METHOD TO ENRICH EXPERIMENTAL DATASETS BY MEANS OF NUMERICAL SIMULATIONS IN VIEW OF CLASSIFICATION TASKS

Classification tasks are frequent in many applications in science and engineering. A wide variety of statistical learning methods exist to deal with these problems. However, in many industrial applications, the number of available samples to train and construct a classifier is scarce and this has an impact on the classifications performances. In this work, we consider the case in which some a priori information on the system is available in form of a mathematical model. In particular, a set of numerical simulations of the system can be integrated to the experimental dataset. The main question we address is how to integrate them systematically in order to improve the classification performances. The method proposed is based on Nearest Neighbours and on the notion of Hausdorff distance between sets. Some theoretical results and several numerical studies are proposed. Mathematics Subject Classification. 60B10, 68T05. Received March 26, 2021. Accepted September 23, 2021.


Introduction
Classification tasks are frequent in many applications in science and engineering. The statistical learning methods which are proposed to deal with them rely on the fact that many examples (where the number of samples depends on the application under consideration) are available and can be exploited to uncover the underlying structure of the data and their separation in several classes. After the learning phase has been performed, a classifier is set up and can be used to infer to which class a new observed sample belongs to.
In many industrial applications the number of available samples is scarce, impacting the performances of the classification. A way to circumvent this limitation is to integrate to the available a posteriori information (provided by the available data) some a priori information (coming from experimental insight or theoretical knowledge) as proposed for instance in [15,16,20,21].
The use of mathematical models and numerical simulations to construct the training set of machine learning methods has been recently investigated in [3,27,29]. In [29], a model order reduction framework is proposed in order to deal with classification problems. In this, synthetic outputs obtained by numerical simulations are used in order to train the machine learning algorithms. The influence of the model error on the classification performance is investigated. In [3], numerical simulations are used to set up a sparse gaussian process. This is used in order to solve an optimal design problem for structural anomaly detection. In [27], a convolutional neural network framework is proposed to efficiently deal with health monitoring, seen as a classification problem on multivariate time series. The training of the network is performed by using numerical simulations of a physical based model of the system.
In this work we consider the case in which some a priori information is available in form of a mathematical model. Numerical simulations of several instances of the model can be computed and integrated to an available dataset in order to improve the classification performances. The main questions to be answered are: how many numerical simulations should we include, and which ones? Which information is needed in order to devise a systematic strategy? This work is devoted to the investigation of possible answers to these questions, in the spirit of what has been proposed in [2], in which an adaptive sampling is proposed in order to improve the performances of a SVM classifier. The selection of the samples aims at improving the position of the support vectors and the margin. These questions have also been raised in [17], where each training sample is weighted in order to solve SVM classification tasks.
This topic is also closely related to two research fields in machine learning: domain adaptation and instance (or prototype) selection. The main goal of domain adaptation is to account for the discrepancies between target and test sets and propose ways to correct for them. An abundant literature on this subject is available [23,28,33,36]. The main difference with respect to the method proposed in the present work consists in the fact that in domain adaptation we often try to minimise a discrepancy between the datasets, whereas in the present work we focus on trying to improve a classification score. This is more similar, in the spirit, to the methods proposed in the field of instance selection. Different kinds of algorithms have been proposed in this research field and can be divided into 4 different classes (commented and compared in the recent work [5]): (1) Incremental, such as Condensed Nearest Neighbors [14] and its variants [26,32] or Instance-based learning [1]. These methods consist in building the training set by adding samples, chosen according different criteria.
(2) Decremental such as Decremental Reduction Optimization Procedure [34,35] or Hit Miss Network [19] consist in defining the training set by pruning samples from an available reservoir of potentially redundant (and corrupted) samples. (3) Batching such as Edited Nearest Neighbors [31], consists in testing whether each sample of the training set follows a removable criterion. All of the samples verifying this criterion are removed at once. (4) Fixed size such as Learning Vector Quantization [22] which consists in fixing a priori the size of the training set and selecting the samples to be used.
Recent studies have proposed in-between methods such as in [8]. These algorithms might have several drawbacks: in the methods in which we test one sample at a time and we decide if it has to be included or not into the training set, we might obtain a result which is sensitive to the order with which we test the samples. In some methods, the fitness function introduced to perform the selection is based on similarity criteria applied to the input features rather than the classification success rate, which might be suboptimal in some cases or it might depend upon hyperparameters which need to be tuned.
The main contributions of the present investigation are the following: (1) A systematic strategy can be set up, that enrich available training sets and improves the classification performance in a substantial way. The only information which is exploited is a representative validation set, given even in form of samples or in form of a set of data and parameters of a reliable mathematical model describing the phenomenon. (2) The method which is proposed can be decomposed in two phases: an incremental one, in which we add to the training set samples taken from a reservoir of numerical simulations; a decremental one in which we prune samples to reduce redundancy and noise oversensitivity. We tried to reduce as much as possible the number of hyperparameters. (3) The obtained approach is not a generative one: it is not strictly needed to have an exhaustive training set distributed as the validation set; it is sufficient to add the most informative samples, in a sense that will be made more precise in the following, and that will be encoded in the fit functions used in the incremental and decremental phases.
The structure of the work is as follows. In Section 2 the method is proposed, and some properties are investigated from a theoretical standpoint. In Section 3 the discretisation is discussed, and in Section 4 some numerical test cases are presented to illustrate the approach.

The method
In this section, we detail the method proposed in the present work. The problem under investigation is a classification task, and, for the sake of simplicity, we restrict to a binary classification. Four different sets of samples are introduced: (1) An augmented set, for which we know both the input (observations) and the output (labels). The augmented set is the main unknown of the problem. We wish to devise a way to construct it, starting from an available scarce (in the number of samples) set of labeled instances. The training set of the problem (we will use to set up the classifier) is the augmented set at the end of the enrichment process. The elements of the augmented set will be denoted by the superscript "tr". (2) A validation set, for which we know both the input (observations) and the output (labels), whose elements will be denoted by the superscript "v". This is the only source of information to construct the augmented set.
(3) A test set, for which we know just the observation, whose elements will be denoted by the superscript "te". (4) A reservoir of numerical simulations of the systems, for which we know the observation and the label, to be used in order to construct or enrich the augmented set.
Several possible cases are met in realistic applications. First, we can be in a case in which we have an available experimental dataset covering all the possible meaningful instances of the problem under scrutiny, having however not so many samples (or not enough to have the wished performance on the test set). We will call this a complete validation case. Second, we could be in an incomplete validation case, meaning that the experimental dataset to be used as training and validation covers only a subset of the possible instances (occurring in the test set). In both these situations, we would like to enrich the dataset by integrating elements of the reservoir in the augmented set. This is the simplest way to integrate some a priori information coming from mathematical modelling to the existing a posteriori information of the experimental data. We will consider here the cases of a perfect model (useful to validate certain aspect of the method) and the more realistic case in which the model is biased.

Context and notations
Let be a random variable, representing the state of a system, for a population of individuals. A system configuration, identified by the realisation , can belong to two classes, labelled = {0, 1}. In an application, the system is observed through a measurement process and for a given observation ∈ R (which in general results from the application of a non-linear function to ), we need to uncover whether the state belongs to the class = 0 or = 1.
The system observable for the population can be modelled by a random variable defined on the probability space (Ω, , P), with Ω ⊆ R , the -algebra of all the possible observables and P the probability measure. We denote ( ) ∈ Ω a realisation of and we assume that its probability density distribution, denoted ( ), is a mixture of two densities. Let 0 , 1 ∈ (0, 1), such that 0 + 1 = 1. The probability density distribution reads: where 0 ( ), 1 ( ) are the conditional probability density distributions for the classes 0 and 1 respectively, namely 0,1 ( ) = ( | = (0, 1)).
In the following, the Lebesgue measure of a generic set is denoted by ( ). The classification success rate is based on a score function , which is a measure, introduced and described in [18], and that we recall for sake of completeness. The set of all the subsets in Ω is denoted by 2 Ω . Definition 2.1. We define the score function as follows: : where we take: with the given densities 0 , 1 , and the superscript "s" denotes either the validation or the test set.
This score can be evaluated for all pairs of subsets 0 , 1 . It is related to the classification outcome when we compute it for the following pair: where " " stands for the augmented set. As in [18], we make the following assumption: Under the hypothesis that the set 2 is a zero measure set, it follows that: Remark 2.2. The main goal is to enrich the augmented set aiming at improving the classification performance, which is quantified by the above introduced score. To this end, it is not needed to have the following strong outcome: The propose approach is not a generative one seeking at generating samples distributed as the validation set, but samples which help improving the score. Henceforth, we could hopefully come up with a method which is less costly from a computational point of view.

Augmented set enrichment based on the Hausdorff distance: ASE-HD
We assume that Ω (defined in Sect. 2.1) is a measurable non-empty compact set of R , and an observation of a system is ∈ Ω ⊂ R .
At the beginning, the augmented set is given by the union of two known sets: 1 . The goal is to progressively enrich the augmented set by making use of the samples in the reservoir of simulations. For the sake of simplicity, in this section, we make the hypothesis that the reservoir samples can cover Ω.
The information to be exploited comes from the knowledge of the validation set, either in form of samples or as a set of data and parameters of a mathematical model. This can be translated into two sets: * 0,1 , with * 1 = Ω ∖ * 0 , such that * 0 = { ( ) ∈ Ω| = 0}. These sets are optimal in the sense of the score function : In the following, we denote * the score corresponding to these sets.
Let ∈ N denotes the -th step of the enrichment, we define ( ) ⊆ Ω (for = 0 or 1), the samples of the augmented set being ( ) ∈ ( ) 0 ∪ ( ) 1 , as follows: The score of the classification corresponding to these sets reads: where ( ) is the pdf of the augmented set of class and is the pdf of the validation set of class .
Starting from known sets (0) , = 0, 1, the goal is to transform them in order to converge to * , = 0, 1, which maximizes the classification success rate. We construct a sequence which aims at increasing the cost function ( ) , by observing that it is possible to make the sets ( ) to converge towards the optimal sets * by diminishing a suitable distance between these sets. Let ℬ( , ) ⊂ Ω denotes a ball of center and radius ≥ 0. The enrichment method is performed as follows. Let  (3) Let ℬ * = ℬ( +1 , ). The update of the union of the intersections reads: (2.7)

Analysis of the ASE-HD algorithm
The convergence of the sets ( ) 0,1 to the sets * 0,1 is studied. First, a lemma is introduced, clarifying the meaning of the set ( ) . Let ∆ be the symmetric difference [10] between the sets and .
The result of this lemma, makes it possible to prove the following result (the proofs are presented in Supplementary material). Moreover, the gain on the score between two consecutive steps can easily be estimated. Its expression is given in the following result.
It follows that the gain is proportional to the total variation between 0 and 1 restricted to ℬ * . The result of the proposition states simply that, under the hypothesis that the system observable belongs to a compact set, and the set * 0,1 are known, the proposed iteration enriches the augmented set in such a way that the optimal classification score is retrieved. This algorithm shows some common properties with the algorithm detailed in [4]. In particular, the set sequence depends on the symmetric difference between the expected and the current set.

Reducing noise oversensitivity and bias induced errors: pruning.
At each stage of the ASE-HD algorithm, the samples of the reservoir contained in a selected ball ℬ * are added to the augmented set (either to ( +1) 0 or to ( +1) 1 ). As remarked in [35], a large number of noisy samples could lead to noise oversensitivity. Moreover, as the augmented set is enriched through numerical simulations, a bias could potentially pollute the classification results in regions where the samples of the validation set are scarce. To avoid these phenomena and to make the classification less prone to overfitting, a pruning phase is introduced, which consists in removing the samples which are not useful in improving the score.
Once ASE-HD is performed, the obtained augmented set consists in the pair ( ,0) = ( ). Since, in practice, we have a finite number of samples, these sets consist in a finite set of balls centred around a finite number of samples.
A stochastic algorithm is introduced. At the -th iteration, a sample ∈ of the augmented set is randomly selected. It can be considered as the center of a small ball ℬ ( , ) whose radius is such that the other samples do not belong to ℬ . The score is computed and the following action is taken: otherwise. (2.8) Remark that, by construction, at the end of the pruning step the score is at least as good as the beginning of the pruning step, and in some cases an improvement is obtained.

On realistic scenarios
In many applications different concerns may arise, such as the possible bias in the mathematical model (and then the database) [12,30] and the incomplete validation case. We recall that in the present work we consider incomplete a validation set which does not cover the whole observable space Ω. In this section, a set of results are proposed to deal with these two cases.

Biased database
In general, the database obtained through a collection of experiments and/or simulations may have a bias. Let , ( = 0 or 1) denote the test set which is supposed to cover Ω, i.e. 0 ∪ 1 = Ω: The samples from these sets are samples drawn from the true underlying densities. The sets identified by using the densities of the model are: The densities 0,1 are in general different from the true ones. This is due to the model bias, which is such that the difference in the model state is propagated in the model observable and hence in the density . This, in turn, affects the sets We recall that the sets satisfy: We define the biased sets as follows: The bias sets 0,1 are quantifying, in a sense which is pertinent for the binary classification, the effect of the model bias.
Lemma 2.7. Let the sets , 0,1 be defined as in equations (2.9) and (2.10).The following equalities hold: The result of the lemma makes it possible to prove the following result on the classification score of the test set: Proposition 2.8. Let the hypothesis of Lemma 2.7 hold. Let be the score of the classification of the test set when the augmented set is defined by the model. The maximal score is represented by: It holds: 0 ≤ ≤ * , and, moreover: Remark 2.9. In the case where = ∅, we have = ∫︀ It is straightforward to observe that in the case where there is no bias, we have the equality. In practice, we do not know . It means that, if we only train with the model (database) we will compute the score over .

The Validation set partially covers the set of possible outcomes.
In several situations it is possible to assess whether the validation set covers all the possible scenarios that could occur in the test set (even prior of receiving the test set). This is possible in particular when there is an underlying parametrisation of the system at hand, namely when the scenarios of interest are associated with values of data and parameters that characterise the solution of the models describing the phenomenon. Here, we consider that the validation set partially covers Ω when the validation set does not have enough instances, in the sense that there are meaningful scenarios of the real system which are not represented in the validation set. This would translate in the following: if we trained a classifier by using the validation set, it won't be able to well classify some query samples of the test set.
When the validation set partially covers Ω (incomplete validation set) we can show that the score on the test set (which is supposed to cover Ω) is lower than the score obtained with a validation set covering Ω (see Prop. 2.11).
) the test set score obtained with a complete (resp. incomplete) validation set. By complete, we assume that the distribution of and are the same. Then, ≤ .
In this scenario, we cannot use generative adversarial networks (GANs) [11] to enrich the augmented set in regions which are not covered by the validation set. This is due to the fact that the discriminator has no information on the region where there are no validation samples.
To enrich the augmented set, we propose first to enrich the validation set by adding to it samples extracted from the reservoir such that the enriched validation set covers all the possible meaningful scenarios.
If some information on the model bias is available (a statistics on the model bias), we proceed as follows. Let the bias in the observation be a random variable , whose realisations are denoted by ∈ R . A sample of the reservoir is randomly picked in the region which is not covered by the validation set, whose observation is an element ( ) ∈ R . Then, a sample to be added to the validation set is: 13) and the associated label is ( ) = ( ) .

Discretisation of the method.
When the enrichment method proposed in the previous section has to be applied to realistic cases, we need to account for the fact that the only available quantity is a set of labeled samples, which can be divided into training and validation sets. The method needs to be discretised in order to be practically implemented. Several elements need to be detailed. The first one is the estimation of the score function. Its computation requires a density estimation.

Density estimation in high-dimension.
To estimate the score by using a Monte Carlo method, we need to estimate a density in correspondence to a sample, namely the value ( ) ∈ R + . This task may be cumbersome due to the high-dimensionality of the space. Several methods of non-parametric density estimation are proposed in the literature [6,9,25]. For the present work we consider as a starting point the k-nearest neighbors (KNN) estimation. In the KNN method, a tree-based algorithm subdivides the samples set into overlapping balls, each containing a fix number of samples, say ∈ N * on a total number of ∈ N * samples. The density is usually estimated by making the assumption that the density is roughly constant in a ball, leading to: where ℬ = ℬ( , ) and vol(ℬ ) is its volume, computed according to the metric chosen to select the neighbors. We will denote the ℓ distance between two elements ( 1 , 2 ) as Remark 3.1. Following [13], if we want to classify a given sample * by using the Bayes rules, assuming P( = 0) = P( = 1) and 0 = 1 = , we will obtain the following result. Let: Furthermore, let 0 , 1 be the radius of the balls centred around respectively. The a posteriori probability reads: This means that the classification outcome only depends on the distance between the closest points in each class in the augmented set and their respective th nearest neighbor. Figure 1 shows an example in which, by making use of this approach we wrongly classify a validation point. As the computed radius is lower for class 1 the validation point is labeled 1 instead of 0.
The issue shown in Figure 1 is mainly due to the assumption that the density is constant in the ball. We propose of replacing it by an approximation based on Gaussian radial basis functions (RBFs). Let us introduce ∈ R, = 1, . . . , ; moreover, let the elements in a ball be ( ) ∈ R , = 1, . . . , and > 0 be the radius of the balls the samples ( ) are the center of. The density in a ball is expressed as: Let denotes the density at the sample ( ) obtained by the classical KNN approximation. The weights are computed as the result of the following optimisation problem: The interpretation is simple: the weights are close to the classical KNN estimated density (the Gaussian kernel being equal to one when evaluated at the sample), and when integrated on the ball, the approximation of the density retrieves the expected value of the mass in the ball. Let: The solution reads: The following example aims at illustrating the effect of the above introduced approximation on a classification task.
Let Ω = [−5, 5] 2 be the domain, and = ( 0 , 1 ) ∈ Ω. We define the two classes as follows: The sample size for the training set is 0,1 = 18. For each class the training set is uniformly distributed but with a different density (the density is higher for the class 1 as shown in Fig. 1). The validation set is generated using a regular square mesh of Ω (with steps ∆ 0 = ∆ 1 = 0.1) where each node is a sample (it results in a validation sample size of 0,1 = 5000 for each class). Figure 2 shows the result when the density is estimated via the classical KNN method and with the proposed Gaussian kernel correction. In this test, the accuracy is significantly increased using the proposed technique (we pass from 0.86 to 0.96).

Computing the Hausdorff distance of sets.
One of the key steps of the proposed method is the approximation of the Hausdorff distance and the largest ball contained in the set ( ) . Given the sets 0,1 , we can identify the ∈ N * samples, belonging to the validation set, which are in We denote ( ) ∈ N the indices of these samples: . . , such that ( ) ∈ ( ) }︀ . The pairwise distance between every element of ( ) is computed, and the pair of elements maximising the distance is chosen: We then consider the segment relying the samples ( * ) and ( * ) . The elements of this are characterise by the following expression. Let ∈ [0, 1] and the points: ( ) = (1 − ) ( * ) + ( * ) . If the centre of the balls is chosen among the points of the segment, the problem reduces to finding such that the radius of the ball inscribed in ( ) is the largest: This problem is solved numerically by extensive search: the segment is discretised by considering a number of points on it, where the evaluation of the ball radius is performed.
Remark 3.2. During the enrichment process, it might happen that there are no elements in the reservoir belonging to the ball chosen to reduce the Hausdorff distance between the sets. We propose to add to the augmented set the center of the ball, labeled as the closest sample belonging to the validation set.

Summary of the method.
The overall method is summarised hereafter. Two validation sets are given, namely * 0,1 ⊂ Ω, in the form of sets of validation samples ( ) . At the beginning of the procedure, we have two augmented sets   The pseudo-code of the method is given in Algorithm 1.

Numerical experiments.
In this section, several numerical experiments are proposed to illustrate the enrichment method.

Two dimensional cases
A two dimensional application is performed on three study cases for which we consider Ω = [0, 1] 2 . For each study case, we randomly generated 2000 samples following a uniform law over Ω. The first half is gathered into the validation set, whereas the second half is gathered into the test set. Figure 4 shows the validation set for   -Output of the algorithm: augmented set and classification scores on the validation and test set.
In this study we assume that the reservoir is unbiased. The number of nearest neighbors is set to = 5. Figure 5 shows the constructed training set (augmented set once the algorithm has stopped) samples for each study case.
Two main points are highlighted by this figure: -The whole initial database is not a must-have, only a small fraction of it is actually useful in view of improving the classification score. -The selected samples to construct the augmented set are mainly closed to the class delimitation. Figure 6 shows the scores for the validation and test sets for each study case. As the algorithm is performed on the validation set, the score on the validation set is higher than the one on the test set (and its standard deviation smaller). Despite this slight overfitting, the constructed augmented set ensures a score higher than 0.96 on the test set for these three study cases.

A model in electro-physiology of cells.
This part is devoted to an example in electro-physiology. The observed model output, called action potential (AP) is the potential difference across the cell membrane. This is influenced by the value of several parameters which represent the conductances of some of the ion channels of the cell. The model we consider is called Minimal Ventricular (MV), presented in [7]; it is a system of parametric ordinary differential equations. We focus on three classification problems: given the model output determine if the conductances of sodium, calcium and potassium are above or below a certain threshold.
The dataset is synthetic and the numerical method used to approximate the model solution is a third order Backward Differentiation Formula (BDF3) with a time-step ∆ = 0.1 ms. A periodic source term in the equation is repeated every 1200ms and its parametrisation is given in Table 1.   By starting from the third stimulation the system reaches periodicity (the ℓ 2 norm of the difference between two consecutive periods varies by less than 10 −3 ) we decided to only store the third period for this study.
A total of = 2420 signal were generated with random triplets conductances (for sodium, calcium and potassium) following a uniform law over [0.6, 1] 3 . It follows that for a realization = [ sodium , calcium , potassium ], the component means that channel is blocked at 100*(1− )%. We consider the control case (as a reference) for the realization = [1, 1, 1] which leads to 100% of activity for each channel.
For each component of a realization , the labels are given by: The value 0.8 corresponds to the conductance threshold for the classification task described at the beginning of this section.
As we have three parameters, we divided the problem into three classification tasks: sodium, calcium and potassium conductances classification. An example of AP signals at control case ( = [1, 1, 1]) and in random case is shown in Figure C.1.

Biased data
Different biased datasets were generated from these = 2420 simulated APs. These biased signals were obtained by computing the Fourier transform and putting to zero the entries corresponding to the higher frequencies. We considered three different levels of bias (expressed in terms of energy) as presented in Table 2 An example of an AP signal with its different levels of bias is shown in Figure 7.

Dictionary entry computation
For each sample (AP signal), we consider = 24 observable quantities. These correspond to pairs times and amplitudes in different phases of the AP signal. They are computed in the same way for each sample and are shown in Figure 8.
We denote ( ) the th dictionary entry of the th AP signal. Considering the control case as a reference, we propose to consider the following translated dictionary entries: It follows that, in the control case, we have

Datasets preprocessing
Two study cases are performed: in the first one, we assume that the validation set covers Ω whereas in the second one we consider an incomplete validation (the validation set covers only a subset of Ω). To do so, from the unbiased dataset, we randomly extract = 89 from the = 2420 signals in such a way that 84 of them have a sodium and calcium activity higher than 0.85. The 5 others are randomly chosen in such a way that at least one sample belongs to the other class (sodium and/or calcium conductance is lower than the threshold). Dataset's sizes are summarized in Table 3.
Test, validation and initial augmented sets are randomly extracted from the whole unbiased dataset ( = 2420). The database can be biased or unbiased depending on the study (chosen samples are the same, but with different biases). The random process is performed in such a way that a selected sample belongs to only one set and cannot be selected more than once. Figure 9 shows the densities of the variable for the validation and test sets (for each class), in the sodium classification task.
As we can see, when the complete validation case is considered, the density of is almost uniform over the whole domain of (meaning that we have samples for almost all possible values of ). On the contrary, for the

Computational results
All the following results were obtained using = 5 nearest neighbors. Figure 10 shows the scores obtained with a complete and incomplete validation set.

Comparison between complete and incomplete validation set
(1) Complete validation set: (a) The validation score is higher than the test score because the optimization process is performed on the validation set. (b) The sodium conductance is easy to classify, whereas calcium conductance is the most difficult to infer.
The fact that potassium and calcium conductances are more difficult to classify is due to the compensation effect between these two channels (see Fig. C.1), which is a known phenomenon in electrophysiology. (c) The scores are not significantly impacted by the bias as the proposed method naturally rejects it.
(2) Incomplete validation set: (a) The validation score is higher than the test score because the optimization process is performed on the validation set. (b) The calcium conductance classification shows the lowest success rate whereas the potassium conductance classification shows the highest score. The fact that the potassium has the highest score is expected as no data were removed for this case. The scores obtained in the unbiased case are close to the expected scores: around 69% for the sodium, 75% for the potassium and 60% for the calcium (see Sect. D for more details). The bias does not highly affect the score except for the sodium in the highest bias case). (c) The bias is larger in the first part of the signal, as it can be seen in Figure 7. This phase of the solution is known to be influenced by the sodium conductance. This explains why the score for the sodium Notes. (*) See Figure 10, left panel and blue legend.
classification is more impacted than the ones for calcium and potassium which show a more stable trend. (3) Complete vs. Incomplete validation set: (a) The validation score is more stable and higher for the incomplete validation set. This is explained by the fact that we have less data in the validation set and aggregated in a smaller region, which eases the process. (b) The test score is lower in the incomplete validation set case. This is because there are regions of Ω in which we do not have samples of the dataset. As we do not have information in these empty regions, the score is lower. (c) For the same reasons as above, the variability on the test score is higher when the validation set is incomplete.
A comparison with the construction of a classifier considering the full reservoir of data as the training set is given in Table 4. The same conditions were considered for the three methods (ASE-HD, SVM and KNN). Indeed, for the "No Bias" scenario we put in the reservoir unbiased samples, for the "Low" scenario we considered only samples with a low level bias in the reservoir, and we proceed analogously for the other scenarios. For all the cases, the samples of the test set are unbiased, meaning, they are drawn from the "true" system. In particular, in absence of bias, considering the whole reservoir as the training set is globally better. However, in the presence of bias, the augmented set construction method proposed is better. Moreover, the construction method allows to get a similar classification success rate irrespective of bias. This is due to the method itself which reject biased data in an automated way.
Remark 4.1. In the KNN algorithm implemented in Scikit-Learn [24], we consider the th closest samples (from the training set) of a query point irrespective of the class they belong to. We then classify the query point using the majority vote strategy. This method is quite different of the proposed strategy proposed in this paper. In particular, we consider the th closest samples from the training set of a query point for each class to estimate the density over the two classes (for a binary classification). We then consider a Bayesian approach to classify the query point. This could justify the success rate difference between the No Bias case in the ASE-HD method and KNN method.
Remark 4.2. In this paper we considered a Bayesian approach to classify a query point. However, the augmented set construction method is not restricted to a particular classification method (nor density approximation).

Database and validation set enrichment
As described in Section 2.4.2, once the augmented set enrichment process is performed on the incomplete validation set, we enrich the validation set with data from the database. In the case where we have a bias, we may exploit some statistical information on the bias to generate more pertinent labeled samples. We recall that we have 4 different study cases based on the database (see Sect. 4.2.1): without bias and with a low, medium and high level of bias. We assume that we know the a priori for the two classes: 0 = 1 = 1 2 . Then, we enriched the validation set in such a way the number of samples is each class is the same, with = 400 (we added 311 samples). See Table 3.

Unbiased case
In the unbiased case, we compute the dictionary entry mean and standard deviation for each class of the incomplete validation set. We denotê︀ the estimated a priori. Then, we randomly brows each sample of the database (for each class). While < 400, if one of the entries is outside the corresponding (i.e. same class) mean plus/minus the standard deviation, we add it to the validation set (and remove it from the database) if the following equation holds: min̂︀ ( +1) > min̂︀ ( ) , witĥ︀ ( +1) the a priori computed considering the sample into the validation set and̂︀ ( ) the a priori computed before considering the current sample into the validation set. In other words, it aims to consider the assumptions on the true a priori described above.

Biased case
For the biased case, we compute the average and standard deviation difference (in the dictionary entry space) between the incomplete validation set and the simulated data with the same parameters: with ∈ R the mean ( = ) or the standard deviation ( = ) and where is the incomplete validation set and is the simulated dataset obtained with as parameter entries of the simulated model. Then, from these statistics, for each sample of the database, we generate 4 ghosts samples following the approach described in Section 2.4.2. Here, we assume that the bias computed on the validation set is preserved on the empty region.

Results
The results are shown in Figure 11.
(1) The validation set (red and orange) vs. test set (blue and green): we always obtain a higher score on the validation set.

Conclusions and perspectives
In the present work a method is proposed to enrich available experimental datasets by using numerical simulations in view of improving classification tasks performances. This is an example of potential interaction between statistical learning and mathematical modelling. The method is based on the probabilistic description of the observations of a phenomenon and a characterisation of the classification performances based on set distances. The main properties of the method have been investigated from a theoretical point of view and illustrated through some numerical experiments. The systematic construction and enrichment of the augmented set can have a significant impact on the classification score. The proposed method performs a bias rejection to some extent, and, if statistical information on a model bias are available, these can be naturally integrated in the algorithm.
Appendix A. Proofs in Section 2.2.1 Lemma 2.4. For the set ( ) , ∀ ∈ N it holds: Proof of Lemma 2.4. By definition of the symmetric difference, we have: = Ω ∖ and * 0 respectively. It follows that: The proof for * 1 ∆ ( ) 1 is similar.
Proposition 2.5. Using the sequence of operations introduced in Section 2.2, almost surely, we have: Proof of Proposition 2.5. By definition of * and ( ) (see Eq. (2.6)), we have: Then, ( ) is a disjoint union of two sets. This implies that: Remark that, by definition of the Lebesgue measure on a set and due to the compactness of the sets, we have the following inequalities: It is straightforward to show that: Let assume that (︀ ( ) )︀ > 0. It follows that at least one of the following inequalities is satisfied: Let ′ be the set such that: We then have ( ′ ) > 0. Therefore, ∃ +1 ∈ ′ and > 0 such that the ball ℬ( +1 , ) ⊆ ′ . By definition of ( ) (see Sect. 2.2), we have: ( +1) = ( ) ∖ ℬ.
As ℬ ∈ ′ ⊆ ( ) and (ℬ) > 0, we have: We have a sequence of measures which is strictly decreasing and bounded. Thus, this sequence converges to its minimum. Let assume that this minimum is > 0. Then, it exists a non-empty ball such that the measure will decrease, which is impossible. It follows that: Then, at iteration + 1, we have: with: Let us consider the first scenario: Then using the fact that the sets are disjoint, we have: which immediately yields to: Here, we assumed that ℬ * ⊆ * 0 ∩ ( ) 1 . The inequality is given by the definition of * 0 . On this set, we have: 0 0 − 1 1 > 0. The equality is then obtained if and only if (ℬ * ) = 0. Considering the second scenario, we finally obtain: Appendix B. Proofs in Section 2.4.1 Lemma 2.7. Let the sets , 0,1 be defined as in equations (2.9) and (2.10).The following equalities hold: Proof of Lemma 2.7. Let us focus on the first equality of the lemma (the proof for the second equality is similar). We have: As 1 ∩ 0 = ∅ we have: Since 0 ∪ 1 = Ω, we finally obtain: be the score of the classification of the test set when the training set is defined by the model. The maximal score is represented by: It holds: 0 ≤ ≤ * , and, moreover: Proof of Proposition 2.8. We have: Then from Lemma 2.7 and based on sets definition, we have: By virtue of the definition of the sets 0 , 1 , it holds: It immediately leads to ≤ * . Moreover, Concerning the left hand side of the inequality, we have: In particular, the intersection of the two members for each equation is empty. Then, we can rewrite as follows: As each integrand is positive or null, we have ≥ 0.
Then, two ensure that the two first integrals are equal to 0, we necessary have: Sodium channel blockade is mainly known to reduce the depolarization peak, calcium channel blockade is mainly known to reduce the plateau phase and the duration whereas potassium channel blockade is mainly known to induce a signal prolongation. Finally, In other words, the worst case for is obtained when the model is as bad as possible.
Using some set theory properties, Then we finally obtain: Proposition 2.11. We denote = { | > } ( ̸ = ), where = (test set) or (validation set). We denote (resp. ) the test set score obtained with a complete (resp. incomplete) validation set. By complete, we assume that the distribution of and are the same. Then, ≤ .  In the incomplete validation case, we have either: Using Lemma 2.10, we have: Moreover, we know that 1 1 ≥ 0 0 over 1 . Hence, the second term of the previous equation is positive. Then, ≤ .

Appendix D. MV: scores in the incomplete validation set scenario
For this study we make the following assumptions: -AP behavior under sodium blockade does not depend on potassium and calcium channel activities.
-AP behavior under potassium and/or calcium channel blockade are dependent.
The following study is coarse, but presented to justify scores obtained in Section 4.2.4 of the manuscript.

D.1. Sodium channel blockade
In the incomplete validation case, sodium activities for the validation set belong to (0.85, 1). We recall that each activity is a independent realization of a random variable following a uniform law over (0.6, 1). Let assume that for the test set (for which sodium activities belong to (0.6, 1)) has elements. Then, we expect to have 0.625 * elements over (0.6, 0.85) and 0.375 * elements over (0.85, 1). As the set is complete over (0.85, 1) we assume that the augmented set enrichment is well performed which leads to a perfectly well classified test set over (0.85, 1). Conversely, as we do not have information over (0.6, 0.85) we assume that half of the test set is well classified over this region. It follows that the averaged score is: Then, by simulation, we expect to have a score close to 0.69 for the sodium channel blockade study in the incomplete validation set case.

D.2. Potassium channel blockade
For this scenario, we use the same idea as the one described in the previous section. The upper panel of Figure D.1 shows regions where we well (green), wrongly (red) and partly well (orange) classify the test set. The lower panel shows the ratio between the potassium and the calcium activity.
Over the incomplete validation region, the lowest ratio for the class 1 ( > 0.8 is 0.81 and the highest ratio for class 0 is 0.93. As the minimal ratio in the unknown region: { < 0.85 ∪ > 0.8} is 0.98 all this region will be well classified. The red area is obtained using the same argument. The orange area corresponds to the region where ratios can be from either side of the class delimitation in the incomplete validation set.
Finally, summing the green area and half of the orange area we obtain a score which is approximately 0.75.

D.3. Calcium channel blockade
This scenario uses exactly the same arguments as the one exposed in the previous section. The corresponding figure is shown in Figure D.2.
These strategy lead to a score approximately equal to 0.6.