COMPLEXITY ANALYSIS OF STOCHASTIC GRADIENT METHODS FOR PDE-CONSTRAINED OPTIMAL CONTROL PROBLEMS WITH UNCERTAIN PARAMETERS

. We consider the numerical approximation of an optimal control problem for an elliptic Partial Differential Equation (PDE) with random coefficients. Specifically, the control function is a deterministic, distributed forcing term that minimizes the expected squared 𝐿 2 misfit between the state ( i.e. solution to the PDE) and a target function, subject to a regularization for well posedness. For the numerical treatment of this risk-averse Optimal Control Problem (OCP) we consider a Finite Element discretization of the underlying PDE, a Monte Carlo sampling method, and gradient-type iterations to obtain the approximate optimal control. We provide full error and complexity analyses of the proposed numerical schemes. In particular we investigate the complexity of a conjugate gradient method applied to the fully discretized OCP (so called Sample Average Approximation), in which the Finite Element discretization and Monte Carlo sample are chosen in advance and kept fixed over the iterations. This is compared with a Stochastic Gradient method on a fixed or varying Finite Element discretization, in which the expectation in the computation of the steepest descent direction is approximated by Monte Carlo estimators, independent across iterations, with small sample sizes. We show in particular that the second strategy results in an improved computational complexity. The theoretical error estimates and complexity results are confirmed by numerical experiments.

random variables or random fields. Such Optimal Control Problems (OCPs) are sometimes also referred to as problems of Optimization Under Uncertainty (OUU).
In this work we focus on the numerical approximation of the problem of controlling the solution of an elliptic PDE with random coefficients by a distributed unconstrained control. Specifically, the control acts as a deterministic volumetric forcing term, so that the controlled solution is as close as possible to a given target function.
While there is a vast literature on the numerical approximation of PDE-constrained optimal control problems in the deterministic case (see, e.g., [8,27] and the references therein), as well as on the numerical approximation of (uncontrolled) PDEs with random coefficients (e.g., [4,23,35]), the analysis of PDE constrained control problems under uncertainty is much more recent and incomplete, although the topic has received increasing attention in the last few years.
The different frameworks of PDE-constrained OCPs under uncertainty considered in the literature can be roughly grouped in two categories.
In the first category, the control is random [1,6,10,32,43,48]. This situation arises when the randomness in the PDE is observable hence an optimal control can be built for each realization of the random system. The corresponding optimality system might still be fully coupled in the random parameters, e.g., if the random objective function also involves some statistics of the state variables (e.g., deviation from the nominal response). The dependence on the random parameters is typically approximated either by polynomial chaos expansions or sampling-based Monte Carlo (MC) techniques. This is, for example, considered in [32] where the authors prove analytic dependence of the control on the random parameters and study its best -term polynomial chaos approximation for a linear parabolic PDEconstrained OCP. In [10] the authors combine a stochastic collocation method with a Finite Element (FE) based reduced basis method to alleviate the computational effort. The works [6,43,48] address the case of a fully coupled optimality system discretized by either Galerkin or collocation approaches and propose different methods, such as sequential quadratic programming or block diagonal preconditioning, to solve the coupled system efficiently. Sampling-based Monte Carlo and Multilevel Monte Carlo approaches are considered in [1] instead, where the case of random coefficients with limited spatial regularity is addressed.
In the second category, the control is deterministic [2,7,22,[28][29][30]49]. This situation arises when randomness in the system is not observable at the time of designing the control, so that the latter should be robust in the sense that it minimizes the risk of obtaining a solution which leads to high values of the objective function. The precise notion of a risk is problem dependent and thus has to be modeled appropriately. In this context, risk typically refers to a suitable statistical measure of the objective function to be minimized, such as those involving expectations, expectations plus variances, a quantile, or a conditional expectation above a quantile (so called Conditional Value at Risk, CVaR [42]). The corresponding OCP often leads to a fully coupled optimality system of equations in the random parameters. It is noteworthy that the idea of minimizing a risk to obtain a solution with favorable properties goes back to the origins of robust optimization [47].
Numerical methods for OCPs of the second category typically depend on the choice of the risk measure. For example, in [2] the authors consider a risk measure based on the mean and variance of a scalar objective function and they use second order Taylor expansions combined with randomized estimators to reduce the computational effort. The work [49] contains a study of a risk measure that involves the expected squared 2 misfit between the state and a target function, with an additional penalty term involving the squared 2 deviation of the state form its mean value. The authors propose a gradient type method in which the expectation of the gradient is computed by a Multilevel Monte Carlo method. In [7], the authors also consider a similar risk measure and propose a reduced basis method on the space of controls to significantly reduce the computational effort. A more general class of risk measures (including the CVaR) for OCPs has been considered in [31], where also the corresponding optimality system of PDEs are derived. The subsequent work [29] introduces a trust-region (Newton) conjugate gradient algorithm combined with an adaptive sparse grid collocation as PDE discretization in the stochastic space for the numerical treatment of these more general OCPs. For the robust OCP with the CVaR as risk measure the study [30] introduces derivative-based optimization methods that are build upon introducing smooth approximations to the CVaR. Finally, in [22] the authors consider a boundary OCP where the deterministic control appears as a Neumann boundary condition.
In this work, we follow the second modeling category and consider a similar risk averse OCP as in [7,49] which consists in minimizing the expected squared 2 misfit between the state and a given target function as objective function, additionally equipped with an 2 regularization on the (deterministic) control. For this setting we consider numerical gradient based methods, either deterministic or stochastic, where adjoint calculus is used to represent the gradient of the objective function. Both the primal problem and the adjoint problem are discretized by finite elements and Monte Carlo estimators are used to approximate expectations defining the risk measure. The reason for studying sampling-based Monte Carlo approximations instead of polynomial chaos type methods is to develop methods that can potentially handle many random parameters and possibly rough random coefficients.
Our main contribution is to provide a full error and complexity analysis for the considered gradient based methods, accounting for the three sources of errors, namely, the Finite Element approximation, the statistical Monte Carlo error, and the error due to the finite number of gradient based iterations.
We note that other error analyses have been presented before, such as [10] for the case of a random control with a discretization in the physical space by Finite Elements and in probability by a stochastic collocation, as well as [22] for the case of a deterministic boundary control that minimizes a quadratic risk measure, using a Finite Element discretization both in physical space and in probability. Finally we mention the recent work [24] where the authors consider as a risk measure the same expected quadratic loss function as in this work and study a quasi-Monte Carlo approximation (i.e., a deterministic quadrature in the probability space) of the expected loss which may offer a further complexity improvement, provided that the system's state equation is sufficient regular with respect to the uncertain parameters. In contrast, here we focus on the subtle interplay of Finite Element discretization errors, Monte Carlo sampling errors and different numerical optimization techniques in the context of general-purpose methods.
The first method that we consider is a gradient based method (e.g., conjugate gradient) on a fully discretized version of the OCP (so called Sample Average Approximation -SAA), in which the Finite Element discretization and the Monte Carlo sample are chosen initially and kept fixed over the iterations. If is the sample size of the Monte Carlo estimator, this method entails the solution of state and adjoint problems at each iteration of the gradient method, which could be troublesome if a small tolerance is required, entailing a very large and small Finite Element mesh size.
We then turn to stochastic gradient methods in which the gradient is re-sampled independently at each iteration and the Finite Element mesh size can be refined along the iterations. At each iteration this corresponds to taking an independent Monte Carlo estimator with only one realization ( = 1) or a very small, fixed sample size ( =¯) independently of the required tolerance, with possibly a finer Finite Element mesh. We follow, in particular, the Robbins-Monro strategy [39,41,44] of reducing progressively the step-size to achieve convergence of the Stochastic Gradient iterations. These Stochastic Gradient (SG) techniques have been extensively applied to machine learning problems [14,16,19,33], but have not yet been used much for PDE-constrained optimization under uncertainty. Here, we show that a Stochastic Gradient method improves the complexity of the conjugate gradient (or equivalent) method applied to the fully discretized OCP by a logarithmic factor. Although the computational gain is not dramatic, we see potential in this approach as only one state problem and one adjoint problem have to be solved at every iteration of the gradient method. Moreover, we believe that the whole construction is more amenable to an adaptive version, which, in combination with an appropriate error estimator, allows for a self-controlling algorithm. We leave this for future work, but mention the related recent work [20] on mesh refinement approaches in the context of a stochastic gradient method for PDE-constrained OCPs subject to uncertainties.
The rest of the paper is organized as follows. In Section 2 we introduce the optimal control problem under uncertainty and recall its well posedness as well as the corresponding optimality conditions. In Sections 3-5 we then introduce the Finite Element discretization, the Monte Carlo approximation, and the gradient based methods, respectively, including their full error analysis. In particular, Theorem 5.2 in Section 5 provides an error bound for the conjugate gradient (or equivalent) method applied to the fully discrete OCP, whereas Corollary 5.4 gives the corresponding computational complexity. In Section 6 we analyze the Stochastic Gradient method with fixed finite element discretization over the iterations (with error bound given in Thm. 6.1 and the corresponding complexity result in Cor. 6.3), whereas in Section 7 we analyze the Stochastic Gradient version in which the Finite Element mesh is refined over the iterations (the main result being stated in Thm. 7.1 and Cor. 7.3). In Section 8, we discuss a 2D test problem and confirm numerically the theoretical error bounds and complexities derived in the preceding Sections. Finally, in Section 9 we draw some conclusions.

Problem setting
We start introducing the state problem that will be part of the OCP discussed in the following. Specifically, we consider the problem of finding the solution : × Γ → R of the elliptic random PDE where ⊂ R is open and bounded, denoting the physical domain, (Γ, ℱ, ) is a complete probability space, and ∈ Γ is an elementary random event. The diffusion coefficient is an almost surely (a.s.) continuous and positive random field on , and is a possibly stochastic source term (which could include a deterministic control term).
Finally, as we will occasionally need 2 -regularity in the following Sections, we also introduce the following sufficient conditions on the domain and on the gradient of .
Then, using standard regularity arguments for elliptic PDEs, one can prove the following result [18,21].
Lemma 2.4. Let Assumptions 2.1 and 2.3 hold. If ∈ 2 (Γ, 2 ( )), then problem (2.1) has a unique solution ∈ 2 (Γ, 2 ( )). Moreover there exists a constant , independent of , such that We are now ready to introduce the optimal control problem linked with the PDE (2.1), which we will study in the rest of the paper.

Optimal control problem
We define the state problem for the OCP as the elliptic PDE (2.1), by specializing its right hand side to: with source term ∈ 2 ( ) and control ∈ 2 ( ). Hereafter, we use the notation = 2 ( ) to denote the set of all admissible (deterministic) control functions, and = 1 0 ( ) to denote the state space of solutions to (2.2). To emphasize the dependence of the solution of the PDE on the control function and on a particular realization (·, ) of the random field, we will use the notation ( ). When the particular realization of is not relevant, or when no confusion arises, we will also simply write ( ) from times. In this work, we focus on the objective functional where is a given target function which we would like the state ( ) to approach as close as possible, in a mean-square-error sense. The coefficient > 0 is a constant of the problem that models the price of energy, i.e. how expensive it is to add some energy in the control in order to decrease the first distance term The ultimate goal then is the unconstrained OCP, of determining the optimal control ⋆ so that ⋆ ∈ arg min ∈ ( ), s.t.
( ) ∈ solves (2.2) a.s. (2.4) It is noteworthy that the random PDE setting (2.2) above could be generalized by using a formalism similar to the one presented in the recent work [37]. In particular, two of the authors of the present work consider a general class of OCPs, for which the randomness may appear in the source term or in the control to state operator. Furthermore, that formalism may allow for randomness to appear in boundary conditions. As we aim at minimizing the objective functional , we will use the theory of optimization and calculus of variations. Specifically, we introduce the optimality condition for the OCP (2.4), in the sense that the optimal control ⋆ satisfies ⟨∇ ( ⋆ ), ⟩ = 0 ∀ ∈ . (2.5) Here, ∇ ( ) denotes the 2 ( )-functional representation of the Gateaux derivative of at ∈ , namely Existence and uniqueness results for the OCP (2.4) can be obtained as a particular case of the more general results in, e.g., the work [31]. We state the result in the next Lemma, without proof. We can thus rewrite the OCP (2.4) equivalently as: for a.e. ∈ Γ. (2.10)

Finite Element approximation in physical space
In this section we introduce the semi-discrete OCP obtained by approximating the underlying PDE by a Finite Element (FE) method and recall an a-priori error bound on the optimal control. Let us denote by { ℎ } ℎ>0 a family of regular triangulations of . Furthermore, let ℎ be the space of continuous piece-wise polynomial functions of degree over ℎ that vanish on , i.e. ℎ = {︀ ∈ 0 ( ) : | ∈ P ( ) ∀ ∈ ℎ , | = 0 }︀ ⊂ = 1 0 ( ). Finally, we set ℎ = ℎ . We can then reformulate (2.10) as a finite dimensional OCP in the FE space: ( ℎ ( ℎ ), ℎ ) = ⟨ ℎ + , ℎ ⟩ ∀ ℎ ∈ ℎ for a.e. ∈ Γ. (3.1) Remark 3.1. The choice ℎ = ℎ is natural for this problem. In fact, one could consider the OCP in which the PDE is discretized in ℎ , whereas the control ∈ is not discretized. It is not difficult to show that the optimal control for such OCP is actually finite dimensional and belongs to ℎ , thus leading to the equivalent formulation (3.1).
For the discrete OCP (3.1) we have analogous well-posedness and optimality results as those stated in Lemma 2.5 for the continuous problem.
Lemma 3.2. The discrete OCP (3.1) admits a unique solution ℎ ⋆ ∈ ℎ and ∇ ℎ can be characterized as where ℎ ( ℎ ) is the solution of the FE adjoint problem Following similar arguments as in (Theorems 3.4 and 3.5 of [27]) and using the optimality condition and the weak form of the state and adjoint problems, we can prove the following. Lemma 3.3. Let ⋆ be the optimal control solution of problem (2.10) and denote by ℎ ⋆ the solution of the discrete OCP (3.1). Then it holds that Moreover, there exists a constant > 0 independent of ℎ such that Proof. The result in the deterministic case is detailed in (Theorems 3.4 and 3.5 of [27]). We can thus write the inequalities (3.3) and (3.4) for almost every realization ∈ Γ, and then take the expectation to conclude.
The FE error ‖ ⋆ − ℎ ⋆ ‖ on the optimal control is thus completely determined by the FE approximation properties of the state and adjoint problems. The next result also follows by standard arguments (see e.g., [27]) and shows that, for smooth data, the 2 error ‖ ⋆ − ℎ ⋆ ‖ on the optimal control converges at rate ℎ +1 .
Lemma 3.4. Let Assumptions 2.1-2.3 hold and suppose that ( ⋆ ), ( ⋆ ) ∈ 2 (Γ, +1 ( )). Then we have In view of the analysis that will be presented later, we state a Lipschitz and a strong convexity result for the functional ( , ) for a.e. ∈ Γ, as well as its discrete version ℎ ( ℎ , ) : Proofs of these results can be found in [36].  .3) it holds that: , where is the Poincaré constant. For the corresponding Finite Element approximation leading to (3.1) the same inequality holds with the same constant: and a.e. ∈ Γ. Lemma 3.6 (Strong Convexity). For the elliptic problem (2.2) and ( , ) as in (2.3) it holds that The same estimate holds for the corresponding Finite Element approximation that yields (3.1), namely:

Approximation in probability
In this section we consider the semi-discrete (i.e., approximation in probability only) optimal control problem obtained by replacing the exact expectation E[·] in (2.3) by a suitable quadrature formulâ︀ [·]. Specifically, for a random variable ( ) be a quadrature formula, where denote the quadrature weights and ∈ Γ the quadrature points. The semi-discrete problem then reads: ( ) ∈ and ( ( ), ) = ⟨ + , ⟩ ∀ ∈ = 1, . . . , . (4.1) The quadrature formulâ︀[·] could either be based on deterministic quadrature points or randomly distributed points leading, in this case, to a Monte Carlo type approximation. In the following, we detail the case of a Monte Carlo type quadrature, whereas the case of a deterministic Gaussian-type quadrature is addressed in Appendix A. It is noteworthy that, although we present the results only for the semi-discrete problem (i.e., continuous in space, discrete in probability) for the sake of notation, they extend straightforwardly to the fully discrete problem in probability and physical space. Indeed, the fully discrete problem is obtained by replacing the (spatial) functions and corresponding functions spaces in (4.1) by their finite dimensional Finite Element approximations.
In the case of a Monte Carlo (MC) approximation, the quadrature formula reads − → is a collection of independent and identically distributed (i.i.d.) points drawn randomly on Γ according to the probability measure . We recall that the use of MC type approximations might be advantageous over a quadrature/collocation approach in cases where the state and adjoint solutions are rough or highly oscillatory, which is, for example, the case when (·, ·) is a rough random field and/or has a short correlation length. Moreover, the Monte Carlo quadrature formula has always positive weights, which is an important feature to guarantee that the approximate functional̂︀ preserves the strong convexity property. We stress that, when using a Monte Carlo quadrature formula, the optimal control̂︀ ⋆ is a stochastic function since it depends on the i.i.d. random points − → = { } =1 . The next theorem gives an error bound on the approximate optimal control of the OCP (4.1).
Theorem 4.1. Let̂︀ ⋆ be the optimal control of problem (4.1) witĥ︀ = − → MC and ⋆ be the exact optimal control of the continuous problem (2.10). Then we have Proof. The two optimality conditions for OCPs (2.10) and (4.1) read Choosing =̂︀ ⋆ − ⋆ in (4.2) and subtracting the two optimality conditions, we obtain: The first term on the right hand side of (4.3) can be bounded as which concludes the proof.
Theorem 4.1 shows that the semi-discrete optimal control̂︀ ⋆ converges at the usual MC rate of 1/ √ in the root mean squared sense, with the constant being proportional to

Numerical solution of the fully discrete problem
Now we focus on a class of optimization methods to approximate the fully discrete minimization problem obtained by using the Monte Carlo estimator to approximate the expectation in (3.1) and a FE approximation of the state and adjoint equations, as discussed in the previous two sections. That is, here we consider the fully discrete OCP: The constraints in (5.1) can be written in algebraic form as where u ∈ R ℎ is the vector of the ℎ degrees of freedom corresponding to the control ℎ ∈ ℎ , y ∈ R ℎ is the vector of degrees of freedom corresponding to the finite element state solution ℎ ( ℎ ) ∈ ℎ , ∈ R ℎ × ℎ is the FE mass matrix, and ∈ R ℎ × ℎ is the FE stiffness matrix corresponding to the diffusion coefficient (·, ).
Defining the block matrices and vectors where the p solve the adjoint systems p = y − z , the optimality condition for (5.1) reads u + 1 ∑︀ =1 p = 0, which leads to the coupled linear system By eliminating the state and adjoint unknowns, this can be recast into a linear system in the control variable only u = , 3) can, for example, be solved by an iterative method such as gradient or conjugate gradient type methods. At each iteration, a matrix-vector multiplication will involve the solution of state and adjoint equations, resulting in 2 PDE solves per iteration.
Both iterative solvers are examples of solution schemes that offer an exponential convergence in the number of iterations. Alternative to the formulation (5.3) one could rewrite system (5.2) in the form which could be solved, e.g., by GMRES or MINRES iterative methods. In the next Theorem 5.2 we analyze the complexity of the fully discrete problem in terms of computational work needed to achieve a given tolerance. Instead of particularizing the result to one specific iterative solver, we make the assumption that an exponentially convergent iterative solver is employed. More precisely: Assumption 5.1. Let̂︀ ℎ ⋆ be the exact solution to (5.1). Furthermore, let̂︀ ℎ denote the approximate solution to the OCP (5.1) obtained after iterations of the iterative solver used to solve (5.1). We assume that the chosen iterative solver satisfies for some constants 1 , > 0 that are independent of ℎ and .
This assumption is sound since the condition number of the matrix in (5.3) can be bounded uniformly in ℎ and and scales as −1 . Similarly, the system (5.4) can be optimally preconditioned, so that the exponential convergence rate does not depend on the discretization parameters.
Based on Assumption 5.1 concerning the iterative solver, we now provide an error bound for the approximate solution̂︀ ℎ , as a function of all discretization parameters , ℎ, and .
Theorem 5.2. Let ⋆ be the solution of the optimal control problem (2.10). Moreover, let̂︀ ℎ be the -th iteration of a linear solver applied to (5.3) and suppose that the solver satisfies Assumption 5.1. Then under the assumptions of Lemma 3.4, there exist constants 1 , 2 , 3 > 0 independent of ℎ and such that Proof. The global error can be decomposed as follows: quantifies the convergence of the finite dimensional optimization algorithm, which is exponential w.r.t. the iteration number thanks to the hypotheses.
The second term E[‖̂︀ ℎ ⋆ − ℎ ⋆ ‖ 2 ] accounts for the standard MC error and can be controlled as in Theorem 4.1 (applied to the FE approximation) leading to Finally, the term E[‖ ℎ ⋆ − ⋆ ‖ 2 ] can be controlled by the result in Lemma 3.4, namely by so that the claim follows.
We conclude this Section by analyzing the computational complexity of solving the fully discrete OCP (5.1), or equivalently (5.3), using an exponentially convergent iterative solver. We assume that the state and adjoint problems, using a triangulation with mesh size ℎ, can be solved in computational time ℎ ℎ − . Here, ∈ [1, 3] is a parameter representing the efficiency of the linear solver used (e.g., = 3 for a direct solver and = 1 + with > 0 arbitrarily small for an optimal multigrid solver), while is the dimension of the physical space. Hence the overall computational work of gradient iterations is proportional to ≃ 2 ℎ − .
Remark 5.3. We briefly recall the asymptotic boundedness notation " " (also known as (a.k.a.) the Landau big O(·) notation) as well as "≃", a.k.a. , which will be useful for stating the complexity results that follow. That is, we write Proof. To achieve a tolerance tol, we can equidistribute the precision tol 2 over the three terms in (5.6). This leads to the choices: Hence the total cost for computing a solution̂︀ ℎ max that achieves the required tolerance is ≃ 2 max ℎ − = tol −2− +1 | log(tol)| as claimed.
6. Stochastic gradient with fixed mesh size.
In the previous Section, we have considered an approach to approximately solve the OCP (2.10) in which the Monte Carlo sample size is fixed a-priori, based on accuracy requirements, the sample is generate once and for all, and then the coupled system (5.2) is solved by an iterative scheme, each iteration involving the solution of primal and adjoint problems. As an alternative, in this section we consider an approach based on stochastic optimization ideas. Specifically, we will use randomized methods known in the literature as Stochastic Approximation (SA) or Stochastic Gradient (SG) [16,40,41,46,47]. At the basis of these methods is a steepest decent algorithm to tackle the optimization problem. For example, the classic version of such a method, the so-called Robbins-Monro method, works as follows. Within the steepest descent algorithm the exact gradient ∇ = ∇E[ ] = E[∇ ] is replaced by a single evaluation ∇ (·, ), where the random variable is re-sampled independently at each iteration of the steepest-descent method: Here, is the step-size of the algorithm (also called learning rate) and decreases as 1/ , over the iterations, in the usual approach.
Alternatively, the single evaluation ∇ (·, ) can be replaced by a sample average over i.i.d. realizations (so called mini-batches [12,15,17]) at every iteration, which are drawn independently of the previous iterations. More precisely, let − → = ( (1) , · · · , ( ) ) ∼ ⊗ , then we define the recursion as is the usual Monte Carlo estimator using a sample of size at iteration . Notice that the Robbins-Monro method is a special case of this scheme, namely with = 1 for all . In what follows, we investigate optimal choices of the sequences { } and { } , and the overall computational complexity of the corresponding algorithm. We first analyze the convergence of the Stochastic Gradient algorithm (6.2) when applied to the OPC (2.10) in the continuous setting, i.e. with no Finite Element discretization. The proof of the next theorem follows closely the general one in Sect. 5.9 of [47], although here we do not assume uniform boundedness of E [︀ ‖∇ ( , ·)‖ 2 ]︀ with respect to , nor do we project the control onto a bounded set at each iteration, which leads to slight technical modifications. For the sake of completeness, we give the full proof of the theorem. Theorem 6.1. Let ⋆ be the solution of the continuous OCP (2.10) and denote by the -th iterate of (6.2). Then it holds that
Proof. It follows from Theorem 6.1 that the factor in (6.3) for the particular case of ( , ) = ( 0 / , ) becomes: Next we use the recursive formula (6.3) and set, as in Section 3, ℎ ⋆ to be the exact optimal control for the FE problem defined in (3.1). We emphasize that (3.1) has no approximation in the probability space. Setting

from (6.3) applied to the sequence of Finite Element solutions
For the first term , computing its logarithm, we have, where we have set and − 0 . For the second term ℬ in (6.6) we have: For the term we can proceed as before which shows that It follows that for 0 > 1/ . Eventually, we obtain the following upper bound, for two constants 3 > 0 and 4 > 0 independent of ℎ and : From the condition 0 > 1 , we conclude that with 1 possibly depending in ‖ ℎ 0 − ℎ ⋆ ‖. Finally splitting the error as and using (3.6) in Lemma 3.4 to bound the second term, the claim follows.
Algorithm 1 contains a detailed pseudo-code description of the SG method (6.4) with a fixed FE mesh size ℎ applied to the OCP (3.1).

Algorithm 1: Stochastic Gradient algorithm with fixed mesh size.
Data: Given a desired tolerance tol, choose 1 < 0 , max ≃ tol −2 , and ℎ ≃ tol We conclude this section by analyzing the complexity of the method described in Algorithm 1. Corollary 6.3. To achieve a given tolerance tol in a root mean squared sense, i.e. to guarantee that E[‖ ℎ − ⋆ ‖ 2 ] tol 2 , the total required computational work is bounded by Here, we recall that the state and adjoint problems can be solved, using a triangulation with mesh size ℎ, in computational time ℎ ℎ − , ∈ [1,3], and is the degree of the polynomial FE space.
Proof. To achieve a tolerance tol 2 for the error E[‖ ℎ − ⋆ ‖ 2 ], we can equidistribute the precision tol 2 over the two terms in (6.5). This leads to the choice: The cost for solving one deterministic PDE with the FE method is proportional to ℎ − . Hence the total cost for computing a solution ℎ that achieves the required tolerance is as claimed.
We have investigated also other choices for ( , ) in the SG method (6.4). However, among the cases considered, we have found that ( , ) = ( 0 , ) leads to the best complexity. For example we have studied the SG with step-size = 0 / , 0 > 1 and increasing the MC sample size as ∼ 0 −1 across iterations. With this choice the estimate in (6.5) becomes which leads to the choice max ≃ tol − 2 0 | log(tol)| 1 0 and a final complexity which is analogous to that of the full optimization algorithm discussed in Section 5. The proof of the bound (6.9) is detailed in Appendix B for completeness.
Remark 6.5. In general convex stochastic optimization problems, the constant may be challenging to estimate in practice, which makes it difficult to fulfill the condition 0 > 1/ . To bypass this difficulty, one could consider the so-called Averaged Stochastic Gradient method [46] instead, in which the step size = 0 / , ∈ (0, 1), is chosen with = and the averaged control 1 ∑︀ =1 is considered. The analysis of this alternative method is postponed to a future work. We remark, however, that in our setting the constant is directly related to the regularization parameter, namely = 2 , so the need for averaging the control is not so compelling. Table 1 summarizes the complexity results for the SG method in both the fixed sample size and increasing sample size regimes, as well as the complexity of a linear solver (e.g., CG) applied to the fully discretized OCP. There, the total work ( ) to achieve a given tolerance (tol) is presented. We see from the table that the SG version with fixed sample size (second column) improves the complexity only by a logarithmic factor compared to the MC method in conjunction with a linear solver for the optimization problem (first column). The advantage we see in this SG version compared to the MC method with linear solver is that we do not have to fix the sample size in advance and can, instead, simply monitor the convergence of the SG iterations until a prescribed tolerance is reached. However, in Algorithm 1, we do have to choose the FE mesh size in advance. It is therefore natural to look at a further variation of the SG algorithm in which the FE mesh is refined during the iterations until a prescribed tolerance is reached. This is detailed in the next Section.

Stochastic gradient with variable mesh size
In this section, we discus a variant of the stochastic gradient (SG) method that also refines the FE mesh size across iterations of the steepest decent optimization routine. That is, the new mesh size ℎ is now depending on the current iteration . Here we study only sequences of nested meshes of size ℎ = 2 −ℓ( ) with ℓ : N → N being a non-decreasing function. The complete SG procedure with decreasing FE mesh size then reads: with − → := ( (1) , · · · , ( ) ). Notice that if non-nested meshes are used across the iterations, a projection operator has to be added in (7.1) to transfer information from one mesh to another. We first derive a recurrence formula for the error in the spirit of (6.3).
+1 be the approximated control obtained from the SG with variable mesh size (7.1) and ⋆ the exact control for the continuous optimal problem (2.10). Then, under the assumptions of Lemma 3.4, we have: , and are the convexity constant and the Lipschitz constant of , resp., and > 0 is a constant that depends on the +1 -semi-norm of ( ⋆ ) and ( ⋆ ).
Proof. Subtracting the optimal continuous control ⋆ from both sides of the recurrence formula (7.1), we get Then setting, similar as in the proof of Theorem 6.1, we can rewrite the last equality as We compute the mean of the squared norm of Next, we bound each of these ten terms on the right-hand side. First, the term 2 E[‖ 1 ‖ 2 ] can be bounded as in the proof of Theorem 6.1 leading to: with ℎ being the Lipschitz constant for the function ℎ , which is bounded by (see Lem. 3.5). For the term 2 E[‖ 3 ‖ 2 ], we find, Next, for 2 E[‖ 2 ‖ 2 ] we use the same steps as in Theorem 6.1 to find The second term of the right hand side can be further bounded uniformly w.r.t. ℎ as where we have used the same steps as for 3 Finally, for the cross terms we have [using Strong convexity] and as in Theorem 9, and finally Putting everything together, we finally obtain (7.2), as claimed.
A natural choice to tune the parameters , and ℎ would be to set, guided by the usual Robbins-Monro theory, = 0 / , = and balancing all terms on right hand side of (7.2). This leads to the following. ⌉︁ , and assuming 0 > 1/ , we have: for a suitable constant 1 independent of .
Proof. With the choice of , and ℓ( ) in the statement of the theorem, the two last terms in the inequality (7.2) have the same order −2 . Then, we apply the same reasoning as in Theorem 6.2 to conclude the proof.
Algorithm 2 details the SG Robbins-Monro method with variable mesh size (7.1) applied to the OCP (2.10).
As max ≃ tol −2 , we finally bound the computational work by Notice that the asymptotic complexity remains the same as for the Stochastic Gradient algorithm with fixed mesh size (6.4); cf. Corollary 6.3. However, as the SG method with variable mesh size uses computations on coarser meshes for the first few iterations, we nonetheless expect a practical improvement due to reducing the proportionality constant. We will assess such improvement by numerical experimentation in the next Section.

Numerical results
In this section we verify the assertions of Theorems 5.2, 6.2 and 7.2, as well as the computational complexity derived in the corresponding Corollaries 5.4, 6.3 and 7.3. Specifically, we illustrate the order of convergence for the three versions of the iterative optimization method presented in Sections 5-7 respectively. For this purpose, we consider problem (2.2) in the domain = (0, 1) 2 with = 1 and the random diffusion coefficient  Figure 2 shows three typical realizations of the random field. The target function has been chosen as ( 1 , 2 ) = sin( 1 ) sin( 2 ) (see Fig. 1b) and we have taken = 10 −4 in the objective function ( ) in (2.3). For the FE approximation, we have considered a structured triangular grid of size ℎ (see Fig. 1a) where each side of the domain is divided into 1/ℎ sub-intervals and used piece-wise linear finite elements (i.e. = 1). All computations have been performed using the FE library Freefem++ [26]. This relatively simple setting, with only 4 uniform random variables, has been chosen to be able to compute an accurate reference solution by a stochastic collocation method on a fine FE mesh.

Reference solution
To compute a reference solution of problem (2.2), we use a full tensorized Gauss-Legendre (GL) quadrature formula with 9 points in each direction (i.e. with a total number of knots = 9 4 ) and a fine triangulation with   ℎ = 2 −9 ; see, e.g., references [9,45] and Appendix A for estimates of the quadrature error. As this approximated problem with fixed Gauss nodes is now deterministic, we have used a stopping condition based on the norm of the gradient, and have chosen the conjugate gradient (CG) algorithm applied on the linear system (5.3) as the iterative optimization scheme. In Figure 3 we show the optimal control obtained after = 16 iterations when the stopping criterion ‖ GL (9,9,9,9) [∇ ( ℎ )]‖ ≤ 10 −8 was met, where ℎ is the th CG iterate and̂︀ = GL (9,9,9,9) denote the tensor Gaussian quadrature used in (4.1) to approximate the true expectation. The 2 -norm of the final control using this Gaussian quadrature is ‖̂︀ ℎ=2 −8 =16 ‖ = 16.4128.

Conjugate gradient on fully discretized OCP
We investigate here the convergence of the method described in Section 5, with the particular choice of the CG method applied to the linear system (5.3) as iterative solver. We recall the error bound (5.6) in the case of piece-wise linear FE (i.e. = 1): For each tolerance tol, using formula (8.2), we compute the optimal mesh size ℎ = ℎ(tol), the optimal sample size in the MC approximation MC = MC (tol), and, finally, the minimum number of iterations required in the iterative optimization method, = (tol). To optimally balance the error contributions in (8.2), we need further to estimate the constants 1 , 2 , 3 , . This is detailed hereafter.
-In order to estimate the constant 2 , we used the same finest mesh as the one used to compute our reference solution, namely ℎ = 2 −9 , and ran the CG method up to 20 iterations on the linear system (5.3) discretized by Monte Carlo with a sample of increasing size MC = 2 0 , 2 1 , · · · , 2 10 . For every sample size MC we repeated the simulation 10 times (with 10 independent MC samples) and averaged the final 2 ( ) error on the control. We numerically found 2 ≈ 0.430527. Figure 4 presents these results, where the mean and standard deviation (std) of the 2 error have been approximated by sample averages using the 10 independent realizations. -To estimate the constant 3 , we have discretized the OCP by a Gaussian quadrature with = 2 4 knots (2 Gauss-Legendre points per random variable) and a sequence of decreasing mesh sizes ℎ = 2 −1 , . . . , 2 −8 . The optimal control has been computed by sufficiently many CG iterations and the error estimated with respect to a reference solution with ℎ = 2 −9 and the same Gaussian quadrature with = 2 4 knots. We found 3 ≈ 436.516. Figure 5 shows the convergence of the error on the control (in the 2 -norm), versus the discretization parameter ℎ. We observe a convergence rate of ℎ 2 , which is consistent with the theoretical result in (8.2). -To estimate the constants 1 and , we have discretized the OCP by a Gaussian quadrature with = 5 4 tensorized Gauss-Legendre knots and ℎ = 2 −9 . We have run the CG algorithm and recorded the 2 error on the control, computed with respect to a converged solution, over the iterations. We found = 4.941 and 1 = 17.9140. Figure 6 shows the computational complexity of the considered method, i.e. the fully discretized OPC solved by the CG algorithm, with optimally chosen parameters MC , ℎ, and . Here we plot the mean 2 error on the optimal control versus the computational cost model = 2 MC ( 1 ℎ − 1) 2 (we assume here an optimal ideal linear algebra solver which achieve = 1 for the FE discretized PDE). The mean error and its standard deviation have been estimated by repeating the whole procedure 20 times.
The observed slope is consistent with our theoretical result ∼ tol −3 up to logarithmic terms.

Stochastic gradient with fixed mesh size
We implement here the Stochastic Gradient method described in Section 6 using = 1 samples at each iteration (recall that the complexity does not depend on ) and learning rate = 0 +10 , with 0 = 2 . It is noteworthy that there may be applications for which a mini-batch approach with > 1 may provide additional benefits. However, for the OCP problem considered in this work we did not observe any performance benefits by using > 1, so that we used = 1 throughout. We have first assessed the convergence of the SG iterations on the FE discretized OCP (3.1), using a mesh size ℎ = 2 −4 . The reference solution was computed using the same FE mesh size, a Gaussian quadrature with = 5 4 knots to approximate the expectation and CG iterations up to convergence. Figure 7 clearly shows the 1 √ convergence rate of the SG algorithm w.r.t. the iteration counter , as predicted by Theorem 6.2. We have then studied the complexity of the SG Algorithm (where the error is computed with respect to the solution of the continuous OCP (2.10)). For convenience, we recall the error bound (6.5): To optimally balance the two error contributions, we have estimated the constants 1 and 2 as described next.
-The constant 1 can be inferred from the results in Figure 7. A least squares fit of the mean 2 error versus the iteration counter gives 1 ≈ 2.143. -The constant 2 is the same as the constant 3 in (8.2), hence we kept the same estimate 2 ≈ 436.5.
With these constants estimated, for a given required tolerance tol we can estimate the correct number of SG iterations (tol) and mesh size ℎ(tol) to fulfill the accuracy requirement. Figure 8 shows the estimated mean 2 error, using SG Algorithm 1, as a function of the computational cost model = 2 ( 1 ℎ − 1) 2 . The slope is the one predicted in Corollary 6.3 (with = 1 and = 1), namely tol −3 .

Stochastic gradient with variable mesh size
We present here the results for the Stochastic Gradient method described in Section 7 with¯= 1 and learning rate = 0 +15 , with 0 = 2 . The mesh refinement strategy over the iterations is the one described in Algorithm 2 with ℎ 0 = 2 −4 .
The rational behind this choice of ℎ 0 is the following: from (7.2) we have that where the Monte Carlo error behaves asymptotically for → ∞ as = We have run the algorithm for max = 10 000 iterations, and repeated the simulation 100 times to estimate the mean error and its standard deviation. Figure 9 shows the mean 2 error computed with respect to the same reference solution described at the beginning of this section, versus the iteration counter. Figure 10 shows, instead, the mean 2 error at iteration versus the computational cost model = ∑︀ =1 2( 1 ℎ − 1) 2 , as well as one particular realization of the algorithm. In both plots, the observed convergence rate of the mean error is consistent with the results in Theorem 7.2 and Corollary 7.3, resp. In Figure 10 we have also added, for comparison purposes, the complexity results of the other two methods, namely CG on the fully discretized OCP and SG on the semi-discretized   OCP, with optimal choices of discretization parameters. It is clear from this plot that all methods perform very similarly for the problem at hand.

Conclusions
In this work, we have analyzed and compared the complexity of three gradient based methods for the numerical solution of a risk-averse optimal control problem involving an elliptic PDE with random coefficients, where a Finite Element discretization is used to approximate the underlying PDEs and a Monte Carlo sampling is used to approximate the expectation in the risk measure. The first version considered is when the OCP is discretized upfront, using a fixed finite element mesh and a fixed Monte Carlo sample, and then solved by an efficient iterative method such as a Conjugate Gradient. The second version is a Stochastic Gradient method in which the finite element discretization is still kept fixed over the iterations, but the expectation in the objective function is re-sampled independently at each iteration, with a small (fixed) sample size. Finally, the third version is again a Stochastic Gradient method, but now with successively refined FE meshes over the iterations. We have shown in particular, that the stochastic gradient methods improve the computational complexity by log factors, compared to applying the CG (or equivalent) linear solver to the fully discretized OCP. Our complexity analysis is based on a-priori error estimates and a-priori choices for the FE mesh size, the Monte Carlo sample size, and the gradient iterations to obtain a prescribed tolerance. We stress that the complexity bound derived in Corollary 6.3 is tight, as for a given control , computing the value of objective functional ( ) only using MC and FE methods has a complexity of tol −(MC-rate+FE-rate) already, where MC-rate = 2 and FE-rate = +1 , which is the asymptotic complexity stated in Corollary 6.3. One could potentially obtain a better complexity by considering higher order discretization schemes, either for space or for probability, by exploiting ideas from our recent work [37] or considering a hierarchical/multigrid discretization in probability [38].
In addition to the improved complexity, another benefit of the stochastic gradient methods is that they are more amenable to adaptive versions, in which, e.g., the mesh size and possibly the Monte Carlo sample size are refined over the iterations based on suitable a-posteriori error indicators. The study of such adaptive versions is postponed to future work.
Another interesting direction for future work is the extension of stochastic gradient methods to more general risk measures. For example, we mention that Stochastic Gradient methods have been already used in combination with the CVaR risk measure [5], although not in the context of PDE-constrained optimal control problems. One way of introducing more general risk measures in our context could be to consider the objective functional In this work we have focused on the "mean-squared" risk measure (·) = E[(·) 2 ] but this could be replaced by any other coherent and more risk averse measure. Indeed, since every coherent risk measure satisfies the monotonic and convexity assumptions, the objective functional ↦ → ( ) will be convex provided the solution map from to is convex. Therefore, the theory developed here could still be applied thanks to the regularization term that ensures that the strong convexity assumption is satisfied.

Appendix A. Reference solution by Stochastic Collocation
In this appendix, we briefly describe the computation of the reference solution used in the numerical result of Section 8, which is based on a stochastic collocation method on a tensor grid of Gauss Legendre points. Moreover, we provide an error estimate for such an approximation based on stochastic collocation. While the numerical example in Section 8 only depends on 4 random variables, here we show that the stochastic collocation approximation is exponentially convergent and that a highly accurate solution can be obtained with a moderate number of collocation points; recall that 9 4 points were used in the numerical experiments. We suppose that quadrature to approximate the expectation is not random, but uses deterministic points , for = 1, . . . , . The estimated optimal control̂︀ is then deterministic as well. The following theorem then provides an error bound.
We consider the state and adjoint problems extended to the complex domain: ∀ ∈ 0 find (·, ) ∈ 1 0 ( ; C) such that ∫︁ with min = essinf ∈ 0 ( ). The next lemma states that both ↦ → (·, ) and ↦ → (·, ) are holomorphic functions in 0 with uniform bounds on Σ. The result for ↦ → (·, ) is well known and can be found in reference [11] for example, so that we only give the proof for ↦ → (·, ). Proof. It is well known (see e.g., [11]) that the function ↦ → (·, ) is holomorphic on 0 with bound (A.7). This property translates to the adjoint function ↦ → (·, ) ∈ 1 0 ( ; C) which is holomorphic in 0 as well with bound Based on the last regularity result and following [4], we can state the following error estimate for the quadrature error.

Appendix B. Proof for increasing Monte Carlo sampling in SG
Here we detail the proof of the bound (6.9) in Remark 6.4. In that case the factor in (6.3) becomes for = 0 / and ∼ 0 −1 with 0 − 1 > 0. We use the recursive formula (6.3) and set, as before, ℎ ⋆ to be the exact optimal control for the FE problem defined in (3.1). We emphasize that (3.1) has no approximation in probability space. Setting = E[‖ ℎ − ℎ ⋆ ‖ 2 ] and = log which shows that It follows that for 0 > 1/ . Eventually, we obtained the following upper bound for two constants 3 > 0 and 4 > 0: and using (3.6) to bound the second term, the claim follows.