Convergence bounds for empirical nonlinear least-squares

We consider best approximation problems in a nonlinear subset $\mathcal{M}$ of a Banach space of functions $(\mathcal{V},\|\bullet\|)$. The norm is assumed to be a generalization of the $L^2$-norm for which only a weighted Monte Carlo estimate $\|\bullet\|_n$ can be computed. The objective is to obtain an approximation $v\in\mathcal{M}$ of an unknown function $u \in \mathcal{V}$ by minimizing the empirical norm $\|u-v\|_n$. We consider this problem for general nonlinear subsets and establish error bounds for the empirical best approximation error. Our results are based on a restricted isometry property (RIP) which holds in probability and is independent of the nonlinear least squares setting. Several model classes are examined where analytical statements can be made about the RIP and the results are compared to existing sample complexity bounds from the literature. We find that for well-studied model classes our general bound is weaker but exhibits many of the same properties as these specialized bounds. Notably, we demonstrate the advantage of an optimal sampling density (as known for linear spaces) for sets of functions with sparse representations.


Introduction, Scope, Contributions
We consider the problem of estimating an unknown function u from noiseless observations. For this problem to be well-posed, some prior information about u has to be assumed, which often takes the form of regularity assumptions. To make this notion more precise, we assume that u is an element of some Banach space of functions (V, • ) that can be well approximated in a given nonlinear subset (or model class) M ⊆ V. The approximation error is measured in the norm v := ˆY |v| 2 y dρ(y) where Y is some Borel subset of R d , ρ is a probability measure on Y and | • | y is a y-dependent seminorm for which the integral above is finite for all v ∈ V. This norm is a generalization of the L 2 (Y , ρ)-and 2 EIGEL, SCHNEIDER, AND TRUNSCHKE H 1 0 (Y , ρ)-norms which are induced by the seminorms |v| 2 y = |v(y)| 2 and |v| 2 y = ∇v(y) 2 2 , respectively. We characterize any best approximation u M in M by In general, this approximation is not computable. We propose to approximate u M by an estimator u M,n that is based on the weighted least-squares method which replaces the norm v by the empirical seminorm for a given weight function w and a sample set {y i } n i=1 ⊆ Y with y i ∼ w −1 ρ. The weight function is a non-negative function w ≥ 0 such that´Y w −1 dρ = 1. Any corresponding empirical best approximation u M,n in M is characterized by Given this definition we can choose w such that the theoretical convergence rate of u − u M,n n→∞ − −− → u − u M is maximized. Note that changing the sampling measure from ρ to w −1 ρ is a common strategy to reduce the variance in Monte Carlo methods referred to as importance sampling.
Since • is not computable in general, the best approximation error serves as a baseline for a numerical method founded on a finite set of samples. We prove in this paper that the empirical best approximation error u − u M,n is equivalent to this error with high probability.
1.1. Structure. The remainder of the paper is organized as follows.
In Section 1.2 we aim to provide a brief overview of previous work and introduce the notion of the restricted isometry property (RIP). Based on the RIP, Section 2 develops the central results of this work. These are applied to some common model classes in Section 3. We begin by considering linear spaces in Section 3.1. Section 3.2 considers sets of sparse functions and Section 3.3 examines sets of low-rank functions. Finally, we investigate the influence of the seminorm the convergence in Section 4. We conclude in Section 5 with a discussion of the derived results and an outlook on future work.

Related work.
When |v| y = |v(y)| is used, u M,n is known as the nonlinear least squares estimator of u. The extensive interest in machine learning in recent years has lead to the investigation of this estimator for special model classes like sparse vectors [2,6,7], low-rank tensors [4,[8][9][10][11] and neural networks [12,13]. However, to the knowledge of the authors no investigation for general model classes has been published so far. This may be due to the fact that sparse vectors and low-rank tensors were the first model classes for which rigorous theories were developed and that most of these works focus on 1 and nuclear norm minimization. Our work may be regarded as an extension of these works (in particular of infinite-dimensional compressed sensing [5,14]) to the nonlinear least-squares setting. For a more in-depth discussion of statistical learning theory we refer to the articles [15,16] and the monographs [17,18]. For linear spaces the first estimate in Theorem 2.12 has already appeared in [1] for weighted least squares and in [19][20][21] for standard least squares. A convergence bound for the nonlinear least squares approximation problem was recently analysed in [10]. However, the probability of the bound failing increases exponentially as the best approximation error (1 − P )u approaches zero and becomes one when (1 − P )u vanishes. Moreover, this bound only holds for model classes that are bounded in L ∞ and it does not provide any insight on what property of the set influences the convergence rate.
The empirical approximation problem (1) was thoroughly examined in [1] for linear model spaces. There the model class M is assumed to be the m-dimensional subspace spanned by the orthonormal basis is the Monte Carlo estimate of the Gram matrix I m . This condition is in fact equivalent to the norm equivalence Cohen and Migliorati [1] prove that under suitable conditions the norm equivalence (2) is satisfied with high probability.
Equation (2) can be seen as a generalized restricted isometry property. The notion of a RIP was introduced in the context of compressed sensing [6]. It expresses the well-posedness of the problem by ensuring that • n is indeed a norm and equivalent to • on M. Minimizing the error w.r.t. • n thus minimizes the error w.r.t. • . In compressed sensing of sparse vectors [6,7] and low-rank tensors [4] discrete analogues of (2) are employed to derive bounds for the corresponding reconstruction errors. A recent work which generalizes the RIP from [1] to sparse grid spaces is [22].
In this paper we extend the cited results to more general norms and nonlinear model sets by directly bounding the probability of We prove that under some conditions on n and A this RIP holds with high probability and show that these conditions are satisfied for a variety of model classes. We then use the RIP to provide quasi-optimality guarantees for the empirical best approximation in Theorem 2.12.

CONVERGENCE BOUNDS FOR EMPIRICAL NONLINEAR LEAST-SQUARES 5
In Remark 2.4 we note that it suffices to consider conic model sets. Optimizing over these sets is not straightforward. In [23], appropriate RIP constants for exact recovery of conic model sets using a suitable regularizer are derived.

Main Result
To measure the rate of convergence with which v n approaches v as n tends to ∞, we introduce the variation constant This constant constitutes a uniform upper bound of v n for all realizations of the empirical norm • n and all v ∈ A. We usually omit the dependence on the choice of w, | • | y and Y . When a distinction between different choices of these parameters is necessary we add subscripts to K, respectively. The constant K is a fundamental parameter in many concentration inequalities that are used to provide bounds for the rate of convergence of the quadrature error. Definition 2.1 (Quadrature Error). The quadrature error of the empirical norm • 2 n on the model set A ⊆ V is defined by This error is closely related to the RIP through the normalization operator U . This relation is developed rigorously in the subsequent lemma.

Lemma 2.3 (Equivalence of RIP and a bounded quadrature error).
For some set A, Proof. Note that 0 n = 0 , αu n = |α| u n for all u ∈ A and u = 1 for all u ∈ U (A). Therefore, for δ > 0.
The proof of this lemma can be found in Appendix A. With the preceding preparations we can derive a central result: Proof. By Lemma 2.3 it suffices to bound the quadrature error on U (A). Lemma 2.6 provides a bound for the probability of the complementary event.  2 ) with the weight function w ≡ 1 and let P k denote the k-th Legendre polynomial. Let moreover n ∈ N and m ≥ n 2 2 +n− 1 2 . Then the 1-dimensional manifold span{P m } has a larger variation constant than the n-dimensional manifold span{P k } k∈ [n] . We refer to Example 3.1 for the computation of these variation constants. Proof. To obtain RIP U (A) (δ) with a probability of 1 − p it suffices that Equivalently, Linear spaces, sparse vectors and low-rank tensors all satisfy the requirements of this corollary with M depending linearly on the number of parameters of the model [4,12,24]. The corollary states that in these cases n ∈ O(M G) where the factor G := ln(K)K 2 represents the variation of • n on M. If K is independent of M this means that n depends only linearly on M . Remark 2.11. An interpretation of Corollary 2.10 is that the variation constant K is of greater importance than the covering number ν which enters the bound on the sample complexity only logarithmically.
and consequently Proof.
Hence, equation (3) holds since v n ≤ v w,∞ is satisfied for all v ∈ V and in particular for u − u M . Equation   Gramian could be used to verify this RIP for a given sample set. In the nonlinear setting this is not possible. To obtain a practical indicator for the convergence of our method we make the following considerations.

We can use this bound as an indicator for when RIP
To do this we select a test set of n samples and observe the test set errorẽ n := u − u M,n n as the number of samples n is increased. Whenẽ n begins to decrease with a rate of (1 + n −r ) we take this as an indication that RIP A (δ) is satisfied and that additional sampling is unnecessary. This is illustrated in Figure 1.
2 ), w ≡ 1 and M be the model space of polynomials of degree less than m = 10. Let moreover A, e n and e be defined as in Remark 2.16. Depicted is the distribution of the random variable e n /e−1 for different values of n and a synthetic (but fixed) function u. The hatched area on the left marks a range of n where the approximation problem is underdetermined and any error can be reached. When n ≥ m the approximation problem has a unique solution in the least squares sense. From this point until the gray and dashed line, an exponential decay of the error can be observed. This decay results from the exponentially fast convergence of the probability for RIP A (δ) w.r.t. n. From there on, RIP A (δ) holds with a high probability and the error decays with a rate of n −1 . Remark 2.16 predicts a rate of n −1/2 but the condition e n ≤ c(1 + n −r ) is satisfied for c = r = 1. This faster decay can be explained by the fact that for the linear space M the bounds in the proof of Theorem 2.7 are suboptimal (see Example 3.1).

CONVERGENCE BOUNDS FOR EMPIRICAL NONLINEAR LEAST-SQUARES 11
and the perturbed empirical best approximation

Examples and numerical illustrations
In this section, we examine some exemplary model spaces to which the developed theory can be applied. More specifically, we consider linear spaces, sparse vectors and tensors of fixed rank. The following theorem is central to the further considerations. y with respect to y ∈ Y is measurable and for any weight function w where the lower bound is attained by the weight function Proof. See Appendix B.
This theorem allows to analyse the seminorm and the model class independently from the choice of weight function which can be chosen optimally when these first two parameters are fixed.
Here, the second equality follows by orthonormality and the third by the Cauchy-Schwarz inequality. From this, Theorem 3.1 implies where the optimal weight function is given by w(y) := m B(y) −2 2 . Note that this fact was already reported in [1]. Using Corollary 2.10 then bounds the sample complexity of this model class by Although our approach is more general the resulting asymptotic bound differs only by a factor of m 2 from the bound n ∈ O(m ln(m)) provided in [1]. The near optimal bound in [1] is obtained by using tighter concentration inequalities (cf. [25]) when bounding the probability of RIP Vm (δ) in Theorem 1.1.

Remark 3.2.
When the sampling density cannot be changed, the variation constant can also be used to guide the choice of a suitable model class. For linear spaces this section shows that an optimal model space is spanned by an orthonormal basis for which the basis functions are bounded by 1. Such spaces are characterized in [26] and a prime example is the Fourier basis of L 2 ([−1, 1], dx 2 ; C).

Sets of sparse functions.
In this section we follow the ideas of [2] and consider spaces with weighted sparsity constraints. For any sequence ω ∈ R N ≥0 and any subset S ⊆ N, define a weighted cardinality and a weighted 0 -seminorm by Observe that ω ω (i.e. ω j ≤ω j for all j) implies ω(S) ≤ω(S) and that ω(S) = |S| for ω ≡ 1.

CONVERGENCE BOUNDS FOR EMPIRICAL NONLINEAR LEAST-SQUARES 13
Let in the following {B j } j∈N be a fixed orthonormal basis for V := L 2 (Y , ρ), fix a weight function w and define the model set Proof. The first four assertions are trivial. To prove the last one, let v ∈ M ω,s . Using the triangle inequality and The Cauchy-Schwarz inequality, v ω,0 ≤ s and the orthonormality of Proof. This follows directly from Lemma 3.3.

Remark 3.5. This setting also incorporates the standard sparsity class
When the chosen basis is a tensor product basis B j = B j 1 ⊗ · · · ⊗ B j M and the weight function has a product structure w = w 1 ⊗ · · · ⊗ w M , this implies that K(U (M 1,k )) grows exponentially with the order M . This is a limitation when using classical isotropic sparsity for high-dimensional problems. Lemma 3.6. Let ω j ≥ B j w,∞ for all j and let V m be an m-dimensional subspace spanned by a subset of {B j } j∈N . Then there exists C > 0 such that

EIGEL, SCHNEIDER, AND TRUNSCHKE
Proof. We show that For the first step, let {v j } be the centers of a • -covering of This implies that {v j } are also the centers of an • w,∞ -covering with radius r.
For the second step, observe that M ω,s ⊆ M 1,s = M 1, s . Since (V m , • ) (R m , • 2 ) it remains to compute the covering number for the unit sphere of s -sparse vectors in R m . A bound for this is given in [24] by Proof. The assertion follows directly from Theorem 2.7 together with Lemmas 3.4 and 3.6.
According to Theorem 3.1, the sampling density and weight function can be chosen optimally for a given model set. For M ω,s this is not straightforward because Lemma 3.7 bounds K(U (M ω,s )) ≤ s independently of w as long as ω j ≥ B j w,∞ . Note however that this bound is not unique since M ω,s = M cω,c 2 s for any c > 0. This means that K(U (M ω,s )) ≤ c 2 s for any c that satisfies cω j ≥ B j w,∞ and the smallest possible c is given by c min := sup j∈N An optimal weight function for the model class M ω,s must thus minimize c min . If we assume that ω ≡ 1 then From Theorem 3.1 we know that the minimum wb L ∞ (Y ,ρ) = b L 1 (Y ,ρ) is attained for the weight functionw = b L 1 (Y ,ρ)b −1 . An upper bound forb and thus for c min is computed in the subsequent Lemma 3.10. The resulting sequence B j w,∞ is contrasted to the sequence B j w,∞ for j = 1, . . . , 100 in Figure 2. We observe that the new weight functioñ w slightly increases B 1 w,∞ but considerably decreases B j w,∞ for all j > 1. 1 A reconstruction using new weight function is shown in Figure 3e. Since the new constraint ω j ≥ B j w,∞ is significantly weaker than the previous constraint ω j ≥ B j w,∞ for j > 1, one might ask what happens if the weight sequence ω is adapted as well. Figure 3f illustrates this for the smallest possible weight sequence ω j := B j w,∞ . Since this new weight sequence is almost constant (cf. Figure 2) the resulting model class approximates the larger model class M 1,s . This means that we can not expect the results in Figure 3f to be better than those in Figure 3e. We observe however that they are indeed better than those in Figure 3c where the model class M 1,s was used. with Ω := diag(ω).
Proof. Observe that by the Cauchy-Schwarz inequality Defining the model set Since we know thatb for M ω,s ∩ V m is bounded byb for M, we derive an estimate for the larger set.
Recall that The theory presented in this subsection can be generalized easily to dictionary learning (cf. [27,28]). This is stated without proof in the following theorem.

Theorem 3.11. Assume that {B j } j∈N is a Riesz sequence satisfying
and that ω is chosen such that ω j ≥ B j for all j. Redefine be distributions on R m and consider ρ = ρ 1 ⊗· · ·⊗ ρ M . This problem occurs for example whenever one tries to approximate a low-rank function of M variables using a tensor product basis. A special case of this setting is the problem of tensor completion where a tensor has to be recovered from a few entries. In this problem all distributions ρ k have to be discrete measures on the standard basis vectors.
In both problems the task is to find a best approximation in a subset T r ⊆ V of bounded rank r. For tensors however there exist many different concepts of rank for which we refer to [29][30][31][32] and the works cited below. with Ω := diag(ω). 3c displays the results of unweighted 1 -minimization (ω j = 1) and 3d displays the results of weighted 1 -minimization using the weight sequence ω j = B j L ∞ . In all aforementioned cases the sampling points are drawn according to the uniform measure on [−1, 1]. The subplots 3e and 3f use samples that are drawn according to the optimal sampling density as given in Lemma 3.10. 3e uses the original sequence ω j = B j L ∞ while 3f uses the minimal possible weight sequence ω j = B j w,∞ B j L ∞ .

CONVERGENCE BOUNDS FOR EMPIRICAL NONLINEAR LEAST-SQUARES 19
Recovery from Gaussian samples. In this section we consider a subset T r ⊆ V of tensors of bounded (Hierarchical Tucker) HT-rank r. For w ≡ 1 the following bound for the sample complexity subject to δ is given in [4,Theorem 2], To obtain a sample bound from our theory, we would have to bound the variation constant, which however is infinity, This shows that a direct application of the presented formalism to this problem cannot provide a finite sample complexity.

Remark 3.12. As above, this exposes the lacking sharpness of the results used in the proof of Theorem 2.7. With more refined concentration inequalities as in [33], a different definition of the variation constant would emerge (replacing • w,∞ by a sub-Gaussian norm), which would be finite for this problem.
The present theory can deal with this problem in two different ways. The first option is to choose the weight function w(y) = m M y −2 Fro , which yields the variation constant where the final equality holds since • = • Fro . The second option is to normalize the samples and thereby replace the Gaussian distribution by a uniform distribution on the unit sphere. In this case we obtain the new identity • = m −M/2 • Fro and the corresponding variation constant In both cases K(U (T r )) = K(U (V)).
Let k = K(U (T 2r )). By using the bound • w,∞ ≤ k • on T 2r we can utilize the bound for the covering number for tensors of HT-rank r that is provided in [4]. This leads to the estimate . 20

EIGEL, SCHNEIDER, AND TRUNSCHKE
A subsequent application of Corollary 2.10 yields For k = 1 this has the same asymptotic complexity as the bound in [4]. We conjecture that the transition k = m M/2 ; k = 1 can be achieved by using a generic chaining argument (cf. [33]) rather than a simple Hoeffding bound in the proof of Theorem 2.7.
Recovery from rank-1 samples and completion. In this section we consider subsets T r ⊆ V of generic rank-r tensors but assume that the rank concept satisfies T 1 ⊆ T r . This is the case for all tree-shaped tensor formats including the Tucker format, the tensor train (TT) format and general hierarchical tensor formats (HT) as well as the canonical polyadic decomposition (CP). For the sake of completeness we define The variation constant for the set T r is computed in the next theorem. Since y has rank 1, We deduce that K(U (T 1 )) = K(U (V)) by Theorem 3.1. This proves the assertion since The theorem states that tensor formats do not exhibit a smaller variation constant than the linear space they are embedded in. This result is surprising at first because tensor formats have a significantly smaller covering number than the full tensor space, cf. [4]. However, this is already indicated by the classical analysis of matrix completion from which it is known that the notion of incoherence is required in addition to a low-rank property.
Despite this unfavourable result, it is noteworthy that the present theory can be used in this setting. The bound • w,∞ ≤ K(U (T 2r )) • and the isometry • = • Fro imply Assuming the weight function w is chosen optimally, we know from Theorem 3.13 and Section 3.1 that K(U (T 2r )) = m M . We can now apply the bound for the covering number of tensors of HT-rank r from [4]. The resulting estimate reads .
A final application of Corollary 2.10 yields To the knowledge of the authors this is the first estimate of the number of samples that are necessary to satisfy RIP Tr (δ) in this setting. Note that this is a worst-case estimate and that significantly less samples are needed in practice (cf. [10]). In the following examples we discuss the application to two common classes of problems. Example 3.14. In this example we consider the problem of recovering the low-rank coefficient tensor of a function from samples. Let π m be a probability measure on Z m and W m ⊆ L 2 (Z m , π m ) be spanned by the d m orthonormal basis functions {B m,j } j∈ [dm] . Now define the product space W := W ⊗M m ⊆ L 2 (Z, π) with Z := Z M m and π := π ⊗M m and endow it with the seminorm †w † z := |w(z)|. This is the space in which the sought functions will live and it shall be approximated in the norm † † • † † := • L 2 (Z,π) . As a model class consider the set T W r ⊆ W of functions with a coefficient tensor of rank r with respect to the tensor product basis B j (z) := M k=1 B m,j k (z k ) and denote this set of coefficient tensors by T V r := T r . For the sake of simplicity, assume that the weight function w ≡ 1 is constant.
To compute the variation constant of this model class, recall the definition of V = (Y , • ), Y = (R m ) ⊗M and |v| y = |(v, y) Fro | from above. Note that each function w ∈ W corresponds uniquely to a coefficient tensor w ∈ V and that the mapping B : Z → Y given by (B(z)) j := B j (z) induces an isometry of seminorms †w † z = |w(z)| = |(w, B(z)) Fro | = |w| B(z) .

EIGEL, SCHNEIDER, AND TRUNSCHKE
This means that if we choose ρ as the pushforward measure ρ := B * π the isometry of seminorms induces the isometry of the two norms w = ˆY |w| 2 y dρ(y) Together with Theorem 3.13 and Theorem 3.1 it follows that

Example 3.15. The problem of tensor completion can be considered as a special case of Example 3.14. In this setting Z = [m] M is the set of all multi-indices, π = U(Z) is a uniform distribution on Z and
Since this is a special case of Example 3.14 the model class of rank-r tensors T r exhibits the same bound K(U (T r )) = K(U (W)) ≥ m M .
These two examples show that K(U (T r )) ≥ m M in important applications. To reduce the variation constant in these cases we can only intersect T r with another model class M with low variation constant. The intersection then inherits the low covering number of T r and the low variation constant of M.

Dependence on the seminorm
Since the definition of the • -norm is very general, our theory is not limited to the L 2 -norm but extends to Sobolev or energy norms. It is therefore natural to ask how the choice of the semi-norm | • | y influences the variation constant. In this section we investigate this influence using Sobolev norms as an example.
We will need the following generalization of reproducing kernel Hilbert spaces (GRKHS) as a tool for the analysis.  It is proved in Appendix C that for w ≡ 1 Since κ m increases but λ m decreases with m, both effects should be equilibrated by a proper choice of m. This is illustrated for two different model spaces M in Figure 4. The small effect of κ m is due to the dimension d = 1 for which we can bound by Gautschi's inequality [34, Eq. 5.6.4]. We conclude that for linear model spaces an approximation with respect to the H m -norm for larger m requires less samples than an approximation with respect to the L 2 -norm. For m = 1 this hypothesis is confirmed numerically in Figure 5. For an application in the setting of weighted sparsity we refer to the recent work [35]. Note that this does not have to be the case in general. If the model class contains only piecewise constant functions then information about the gradients is irrelevant. Such phenomena may also arise due to intricate properties of the model class and may only be observable by looking at the variation constant.
Also note that the minimization with respect to the H m -norm does not necessarily require more computational effort than the minimization with respect to the    [36,37] FM 1,2k )).
To evaluate this, let ψ be the mother wavelet and define the daughter wavelets ψ a,b (t) = 1 √ a ψ( t−b a ). Due to basic properties of the Fourier transformψ a,b (ω) = √ aψ(aω) exp(−iaω) and since the daughter wavelets are normalized we obtain Note thatψ is the mother wavelet and therefore ψ 2 L ∞ is constant. It can be concluded that many samples are needed to recover larger scale coefficients but fewer samples for smaller scales. This suggests a multilevel approach where the small-scale coefficients are learned separately from the large-scale coefficients. This was already observed in the compressed sensing literature (cf. [5]). Typically, these schemes use the classical unweighted notion of sparsity. For a recent application of weighted sparsity in the context of residual minimization in a sparse wavelet representation we refer to [38].
Due to the high variation constant of the large scale coefficients, it is sensible to incorporate as much information as possible into this model class. In the spirit of works like [39], this can for example be achieved by means of manifold constraints. These manifolds can either be estimated for a single patient (cf. [40]) or for multiple patients when it can be assumed that the large-scale structures remain similar for different patients. In this way the image u is decomposed (approximately) as a sum of a background image modelling the healthy tissue and a foreground image modelling the pathological lesion.
Note that if the mother wavelet Ψ is differentiable we can instead consider the semi-norm |v| ω := √ 1 + ω 2 |v(ω)|, which corresponds to the H 1 -norm in the physical domain. Computing the variation constant is however out of the scope of our discussion.

Discussion
The nonlinear least squares method is probably the easiest and currently the most commonly used setting in machine learning regression. In Section 2 we derive an error bound for the nonlinear least squares estimator (1) that can be used with arbitrary model classes. This result is based on a restricted isometry property (RIP), which we prove to hold with high probability when the number of samples is sufficiently large.
To put our theory into perspective, we apply it to well-known model classes and compare the results to the near optimal bounds that often already exist in the literature. In the cases of linear spaces (Section 3.1), functions with sparse representation (Section 3.2) and low-rank tensors (Section 3.3), we obtain asymptotic bounds which differ from these near optimal ones by a polynomial factor. This means that our analysis does not provide optimal complexity bounds when the number of samples should be determined a priori and when sampling is costly (i.e. when it is imperative to require as few samples as possible). We however assume that a more meticulous application of modern concentration arguments (like [33]) would close this gap. We also obtain first bounds for the sample complexity for rank-1 measured low-rank tensors in Section 3.3. These bounds however only improve the sample complexity of full-rank tensors by a logarithmic term. An intuition for this result is provided by matrix recovery where it is known that regularity in the form of incoherence is needed in addition to the low-rank property. As a first remedy we suggest to impose additional regularity assumptions on the model class as was done in [41]. We however believe that this problem can be handled by taking the regularity of the function u that we want to approximate into account. Figure 6 illustrates this behaviour. The model class used for all three experiments is the same and only the regularity of the function varies. Even though the best approximation error in all three cases is bounded by 10 −3 , we can observe how the empirical approximations deteriorate with decreasing regularity. The

CONVERGENCE BOUNDS FOR EMPIRICAL NONLINEAR LEAST-SQUARES
27 relative errors for the empirical approximation increase from 10 −2 to 10 1 . This phenomenon will be investigated in future research.
Despite the mentioned limitation, we nevertheless obtain qualitatively similar results to what is reported with more specialized approaches. In particular this concerns the emergence of an optimal sampling measure in Section 3.1, the importance of weighted sparsity in Section 3.2 and the advantage of multilevel sampling in Example 4.3. The generality of our theory also allows us to combine these result and derive an optimal weight function for weighted sparsity in Lemma 3.10. Since these results rely only on an estimation of the RIP, they can be compared to results on weighted 1 -minimization. We observe an improvement over the unweighted case.
In a final section, the dependence of the sample complexity on the seminorm that is used is investigated. We observe faster convergence when stronger norms are used and provide a theoretical reasoning for this effect.
Despite several remaining problems, we hope that this work is a promising first step towards a general theory for the sample complexity of the nonlinear least squares problem. We also want to emphasise that although our discussion is limited to well-known model classes, the developed theory can be applied to arbitrary model classes which may even be constructed empirically by methods such as manifold learning.

Acknowledgements
We thank the anonymous referees for suggestions that helped to significantly improve the manuscript and also to correct an error. We 28 EIGEL, SCHNEIDER, AND TRUNSCHKE also thank Leon Sallandt, Mathias Oster and Michael Götte for fruitful discussions.
M. Eigel acknowledges support by the DFG SPP 1886. R. Schneider was supported by the Einstein Foundation Berlin. P. Trunschke acknowledges support by the Berlin International Graduate School in Model and Simulation based Research (BIMoS). 2K 2 ) for each 1 ≤ j ≤ ν by a standard concentration inequality. Combining both inequalities yields the statement.
In the following we are concerned with proving Lemmas A.3 and A.5 which both rely on properties of the function y : u → w(y)|u| 2 y .
Lemma A.1. The function y : u → w(y)|u| 2 y has the properties • | y (u)| ≤ K and for all u, v ∈ U (A).
Proof. Let u, v ∈ U (A). The first statement follows immediately by To prove the second statement we consider the seminorm k y := y and use the reverse triangle inequality Since k y is bounded by √ K, we can use the Lipschitz continuity of As an intermediate step we first prove Lemma A.2 from which Lemma A.3 follows almost immediately.  The set of feasible α 1 , α 2 satisfying α 1 , α 2 > 0 and α 1 I 1 + α 2 I 2 = 1 is displayed in red. Contour lines of the function (α 1 , α 2 ) → α −1 1 ∨ α −1 2 for t 1 < t 2 (left) and for the optimal value t opt = α −1 1 = α −1 2 (right) are drawn in black.