Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Complex fractal trainability boundary can arise from trivial non-convexity

Yizhou Liu liuyz@mit.edu Physics of Living Systems, Department of Physics, MIT, Cambridge, MA 02139, USA. Department of Mechanical Engineering, MIT, Cambridge, MA 02139, USA.
(June 20, 2024)
Abstract

Training neural networks involves optimizing parameters to minimize a loss function, where the nature of the loss function and the optimization strategy are crucial for effective training. Hyperparameter choices, such as the learning rate in gradient descent (GD), significantly affect the success and speed of convergence. Recent studies indicate that the boundary between bounded and divergent hyperparameters can be fractal, complicating reliable hyperparameter selection. However, the nature of this fractal boundary and methods to avoid it remain unclear. In this study, we focus on GD to investigate the loss landscape properties that might lead to fractal trainability boundaries. We discovered that fractal boundaries can emerge from simple non-convex perturbations, i.e., adding or multiplying cosine type perturbations to quadratic functions. The observed fractal dimensions are influenced by factors like parameter dimension, type of non-convexity, perturbation wavelength, and perturbation amplitude. Our analysis identifies “roughness of perturbation”, which measures the gradient’s sensitivity to parameter changes, as the factor controlling fractal dimensions of trainability boundaries. We observed a clear transition from non-fractal to fractal trainability boundaries as roughness increases, with the critical roughness causing the perturbed loss function non-convex. Thus, we conclude that fractal trainability boundaries can arise from very simple non-convexity. We anticipate that our findings will enhance the understanding of complex behaviors during neural network training, leading to more consistent and predictable training strategies.

Introduction

Machine learning has become a cornerstone of modern technology. When training a machine learning model, we need to update the model parameters towards optimizing a loss function (usually based on the loss gradient) to attain desired performance. To better understand and achieve successful training, researchers have tried to capture shapes of the loss landscapes [1, 2, 3] and dynamics that arise from optimization algorithms [4, 5, 6, 7, 8]. In general, despite the considerable empirical success and broad application of these optimization techniques in training models, our theoretical understanding of the training processes remains limited.

Recent empirical evidence suggests that whether a model is trainable can be extremely sensitive to choices in optimization. A model is said to be not trainable here if the optimization applied leads to divergent loss function value. By convention, we call model parameters to be optimized as parameters and other parameters in the optimizer controlling the optimization process as hyperparameters. One of the most important hyperparameters is learning rate, which affects the size of the steps taken during optimization. On a simple two layer neural network, gradient descent (GD) was recently found to have a fractal boundary between learning rates that lead to bounded and divergent loss (fractal trainability boundary for short) [9]. Consequently, a slight change in hyperparameters can change the training result qualitatively with little hope to choose good hyperparameters in advance.

Here, we aim to explore the mechanisms underlying the fractal trainability boundary and the key factors influencing its fractal dimension. Guided by the intuition that non-convexity renders the gradient sensitive to parameter, which makes GD sensitive to varying learning rates, we tried to quantify the relation between non-convexity and fractal trainability. Given the difficulties in describing and controlling the loss functions of real neural networks, our approach involves constructing simple non-convex loss functions and testing GD on these to examine the boundary between learning rates that lead to bounded versus divergent losses. We discover that even a simple quadratic function, when perturbed by adding or multiplying a regular perturbation function (specifically a cosine function), exhibits a fractal trainability boundary. A parameter specific to the form of the perturbation, defined as roughness, appears to govern the fractal dimension of the trainability boundary, describing the gradient sensitivity to the parameter. A notable difference emerges between the fractal behaviors of trainability in quadratic functions perturbed by additive perturbation versus those altered by multiplicative perturbation: fractal behavior disappears at a finite roughness (when the perturbed loss becomes convex) for additive cases but persists for multiplicative perturbations (where the perturbed loss is always non-convex). We therefore offer a perspective to explain the fractal trainability boundaries observed in real neural networks as a result of non-convexity, emphasizing “roughness” as the factor controlling fractal dimensions.

Results

We first elaborate the key intuition of why certain non-convexity may lead to the fractal trainability boundary. Common ways to generate fractals involve iterating a function sensitive to hyperparameters in the function (e.g., Mandelbrot and quadratic Julia sets [10, 9]). We denote the loss function as f(x)𝑓𝑥f(x)italic_f ( italic_x ) where x𝑥xitalic_x is the parameter to optimize to minimize f(x)𝑓𝑥f(x)italic_f ( italic_x ). GD can be described as iterating the function

x(k+1)=x(k+1)(x(k);s)=x(k)sf(x(k)).superscript𝑥𝑘1superscript𝑥𝑘1superscript𝑥𝑘𝑠superscript𝑥𝑘𝑠𝑓superscript𝑥𝑘x^{(k+1)}=x^{(k+1)}(x^{(k)};s)=x^{(k)}-s\nabla f(x^{(k)}).italic_x start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT = italic_x start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ; italic_s ) = italic_x start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT - italic_s ∇ italic_f ( italic_x start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) . (1)

Here, x(k)superscript𝑥𝑘x^{(k)}italic_x start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT is the parameter obtained at the k𝑘kitalic_kth step and s𝑠sitalic_s is the learning rate (hyperparameter). If non-convexity can make the gradient f(x)𝑓𝑥\nabla f(x)∇ italic_f ( italic_x ) sensitive to parameter x𝑥xitalic_x, after we shift learning rate s𝑠sitalic_s a little at the k𝑘kitalic_kth step (this is another training process to be compared with the original one), x(k+1)superscript𝑥𝑘1x^{(k+1)}italic_x start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT will be a little different from the one obtained with the unchanged learning rate. However, the gradient at the new x(k+1)superscript𝑥𝑘1x^{(k+1)}italic_x start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT will be very different, leading to very different subsequent iterations. The sensitivity of gradient’s dependence on parameter can therefore be transformed to the sensitive dependence of optimization process on the learning rate (hyperparameter), which is the key to generate fractals. Notably, this sensitive dependence on learning rate is sufficient to generate chaos while is not obvious to yield divergent training.

Refer to caption
Figure 1: On constructed loss functions, we conduct numerical experiments to study the trainability boundaries. (a) Illustration of loss landscapes with additive perturbation (f+subscript𝑓f_{+}italic_f start_POSTSUBSCRIPT + end_POSTSUBSCRIPT with ϵ=0.2italic-ϵ0.2\epsilon=0.2italic_ϵ = 0.2 and λ=0.1𝜆0.1\lambda=0.1italic_λ = 0.1). (b) An example of loss constructed having multiplicative perturbation (f×subscript𝑓f_{\times}italic_f start_POSTSUBSCRIPT × end_POSTSUBSCRIPT with ϵ=0.2italic-ϵ0.2\epsilon=0.2italic_ϵ = 0.2 and λ=0.1𝜆0.1\lambda=0.1italic_λ = 0.1). (c) On a fixed range of learning rate, we can put in N𝑁Nitalic_N small segments and evaluate whether training diverge or not at each end of the segments. We therefore can generate a set of boundary segments, BNsubscript𝐵𝑁B_{N}italic_B start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT and count the number of boundary segments. (d) An example when we have more segments (fine-grain), the number of boundary segments (black segments) is increasing (figure obtained based on multiplicative perturbation case f×subscript𝑓f_{\times}italic_f start_POSTSUBSCRIPT × end_POSTSUBSCRIPT with parameters ϵ=0.2italic-ϵ0.2\epsilon=0.2italic_ϵ = 0.2 and λ=0.1𝜆0.1\lambda=0.1italic_λ = 0.1). The colored bar at the bottom visualizes losses for bounded (blue) and divergent training (red).

We thus need to do experiments on specific functions to test if non-convexity with sensitive gradients can lead to fractal trainability boundaries. We started by looking at one dimensional parameters (xd𝑥superscript𝑑x\in\mathbb{R}^{d}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, d=1𝑑1d=1italic_d = 1) and chose to construct our loss landscape by perturbing the convex quadratic function f0(x)=x2subscript𝑓0𝑥superscript𝑥2f_{0}(x)=x^{2}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x ) = italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. One way of perturbation is to add a non-convex function (Fig. 1a). We used the simplest regular perturbation ϵf1=ϵcos(2πx/λ)italic-ϵsubscript𝑓1italic-ϵ2𝜋𝑥𝜆\epsilon f_{1}=\epsilon\cos(2\pi x/\lambda)italic_ϵ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_ϵ roman_cos ( 2 italic_π italic_x / italic_λ ), where λ𝜆\lambdaitalic_λ is the wavelength of the perturbation, and ϵitalic-ϵ\epsilonitalic_ϵ the amplitude of the perturbation. We therefore defined the additive perturbation case in our context as

f+(x)=f0+ϵf1=x2+ϵcos(2πx/λ).subscript𝑓𝑥subscript𝑓0italic-ϵsubscript𝑓1superscript𝑥2italic-ϵ2𝜋𝑥𝜆f_{+}(x)=f_{0}+\epsilon f_{1}=x^{2}+\epsilon\cos(2\pi x/\lambda).italic_f start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_x ) = italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_ϵ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ϵ roman_cos ( 2 italic_π italic_x / italic_λ ) . (2)

An alternative way to introduce non-convexity is via multiplying a perturbation function (Fig. 1b), which will be referred to as multiplicative perturbation case:

f×(x)=f0(1+ϵf1)=x2(1+ϵcos(2πx/λ)).subscript𝑓𝑥subscript𝑓01italic-ϵsubscript𝑓1superscript𝑥21italic-ϵ2𝜋𝑥𝜆f_{\times}(x)=f_{0}(1+\epsilon f_{1})=x^{2}(1+\epsilon\cos(2\pi x/\lambda)).italic_f start_POSTSUBSCRIPT × end_POSTSUBSCRIPT ( italic_x ) = italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( 1 + italic_ϵ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + italic_ϵ roman_cos ( 2 italic_π italic_x / italic_λ ) ) . (3)

The two cases are qualitatively different as when x𝑥x\to\inftyitalic_x → ∞, the additive perturbation will become small comparing to f0=x2subscript𝑓0superscript𝑥2f_{0}=x^{2}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT while the multiplicative perturbation is always comparable to the unperturbed f0=x2subscript𝑓0superscript𝑥2f_{0}=x^{2}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Our test functions have simple analytic forms and represent different forms of non-convexity.

We next explain the idea of investigating fractal trainability boundaries numerically. The boundary points separating learning rates leading to finite and divergent loss are not accessible directly. So, we need to use finite but many grid points to locate the boundary learning rates. On a given range of learning rate (s[smin,smax]𝑠subscript𝑠minsubscript𝑠maxs\in[s_{\rm min},s_{\rm max}]italic_s ∈ [ italic_s start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ]), we can evenly put N+1𝑁1N+1italic_N + 1 grid points, with which GD can be tested to diverge or not. If two neighboring learning rate grid point values lead to the same divergent/bounded loss behavior, at this coarse-grain level (quantified by N𝑁Nitalic_N), we say there is no boundary between the two grid points. Otherwise, we say the segment between this two grid points is a boundary segment. We define a set BNsubscript𝐵𝑁B_{N}italic_B start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT as the set containing all boundary segments when we have N+1𝑁1N+1italic_N + 1 grid points (Fig. 1c). Heuristically, we expect each boundary segment to cover one boundary accurately when the segment length goes to zero (i.e., N𝑁N\to\inftyitalic_N → ∞), and thus BNsubscript𝐵𝑁B_{N}italic_B start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT becomes the set of boundary learning rates when N𝑁N\to\inftyitalic_N → ∞. If the number of boundary segments, denoted by |BN|subscript𝐵𝑁|B_{N}|| italic_B start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT |, increases with respect to N𝑁Nitalic_N as a scaling law asymptotically

|BN|Nα,(N),proportional-tosubscript𝐵𝑁superscript𝑁𝛼𝑁|B_{N}|\propto N^{\alpha},~{}(N\to\infty),| italic_B start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT | ∝ italic_N start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT , ( italic_N → ∞ ) , (4)

we say the trainability boundary has a fractal dimension α𝛼\alphaitalic_α (by convention, the fractal dimension defined in this way is called box dimension [11]). We observed that the number of boundary segments (black segments in Fig. 1d) indeed increases when we do tests on the multiplicative case and put more and more testing grid points in a fixed learning rate range (Fig. 1d). The colored bar at the bottom visualizes loss values for bounded (blue) and divergent (red) training evaluated at 220superscript2202^{20}2 start_POSTSUPERSCRIPT 20 end_POSTSUPERSCRIPT grid points in [0,1.5]01.5[0,1.5][ 0 , 1.5 ] (Fig. 1d). And the color intensity is determined by ifisubscript𝑖subscript𝑓𝑖\sum_{i}f_{i}∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for bounded training and ifi1subscript𝑖superscriptsubscript𝑓𝑖1\sum_{i}f_{i}^{-1}∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT for divergent training, respectively [9], where fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the loss value at i𝑖iitalic_ith step (totally 1000 steps). Once we zoom in, we see more boundaries between blue and red (the original vector image can be found online). We therefore are ready for further exploration with the numerical tool identifying fractals.

Refer to caption
Figure 2: Simple non-convexity can lead to fractal trainability boundaries, whose fractal dimensions depend on perturbation form, wavelength, and amplitude. (a) For additive perturbation case f+subscript𝑓f_{+}italic_f start_POSTSUBSCRIPT + end_POSTSUBSCRIPT with ϵ=0.2italic-ϵ0.2\epsilon=0.2italic_ϵ = 0.2 and λ=0.1𝜆0.1\lambda=0.1italic_λ = 0.1, we studied learning rate in [0,1.5]01.5[0,1.5][ 0 , 1.5 ], where the number of boundary segments increases as a scaling law with respect to the number of segments put, suggesting a fractal trainability boundary. The fractal dimension, i.e., the slope log|BN|subscript𝐵𝑁\log|B_{N}|roman_log | italic_B start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT | against logN𝑁\log Nroman_log italic_N is fitted as 0.996±0.005plus-or-minus0.9960.0050.996\pm 0.0050.996 ± 0.005. (b) For additive perturbation case f×subscript𝑓f_{\times}italic_f start_POSTSUBSCRIPT × end_POSTSUBSCRIPT with ϵ=0.2italic-ϵ0.2\epsilon=0.2italic_ϵ = 0.2 and λ=0.1𝜆0.1\lambda=0.1italic_λ = 0.1, fractal trainability boundary is also observed with fractal dimension 0.837±0.004plus-or-minus0.8370.0040.837\pm 0.0040.837 ± 0.004. (c, d) The fractal dimension of trainability boundary vary with respect to perturbation amplitude ϵitalic-ϵ\epsilonitalic_ϵ and wavelength λ𝜆\lambdaitalic_λ. In particular, for the additive perturbation case (c), the fractal dimension increases with larger amplitude and smaller wavelength. For the multiplicative case (d), the fractal dimension has no clear dependence on the two function parameters.

We investigated the trainability boundaries using GD on specifically constructed loss functions. For our experiments, we applied GD to the loss function with additive perturbation, f+subscript𝑓f_{+}italic_f start_POSTSUBSCRIPT + end_POSTSUBSCRIPT, starting from x(0)=1.0superscript𝑥01.0x^{(0)}=1.0italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = 1.0 with parameters ϵ=0.2italic-ϵ0.2\epsilon=0.2italic_ϵ = 0.2 and λ=0.1𝜆0.1\lambda=0.1italic_λ = 0.1. We conducted 1000 steps of GD and defined a training session as divergent (untrainable) if the sum of losses at the 1000 steps exceeded 1016superscript101610^{1}610 start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT 6. This upper loss threshold affects misclassifying slowly diverging training into bounded training; although varying it between 1012superscript101210^{1}210 start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT 2 and 1020superscript102010^{20}10 start_POSTSUPERSCRIPT 20 end_POSTSUPERSCRIPT did not alter the observed fractal dimension. We adjusted the learning rate within the range of [0,1.5]01.5[0,1.5][ 0 , 1.5 ] and increased the number of intervals, N𝑁Nitalic_N, up to 232superscript2322^{32}2 start_POSTSUPERSCRIPT 32 end_POSTSUPERSCRIPT (see Methods). Our findings reveal that the trainability boundary for this simple function displays fractal behaviors, meaning the number of boundary segments, |BN|subscript𝐵𝑁|B_{N}|| italic_B start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT |, increases following a scaling law with N𝑁Nitalic_N at large values (Fig. 2a). The fractal dimension, α𝛼\alphaitalic_α, calculated via least squares as the slope of log|BN|subscript𝐵𝑁\log|B_{N}|roman_log | italic_B start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT | against logN𝑁\log Nroman_log italic_N (log\logroman_log base is 2222 throughout this paper), is approximately 0.996±0.005plus-or-minus0.9960.0050.996\pm 0.0050.996 ± 0.005 (error is standard deviation). A similar analysis on another loss function with multiplicative perturbation, f×subscript𝑓f_{\times}italic_f start_POSTSUBSCRIPT × end_POSTSUBSCRIPT, yielded a fractal dimension of α=0.837±0.004𝛼plus-or-minus0.8370.004\alpha=0.837\pm 0.004italic_α = 0.837 ± 0.004 (Fig. 2b). This suggests that fractal boundaries are more densely packed in a narrower range for the additive perturbation scenario, indicating a potentially less erratic behavior. Nonetheless, the emergence of fractal trainability boundaries in these trivially simple loss functions is remarkable.

We next sought to examine factors affecting the fractal dimension of trainability boundaries. We evaluated the fractal dimension of the trainability boundary with ten different amplitudes ϵitalic-ϵ\epsilonitalic_ϵ evenly picked from [0.01,0.2]0.010.2[0.01,0.2][ 0.01 , 0.2 ] and ten different wavelengths λ𝜆\lambdaitalic_λ evenly picked from [0.01,1.0]0.011.0[0.01,1.0][ 0.01 , 1.0 ]. Least square fitting is used to obtain the fractal dimensions. We found that for the additive perturbation case, the fractal dimension increases with decreasing wavelength and increasing amplitude (Fig. 2c), while for the multiplicative perturbation case, the fractal dimension has no clear dependence on amplitude or wavelength (Fig. 2d). The fractal dimension therefore depends on the type, wavelength, and amplitude of the non-convex perturbation in a complicated manner.

We next tried to analyze how the perturbation wavelength and amplitude change the fractal dimension of trainability boundary. We can rescale the parameter, x~=x/b~𝑥𝑥𝑏\tilde{x}=x/bover~ start_ARG italic_x end_ARG = italic_x / italic_b, and renormalize the loss, f~(x~)=f(x)/ζ~𝑓~𝑥𝑓𝑥𝜁\tilde{f}(\tilde{x})=f(x)/\zetaover~ start_ARG italic_f end_ARG ( over~ start_ARG italic_x end_ARG ) = italic_f ( italic_x ) / italic_ζ, such that we map a GD process starting at x(0)superscript𝑥0x^{(0)}italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT to another one starting at x~(0)=x(0)/bsuperscript~𝑥0superscript𝑥0𝑏\tilde{x}^{(0)}=x^{(0)}/bover~ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT / italic_b and updating with respect to

x~(k+1)=x~(k)s~f~(x~(k)).superscript~𝑥𝑘1superscript~𝑥𝑘~𝑠~𝑓superscript~𝑥𝑘\tilde{x}^{(k+1)}=\tilde{x}^{(k)}-\tilde{s}\nabla\tilde{f}(\tilde{x}^{(k)}).over~ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT = over~ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT - over~ start_ARG italic_s end_ARG ∇ over~ start_ARG italic_f end_ARG ( over~ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) . (5)

By choosing ζ=b2𝜁superscript𝑏2\zeta=b^{2}italic_ζ = italic_b start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, we can show that s~=s~𝑠𝑠\tilde{s}=sover~ start_ARG italic_s end_ARG = italic_s and f~~𝑓\tilde{f}over~ start_ARG italic_f end_ARG has the same function form (i.e., with additive perturbation or multiplicative perturbation) as f𝑓fitalic_f while with different function parameters ϵ~~italic-ϵ\tilde{\epsilon}over~ start_ARG italic_ϵ end_ARG and λ~~𝜆\tilde{\lambda}over~ start_ARG italic_λ end_ARG (see Methods). This means a GD process on our constructed loss given ϵitalic-ϵ\epsilonitalic_ϵ, λ𝜆\lambdaitalic_λ, and x(0)superscript𝑥0x^{(0)}italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT will have the same divergence property with the same learning rate as another GD process on our constructed loss with ϵ~~italic-ϵ\tilde{\epsilon}over~ start_ARG italic_ϵ end_ARG, λ~~𝜆\tilde{\lambda}over~ start_ARG italic_λ end_ARG, and x~(0)superscript~𝑥0\tilde{x}^{(0)}over~ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT. Therefore, trainability boundaries are the same for the two sets of conditions, {ϵ,λ,x(0)}italic-ϵ𝜆superscript𝑥0\{\epsilon,\lambda,x^{(0)}\}{ italic_ϵ , italic_λ , italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT } and {ϵ~,λ~,x~(0)}~italic-ϵ~𝜆superscript~𝑥0\{\tilde{\epsilon},\tilde{\lambda},\tilde{x}^{(0)}\}{ over~ start_ARG italic_ϵ end_ARG , over~ start_ARG italic_λ end_ARG , over~ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT }. If we further assume the fractal dimension α𝛼\alphaitalic_α depends mainly on loss properties rather than initial conditions (seems to be true, see SI), we conclude that fractal dimensions of the trainability boundaries for {ϵ,λ}italic-ϵ𝜆\{\epsilon,\lambda\}{ italic_ϵ , italic_λ } and {ϵ~,λ~}~italic-ϵ~𝜆\{\tilde{\epsilon},\tilde{\lambda}\}{ over~ start_ARG italic_ϵ end_ARG , over~ start_ARG italic_λ end_ARG } are the same.

More specifically, for the additive perturbation case, the renormalization flow not changing the fractal dimension should be given by (see Methods)

ϵ~=ϵ/b2,λ~=λ/b.formulae-sequence~italic-ϵitalic-ϵsuperscript𝑏2~𝜆𝜆𝑏\tilde{\epsilon}=\epsilon/b^{2},~{}\tilde{\lambda}=\lambda/b.over~ start_ARG italic_ϵ end_ARG = italic_ϵ / italic_b start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , over~ start_ARG italic_λ end_ARG = italic_λ / italic_b . (6)

Note there is only one independent combination of ϵitalic-ϵ\epsilonitalic_ϵ and λ𝜆\lambdaitalic_λ, i.e.,

θ+=ϵ/λ2,subscript𝜃italic-ϵsuperscript𝜆2\theta_{+}=\epsilon/\lambda^{2},italic_θ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT = italic_ϵ / italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (7)

is invariant under the renormalization transformation, we claim the fractal dimension can only depends on θ+subscript𝜃\theta_{+}italic_θ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT since it only depends on {ϵ,λ}italic-ϵ𝜆\{\epsilon,\lambda\}{ italic_ϵ , italic_λ } and is invariant under the renormalization transformations. We would call the quantity θ+subscript𝜃\theta_{+}italic_θ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT “roughness” as it is in the pre-factor of the second derivative of the additive perturbation, measuring the sensitivity of gradient’s dependence on parameter. We plotted the fractal dimension as a function of roughness θ+subscript𝜃\theta_{+}italic_θ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT (the same set of data as Fig. 2c) and found the fractal dimension shows a clear and sharp transition from zero (i.e., no fractal behavior) to non-zero when increasing roughness θ+subscript𝜃\theta_{+}italic_θ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT (Fig. 3a). We found that the critical roughness θ+subscript𝜃\theta_{+}italic_θ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT is near 1/2π212superscript𝜋21/2\pi^{2}1 / 2 italic_π start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (dashed line in Fig. 3a), which corresponds to the critical situation f+subscript𝑓f_{+}italic_f start_POSTSUBSCRIPT + end_POSTSUBSCRIPT begin to be non-convex (x𝑥\exists x∃ italic_x such that 2f+(x)=0superscript2subscript𝑓𝑥0\nabla^{2}f_{+}(x)=0∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_x ) = 0). The simple renormalization analysis yields roughness of the additive perturbation, which determines the fractal dimension and shows that fractal behaviors show up when the perturbed loss is non-convex.

Refer to caption
Figure 3: Roughness determines fractal dimension of trainability boundaries and captures the transition to fractal trainability boundary when the landscape is non-convex. (a) For the additive perturbation case, the roughness θ+subscript𝜃\theta_{+}italic_θ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT found well organizes the fractal dimensions α𝛼\alphaitalic_α with different amplitude and wavelength (data in Fig. 2c), where we can see a clear transition to non-zero fractal dimensions near θ+=1/2π2subscript𝜃12superscript𝜋2\theta_{+}=1/2\pi^{2}italic_θ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT = 1 / 2 italic_π start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (dashed line, corresponding emergence of non-convexity). (b) For the multiplicative perturbation case, the roughness θ×subscript𝜃\theta_{\times}italic_θ start_POSTSUBSCRIPT × end_POSTSUBSCRIPT found determines the fractal dimensions α𝛼\alphaitalic_α (data from Fig. 2d). Error bars are standard deviations of fitting.

Following the same renormalization procedure, we found the roughness determining fractal dimension for the multiplicative perturbation case is

θ×=ϵ.subscript𝜃italic-ϵ\theta_{\times}=\epsilon.italic_θ start_POSTSUBSCRIPT × end_POSTSUBSCRIPT = italic_ϵ . (8)

This quantity θ×=ϵsubscript𝜃italic-ϵ\theta_{\times}=\epsilonitalic_θ start_POSTSUBSCRIPT × end_POSTSUBSCRIPT = italic_ϵ also shows up in the second derivative of f×subscript𝑓f_{\times}italic_f start_POSTSUBSCRIPT × end_POSTSUBSCRIPT and contributes to the sensitive dependence of gradient on parameter x𝑥xitalic_x, while it is not the only one (there are also ϵ/λitalic-ϵ𝜆\epsilon/\lambdaitalic_ϵ / italic_λ and ϵ/λ2italic-ϵsuperscript𝜆2\epsilon/\lambda^{2}italic_ϵ / italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT) and thus just looking at the second derivative does not suffice. For the multiplicative perturbation case, we found the fractal dimension decreases a little with increasing roughness (Fig. 3b; same data as Fig. 2d). The multiplicative case is always non-convex and always has fractal trainability boundary, which is consistent with the previous finding that fractal behaviors emerges when the perturbed loss becomes non-convex. Note that the non-convex part of the multiplicative case (where the second derivatives begin to be non-positive) has loss value around and above the order of magnitude 1/ϵ1italic-ϵ1/\epsilon1 / italic_ϵ, if we have too small ϵitalic-ϵ\epsilonitalic_ϵ while our numerical upper bound to classify bounded and divergent training is not large enough, the classification will solely depends on the convex part of the loss. In this case, if non-convexity is necessary for fractal behaviors, we would expect no fractal behaviors as a numerical artifact, which is proved to be true (see SI). We therefore conclude that roughness found through simple renormalization determines fractal dimension of trainability boundaries and the transition to non-zero fractal dimensions corresponds to the loss function becoming non-convex.

Beyond simple cases we can analyze, we next studied a slightly more complicated loss landscapes. As a first step towards perturbations with multiple length scales, we considered additive perturbations with two cosine functions,

f+(x)=x2+ϵ1cos(2πx/λ1)+ϵ2cos(2πx/λ2).subscript𝑓𝑥superscript𝑥2subscriptitalic-ϵ12𝜋𝑥subscript𝜆1subscriptitalic-ϵ22𝜋𝑥subscript𝜆2f_{+}(x)=x^{2}+\epsilon_{1}\cos(2\pi x/\lambda_{1})+\epsilon_{2}\cos(2\pi x/% \lambda_{2}).italic_f start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_x ) = italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_cos ( 2 italic_π italic_x / italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_cos ( 2 italic_π italic_x / italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) . (9)

The renormalization can only say ϵ1/λ12subscriptitalic-ϵ1superscriptsubscript𝜆12\epsilon_{1}/\lambda_{1}^{2}italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and ϵ2/λ22subscriptitalic-ϵ2superscriptsubscript𝜆22\epsilon_{2}/\lambda_{2}^{2}italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT determine the fractal dimension and cannot yield a single variable controlling the fractal behavior. In numerical tests, we fixed λ1=0.3subscript𝜆10.3\lambda_{1}=0.3italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.3 and λ1=0.5subscript𝜆10.5\lambda_{1}=0.5italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.5, while changed amplitudes ϵ1subscriptitalic-ϵ1\epsilon_{1}italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and ϵ2subscriptitalic-ϵ2\epsilon_{2}italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. We found the fractal dimension depends non-monotonically on each of the amplitudes (Fig. 4a). However, the claim that fractal behaviors arise when the loss becomes non-convex is still valid, as the boundary between convex and non-convex losses (red curve in Fig. 4a, solved numerically) also separates zero and non-zero fractal dimensions.

We next explored how the dimension of parameters x𝑥xitalic_x affect the trainability boundary. For the additive perturbation case, we generalize the function for xd𝑥superscript𝑑x\in\mathbb{R}^{d}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT as

f+(x)=ixi2+ϵicos(2πxi/λ).subscript𝑓𝑥subscript𝑖superscriptsubscript𝑥𝑖2italic-ϵsubscript𝑖2𝜋subscript𝑥𝑖𝜆f_{+}(x)=\sum_{i}x_{i}^{2}+\epsilon\sum_{i}\cos(2\pi x_{i}/\lambda).italic_f start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ϵ ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_cos ( 2 italic_π italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_λ ) . (10)

Following similar numerical methods as before (see detailed parameter setting in SI), we studied the fractal dimension of trainability boundary for f+subscript𝑓f_{+}italic_f start_POSTSUBSCRIPT + end_POSTSUBSCRIPT varying d𝑑ditalic_d from 1111 to 100100100100. The results suggest that the fractal dimension does not change much with respect to d𝑑ditalic_d in the additive perturbation scenario (Fig. 4b). For the multiplicative perturbation case, we can define a class of functions

f×(x)=(1+ϵicos(2πxi/λ))ixi2.subscript𝑓𝑥1italic-ϵsubscript𝑖2𝜋subscript𝑥𝑖𝜆subscript𝑖superscriptsubscript𝑥𝑖2f_{\times}(x)=\Big{(}1+\epsilon\sum_{i}\cos(2\pi x_{i}/\lambda)\Big{)}\sum_{i}% x_{i}^{2}.italic_f start_POSTSUBSCRIPT × end_POSTSUBSCRIPT ( italic_x ) = ( 1 + italic_ϵ ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_cos ( 2 italic_π italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_λ ) ) ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (11)

We found the fractal dimension α𝛼\alphaitalic_α slowly increases in this case with respect to increasing d𝑑ditalic_d (Fig. 4c), which makes sense as high-dimensional optimization should be more complicated. Our renormalization procedure cannot connect two functions with different dimensions d𝑑ditalic_d, and therefore roughness values for functions with different dimensions d𝑑ditalic_d are not comparable. Future works are needed to analyze the impact of parameter dimensions d𝑑ditalic_d, e.g., defining a generalized roughness that can determine fractal dimension of trainability boundaries across different d𝑑ditalic_d.

Refer to caption
Figure 4: Beyond simple cases we can analyze, fractal dimension of trainability boundary depends on many other parameters determine the loss function, while it seems to be general that non-convexity leads to fractal behaviors. (a) For additive case with two cosine perturbations, the fractal dimension depends complicatedly on the amplitudes, while it is true fractal behaviors show up after the loss is non-convex (red line is the boundary of convexity). (b and c) For high dimensional optimization, the fractal dimension can depend on parameter dimensions. The fractal dimension is robust to increasing the parameter dimension d𝑑ditalic_d for the additive perturbation case. (b) While the fractal dimension increases with the parameter dimension d𝑑ditalic_d for the multiplicative perturbation case. Error bars are standard deviations of fitting.

Discussion

In this study, we have demonstrated that fractal trainability boundaries can arise from relatively simple non-convex modifications to loss functions. Specifically, our results show that the sensitivity of the loss gradient to parameter changes—a consequence of non-convexity introduced either through additive or multiplicative cosine perturbation—plays a crucial role in the emergence of fractal trainability. The fractal dimensions we observed are influenced by several factors, including the parameter dimension, the type of non-convexity, perturbation wavelength, and amplitude. Notably, our use of renormalization techniques in one-dimensional optimization cases has linked various loss functions to corresponding fractal dimensions of their trainability boundaries. Therefore, we have identified “roughness of perturbation” as a key property that quantifies this sensitivity and dictates the fractal behavior. We observed a clear transition from non-fractal to fractal trainability boundaries as roughness increases, with the critical roughness causing the perturbed loss to be non-convex. These findings not only validate our hypothesis about the impact of non-convexity on trainability but also open up new avenues for understanding the dynamics of learning in complex models.

While our method effectively characterizes fractal behaviors, it may not fully capture the complexity inherent in the trainability boundary. We computed the box dimension of these boundaries as a more feasible alternative to direct, uniform sampling from the trainability boundary set, which remains impractical. However, the box dimension is not without its limitations; for example, it is known that all rational numbers between 0 and 1 technically have a box dimension of 1111 [11, 12]. Consequently, while the relative magnitudes of our computed box dimensions can be informative in assessing the degree of complexity, the absolute values themselves may not be entirely reliable.

Beyond mathematical limitations, constraints in our numerical implementation also impact the accuracy of the fractal dimensions we obtained. For instance, computational resources cap the largest feasible N𝑁Nitalic_N, limiting the number of data points available for accurately fitting the fractal dimension. If a fractal boundary is densely packed within a very narrow range, a significantly large N𝑁Nitalic_N is required to discern its fractal nature, potentially causing us to overlook certain fractal behaviors when N𝑁Nitalic_N is limited. Interestingly, the practical significance of these fractal boundaries also comes into question; narrowly distributed boundaries are unlikely to be encountered in most applications, thus posing minimal risk. This observation led to a new insight: fractal dimension alone may not suffice to assess the risk posed by fractal boundaries. It also becomes essential to understand the distribution breadth of these boundary points. In our experiments, the maximum N𝑁Nitalic_N tested did not vary widely, suggesting that we may have consistently overlooked very narrow fractal boundaries. However, this might not be detrimental, as such boundaries are less likely to impact practical applications.

Our renormalization analysis, while effective in identifying roughness as a key parameter, exhibits limited generalizability. This analysis is restricted to simple functions with explicitly defined parameters, making our conclusions highly specific to the cases studied. Additionally, we cannot answer with this analysis why non-convexity of the loss function leads to the emergence of fractal behaviors. My initial concept was to establish a mapping that links different loss landscapes and learning rates, thereby preserving unchanged trainability. This mapping would ideally define an updating flow for function parameters and the learning rate. If successful, we could potentially transform the question of trainability into an investigation of where this updating flow stabilizes, using familiar functions such as the quadratic function as endpoints. However, the renormalization flow falls short in achieving this, as it cannot eliminate the perturbations. The first time I viewed the figures in [9], they reminded me of images of Jupiter, whose fractal-like surface arises from some fluid dynamics. This analogy suggests a future possibility where we might develop an updating flow for hyperparameters that mirrors principles from fluid dynamics.

In conclusion, substantial future research is necessary to more accurately capture the fractal behaviors of trainability boundaries. As we have discussed, developing a theory that can predict both the critical emergence of fractal trainability boundaries and their fractal dimensions is essential. Moreover, establishing connections to realistic loss functions from contemporary machine learning models is needed, particularly finding methods to characterize roughness in general loss functions lacking simple explicit formulas. Further exploration into the mechanisms that contribute to rough non-convexity is also required. With a deeper understanding of these phenomena, we could potentially develop strategies associated to model construction, dataset management, and optimizers that mitigate the risks associated with dangerous fractal trainability boundaries. By continuing to build on this foundation, we pave the way for more robust and predictable machine learning methodologies.

Methods

We conducted large scale numerical experiments with Julia 1.8.4 on CPUs or with Python 3.9 on GPUs of MIT Supercloud [13]. The results have no notable difference. We ran small scale tests and analyze data with Python 3.10.9 on a laptop. All codes are available online. In practice, we set number of segments N=2n𝑁superscript2𝑛N=2^{n}italic_N = 2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT for integer n𝑛nitalic_n, and ran tests on Nmax+1=2nmax+1subscript𝑁max1superscript2subscript𝑛max1N_{\rm max}+1=2^{n_{\rm max}}+1italic_N start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT + 1 = 2 start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + 1 learning rates evenly distributed in [0,1.5]01.5[0,1.5][ 0 , 1.5 ]. Given hyperparameters and the loss function, we ran GD for 1000 steps, and classify bounded or divergent training based on the sum of loss values of the 1000 steps. We also classified bounded or divergent training based on whether GD cannot or can hit an upper bound. The latter classification may mistake some cases where GD first diverge but then converge. However, in the tests reported in the main text, the two classification methods do not have notable difference. When analyzing data, we can choose 2n+1superscript2𝑛12^{n}+12 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT + 1 (nnmax𝑛subscript𝑛maxn\leq n_{\rm max}italic_n ≤ italic_n start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT) evenly spaced points from the 2nmax+1superscript2subscript𝑛max12^{n_{\rm max}}+12 start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + 1 points to analyze boundary segments at a coarse-grained level, which can give |BN|subscript𝐵𝑁|B_{N}|| italic_B start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT | with N=2n𝑁superscript2𝑛N=2^{n}italic_N = 2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. The largest nmaxsubscript𝑛maxn_{\rm max}italic_n start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT we tested is 32323232. Most times, nmax=20subscript𝑛max20n_{\rm max}=20italic_n start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 20 is sufficient to yield a good fitting of fractal dimensions. We ran all numerical tests with data type float64, which is accurate enough for our choices of nmaxsubscript𝑛maxn_{\rm max}italic_n start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT.

The choice of learning rate range [0,1.5]01.5[0,1.5][ 0 , 1.5 ] tested relies on the facts f0=x2subscript𝑓0superscript𝑥2f_{0}=x^{2}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT has one trainability boundary at s=1.0𝑠1.0s=1.0italic_s = 1.0 and we observe a lot of trainability boundaries when s<1.0𝑠1.0s<1.0italic_s < 1.0 in practice (Fig. 1d). We prove f0=x2subscript𝑓0superscript𝑥2f_{0}=x^{2}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT has one trainability boundary as follows. By the definition of GD, on f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we have

x(k+1)=x(k)2sx(k)=(12s)x(k).superscript𝑥𝑘1superscript𝑥𝑘2𝑠superscript𝑥𝑘12𝑠superscript𝑥𝑘x^{(k+1)}=x^{(k)}-2sx^{(k)}=(1-2s)x^{(k)}.italic_x start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT = italic_x start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT - 2 italic_s italic_x start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = ( 1 - 2 italic_s ) italic_x start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT . (12)

Convergence requires |12s|<112𝑠1|1-2s|<1| 1 - 2 italic_s | < 1, which gives 0<s<10𝑠10<s<10 < italic_s < 1 and thus completes the proof. We applied our numerical method to f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for a rational check and found indeed there is no fractal trainability boundary for f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (SI).

Details of the renormalization procedure are given as follows. For both additive and multiplicative perturbation cases, we can write the loss function in a form

f(x)=x2+ϕ.𝑓𝑥superscript𝑥2italic-ϕf(x)=x^{2}+\phi.italic_f ( italic_x ) = italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ϕ . (13)

By substituting x~=x/b~𝑥𝑥𝑏\tilde{x}=x/bover~ start_ARG italic_x end_ARG = italic_x / italic_b and f~(x~)=f(x)/ζ~𝑓~𝑥𝑓𝑥𝜁\tilde{f}(\tilde{x})=f(x)/\zetaover~ start_ARG italic_f end_ARG ( over~ start_ARG italic_x end_ARG ) = italic_f ( italic_x ) / italic_ζ into the original GD, we have

x~(k+1)=x~(k)s~f~(x~(k)),superscript~𝑥𝑘1superscript~𝑥𝑘~𝑠~𝑓superscript~𝑥𝑘\tilde{x}^{(k+1)}=\tilde{x}^{(k)}-\tilde{s}\nabla\tilde{f}(\tilde{x}^{(k)}),over~ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT = over~ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT - over~ start_ARG italic_s end_ARG ∇ over~ start_ARG italic_f end_ARG ( over~ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) , (14)

with s~=sζ/b2~𝑠𝑠𝜁superscript𝑏2\tilde{s}=s\zeta/b^{2}over~ start_ARG italic_s end_ARG = italic_s italic_ζ / italic_b start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and

f~(x~)=b2ζx~2+1ζϕ(bx~).~𝑓~𝑥superscript𝑏2𝜁superscript~𝑥21𝜁italic-ϕ𝑏~𝑥\tilde{f}(\tilde{x})=\frac{b^{2}}{\zeta}\tilde{x}^{2}+\frac{1}{\zeta}\phi(b% \tilde{x}).over~ start_ARG italic_f end_ARG ( over~ start_ARG italic_x end_ARG ) = divide start_ARG italic_b start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ζ end_ARG over~ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_ζ end_ARG italic_ϕ ( italic_b over~ start_ARG italic_x end_ARG ) . (15)

Since we want the new function f~~𝑓\tilde{f}over~ start_ARG italic_f end_ARG to have the same function form as f𝑓fitalic_f, we need the pre-factor of x~2superscript~𝑥2\tilde{x}^{2}over~ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, i.e., b2/ζsuperscript𝑏2𝜁b^{2}/\zetaitalic_b start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ζ, to be one. Consequently, we have ζ=b2𝜁superscript𝑏2\zeta=b^{2}italic_ζ = italic_b start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and the learning rate s~=s~𝑠𝑠\tilde{s}=sover~ start_ARG italic_s end_ARG = italic_s unchanged. And for the additive perturbation case, where ϕ=ϵcos(2πx/λ)italic-ϕitalic-ϵ2𝜋𝑥𝜆\phi=\epsilon\cos(2\pi x/\lambda)italic_ϕ = italic_ϵ roman_cos ( 2 italic_π italic_x / italic_λ ), if we want to write the transformed ϕ~=1ζϕ(bx~)=ϵ~cos(2πx~/λ~)~italic-ϕ1𝜁italic-ϕ𝑏~𝑥~italic-ϵ2𝜋~𝑥~𝜆\tilde{\phi}=\frac{1}{\zeta}\phi(b\tilde{x})=\tilde{\epsilon}\cos(2\pi\tilde{x% }/\tilde{\lambda})over~ start_ARG italic_ϕ end_ARG = divide start_ARG 1 end_ARG start_ARG italic_ζ end_ARG italic_ϕ ( italic_b over~ start_ARG italic_x end_ARG ) = over~ start_ARG italic_ϵ end_ARG roman_cos ( 2 italic_π over~ start_ARG italic_x end_ARG / over~ start_ARG italic_λ end_ARG ), we will arrive the results ϵ~=ϵ/b2~italic-ϵitalic-ϵsuperscript𝑏2\tilde{\epsilon}=\epsilon/b^{2}over~ start_ARG italic_ϵ end_ARG = italic_ϵ / italic_b start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and λ~=λ/b~𝜆𝜆𝑏\tilde{\lambda}=\lambda/bover~ start_ARG italic_λ end_ARG = italic_λ / italic_b. Similarly, for the multiplicative perturbation case, since ϕ=x2ϵcos(2πx/λ)italic-ϕsuperscript𝑥2italic-ϵ2𝜋𝑥𝜆\phi=x^{2}\epsilon\cos(2\pi x/\lambda)italic_ϕ = italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ϵ roman_cos ( 2 italic_π italic_x / italic_λ ) and ϕ~=1ζϕ(bx~)=x~2ϵ~cos(2πx~/λ~)~italic-ϕ1𝜁italic-ϕ𝑏~𝑥superscript~𝑥2~italic-ϵ2𝜋~𝑥~𝜆\tilde{\phi}=\frac{1}{\zeta}\phi(b\tilde{x})=\tilde{x}^{2}\tilde{\epsilon}\cos% (2\pi\tilde{x}/\tilde{\lambda})over~ start_ARG italic_ϕ end_ARG = divide start_ARG 1 end_ARG start_ARG italic_ζ end_ARG italic_ϕ ( italic_b over~ start_ARG italic_x end_ARG ) = over~ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over~ start_ARG italic_ϵ end_ARG roman_cos ( 2 italic_π over~ start_ARG italic_x end_ARG / over~ start_ARG italic_λ end_ARG ), we will have ϵ~=ϵ~italic-ϵitalic-ϵ\tilde{\epsilon}=\epsilonover~ start_ARG italic_ϵ end_ARG = italic_ϵ and λ~=λ/b~𝜆𝜆𝑏\tilde{\lambda}=\lambda/bover~ start_ARG italic_λ end_ARG = italic_λ / italic_b. Since x~=x/b~𝑥𝑥𝑏\tilde{x}=x/bover~ start_ARG italic_x end_ARG = italic_x / italic_b is a one-to-one mapping, we know that changing the set of conditions {s,ϵ,λ,x(0)}𝑠italic-ϵ𝜆superscript𝑥0\{s,\epsilon,\lambda,x^{(0)}\}{ italic_s , italic_ϵ , italic_λ , italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT } to {s,ϵ~,λ~,x(0)/b}𝑠~italic-ϵ~𝜆superscript𝑥0𝑏\{s,\tilde{\epsilon},\tilde{\lambda},x^{(0)}/b\}{ italic_s , over~ start_ARG italic_ϵ end_ARG , over~ start_ARG italic_λ end_ARG , italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT / italic_b } will only yield a rescaled GD trajectory but not change whether the trajectory diverge or not. In other words, the conditions {ϵ,λ,x(0)}italic-ϵ𝜆superscript𝑥0\{\epsilon,\lambda,x^{(0)}\}{ italic_ϵ , italic_λ , italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT } have the same trainability boundary as {ϵ~,λ~,x(0)/b}~italic-ϵ~𝜆superscript𝑥0𝑏\{\tilde{\epsilon},\tilde{\lambda},x^{(0)}/b\}{ over~ start_ARG italic_ϵ end_ARG , over~ start_ARG italic_λ end_ARG , italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT / italic_b }.

acknowledgments

It is a pleasure to thank Weijie Su for introducing this problem and highlighting the importance, Ziming Liu for discussions on the intuitions, Jörn Dunkel for valuable discussions on aspects to explore, Jeff Gore for the support and inspiring discussions, and Jascha Sohl-Dickstein for valuable suggestions.

References

  • Choromanska et al. [2015] A. Choromanska, M. Henaff, M. Mathieu, G. Ben Arous, and Y. LeCun, The Loss Surfaces of Multilayer Networks, in Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, Vol. 38, edited by G. Lebanon and S. V. N. Vishwanathan (PMLR, San Diego, California, USA, 2015) pp. 192–204.
  • Jin et al. [2018] C. Jin, L. T. Liu, R. Ge, and M. I. Jordan, On the local minima of the empirical risk, in Neural Information Processing Systems (2018).
  • Zhang et al. [2022] Y. Zhang, Q. Qu, and J. Wright, From symmetry to geometry: Tractable nonconvex problems (2022), arXiv:2007.06753 [cs.LG] .
  • Su et al. [2016] W. Su, S. Boyd, and E. J. Candès, A differential equation for modeling nesterov’s accelerated gradient method: Theory and insights, Journal of Machine Learning Research 17, 1 (2016).
  • Wibisono et al. [2016] A. Wibisono, A. C. Wilson, and M. I. Jordan, A variational perspective on accelerated methods in optimization, Proceedings of the National Academy of Sciences 113, E7351 (2016)https://www.pnas.org/doi/pdf/10.1073/pnas.1614734113 .
  • Kong and Tao [2020] L. Kong and M. Tao, Stochasticity of deterministic gradient descent: Large learning rate for multiscale objective function, in Advances in Neural Information Processing Systems, Vol. 33, edited by H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Curran Associates, Inc., 2020) pp. 2625–2638.
  • Shi et al. [2023] B. Shi, W. Su, and M. I. Jordan, On Learning Rates and Schrödinger Operators, Journal of Machine Learning Research 24, 1 (2023).
  • Chen et al. [2023] X. Chen, K. Balasubramanian, P. Ghosal, and B. Agrawalla, From stability to chaos: Analyzing gradient descent dynamics in quadratic regression (2023), arXiv:2310.01687 [cs.LG] .
  • Sohl-Dickstein [2024] J. Sohl-Dickstein, The boundary of neural network trainability is fractal (2024), arXiv:2402.06184 [cs.LG] .
  • Mandelbrot [1982] B. B. Mandelbrot, The fractal geometry of nature, Vol. 1 (WH freeman New York, 1982).
  • Strogatz [2018] S. H. Strogatz, Nonlinear dynamics and chaos: with applications to physics, biology, chemistry, and engineering (CRC press, 2018).
  • Falconer [2007] K. Falconer, Fractal geometry: mathematical foundations and applications (John Wiley & Sons, 2007).
  • Reuther et al. [2018] A. Reuther, J. Kepner, C. Byun, S. Samsi, W. Arcand, D. Bestor, B. Bergeron, V. Gadepally, M. Houle, M. Hubbell, M. Jones, A. Klein, L. Milechin, J. Mullen, A. Prout, A. Rosa, C. Yee, and P. Michaleas, Interactive Supercomputing on 40,000 Cores for Machine Learning and Data Analysis, in 2018 IEEE High Performance Extreme Computing Conference (HPEC) (2018) pp. 1–6.

Supplementary Information

All the raw data and comparison between fitted line and raw data points are available online.

We show two examples of training loss updating on our constructed loss landscapes in Fig. S1. The sanity check of our method on the quadratic function f0=x2subscript𝑓0superscript𝑥2f_{0}=x^{2}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is presented in Fig. S2.

The multiplicative case is always non-convex and should always display fractal trainability boundary. But if we study the second derivative,

2f×(x)=superscript2subscript𝑓𝑥absent\displaystyle\nabla^{2}f_{\times}(x)=∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT × end_POSTSUBSCRIPT ( italic_x ) = 2+2ϵcos(2πxλ)4ϵ2πxλsin(2πxλ)22italic-ϵ2𝜋𝑥𝜆4italic-ϵ2𝜋𝑥𝜆2𝜋𝑥𝜆\displaystyle 2+2\epsilon\cos\left(\frac{2\pi x}{\lambda}\right)-4\epsilon% \frac{2\pi x}{\lambda}\sin\left(\frac{2\pi x}{\lambda}\right)2 + 2 italic_ϵ roman_cos ( divide start_ARG 2 italic_π italic_x end_ARG start_ARG italic_λ end_ARG ) - 4 italic_ϵ divide start_ARG 2 italic_π italic_x end_ARG start_ARG italic_λ end_ARG roman_sin ( divide start_ARG 2 italic_π italic_x end_ARG start_ARG italic_λ end_ARG )
ϵ(2πxλ)2cos(2πxλ),italic-ϵsuperscript2𝜋𝑥𝜆22𝜋𝑥𝜆\displaystyle-\epsilon\left(\frac{2\pi x}{\lambda}\right)^{2}\cos\left(\frac{2% \pi x}{\lambda}\right),- italic_ϵ ( divide start_ARG 2 italic_π italic_x end_ARG start_ARG italic_λ end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_cos ( divide start_ARG 2 italic_π italic_x end_ARG start_ARG italic_λ end_ARG ) , (S1)

we find that 2f×(x)>0superscript2subscript𝑓𝑥0\nabla^{2}f_{\times}(x)>0∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT × end_POSTSUBSCRIPT ( italic_x ) > 0 in a region |x|𝑥|x|| italic_x | is small enough. If we set an upper bound fmaxsubscript𝑓maxf_{\rm max}italic_f start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT to tell whether training is classified to be bounded or divergent and do not care the dynamics once it goes beyond fmaxsubscript𝑓maxf_{\rm max}italic_f start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT, GD actually only sees the loss within the region |x|<fmax𝑥subscript𝑓max|x|<\sqrt{f_{\rm max}}| italic_x | < square-root start_ARG italic_f start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG roughly. And if ϵitalic-ϵ\epsilonitalic_ϵ is too small, in the region |x|<fmax𝑥subscript𝑓max|x|<\sqrt{f_{\rm max}}| italic_x | < square-root start_ARG italic_f start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG, the loss may be convex and we may not see any fractal behaviors, which is a numerical artifact. To test the idea, we set different fmaxsubscript𝑓maxf_{\rm max}italic_f start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT values and stop GD and regard it as divergent once the loss reaches fmaxsubscript𝑓maxf_{\rm max}italic_f start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT. We found too small ϵitalic-ϵ\epsilonitalic_ϵ indeed will make fractal behaviors vanish and the transition to fractal behaviors differs for different upper bounds fmaxsubscript𝑓maxf_{\rm max}italic_f start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT (Fig. S3). We next analyze when the loss within |x|<fmax𝑥subscript𝑓max|x|<\sqrt{f_{\rm max}}| italic_x | < square-root start_ARG italic_f start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG can be non-convex and see if this case corresponds to the emergence of fractal behaviors. The critical situation is that 2f×superscript2subscript𝑓\nabla^{2}f_{\times}∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT × end_POSTSUBSCRIPT becomes zero near |x|=fmax𝑥subscript𝑓max|x|=\sqrt{f_{\rm max}}| italic_x | = square-root start_ARG italic_f start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG. Since fmaxsubscript𝑓max\sqrt{f_{\rm max}}square-root start_ARG italic_f start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG is very large, the quadratic term dominants among terms having ϵitalic-ϵ\epsilonitalic_ϵ, we have the minimum second derivative near |x|=fmax𝑥subscript𝑓max|x|=\sqrt{f_{\rm max}}| italic_x | = square-root start_ARG italic_f start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG being

min2f×2ϵfmax(2πλ)2.superscript2subscript𝑓2italic-ϵsubscript𝑓maxsuperscript2𝜋𝜆2\displaystyle\min\nabla^{2}f_{\times}\approx 2-\epsilon f_{\rm max}\left(\frac% {2\pi}{\lambda}\right)^{2}.roman_min ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT × end_POSTSUBSCRIPT ≈ 2 - italic_ϵ italic_f start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( divide start_ARG 2 italic_π end_ARG start_ARG italic_λ end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (S2)

By setting the above estimation to be zero, we reach the boundary of non-convexity for loss within |x|<fmax𝑥subscript𝑓max|x|<\sqrt{f_{\rm max}}| italic_x | < square-root start_ARG italic_f start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG as

ϵ=1fmaxλ22π2.italic-ϵ1subscript𝑓maxsuperscript𝜆22superscript𝜋2\epsilon=\frac{1}{f_{\rm max}}\frac{\lambda^{2}}{2\pi^{2}}.italic_ϵ = divide start_ARG 1 end_ARG start_ARG italic_f start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG divide start_ARG italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_π start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG . (S3)

These estimated boundaries are plotted in the (ϵ,λ)italic-ϵ𝜆(\epsilon,\lambda)( italic_ϵ , italic_λ ) plane for different fmaxsubscript𝑓maxf_{\rm max}italic_f start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT values (red curves in Fig. S3). Note greater ϵitalic-ϵ\epsilonitalic_ϵ means non-convexity, we found it is true that fractal behaviors only show up when the part of loss function GD can see becomes non-convex (non-zero fractal dimensions are all above the red curves). We also note that with increasing upper bound fmaxsubscript𝑓maxf_{\rm max}italic_f start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT, the boundary of non-convexity (red curves) tend to be smaller and smaller than the boundary of fractal behaviors. This might due to the fact the region |x|<fmax𝑥subscript𝑓max|x|<\sqrt{f_{\rm max}}| italic_x | < square-root start_ARG italic_f start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG is expanding with larger fmaxsubscript𝑓maxf_{\rm max}italic_f start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT and near critical (ϵ,λ)italic-ϵ𝜆(\epsilon,\lambda)( italic_ϵ , italic_λ ) for non-convexity, it is more difficult to see the non-convex part near |x|fmax𝑥subscript𝑓max|x|\approx\sqrt{f_{\rm max}}| italic_x | ≈ square-root start_ARG italic_f start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG. In conclusion, the numerical artifact help to further prove the idea that loss becoming non-convex leads to the emergence of fractal behaviors in training.

We checked the dependence of fractal dimension on the initial condition of parameter x(0)superscript𝑥0x^{(0)}italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT in Fig. S4 and S5, which suggest initial parameter x(0)superscript𝑥0x^{(0)}italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT may not affect fractal dimension.

Refer to caption
Figure S1: Training loss can be bounded or divergent. (a) An example obtained based on the multiplicative noise case with amplitude ϵ=0.2italic-ϵ0.2\epsilon=0.2italic_ϵ = 0.2, wavelength λ=0.1𝜆0.1\lambda=0.1italic_λ = 0.1, and learning rate s=0.01𝑠0.01s=0.01italic_s = 0.01, where the loss will decay and be bounded. (b) An example of divergent training based on the multiplicative noise case with amplitude ϵ=0.2italic-ϵ0.2\epsilon=0.2italic_ϵ = 0.2, wavelength λ=0.1𝜆0.1\lambda=0.1italic_λ = 0.1, and learning rate s=0.2𝑠0.2s=0.2italic_s = 0.2.
Refer to caption
Figure S2: Sanity check on the quadratic loss function indicates our numerical method is not wrong. (a) The quadratic function f0=x2subscript𝑓0superscript𝑥2f_{0}=x^{2}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, on which we know there is only one trainability boundary s=0𝑠0s=0italic_s = 0. (b) We found with our numerical method |BN|subscript𝐵𝑁|B_{N}|| italic_B start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT | is always 1111, consistent with the theory.
Refer to caption
Figure S3: Numerical artifact supports that fractal behaviors emerge when the landscape is non-convex. When we increase the loss upper bound for classifying bounded and divergent training, GD can run on greater regions. On different parameter (x𝑥xitalic_x) regions, the function parameters (ϵ,λ)italic-ϵ𝜆(\epsilon,\lambda)( italic_ϵ , italic_λ ) for multiplicative case (f×(x)subscript𝑓𝑥f_{\times}(x)italic_f start_POSTSUBSCRIPT × end_POSTSUBSCRIPT ( italic_x )) to be non-convex are different (red curves in the figure is boundary of convexity and non-convexity). Changing the upper bound from (a) 1e+3, (b) 1e+4, (c) 1e+5, to (d) 1e+6, the fractal behaviors all show up after the region GD runs on becomes non-convex (above the red curves).
Refer to caption
Figure S4: Tests suggest that initial parameter x(0)superscript𝑥0x^{(0)}italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT may not affect fractal dimension. We set for the additive noise case ϵ=0.2italic-ϵ0.2\epsilon=0.2italic_ϵ = 0.2 and λ=0.1𝜆0.1\lambda=0.1italic_λ = 0.1, and sampled 100 different initial conditions x(0)superscript𝑥0x^{(0)}italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT uniformly from [5,5]55[-5,5][ - 5 , 5 ]. The fractal dimension averaged over initial conditions is 0.980.980.980.98 with a standard deviation 0.080.080.080.08.
Refer to caption
Figure S5: Tests suggest that initial parameter x(0)superscript𝑥0x^{(0)}italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT may not affect fractal dimension. We set for the multiplicative noise case ϵ=0.2italic-ϵ0.2\epsilon=0.2italic_ϵ = 0.2 and λ=0.1𝜆0.1\lambda=0.1italic_λ = 0.1, and sampled 100 different initial conditions x(0)superscript𝑥0x^{(0)}italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT uniformly from [5,5]55[-5,5][ - 5 , 5 ]. The fractal dimension averaged over initial conditions is 1.001.001.001.00 with a standard deviation 0.010.010.010.01.