Publications

2024
H. Hu, Y. M. Lu, and T. Misiakiewicz, “Asymptotics of Random Feature Regression Beyond the Linear Scaling Regime,” 2024. arXiv:2403.08160 [stat.ML]Abstract
Recent advances in machine learning have been achieved by using overparametrized models trained until near interpolation of the training data. It was shown, e.g., through the double descent phenomenon, that the number of parameters is a poor proxy for the model complexity and generalization capabilities. This leaves open the question of understanding the impact of parametrization on the performance of these models. How does model complexity and generalization depend on the number of parameters p? How should we choose p relative to the sample size n to achieve optimal test error?
In this paper, we investigate the example of random feature ridge regression (RFRR). This model can be seen either as a finite-rank approximation to kernel ridge regression (KRR), or as a simplified model for neural networks trained in the so-called lazy regime. We consider covariates uniformly distributed on the d-dimensional sphere and compute sharp asymptotics for the RFRR test error in the high-dimensional polynomial scaling, where p,n,d→∞ while p/dκ1 and n/dκ2 stay constant, for all κ1,κ2∈ℝ>0. These asymptotics precisely characterize the impact of the number of random features and regularization parameter on the test performance. In particular, RFRR exhibits an intuitive trade-off between approximation and generalization power. For n=o(p), the sample size n is the bottleneck and RFRR achieves the same performance as KRR (which is equivalent to taking p=∞). On the other hand, if p=o(n), the number of random features p is the limiting factor and RFRR test error matches the approximation error of the random feature model class (akin to taking n=∞). Finally, a double descent appears at n=p, a phenomenon that was previously only characterized in the linear scaling κ1=κ2=1.
H. Cui, et al., “Asymptotics of feature learning in two-layer networks after one gradient-step,” 2024. arXiv:2402.04980 [stat.ML]Abstract
In this manuscript we investigate the problem of how two-layer neural networks learn features from data, and improve over the kernel regime, after being trained with a single gradient descent step. Leveraging a connection from (Ba et al., 2022) with a non-linear spiked matrix model and recent progress on Gaussian universality (Dandi et al., 2023), we provide an exact asymptotic description of the generalization error in the high-dimensional limit where the number of samples n, the width p and the input dimension d grow at a proportional rate. We characterize exactly how adapting to the data is crucial for the network to efficiently learn non-linear functions in the direction of the gradient -- where at initialization it can only express linear functions in this regime. To our knowledge, our results provides the first tight description of the impact of feature learning in the generalization of two-layer neural networks in the large learning rate regime η=Θd(d), beyond perturbative finite width corrections of the conjugate and neural tangent kernels.
2023
R. Dudeja, S. Sen, and Y. M. Lu, “Spectral Universality of Regularized Linear Regression with Nearly Deterministic Sensing Matrices,” IEEE Transactions on Information Theory, in revision, 2023. arXiv:2208.02753 [cs.IT]Abstract
It has been observed that the performances of many high-dimensional estimation problems are universal with respect to underlying sensing (or design) matrices. Specifically, matrices with markedly different constructions seem to achieve identical performance if they share the same spectral distribution and have ``generic'' singular vectors. We prove this universality phenomenon for the case of convex regularized least squares (RLS) estimators under a linear regression model with additive Gaussian noise. Our main contributions are two-fold: (1) We introduce a notion of universality classes for sensing matrices, defined through a set of deterministic conditions that fix the spectrum of the sensing matrix and precisely capture the previously heuristic notion of generic singular vectors; (2) We show that for all sensing matrices that lie in the same universality class, the dynamics of the proximal gradient descent algorithm for solving the regression problem, as well as the performance of RLS estimators themselves (under additional strong convexity conditions) are asymptotically identical. In addition to including i.i.d. Gaussian and rotational invariant matrices as special cases, our universality class also contains highly structured, strongly correlated, or even (nearly) deterministic matrices. Examples of the latter include randomly signed versions of incoherent tight frames and randomly subsampled Hadamard transforms. As a consequence of this universality principle, the asymptotic performance of regularized linear regression on many structured matrices constructed with limited randomness can be characterized by using the rotationally invariant ensemble as an equivalent yet mathematically more tractable surrogate.
L. Xiao, H. Hu, T. Misiakiewicz, Y. M. Lu, and J. Pennington, “Precise learning curves and higher-order scaling limits for dot-product kernel regression*,” Journal of Statistical Mechanics: Theory and Experiment, 2023. Publisher's Version
S. Dubova, Y. M. Lu, B. McKenna, and H. - T. Yau, “Universality for the global spectrum of random inner-product kernel matrices in the polynomial regime,” preprint, 2023. arXiv:2310.18280 [math.PR]
R. Dudeja, Y. M. Lu, and S. Sen, “Universality of Approximate Message Passing with Semi-Random Matrices,” Annals of Probability, in press, vol. 51, no. 5, pp. 1616-1683, 2023. arXiv:2204.04281 [math.PR]Abstract
Approximate Message Passing (AMP) is a class of iterative algorithms that have found applications in many problems in high-dimensional statistics and machine learning. In its general form, AMP can be formulated as an iterative procedure driven by a matrix M. Theoretical analyses of AMP typically assume strong distributional properties on M such as M has i.i.d. sub-Gaussian entries or is drawn from a rotational invariant ensemble. However, numerical experiments suggest that the behavior of AMP is universal, as long as the eigenvectors of M are generic. In this paper, we take the first step in rigorously understanding this universality phenomenon. In particular, we investigate a class of memory-free AMP algorithms (proposed by Çakmak and Opper for mean-field Ising spin glasses), and show that their asymptotic dynamics is universal on a broad class of semi-random matrices. In addition to having the standard rotational invariant ensemble as a special case, the class of semi-random matrices that we define in this work also includes matrices constructed with very limited randomness. One such example is a randomly signed version of the Sine model, introduced by Marinari, Parisi, Potters, and Ritort for spin glasses with fully deterministic couplings.
Y. M. Lu and H. T. Yau, “An Equivalence Principle for the Spectrum of Random Inner-Product Kernel Matrices,” preprint, 2023. arXiv:2205.06308 [math.PR]Abstract
We consider random matrices whose entries are obtained by applying a (nonlinear) kernel function to the pairwise inner products between \(n\) independent data vectors drawn uniformly from the unit sphere in \(\mathbb{R}^d\). Our study of this model is motivated by problems in machine learning, statistics, and signal processing, where such inner-product kernel random matrices and their spectral properties play important roles. Under mild conditions on the kernel function, we establish the weak-limit of the empirical spectral distribution of these matrices when \(d, n \to \infty\) such that \(n/d^\ell \to \kappa \in (0, \infty)\), for some fixed \(\ell \in \mathbb{N}\) and \(\kappa \in \mathbb{R}\). This generalizes an earlier result of Cheng and Singer (2013), who studied the same model in the linear scaling regime (with \(\ell = 1\) and \(n/d \to \kappa\)). The main insight of our work is a general equivalence principle: the spectrum of the random kernel matrix is asymptotically equivalent to that of a simpler matrix model, constructed as the linear combination of a (shifted) Wishart matrix and an independent matrix drawn from the Gaussian orthogonal ensemble. The aspect ratio of the Wishart matrix and the coefficients of the linear combination are determined by \(\ell\) and by the expansion of the kernel function in the orthogonal Hermite polynomial basis. Consequently, the limiting spectrum of the random kernel matrix can be characterized as the free additive convolution between a Marchenko-Pastur law and a semicircle law.
2022
H. Hu and Y. M. Lu, “Universality Laws for High-Dimensional Learning with Random Features,” IEEE Transactions on Information Theory, in press, 2022. arXiv:2009.07669 [cs.IT]Abstract
We prove a universality theorem for learning with random features. Our result shows that, in terms of training and generalization errors, the random feature model with a nonlinear activation function is asymptotically equivalent to a surrogate Gaussian model with a matching covariance matrix. This settles a conjecture based on which several recent papers develop their results. Our method for proving the universality builds on the classical Lindeberg approach. Major ingredients of the proof include a leave-one-out analysis for the optimization problem associated with the training process and a central limit theorem, obtained via Stein's method, for weakly correlated random variables.
H. Hu and Y. M. Lu, “Asymptotics and Optimal Designs of SLOPE for Sparse Linear Regression,” IEEE Transactions on Information Theory, vol. 68, no. 11, pp. 7627--7664, 2022. arXiv:1903.11582 [cs.IT]Abstract
In sparse linear regression, the SLOPE estimator generalizes LASSO by assigning magnitude-dependent regularizations to different coordinates of the estimate. In this paper, we present an asymptotically exact characterization of the performance of SLOPE in the high-dimensional regime where the number of unknown parameters grows in proportion to the number of observations. Our asymptotic characterization enables us to derive optimal regularization sequences to either minimize the MSE or to maximize the power in variable selection under any given level of Type-I error. In both cases, we show that the optimal design can be recast as certain infinite-dimensional convex optimization problems, which have efficient and accurate finite-dimensional approximations. Numerical simulations verify our asymptotic predictions. They also demonstrate the superiority of our optimal design over LASSO and a regularization sequence previously proposed in the literature.
L. Xiao, H. Hu, T. Misiakiewicz, Y. M. Lu, and J. Pennington, “Precise Learning Curves and Higher-Order Scaling Limits for Dot Product Kernel Regression,” in Thirty-sixth Conference on Neural Information Processing Systems (NeurIPS), 2022.Abstract
As modern machine learning models continue to advance the computational frontier, it has become increasingly important to develop precise estimates for expected performance improvements under different model and data scaling regimes. Currently, theoretical understanding of the learning curves that characterize how the prediction error depends on the number of samples is restricted to either large-sample asymptotics ($m\to\infty$) or, for certain simple data distributions, to the high-dimensional asymptotics in which the number of samples scales linearly with the dimension ($m\propto d$). There is a wide gulf between these two regimes, including all higher-order scaling relations $m\propto d^r$, which are the subject of the present paper. We focus on the problem of kernel ridge regression for dot-product kernels and present precise formulas for the mean of the test error, bias, and variance, for data drawn uniformly from the sphere with isotropic random labels in the $r$th-order asymptotic scaling regime $m\to\infty$ with $m/d^r$ held constant. We observe a peak in the learning curve whenever $m \approx d^r/r!$ for any integer $r$, leading to multiple sample-wise descent and nontrivial behavior at multiple scales. 
H. Hu and Y. M. Lu, “Sharp Asymptotics of Kernel Ridge Regression Beyond the Linear Regime,” technical report, 2022. arXiv:2205.06798 [cs.LG]Abstract
The generalization performance of kernel ridge regression (KRR) exhibits a multi-phased pattern that crucially depends on the scaling relationship between the sample size n and the underlying dimension d. This phenomenon is due to the fact that KRR sequentially learns functions of increasing complexity as the sample size increases; when dk−1≪n≪dk, only polynomials with degree less than k are learned. In this paper, we present sharp asymptotic characterization of the performance of KRR at the critical transition regions with n≍dk, for k∈ℤ+. Our asymptotic characterization provides a precise picture of the whole learning process and clarifies the impact of various parameters (including the choice of the kernel function) on the generalization performance. In particular, we show that the learning curves of KRR can have a delicate "double descent" behavior due to specific bias-variance trade-offs at different polynomial scaling regimes.
B. Çakmak, Y. M. Lu, and M. Opper, “Analysis of Random Sequential Message Passing Algorithms for Approximate Inference,” Journal of Statistical Mechanics: Theory and Experiments, no. 073401, 2022. arXiv:2202.08198 [cs.LG]Abstract
We analyze the dynamics of a random sequential message passing algorithm for approximate inference with large Gaussian latent variable models in a student-teacher scenario. To model nontrivial dependencies between the latent variables, we assume random covariance matrices drawn from rotation invariant ensembles. Moreover, we consider a model mismatching setting, where the teacher model and the one used by the student may be different. By means of dynamical functional approach, we obtain exact dynamical mean-field equations characterizing the dynamics of the inference algorithm. We also derive a range of model parameters for which the sequential algorithm does not converge. The boundary of this parameter range coincides with the de Almeida Thouless (AT) stability condition of the replica symmetric ansatz for the static probabilistic model.
2021
Y. M. Lu, “Householder Dice: A Matrix-Free Algorithm for Simulating Dynamics on Gaussian and Random Orthogonal Ensembles,” IEEE Transactions on Information Theory, vol. 67, no. 12, pp. 8264-8272, 2021. arXiv:2101.07464 [cs.IT]Abstract
This paper proposes a new algorithm, named Householder Dice (HD), for simulating dynamics on dense random matrix ensembles with translation-invariant properties. Examples include the Gaussian ensemble, the Haar-distributed random orthogonal ensemble, and their complex-valued counterparts. A "direct" approach to the simulation, where one first generates a dense n×n matrix from the ensemble, requires at least O(n2) resource in space and time. The HD algorithm overcomes this O(n2) bottleneck by using the principle of deferred decisions: rather than fixing the entire random matrix in advance, it lets the randomness unfold with the dynamics. At the heart of this matrix-free algorithm is an adaptive and recursive construction of (random) Householder reflectors. These orthogonal transformations exploit the group symmetry of the matrix ensembles, while simultaneously maintaining the statistical correlations induced by the dynamics. The memory and computation costs of the HD algorithm are O(nT) and O(nT2), respectively, with T being the number of iterations. When T≪n, which is nearly always the case in practice, the new algorithm leads to significant reductions in runtime and memory footprint. Numerical results demonstrate the promise of the HD algorithm as a new computational tool in the study of high-dimensional random systems.
O. Dhifallah and Y. M. Lu, “A Precise Performance Analysis of Learning with Random Features,” Technical report, 2021. arXiv:2008.11904 [cs.IT]Abstract
We study the problem of learning an unknown function using random feature models. Our main contribution is an exact asymptotic analysis of such learning problems with Gaussian data. Under mild regularity conditions for the feature matrix, we provide an exact characterization of the asymptotic training and generalization errors, valid in both the under-parameterized and over-parameterized regimes. The analysis presented in this paper holds for general families of feature matrices, activation functions, and convex loss functions. Numerical results validate our theoretical predictions, showing that our asymptotic findings are in excellent agreement with the actual performance of the considered learning problem, even in moderate dimensions. Moreover, they reveal an important role played by the regularization, the loss function and the activation function in the mitigation of the "double descent phenomenon" in learning.
A. Maillard, F. Krzakala, Y. M. Lu, and L. Zdeborova, “Construction of optimal spectral methods in phase retrieval,” in Mathematical and Scientific Machine Learning, 2021. arXiv:2012.04524 [cs.IT]Abstract
We consider the phase retrieval problem, in which the observer wishes to recover a n-dimensional real or complex signal X⋆ from the (possibly noisy) observation of |ΦX⋆|, in which Φ is a matrix of size m×n. We consider a \emph{high-dimensional} setting where n,m→∞ with m/n=O(1), and a large class of (possibly correlated) random matrices Φ and observation channels. Spectral methods are a powerful tool to obtain approximate observations of the signal X⋆ which can be then used as initialization for a subsequent algorithm, at a low computational cost. In this paper, we extend and unify previous results and approaches on spectral methods for the phase retrieval problem. More precisely, we combine the linearization of message-passing algorithms and the analysis of the \emph{Bethe Hessian}, a classical tool of statistical physics. Using this toolbox, we show how to derive optimal spectral methods for arbitrary channel noise and right-unitarily invariant matrix Φ, in an automated manner (i.e. with no optimization over any hyperparameter or preprocessing function).
O. Dhifallah and Y. M. Lu, “On the Inherent Regularization Effects of Noise Injection During Training,” in International Conference on Machine Learning (ICML), 2021. arXiv:2102.07379 [cs.LG]Abstract
Randomly perturbing networks during the training process is a commonly used approach to improving generalization performance. In this paper, we present a theoretical study of one particular way of random perturbation, which corresponds to injecting artificial noise to the training data. We provide a precise asymptotic characterization of the training and generalization errors of such randomly perturbed learning problems on a random feature model. Our analysis shows that Gaussian noise injection in the training process is equivalent to introducing a weighted ridge regularization, when the number of noise injections tends to infinity. The explicit form of the regularization is also given. Numerical results corroborate our asymptotic predictions, showing that they are accurate even in moderate problem dimensions. Our theoretical predictions are based on a new correlated Gaussian equivalence conjecture that generalizes recent results in the study of random feature models.
O. Dhifallah and Y. M. Lu, “Phase Transitions in Transfer Learning for High-Dimensional Perceptrons,” Entropy, Special Issue "The Role of Signal Processing and Information Theory in Modern Machine Learning", vol. 23, no. 4, 2021. arXiv:2101.01918 [cs.LG]Abstract
Transfer learning seeks to improve the generalization performance of a target task by exploiting the knowledge learned from a related source task. Central questions include deciding what information one should transfer and when transfer can be beneficial. The latter question is related to the so-called negative transfer phenomenon, where the transferred source information actually reduces the generalization performance of the target task. This happens when the two tasks are sufficiently dissimilar. In this paper, we present a theoretical analysis of transfer learning by studying a pair of related perceptron learning tasks. Despite the simplicity of our model, it reproduces several key phenomena observed in practice. Specifically, our asymptotic analysis reveals a phase transition from negative transfer to positive transfer as the similarity of the two tasks moves past a well-defined threshold.
2020
H. Hu and Y. M. Lu, “The Limiting Poisson Law of Massive MIMO Detection with Box Relaxation,” IEEE Journal on Selected Areas in Information Theory, vol. 1, no. 3, pp. 695-704, 2020. arXiv:2006.08416 [cs.IT]Abstract
Estimating a binary vector from noisy linear measurements is a prototypical problem for MIMO systems. A popular algorithm, called the box-relaxation decoder, estimates the target signal by solving a least squares problem with convex constraints. This paper shows that the performance of the algorithm, measured by the number of incorrectly-decoded bits, has a limiting Poisson law. This occurs when the sampling ratio and noise variance, two key parameters of the problem, follow certain scalings as the system dimension grows. Moreover, at a well-defined threshold, the probability of perfect recovery is shown to undergo a phase transition that can be characterized by the Gumbel distribution. Numerical simulations corroborate these theoretical predictions, showing that they match the actual performance of the algorithm even in moderate system dimensions.
Y. M. Lu and G. Li, “Phase Transitions of Spectral Initialization for High-Dimensional Nonconvex Estimation,” Information and Inference: A Journal of the IMA, vol. 9, no. 3, pp. 507-541, 2020. arXiv:1702.06435 [cs.IT]Abstract

We study a spectral initialization method that serves as a key ingredient in recent work on using efficient iterative algorithms for estimating signals in nonconvex settings. Unlike previous analysis in the literature, which is restricted to the phase retrieval setting and which provides only performance bounds, we consider arbitrary generalized linear sensing models and present a precise asymptotic characterization of the performance of the spectral method in the high-dimensional regime. Our analysis reveals a phase transition phenomenon that depends on the sampling ratio. When the ratio is below a minimum threshold, the estimates given by the spectral method are no better than a random guess drawn uniformly from the hypersphere; above a maximum threshold, however, the estimates become increasingly aligned with the target signal. The computational complexity of the spectral method is also markedly different in the two phases. Worked examples and numerical results are provided to illustrate and verify the analytical predictions. In particular, simulations show that our asymptotic formulas provide accurate predictions even at moderate signal dimensions.

B. Aubin, Y. M. Lu, F. Krzakala, and L. Zdeboravá, “Generalization error in high-dimensional perceptrons: Approaching Bayes error with convex optimization,” in Conference on Neural Information Processing Systems (NeurIPS), 2020. arXiv:2006.06560 [stat.ML]Abstract
We consider a commonly studied supervised classification of a synthetic dataset whose labels are generated by feeding a one-layer neural network with random iid inputs. We study the generalization performances of standard classifiers in the high-dimensional regime where α=n/d is kept finite in the limit of a high dimension d and number of samples n. Our contribution is three-fold: First, we prove a formula for the generalization error achieved by ℓ2 regularized classifiers that minimize a convex loss. This formula was first obtained by the heuristic replica method of statistical physics. Secondly, focussing on commonly used loss functions and optimizing the ℓ2 regularization strength, we observe that while ridge regression performance is poor, logistic and hinge regression are surprisingly able to approach the Bayes-optimal generalization error extremely closely. As α→∞ they lead to Bayes-optimal rates, a fact that does not follow from predictions of margin-based generalization error bounds. Third, we design an optimal loss and regularizer that provably leads to Bayes-optimal generalization error.

Pages