Publications by Type: Conference Paper

2024

H. Cui, et al., “Asymptotics of feature learning in two-layer networks after one gradient-step,” 2024. arXiv:2402.04980 [stat.ML]Abstract

In this manuscript we investigate the problem of how two-layer neural networks learn features from data, and improve over the kernel regime, after being trained with a single gradient descent step. Leveraging a connection from (Ba et al., 2022) with a non-linear spiked matrix model and recent progress on Gaussian universality (Dandi et al., 2023), we provide an exact asymptotic description of the generalization error in the high-dimensional limit where the number of samples n, the width p and the input dimension d grow at a proportional rate. We characterize exactly how adapting to the data is crucial for the network to efficiently learn non-linear functions in the direction of the gradient -- where at initialization it can only express linear functions in this regime. To our knowledge, our results provides the first tight description of the impact of feature learning in the generalization of two-layer neural networks in the large learning rate regime η=Θd(d), beyond perturbative finite width corrections of the conjugate and neural tangent kernels.

2022

L. Xiao, H. Hu, T. Misiakiewicz, Y. M. Lu, and J. Pennington, “Precise Learning Curves and Higher-Order Scaling Limits for Dot Product Kernel Regression,” in Thirty-sixth Conference on Neural Information Processing Systems (NeurIPS), 2022.Abstract

As modern machine learning models continue to advance the computational frontier, it has become increasingly important to develop precise estimates for expected performance improvements under different model and data scaling regimes. Currently, theoretical understanding of the learning curves that characterize how the prediction error depends on the number of samples is restricted to either large-sample asymptotics ($m\to\infty$) or, for certain simple data distributions, to the high-dimensional asymptotics in which the number of samples scales linearly with the dimension ($m\propto d$). There is a wide gulf between these two regimes, including all higher-order scaling relations $m\propto d^r$, which are the subject of the present paper. We focus on the problem of kernel ridge regression for dot-product kernels and present precise formulas for the mean of the test error, bias, and variance, for data drawn uniformly from the sphere with isotropic random labels in the $r$th-order asymptotic scaling regime $m\to\infty$ with $m/d^r$ held constant. We observe a peak in the learning curve whenever $m \approx d^r/r!$ for any integer $r$, leading to multiple sample-wise descent and nontrivial behavior at multiple scales.

2021

A. Maillard, F. Krzakala, Y. M. Lu, and L. Zdeborova, “Construction of optimal spectral methods in phase retrieval,” in Mathematical and Scientific Machine Learning, 2021. arXiv:2012.04524 [cs.IT]Abstract

We consider the phase retrieval problem, in which the observer wishes to recover a n-dimensional real or complex signal X⋆ from the (possibly noisy) observation of |ΦX⋆|, in which Φ is a matrix of size m×n. We consider a \emph{high-dimensional} setting where n,m→∞ with m/n=O(1), and a large class of (possibly correlated) random matrices Φ and observation channels. Spectral methods are a powerful tool to obtain approximate observations of the signal X⋆ which can be then used as initialization for a subsequent algorithm, at a low computational cost. In this paper, we extend and unify previous results and approaches on spectral methods for the phase retrieval problem. More precisely, we combine the linearization of message-passing algorithms and the analysis of the \emph{Bethe Hessian}, a classical tool of statistical physics. Using this toolbox, we show how to derive optimal spectral methods for arbitrary channel noise and right-unitarily invariant matrix Φ, in an automated manner (i.e. with no optimization over any hyperparameter or preprocessing function).

O. Dhifallah and Y. M. Lu, “On the Inherent Regularization Effects of Noise Injection During Training,” in International Conference on Machine Learning (ICML), 2021. arXiv:2102.07379 [cs.LG]Abstract

Randomly perturbing networks during the training process is a commonly used approach to improving generalization performance. In this paper, we present a theoretical study of one particular way of random perturbation, which corresponds to injecting artificial noise to the training data. We provide a precise asymptotic characterization of the training and generalization errors of such randomly perturbed learning problems on a random feature model. Our analysis shows that Gaussian noise injection in the training process is equivalent to introducing a weighted ridge regularization, when the number of noise injections tends to infinity. The explicit form of the regularization is also given. Numerical results corroborate our asymptotic predictions, showing that they are accurate even in moderate problem dimensions. Our theoretical predictions are based on a new correlated Gaussian equivalence conjecture that generalizes recent results in the study of random feature models.

2020

B. Aubin, Y. M. Lu, F. Krzakala, and L. Zdeboravá, “Generalization error in high-dimensional perceptrons: Approaching Bayes error with convex optimization,” in Conference on Neural Information Processing Systems (NeurIPS), 2020. arXiv:2006.06560 [stat.ML]Abstract

We consider a commonly studied supervised classification of a synthetic dataset whose labels are generated by feeding a one-layer neural network with random iid inputs. We study the generalization performances of standard classifiers in the high-dimensional regime where α=n/d is kept finite in the limit of a high dimension d and number of samples n. Our contribution is three-fold: First, we prove a formula for the generalization error achieved by ℓ2 regularized classifiers that minimize a convex loss. This formula was first obtained by the heuristic replica method of statistical physics. Secondly, focussing on commonly used loss functions and optimizing the ℓ2 regularization strength, we observe that while ridge regression performance is poor, logistic and hinge regression are surprisingly able to approach the Bayes-optimal generalization error extremely closely. As α→∞ they lead to Bayes-optimal rates, a fact that does not follow from predictions of margin-based generalization error bounds. Third, we design an optimal loss and regularizer that provably leads to Bayes-optimal generalization error.

F. Mignacco, F. Krzakala, Y. M. Lu, and L. Zdeboravá, “The role of regularization in classification of high-dimensional noisy Gaussian mixture,” in International Conference on Machine Learning (ICML), 2020. arXiv:2002.11544 [stat.ML]Abstract

We consider a high-dimensional mixture of two Gaussians in the noisy regime where even an oracle knowing the centers of the clusters misclassifies a small but finite fraction of the points. We provide a rigorous analysis of the generalization error of regularized convex classifiers, including ridge, hinge and logistic regression, in the high-dimensional limit where the number n of samples and their dimension d go to infinity while their ratio is fixed to α=n/d. We discuss surprising effects of the regularization that in some cases allows to reach the Bayes-optimal performances. We also illustrate the interpolation peak at low regularization, and analyze the role of the respective sizes of the two clusters.

2019

C. Wang, H. Hu, and Y. M. Lu, “A Solvable High-Dimensional Model of GAN,” in Proc. Thirty-third Conference on Neural Information Processing Systems (NeurIPS), 2019. arXiv:1805.08349 [cs.LG]Abstract

Despite the remarkable successes of generative adversarial networks (GANs) in many applications, theoretical understandings of their performance is still limited. In this paper, we present a simple shallow GAN model fed by high-dimensional input data. The dynamics of the training process of the proposed model can be exactly analyzed in the high-dimensional limit. In particular, by using the tool of scaling limits of stochastic processes, we show that the macroscopic quantities measuring the quality of the training process converge to a deterministic process that is characterized as the unique solution of a finite-dimensional ordinary differential equation (ODE). The proposed model is simple, but its training process already exhibits several different phases that can mimic the behaviors of more realistic GAN models used in practice. Specifically, depending on the choice of the learning rates, the training process can reach either a successful, a failed, or an oscillating phase. By studying the steady-state solutions of the limiting ODEs, we obtain a phase diagram that precisely characterizes the conditions under which each phase takes place. Although this work focuses on a simple GAN model, the analysis methods developed here might prove useful in the theoretical understanding of other variants of GANs with more advanced training algorithms.

L. Saglietti, Y. M. Lu, and C. Lucibello, “Generalized Approximate Survey Propagation for High-Dimensional Estimation,” in Proc. International Conference on Machine Learning (ICML), 2019. arXiv:1905.05313 [cond-mat.dis-nn]Abstract

In Generalized Linear Estimation (GLE) problems, we seek to estimate a signal that is observed through a linear transform followed by a component-wise, possibly nonlinear and noisy, channel. In the Bayesian optimal setting, Generalized Approximate Message Passing (GAMP) is known to achieve optimal performance for GLE. However, its performance can significantly degrade whenever there is a mismatch between the assumed and the true generative model, a situation frequently encountered in practice. In this paper, we propose a new algorithm, named Generalized Approximate Survey Propagation (GASP), for solving GLE in the presence of prior or model mis-specifications. As a prototypical example, we consider the phase retrieval problem, where we show that GASP outperforms the corresponding GAMP, reducing the reconstruction threshold and, for certain choices of its parameters, approaching Bayesian optimal performance. Furthermore, we present a set of State Evolution equations that exactly characterize the dynamics of GASP in the high-dimensional limit.

H. Hong and Y. M. Lu, “Asymptotics and optimal designs of SLOPE for sparse linear regression,” in Prof. International Symposium on Information Theory (ISIT), 2019. Longer version with full technical details Abstract

In sparse linear regression, the SLOPE estimator generalizes LASSO by assigning magnitude-dependent regular- izations to different coordinates of the estimate. In this paper, we present an asymptotically exact characterization of the performance of SLOPE in the high-dimensional regime where the number of unknown parameters grows in proportion to the number of observations. Our asymptotic characterization enables us to derive optimal regularization sequences to either minimize the MSE or to maximize the power in variable selection under any given level of Type-I error. In both cases, we show that the optimal design can be recast as certain infinite-dimensional convex optimization problems, which have efficient and accurate finite-dimensional approximations. Numerical simulations verify our asymptotic predictions. They also demonstrate the superi- ority of our optimal design over LASSO and a regularization sequence previously proposed in the literature.

2017

O. Dhifallah, C. Thrampoulidis, and Y. M. Lu, “Phase Retrieval via Linear Programming: Fundamental Limits and Algorithmic Improvements,” in 55th Annual Allerton Conference on Communication, Control, and Computing, 2017. arXiv:1710.05234 [cs.IT]Abstract

A recently proposed convex formulation of the phase retrieval problem estimates the unknown signal by solving a simple linear program. This new scheme, known as PhaseMax, is computationally efficient compared to standard convex relaxation methods based on lifting techniques. In this paper, we present an exact performance analysis of PhaseMax under Gaussian measurements in the large system limit. In contrast to previously known performance bounds in the literature, our results are asymptotically exact and they also reveal a sharp phase transition phenomenon. Furthermore, the geometrical insights gained from our analysis led us to a novel nonconvex formulation of the phase retrieval problem and an accompanying iterative algorithm based on successive linearization and maximization over a polytope. This new algorithm, which we call PhaseLamp, has provably superior recovery performance over the original PhaseMax method.

C. Wang and Y. M. Lu, “The Scaling Limit of High-Dimensional Online Independent Component Analysis,” in Conference on Neural Information Processing Systems (NIPS), 2017.Abstract

We analyze the dynamics of an online algorithm for independent component analysis in the high-dimensional scaling limit. As the ambient dimension tends to infinity, and with proper time scaling, we show that the time-varying joint empirical measure of the target feature vector and the estimates provided by the algorithm will converge weakly to a deterministic measured-valued process that can be characterized as the unique solution of a nonlinear PDE. Numerical solutions of this PDE, which involves two spatial variables and one time variable, can be efficiently obtained. These solutions provide detailed information about the performance of the ICA algorithm, as many practical performance metrics are functionals of the joint empirical measures. Numerical simulations show that our asymptotic analysis is accurate even for moderate dimensions. In addition to providing a tool for understanding the performance of the algorithm, our PDE analysis also provides useful insight. In particular, in the high-dimensional limit, the original coupled dynamics associated with the algorithm will be asymptotically “decoupled”, with each coordinate independently solving a 1-D effective minimization problem via stochastic gradient descent. Exploiting this insight to design new algorithms for achieving optimal trade-offs between computational and statistical efficiency may prove an interesting line of future research.

nips_2017.pdf

Spotlight paper (acceptance rate: 112/3240 = 3.5%)

O. Dhifallah and Y. M. Lu, “Fundamental Limits of PhaseMax for Phase Retrieval: A Replica Analysis,” in the 7th IEEE Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), 2017. arXiv:1708.03355 Abstract

We consider a recently proposed convex formulation, known as the PhaseMax method, for solving the phase retrieval problem. Using the replica method from statistical mechanics, we analyze the performance of PhaseMax in the high-dimensional limit. Our analysis predicts the exact asymptotic performance of PhaseMax. In particular, we show that a sharp phase transi- tion phenomenon takes place, with a simple analytical formula characterizing the phase transition boundary. This result shows that the oversampling ratio required by existing performance bounds in the literature can be significantly reduced. Numerical results confirm the validity of our replica analysis, showing that the theoretical predictions are in excellent agreement with the actual performance of the algorithm, even for moderate signal dimensions.

This paper won the Best Student Paper Award (First Prize) at the 2017 IEEE CAMSAP Workshop.

The predictions made in this paper via the non-rigorous replica method has since been rigorously established in our latest work.

Y. M. Lu and G. Li, “Spectral Initialization for Nonconvex Estimation: High-Dimensional Limit and Phase Transitions,” in IEEE International Symposium on Information Theory (ISIT), 2017.Abstract

We study a simple spectral method that serves as a key ingredient in a growing line of work using efficient iterative algorithms for estimating signals in nonconvex settings. Unlike previous work, which focuses on the phase retrieval setting and provides only bounds on the performance, we consider arbitrary generalized linear sensing models and provide an exact characterization of the performance of the spectral method in the high-dimensional regime. Our analysis reveals a phase transition phenomenon that depends on the sampling ratio. When the ratio is below a critical threshold, the estimates given by the spectral method are no better than random guesses drawn uniformly from the hypersphere; above the threshold, however, the estimates become increasingly aligned with the underlying signal. Worked examples and numerical simulations are provided to illustrate and verify the analytical predictions.

sp_init_isit.pdf

C. Wang, Y. Eldar, and Y. M. Lu, “Subspace Estimation from Incomplete Observations: A Precise High-Dimensional Analysis,” in Signal Processing with Adaptive Structured Representatives (SPARS) Workshop, 2017.Abstract

The problem of estimating and tracking low-rank subspaces from incomplete observations has received a lot of attention recently in the signal processing and learning communities. Popular algorithms, such as GROUSE and PETRELS, are often very effective in practice, but their performance depends on the careful choice of algorithmic parameters. Important questions, such as the global convergence of these algorithms and how the noise level, subsampling ratio, and various other parameters affect the performance, are not fully understood. In this paper, we present a precise analysis of the performance of these algorithms in the asymptotic regime where the ambient dimension tends to infinity. Specifically, we show that the time-varying trajectories of estimation errors converge weakly to a deterministic function of time, which is characterized as the unique solution of a system of ordinary differential equations (ODEs.) Analyzing the limiting ODEs also reveals and characterizes sharp phase transition phenomena associated with these algorithms. Numerical simulations verify the accuracy of our asymptotic predictions, even for moderate signal dimensions.

sparse17_ode.pdf

2016

C. Wang and Y. M. Lu, “Online Learning for Sparse PCA in High Dimensions: Exact Dynamics and Phase Transitions,” in IEEE Information Theory Workshop (ITW), 2016.Abstract

We study the dynamics of an online algorithm for learning a sparse leading eigenvector from samples generated from a spiked covariance model. This algorithm combines the classical Oja's method for online PCA with an element-wise nonlinearity at each iteration to promote sparsity. In the high-dimensional limit, the joint empirical measure of the underlying sparse eigenvector and its estimate provided by the algorithm is shown to converge weakly to a deterministic, measure-valued process. This scaling limit is characterized as the unique solution of a nonlinear PDE, and it provides exact information regarding the asymptotic performance of the algorithm. For example, performance metrics such as the cosine similarity and the misclassification rate in sparse support recovery can be obtained by examining the limiting dynamics. A steady-state analysis of the nonlinear PDE also reveals an interesting phase transition phenomenon. Although our analysis is asymptotic in nature, numerical simulations show that the theoretical predictions are accurate for moderate signal dimensions.

spca.pdf

J. Onativia, P. L. Dragotti, and Y. M. Lu, “ProSparse denoise: Prony's based sparsity pattern recovery in the presence of noise,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016.

2015

G. Li, Y. Gu, and Y. M. Lu, “Phase Retrieval Using Iterative Projections: Dynamics in the Large Systems Limit,” in Allerton Conference on Communications, Control, and Computing, 2015.Abstract

We study phase retrieval by using a simple non-convex algorithm based on iterative projections. Our main contribution in this work is to establish an exact analysis of the dynamics of the algorithm in an online setting, with Gaussian measurement vectors. We show that the algorithm dynamics, measured by the squared distance between the current estimate and the true solution, can be fully characterized by a 2D Markov process, irrespective of the underlying signal dimension. Furthermore, in the large systems limit (i.e., as the signal dimension tends to infinity), the random sample paths of the 2D Markov process converge to the solutions of two deterministic and coupled ordinary differential equations (ODEs). Numerical simulations verify the accuracy of our analytical predictions, even for moderate system sizes. This suggests that the ODE approach presented in this work provides an effective tool for analyzing the performance and convergence of the algorithm.

phase_dynamics.pdf

S. H. Chan, T. Zickler, and Y. M. Lu, “Understanding symmetric smoothing filters via Gaussian mixtures,” in IEEE International Conference on Image Processing, 2015.

J. Onativia, P. L. Dragotti, and Y. M. Lu, “Sparsity according to Prony, Average Performance Analysis,” in Signal Processing with Adaptive Sparse Structured Representations (SPARS) Workshop, Cambridge, England, 2015.

2015_spars.pdf

2014

A. Agaskar and Y. M. Lu, “Optimal hypothesis testing with combinatorial structure: Detecting random walks on graphs,” in Proc. of Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, 2014.Abstract

Suppose we have a time series of observations for each node in a network, and we wish to detect the presence of a particle undergoing a random walk on the network. If there is no such particle, then we observe only zero-mean Gaussian noise. If present, however, the location of the particle has an elevated mean. How well can we detect the particle at low signal-to-noise ratios? This is a special case of the problem of detecting hidden Markov processes (HMPs).

The performance metric we analyze is the error exponent of the optimal detector, which measures the exponential rate of decay in the miss probability if the false alarm probability is held fixed as the observation time increases. This problem exhibits deep connections to a problem in statistical physics: computing the free energy density of a spin glass.

We develop a generalized version of the random energy model (REM) spin glass, whose free energy density provides a lower bound for our error exponent, and compute the bound using large deviations techniques. The bound closely matches empirical results in numerical experiments, and suggests a phase transition phenomenon: below a threshold SNR, the error exponent is nearly constant and near zero, indicating poor performance; above the threshold, there is rapid improvement in performance as the SNR increases. The location of the phase transition depends on the entropy rate of the Markov process.

Yue M. Lu

Gordon McKay Professor of Electrical Engineering and of Applied Mathematics

Publications by Type: Conference Paper

Pages

Publications by Year