Publications by Year: 2024

2024
R. Dudeja, S. Sen, and Y. M. Lu, “Spectral Universality of Regularized Linear Regression with Nearly Deterministic Sensing Matrices,” IEEE Transactions on Information Theory, to appear, 2024. arXiv:2208.02753 [cs.IT]Abstract
It has been observed that the performances of many high-dimensional estimation problems are universal with respect to underlying sensing (or design) matrices. Specifically, matrices with markedly different constructions seem to achieve identical performance if they share the same spectral distribution and have ``generic'' singular vectors. We prove this universality phenomenon for the case of convex regularized least squares (RLS) estimators under a linear regression model with additive Gaussian noise. Our main contributions are two-fold: (1) We introduce a notion of universality classes for sensing matrices, defined through a set of deterministic conditions that fix the spectrum of the sensing matrix and precisely capture the previously heuristic notion of generic singular vectors; (2) We show that for all sensing matrices that lie in the same universality class, the dynamics of the proximal gradient descent algorithm for solving the regression problem, as well as the performance of RLS estimators themselves (under additional strong convexity conditions) are asymptotically identical. In addition to including i.i.d. Gaussian and rotational invariant matrices as special cases, our universality class also contains highly structured, strongly correlated, or even (nearly) deterministic matrices. Examples of the latter include randomly signed versions of incoherent tight frames and randomly subsampled Hadamard transforms. As a consequence of this universality principle, the asymptotic performance of regularized linear regression on many structured matrices constructed with limited randomness can be characterized by using the rotationally invariant ensemble as an equivalent yet mathematically more tractable surrogate.
B. Cakmak, Y. M. Lu, and M. Opper, “A Convergence Analysis of Approximate Message Passing with Non-Separable Functions and Applications to Multi-Class Classification,” IEEE International Symposium on Information Theory (ISIT). 2024. arXiv:2402.08676 [cs.LG]Abstract
Motivated by the recent application of approximate message passing (AMP) to the analysis of convex optimizations in multi-class classifications [Loureiro, et. al., 2021], we present a convergence analysis of AMP dynamics with non-separable multivariate nonlinearities. As an application, we present a complete (and independent) analysis of the motivated convex optimization problem.
H. Hu, Y. M. Lu, and T. Misiakiewicz, “Asymptotics of Random Feature Regression Beyond the Linear Scaling Regime,” 2024. arXiv:2403.08160 [stat.ML]Abstract
Recent advances in machine learning have been achieved by using overparametrized models trained until near interpolation of the training data. It was shown, e.g., through the double descent phenomenon, that the number of parameters is a poor proxy for the model complexity and generalization capabilities. This leaves open the question of understanding the impact of parametrization on the performance of these models. How does model complexity and generalization depend on the number of parameters p? How should we choose p relative to the sample size n to achieve optimal test error?
In this paper, we investigate the example of random feature ridge regression (RFRR). This model can be seen either as a finite-rank approximation to kernel ridge regression (KRR), or as a simplified model for neural networks trained in the so-called lazy regime. We consider covariates uniformly distributed on the d-dimensional sphere and compute sharp asymptotics for the RFRR test error in the high-dimensional polynomial scaling, where p,n,d→∞ while p/dκ1 and n/dκ2 stay constant, for all κ1,κ2∈ℝ>0. These asymptotics precisely characterize the impact of the number of random features and regularization parameter on the test performance. In particular, RFRR exhibits an intuitive trade-off between approximation and generalization power. For n=o(p), the sample size n is the bottleneck and RFRR achieves the same performance as KRR (which is equivalent to taking p=∞). On the other hand, if p=o(n), the number of random features p is the limiting factor and RFRR test error matches the approximation error of the random feature model class (akin to taking n=∞). Finally, a double descent appears at n=p, a phenomenon that was previously only characterized in the linear scaling κ1=κ2=1.
H. Cui, et al., “Asymptotics of feature learning in two-layer networks after one gradient-step,” 2024. arXiv:2402.04980 [stat.ML]Abstract
In this manuscript we investigate the problem of how two-layer neural networks learn features from data, and improve over the kernel regime, after being trained with a single gradient descent step. Leveraging a connection from (Ba et al., 2022) with a non-linear spiked matrix model and recent progress on Gaussian universality (Dandi et al., 2023), we provide an exact asymptotic description of the generalization error in the high-dimensional limit where the number of samples n, the width p and the input dimension d grow at a proportional rate. We characterize exactly how adapting to the data is crucial for the network to efficiently learn non-linear functions in the direction of the gradient -- where at initialization it can only express linear functions in this regime. To our knowledge, our results provides the first tight description of the impact of feature learning in the generalization of two-layer neural networks in the large learning rate regime η=Θd(d), beyond perturbative finite width corrections of the conjugate and neural tangent kernels.