MCMC computations for Bayesian mixture models using repulsive point processes

Journal of Computational and Graphical Statistics

Published On 2022/4/3

Repulsive mixture models have recently gained popularity for Bayesian cluster detection. Compared to more traditional mixture models, repulsive mixture models produce a smaller number of well-separated clusters. The most commonly used methods for posterior inference either require to fix a priori the number of components or are based on reversible jump MCMC computation. We present a general framework for mixture models, when the prior of the “cluster centers” is a finite repulsive point process depending on a hyperparameter, specified by a density which may depend on an intractable normalizing constant. By investigating the posterior characterization of this class of mixture models, we derive a MCMC algorithm which avoids the well-known difficulties associated to reversible jump MCMC computation. In particular, we use an ancillary variable method, which eliminates the problem of having intractable …

Journal

Journal of Computational and Graphical Statistics

Published On

2022/4/3

Volume

31

Issue

2

Page

422-435

Authors

Jesper Møller

Jesper Møller

Aalborg Universitet

Position

Professor in Statistics

H-Index(all)

46

H-Index(since 2020)

23

I-10 Index(all)

0

I-10 Index(since 2020)

0

Citation(all)

0

Citation(since 2020)

0

Cited By

0

Research Interests

Mathematical Statistics

Probability Theory

University Profile Page

Mario Beraha

Mario Beraha

Politecnico di Milano

Position

// Università degli Studi di Bologna

H-Index(all)

5

H-Index(since 2020)

5

I-10 Index(all)

0

I-10 Index(since 2020)

0

Citation(all)

0

Citation(since 2020)

0

Cited By

0

Research Interests

bayesian statistics

bayesian nonparametrics

machine learning

University Profile Page

Other Articles from authors

Mario Beraha

Mario Beraha

Politecnico di Milano

arXiv preprint arXiv:2402.03231

Improved prediction of future user activity in online A/B testing

In online randomized experiments or A/B tests, accurate predictions of participant inclusion rates are of paramount importance. These predictions not only guide experimenters in optimizing the experiment's duration but also enhance the precision of treatment effect estimates. In this paper we present a novel, straightforward, and scalable Bayesian nonparametric approach for predicting the rate at which individuals will be exposed to interventions within the realm of online A/B testing. Our approach stands out by offering dual prediction capabilities: it forecasts both the quantity of new customers expected in future time windows and, unlike available alternative methods, the number of times they will be observed. We derive closed-form expressions for the posterior distributions of the quantities needed to form predictions about future user activity, thereby bypassing the need for numerical algorithms such as Markov chain Monte Carlo. After a comprehensive exposition of our model, we test its performance on experiments on real and simulated data, where we show its superior performance with respect to existing alternatives in the literature.

Jesper Møller

Jesper Møller

Aalborg Universitet

arXiv preprint arXiv:2404.09525

Coupling results and Markovian structures for number representations of continuous random variables

A general setting for nested subdivisions of a bounded real set into intervals defining the digits of a random variable with a probability density function is considered. Under the weak condition that is almost everywhere lower semi-continuous, a coupling between and a non-negative integer-valued random variable is established so that have an interpretation as the ``sufficient digits'', since the distribution of conditioned on does not depend on . Adding a condition about a Markovian structure of the lengths of the intervals in the nested subdivisions, becomes a Markov chain of a certain order . If then are IID with a known distribution. When and the Markov chain is uniformly geometric ergodic, a coupling is established between and a random time so that the chain after time is stationary and follows a simple known distribution. The results are related to several examples of number representations generated by a dynamical system, including base- expansions, generalized L\"uroth series, -expansions, and continued fraction representations. The importance of the results and some suggestions and open problems for future research are discussed.

Jesper Møller

Jesper Møller

Aalborg Universitet

arXiv preprint arXiv:2404.08387

The asymptotic distribution of the scaled remainder for pseudo golden ratio expansions of a continuous random variable

Let be the base- expansion of a continuous random variable on the unit interval where is the positive solution to for an integer (i.e., is a generalization of the golden mean for which ). We study the asymptotic distribution and convergence rate of the scaled remainder when tends to infinity.

Jesper Møller

Jesper Møller

Aalborg Universitet

Methodology and Computing in Applied Probability

How many digits are needed?

Let be the digits in the base-q expansion of a random variable X defined on [0, 1) where is an integer. For , we study the probability distribution of the (scaled) remainder : If X has an absolutely continuous CDF then converges in the total variation metric to the Lebesgue measure on the unit interval. Under weak smoothness conditions we establish first a coupling between X and a non-negative integer valued random variable N so that follows and is independent of , and second exponentially fast convergence of and its PDF . We discuss how many digits are needed and show examples of our results.

Mario Beraha

Mario Beraha

Politecnico di Milano

arXiv preprint arXiv:2401.14722

A Nonparametric Bayes Approach to Online Activity Prediction

Accurately predicting the onset of specific activities within defined timeframes holds significant importance in several applied contexts. In particular, accurate prediction of the number of future users that will be exposed to an intervention is an important piece of information for experimenters running online experiments (A/B tests). In this work, we propose a novel approach to predict the number of users that will be active in a given time period, as well as the temporal trajectory needed to attain a desired user participation threshold. We model user activity using a Bayesian nonparametric approach which allows us to capture the underlying heterogeneity in user engagement. We derive closed-form expressions for the number of new users expected in a given period, and a simple Monte Carlo algorithm targeting the posterior distribution of the number of days needed to attain a desired number of users; the latter is important for experimental planning. We illustrate the performance of our approach via several experiments on synthetic and real world data, in which we show that our novel method outperforms existing competitors.

Mario Beraha

Mario Beraha

Politecnico di Milano

Bayesian Analysis

Bayesian Nonparametric Model-based Clustering with Intractable Distributions: An ABC Approach

Bayesian nonparametric mixture models offer a rich framework for model-based clustering. We consider the situation where the kernel of the mixture is available only up to an intractable normalizing constant. In this case, the most commonly used Markov chain Monte Carlo (MCMC) methods are unsuitable. We propose an approximate Bayesian computational (ABC) strategy, whereby we approximate the posterior to avoid the intractability of the kernel. We derive an ABC-MCMC algorithm which combines (i) the use of the predictive distribution induced by the nonparametric prior as proposal and (ii) the use of the Wasserstein distance and its connection to optimal matching problems. To overcome the sensibility concerning the parameters of our algorithm, we further propose an adaptive strategy. We illustrate the use of the proposed algorithm with several simulation studies and an application on real data, where we …

Mario Beraha

Mario Beraha

Politecnico di Milano

arXiv preprint arXiv:2303.17844

Transform-scaled process priors for trait allocations in Bayesian nonparametrics

Completely random measures (CRMs) provide a broad class of priors, arguably, the most popular, for Bayesian nonparametric (BNP) analysis of trait allocations. As a peculiar property, CRM priors lead to predictive distributions that share the following common structure: for fixed prior's parameters, a new data point exhibits a Poisson (random) number of ``new'' traits, i.e., not appearing in the sample, which depends on the sampling information only through the sample size. While the Poisson posterior distribution is appealing for analytical tractability and ease of interpretation, its independence from the sampling information is a critical drawback, as it makes the posterior distribution of ``new'' traits completely determined by the estimation of the unknown prior's parameters. In this paper, we introduce the class of transform-scaled process (T-SP) priors as a tool to enrich the posterior distribution of ``new'' traits arising from CRM priors, while maintaining the same analytical tractability and ease of interpretation. In particular, we present a framework for posterior analysis of trait allocations under T-SP priors, showing that Stable T-SP priors, i.e., T-SP priors built from Stable CRMs, lead to predictive distributions such that, for fixed prior's parameters, a new data point displays a negative-Binomial (random) number of ``new'' traits, which depends on the sampling information through the number of distinct traits and the sample size. Then, by relying on a hierarchical version of T-SP priors, we extend our analysis to the more general setting of trait allocations with multiple groups of data or subpopulations. The empirical effectiveness of our methods is …

Mario Beraha

Mario Beraha

Politecnico di Milano

arXiv preprint arXiv:2312.13992

Bayesian nonparametric boundary detection for income areal data

Recent discussions on the future of metropolitan cities underscore the pivotal role of (social) equity, driven by demographic and economic trends. More equal policies can foster and contribute to a city's economic success and social stability. In this work, we focus on identifying metropolitan areas with distinct economic and social levels in the greater Los Angeles area, one of the most diverse yet unequal areas in the United States. Utilizing American Community Survey data, we propose a Bayesian model for boundary detection based on income distributions. The model identifies areas with significant income disparities, offering actionable insights for policymakers to address social and economic inequalities. Our approach formalized as a Bayesian structural learning framework, models areal densities through finite mixture models. Efficient posterior computation is facilitated by a transdimensional Markov Chain Monte Carlo sampler. The methodology is validated via extensive simulations and applied to the income distributions in the greater Los Angeles area. We identify several boundaries in the income distributions which can be explained in light of other social dynamics such as crime rates and healthcare, showing the usefulness of such an analysis to policymakers.

2023/12/21

Article Details
Mario Beraha

Mario Beraha

Politecnico di Milano

Statistical learning of random probability measures

The study of random probability measures is a lively research topic that has attracted interest from different fields in recent years. In this thesis, we consider random probability measures in the context of Bayesian nonparametrics, where the law of a random probability measure is used as prior distribution, and in the context of distributional data analysis, where the goal is to perform inference given avsample from the law of a random probability measure. The contributions contained in this thesis can be subdivided according to three different topics: (i) the use of almost surely discrete repulsive random measures (i.e., whose support points are well separated) for Bayesian model-based clustering, (ii) the proposal of new laws for collections of random probability measures for Bayesian density estimation of partially exchangeable data subdivided into different groups, and (iii) the study of principal component analysis and regression models for probability distributions seen as elements of the 2-Wasserstein space. Specifically, for point (i) above we propose an efficient Markov chain Monte Carlo algorithm for posterior inference, which sidesteps the need of split-merge reversible jump moves typically associated with poor performance, we propose a model for clustering high-dimensional data by introducing a novel class of anisotropic determinantal point processes, and study the distributional properties of the repulsive measures, shedding light on important theoretical results which enable more principled prior elicitation and more efficient posterior simulation algorithms. For point (ii) above, we consider several models suitable for clustering …

Mario Beraha

Mario Beraha

Politecnico di Milano

arXiv preprint arXiv:2310.09818

MCMC for Bayesian nonparametric mixture modeling under differential privacy

Estimating the probability density of a population while preserving the privacy of individuals in that population is an important and challenging problem that has received considerable attention in recent years. While the previous literature focused on frequentist approaches, in this paper, we propose a Bayesian nonparametric mixture model under differential privacy (DP) and present two Markov chain Monte Carlo (MCMC) algorithms for posterior inference. One is a marginal approach, resembling Neal's algorithm 5 with a pseudo-marginal Metropolis-Hastings move, and the other is a conditional approach. Although our focus is primarily on local DP, we show that our MCMC algorithms can be easily extended to deal with global differential privacy mechanisms. Moreover, for certain classes of mechanisms and mixture kernels, we show how standard algorithms can be employed, resulting in substantial efficiency gains. Our approach is general and applicable to any mixture model and privacy mechanism. In several simulations and a real case study, we discuss the performance of our algorithms and evaluate different privacy mechanisms proposed in the frequentist literature.

2023/10/15

Article Details
Mario Beraha

Mario Beraha

Politecnico di Milano

arXiv preprint arXiv:2303.15029

Random measure priors in Bayesian frequency recovery from sketches

Given a lossy-compressed representation, or sketch, of data with values in a set of symbols, the frequency recovery problem considers the estimation of the empirical frequency of a new data point. Recent studies have applied Bayesian nonparametrics (BNPs) to develop learning-augmented versions of the popular count-min sketch (CMS) recovery algorithm. In this paper, we present a novel BNP approach to frequency recovery, which is not built from the CMS but still relies on a sketch obtained by random hashing. Assuming data to be modeled as random samples from an unknown discrete distribution, which is endowed with a Poisson-Kingman (PK) prior, we provide the posterior distribution of the empirical frequency of a symbol, given the sketch. Estimates are then obtained as mean functionals. An application of our result is presented for the Dirichlet process (DP) and Pitman-Yor process (PYP) priors, and in particular: i) we characterize the DP prior as the sole PK prior featuring a property of sufficiency with respect to the sketch, leading to a simple posterior distribution; ii) we identify a large sample regime under which the PYP prior leads to a simple approximation of the posterior distribution. Then, we develop our BNP approach to a "traits" formulation of the frequency recovery problem, not yet studied in the CMS literature, in which data belong to more than one symbol (trait), and exhibit nonnegative integer levels of associations with each trait. In particular, by modeling data as random samples from a generalized Indian buffet process, we provide the posterior distribution of the empirical frequency level of a trait, given the sketch. This result is …

Mario Beraha

Mario Beraha

Politecnico di Milano

arXiv preprint arXiv:2309.15408

Frequency and cardinality recovery from sketched data: a novel approach bridging Bayesian and frequentist views

We study how to recover the frequency of a symbol in a large discrete data set, using only a compressed representation, or sketch, of those data obtained via random hashing. This is a classical problem in computer science, with various algorithms available, such as the count-min sketch. However, these algorithms often assume that the data are fixed, leading to overly conservative and potentially inaccurate estimates when dealing with randomly sampled data. In this paper, we consider the sketched data as a random sample from an unknown distribution, and then we introduce novel estimators that improve upon existing approaches. Our method combines Bayesian nonparametric and classical (frequentist) perspectives, addressing their unique limitations to provide a principled and practical solution. Additionally, we extend our method to address the related but distinct problem of cardinality recovery, which consists of estimating the total number of distinct objects in the data set. We validate our method on synthetic and real data, comparing its performance to state-of-the-art alternatives.

Mario Beraha

Mario Beraha

Politecnico di Milano

arXiv preprint arXiv:2303.02438

Bayesian clustering of high-dimensional data via latent repulsive mixtures

Model-based clustering of moderate or large dimensional data is notoriously difficult. We propose a model for simultaneous dimensionality reduction and clustering by assuming a mixture model for a set of latent scores, which are then linked to the observations via a Gaussian latent factor model. This approach was recently investigated by Chandra et al. (2020). The authors use a factor-analytic representation and assume a mixture model for the latent factors. However, performance can deteriorate in the presence of model misspecification. Assuming a repulsive point process prior for the component-specific means of the mixture for the latent scores is shown to yield a more robust model that outperforms the standard mixture model for the latent factors in several simulated scenarios. To favor well-separated clusters of data, the repulsive point process must be anisotropic, and its density should be tractable for efficient posterior inference. We address these issues by proposing a general construction for anisotropic determinantal point processes.

Mario Beraha

Mario Beraha

Politecnico di Milano

Journal of the Royal Statistical Society Series B: Statistical Methodology

Normalised latent measure factor models

We propose a methodology for modelling and comparing probability distributions within a Bayesian nonparametric framework. Building on dependent normalised random measures, we consider a prior distribution for a collection of discrete random measures where each measure is a linear combination of a set of latent measures, interpretable as characteristic traits shared by different distributions, with positive random weights. The model is nonidentified and a method for postprocessing posterior samples to achieve identified inference is developed. This uses Riemannian optimisation to solve a nontrivial optimisation problem over a Lie group of matrices. The effectiveness of our approach is validated on simulated data and in two applications to two real-world data sets: school student test scores and personal incomes in California. Our approach leads to interesting insights for populations and easily …

Jesper Møller

Jesper Møller

Aalborg Universitet

arXiv preprint arXiv:2312.09652

The asymptotic distribution of the remainder in a certain base- expansion

Let be the base- expansion of a continuous random variable on the unit interval where is the golden ratio. We study the asymptotic distribution and convergence rate of the scaled remainder when tends to infinity.

2023/12/15

Article Details
Jesper Møller

Jesper Møller

Aalborg Universitet

Proceedings of the London Mathematical Society

Realizability and tameness of fusion systems

A saturated fusion system over a finite p$p$‐group S$S$ is a category whose objects are the subgroups of S$S$ and whose morphisms are injective homomorphisms between the subgroups satisfying certain axioms. A fusion system over S$S$ is realized by a finite group G$G$ if S$S$ is a Sylow p$p$‐subgroup of G$G$ and morphisms in the category are those induced by conjugation in G$G$. One recurrent question in this subject is to find criteria as to whether a given saturated fusion system is realizable or not. One main result in this paper is that a saturated fusion system is realizable if all of its components (in the sense of Aschbacher) are realizable. Another result is that all realizable fusion systems are tame: a finer condition on realizable fusion systems that involves describing automorphisms of a fusion system in terms of those of some group that realizes it. Stated in this way, these results depend on the …

Jesper Møller

Jesper Møller

Aalborg Universitet

ACM Transactions on Spatial Algorithms and Systems

Stochastic Routing with Arrival Windows

Arriving at a destination within a specific time window is important in many transportation settings. For example, trucks may be penalized for early or late arrivals at compact terminals, and early and late arrivals at general practitioners, dentists, and so on, are also discouraged, in part due to COVID. We propose foundations for routing with arrival-window constraints. In a setting where the travel time of a road segment is modeled by a probability distribution, we define two problems where the aim is to find a route from a source to a destination that optimizes or yields a high probability of arriving within a time window while departing as late as possible. In this setting, a core challenge is to enable comparison between paths that may potentially be part of a result path with the goal of determining whether a path is uninteresting and can be disregarded given the existence of another path. We show that existing solutions …

2023/11/21

Article Details
Jesper Møller

Jesper Møller

Aalborg Universitet

Spatial Statistics

Fitting the grain orientation distribution of a polycrystalline material conditioned on a Laguerre tessellation

The description of distributions related to grain microstructure helps physicists to understand the processes in materials and their properties. This paper presents a general statistical methodology for the analysis of crystallographic orientations of grains in a 3D Laguerre tessellation dataset which represents the microstructure of a polycrystalline material. We introduce complex stochastic models which may substitute expensive laboratory experiments: conditional on the Laguerre tessellation, we suggest interaction models for the distribution of cubic crystal lattice orientations, where the interaction is between pairs of orientations for neighbouring grains in the tessellation. We discuss parameter estimation and model comparison methods based on maximum pseudolikelihood as well as graphical procedures for model checking using simulations. Our methodology is applied for analysing a dataset representing a nickel …

Jesper Møller

Jesper Møller

Aalborg Universitet

Methodology and Computing in Applied Probability

Singular distribution functions for random variables with stationary digits

Let F be the cumulative distribution function (CDF) of the base-q expansion , where is an integer and is a stationary stochastic process with state space . In a previous paper we characterized the absolutely continuous and the discrete components of F. In this paper we study special cases of models, including stationary Markov chains of any order and stationary renewal point processes, where we establish a law of pure types: F is then either a uniform or a singular CDF on [0, 1]. Moreover, we study mixtures of such models. In most cases expressions and plots of F are given.

Mario Beraha

Mario Beraha

Politecnico di Milano

arXiv preprint arXiv:2302.09034

Normalized Random Meaures with Interacting Atoms for Bayesian Nonparametric Mixtures

The study of almost surely discrete random probability measures is an active line of research in Bayesian nonparametrics. The idea of assuming interaction across the atoms of the random probability measure has recently spurred significant interest in the context of Bayesian mixture models. This allows the definition of priors that encourage well separated and interpretable clusters. In this work, we provide a unified framework for the construction and the Bayesian analysis of random probability measures with interacting atoms, encompassing both repulsive and attractive behaviors. Specifically we derive closed-form expressions for the posterior distribution, the marginal and predictive distributions, which were not previously available except for the case of measures with i.i.d. atoms. We show how these quantities are fundamental both for prior elicitation and to develop new posterior simulation algorithms for hierarchical mixture models. Our results are obtained without any assumption on the finite point process that governs the atoms of the random measure. Their proofs rely on new analytical tools borrowed from the theory of Palm calculus and that might be of independent interest. We specialize our treatment to the classes of Poisson, Gibbs, and Determinantal point processes, as well as to the case of shot-noise Cox processes.

Other articles from Journal of Computational and Graphical Statistics journal

Donatello Telesca

Donatello Telesca

University of California, Los Angeles

Journal of Computational and Graphical Statistics

Functional mixed membership models

Mixed membership models, or partial membership models, are a flexible unsupervised learning method that allows each observation to belong to multiple clusters. In this article, we propose a Bayesian mixed membership model for functional data. By using the multivariate Karhunen-Loève theorem, we are able to derive a scalable representation of Gaussian processes that maintains data-driven learning of the covariance structure. Within this framework, we establish conditional posterior consistency given a known feature allocation matrix. Compared to previous work on mixed membership models, our proposal allows for increased modeling flexibility, with the benefit of a directly interpretable mean and covariance structure. Our work is motivated by studies in functional brain imaging through electroencephalography (EEG) of children with autism spectrum disorder (ASD). In this context, our work formalizes the …

Marco Riani

Marco Riani

Università degli Studi di Parma

Journal of Computational and Graphical Statistics

Robust transformations for multiple regression via additivity and variance stabilization

Outliers can have a major effect on the estimated transformation of the response in linear regression models, as they can on the estimates of the coefficients of the fitted model. The effect is more extreme in the Generalized Additive Models (GAMs) that are the subject of this article, as the forms of terms in the model can also be affected. We develop, describe and illustrate robust methods for the nonparametric transformation of the response and estimation of the terms in the model. Numerical integration is used to calculate the estimated variance stabilizing transformation. Robust regression provides outlier free input to the polynomial smoothers used in the calculation of the response transformation and in the backfitting algorithm for estimation of the functions of the GAM. Our starting point was the AVAS (Additivity and VAriance Stabilization) algorithm of Tibshirani. Even if robustness is not required, we have made four …

Cesar Goncalves de Lima

Cesar Goncalves de Lima

Universidade de São Paulo

Journal of Computational and Graphical Statistics

A multi-attribute evaluation of genotype-environment experiments using biplots and joint plots graphics

In plant breeding studies, some of objectives are to study the interaction between genotype and environment (GEI), evaluating genotypic stability and adaptability. The additive model with multiplicative interaction (AMMI) has been widely used for cases in which there is only one response trait. In this work we propose the combined use of the Tucker3 model, joint plot graphics, and Procrustes analysis to analyze data from a GEI experiment with multiple responses. The joint use of these two methodologies allows a direct comparison with the results of the AMMI analysis. This method was applied to a dataset related to an experiment to evaluate the darkening of carioca bean grains by the natural grain darkening method and by the accelerated darkening method installed in the randomized complete block design, in 2016. Nineteen carioca bean genotypes in six environments in the State of São Paulo, Brazil, and eight …

Shu Yang

Shu Yang

North Carolina State University

Journal of Computational and Graphical Statistics

Mixed Matrix Completion in Complex Survey Sampling under Heterogeneous Missingness

Modern surveys with large sample sizes and growing mixed-type questionnaires require robust and scalable analysis methods. In this work, we consider recovering a mixed dataframe matrix, obtained by complex survey sampling, with entries following different canonical exponential distributions and subject to heterogeneous missingness. To tackle this challenging task, we propose a two-stage procedure: in the first stage, we model the entry-wise missing mechanism by logistic regression, and in the second stage, we complete the target parameter matrix by maximizing a weighted log-likelihood with a low-rank constraint. We propose a fast and scalable estimation algorithm that achieves sublinear convergence, and the upper bound for the estimation error of the proposed method is rigorously derived. Experimental results support our theoretical claims, and the proposed estimator shows its merits compared to other …

Byron C. Jaeger

Byron C. Jaeger

University of Alabama at Birmingham

Journal of Computational and Graphical Statistics

Accelerated and interpretable oblique random survival forests

The oblique random survival forest (RSF) is an ensemble supervised learning method for right-censored outcomes. Trees in the oblique RSF are grown using linear combinations of predictors, whereas in the standard RSF, a single predictor is used. Oblique RSF ensembles have high prediction accuracy, but assessing many linear combinations of predictors induces high computational overhead. In addition, few methods have been developed for estimation of variable importance (VI) with oblique RSFs. We introduce a method to increase computational efficiency of the oblique RSF and a method to estimate VI with the oblique RSF. Our computational approach uses Newton-Raphson scoring in each non-leaf node, We estimate VI by negating each coefficient used for a given predictor in linear combinations, and then computing the reduction in out-of-bag accuracy. In benchmarking experiments, we find our …

John Kalbfleisch

John Kalbfleisch

University of Michigan-Dearborn

Journal of Computational and Graphical Statistics

Competing Risk Modeling with Bivariate Varying Coefficients to Understand the Dynamic Impact of COVID-19

The coronavirus disease 2019 (COVID-19) pandemic has exerted a profound impact on patients with end-stage renal disease relying on kidney dialysis to sustain their lives. A preliminary analysis of dialysis patient postdischarge hospital readmissions and deaths in 2020 revealed that the COVID-19 effect has varied significantly with postdischarge time and time since the pandemic onset. However, the complex dynamics cannot be characterized by existing varying coefficient models. To address this issue, we propose a bivariate varying coefficient model for competing risks, where tensor-product B-splines are used to estimate the surface of the COVID-19 effect. An efficient proximal Newton algorithm is developed to facilitate the fitting of the new model to the massive data for Medicare beneficiaries on dialysis. Difference-based anisotropic penalization is introduced to mitigate model overfitting and effect wiggliness …

Javier Cabrera

Javier Cabrera

Rutgers, The State University of New Jersey

Journal of Computational and Graphical Statistics

Data Nuggets: A Method for Reducing Big Data While Preserving Data Structure

Big data, with dimension where N is extremely large, has created new challenges for data analysis, particularly in the realm of creating meaningful clusters of data. Clustering techniques, such as K-means or hierarchical clustering are popular methods for performing exploratory analysis on large datasets. Unfortunately, these methods are not always possible to apply to big data due to memory or time constraints generated by calculations of order . To circumvent this problem, typically the clustering technique is applied to a random sample drawn from the dataset; however, a weakness is that the structure of the dataset, particularly at the edges, is not necessarily maintained. We propose a new solution through the concept of “data nuggets”, which reduces a large dataset into a small collection of nuggets of data, each containing a center, weight, and scale parameter. The data nuggets are then input …

Emi Tanaka

Emi Tanaka

Monash University

Journal of Computational and Graphical Statistics

A Plot is Worth a Thousand Tests: Assessing Residual Diagnostics with the Lineup Protocol

Regression experts consistently recommend plotting residuals for model diagnosis, despite the availability of many numerical hypothesis test procedures designed to use residuals to assess problems with a model fit. Here we provide evidence for why this is good advice using data from a visual inference experiment. We show how conventional tests are too sensitive, which means that too often the conclusion would be that the model fit is inadequate. The experiment uses the lineup protocol which puts a residual plot in the context of null plots. This helps generate reliable and consistent reading of residual plots for better model diagnosis. It can also help in an obverse situation where a conventional test would fail to detect a problem with a model due to contaminated data. The lineup protocol also detects a range of departures from good residuals simultaneously. Supplemental materials for the article are available …

Lu Yang

Lu Yang

University of Minnesota-Twin Cities

Journal of Computational and Graphical Statistics

Double Probability Integral Transform Residuals for Regression Models with Discrete Outcomes

The assessment of regression models with discrete outcomes is challenging and has many fundamental issues. With discrete outcomes, standard regression model assessment tools such as Pearson and deviance residuals do not follow the conventional reference distribution (normal) under the true model, calling into question the legitimacy of model assessment based on these tools. To fill this gap, we construct a new type of residuals for regression models with general discrete outcomes, including ordinal and count outcomes. The proposed residuals are based on two layers of probability integral transformation. When at least one continuous covariate is available, the proposed residuals closely follow a uniform distribution (or a normal distribution after transformation) under the correctly specified model. One can construct visualizations such as QQ plots to check the overall fit of a model straightforwardly, and the …

Bernardo Nipoti

Bernardo Nipoti

Università degli Studi di Milano-Bicocca

Journal of Computational and Graphical Statistics

Accelerated structured matrix factorization

Matrix factorization exploits the idea that, in complex high-dimensional data, the actual signal typically lies in lower-dimensional structures. These lower dimensional objects provide useful insights, with interpretation favored by sparse structures. Sparsity, in addition, is beneficial in terms of regularization and, thus, to avoid over-fitting. By exploiting Bayesian shrinkage priors, we devise a computationally convenient approach for high-dimensional matrix factorization. The dependence between row and column entities is modeled by inducing flexible sparse patterns within factors. The availability of external information is accounted for in such a way that structures are allowed while not imposed. Inspired by boosting algorithms, we pair the proposed approach with a numerical strategy relying on a sequential inclusion and estimation of low-rank contributions, with a data-driven stopping rule. Practical advantages of the …

Martina Morris

Martina Morris

University of Washington

Journal of Computational and Graphical Statistics

Improving and Extending STERGM Approximations Based on Cross-Sectional Data and Tie Durations

Temporal exponential-family random graph models (TERGMs) are a flexible class of models for network ties that change over time. Separable TERGMs (STERGMs) are a subclass of TERGMs in which the dynamics of tie formation and dissolution can be separated within each discrete time step and may depend on different factors. The Carnegie et al. approximation improves estimation efficiency for a subclass of STERGMs, allowing them to be reliably estimated from inexpensive cross-sectional study designs. This approximation adapts to cross-sectional data by attempting to construct a STERGM with two specific properties: a cross-sectional equilibrium distribution defined by an exponential-family random graph model (ERGM) for the network structure, and geometric tie duration distributions defined by constant hazards for tie dissolution. In this article we focus on approaches for improving the behavior of the …

Pariya Behrouzi

Pariya Behrouzi

Wageningen Universiteit

Journal of Computational and Graphical Statistics

Copula graphical models for heterogeneous mixed data

This article proposes a graphical model that handles mixed-type, multi-group data. The motivation for such a model originates from real-world observational data, which often contain groups of samples obtained under heterogeneous conditions in space and time, potentially resulting in differences in network structure among groups. Therefore, the iid assumption is unrealistic, and fitting a single graphical model on all data results in a network that does not accurately represent the between group differences. In addition, real-world observational data is typically of mixed discrete-and-continuous type, violating the Gaussian assumption that is typical of graphical models, which leads to the model being unable to adequately recover the underlying graph structure. Both these problems are solved by fitting a different graph for each group, applying the fused group penalty to fuse similar graphs together and by treating the …

qiang sun

qiang sun

Shandong University

Journal of Computational and Graphical Statistics

Supervised Principal Component Regression for Functional Responses with High Dimensional Predictors

We propose a supervised principal component regression method for relating functional responses with high-dimensional predictors. Unlike the conventional principal component analysis, the proposed method builds on a newly defined expected integrated residual sum of squares, which directly makes use of the association between the functional response and the predictors. Minimizing the integrated residual sum of squares gives the supervised principal components, which is equivalent to solving a sequence of nonconvex generalized Rayleigh quotient optimization problems. We reformulate the nonconvex optimization problems into a simultaneous linear regression with a sparse penalty to deal with high dimensional predictors. Theoretically, we show that the reformulated regression problem can recover the same supervised principal subspace under certain conditions. Statistically, we establish nonasymptotic …

Cheng Meng

Cheng Meng

Renmin University of China

Journal of Computational and Graphical Statistics

Nonparametric Additive Models for Billion Observations

The nonparametric additive model (NAM) is a widely used nonparametric regression method. Nevertheless, due to the high computational burden, classic statistical techniques for fitting NAMs are not well-equipped to handle massive data with billions of observations. To address this challenge, we develop a scalable element-wise subset selection method, referred to as Core-NAM, for fitting penalized regression spline based NAMs. Specifically, we first propose an approximation of the penalized least squares estimation, based on which we develop an efficient variant of generalized cross-validation (GCV) to select the smoothing parameter and approximate the Bayesian confidence intervals for statistical inference. Theoretically, we show that the proposed estimator approximately minimizes an upper bound of the estimation mean squared error. Moreover, we provide a non-asymptotic approximation guarantee for …

Marc A. Suchard

Marc A. Suchard

University of California, Los Angeles

Journal of Computational and Graphical Statistics

Massive parallelization of massive sample-size survival analysis

Large-scale observational health databases are increasingly popular for conducting comparative effectiveness and safety studies of medical products. However, increasing number of patients poses computational challenges when fitting survival regression models in such studies. In this article, we use Graphics Processing Units (GPUs) to parallelize the computational bottlenecks of massive sample-size survival analyses. Specifically, we develop and apply time- and memory-efficient single-pass parallel scan algorithms for Cox proportional hazards models and forward-backward parallel scan algorithms for Fine-Gray models for analysis with and without a competing risk using a cyclic coordinate descent optimization approach. We demonstrate that GPUs accelerate the computation of fitting these complex models in large databases by orders of magnitude as compared to traditional multi-core CPU parallelism. Our …

Jian Kang

Jian Kang

University of Michigan

Journal of Computational and Graphical Statistics

Competing Risk Modeling with Bivariate Varying Coefficients to Understand the Dynamic Impact of COVID-19

The coronavirus disease 2019 (COVID-19) pandemic has exerted a profound impact on patients with end-stage renal disease relying on kidney dialysis to sustain their lives. A preliminary analysis of dialysis patient postdischarge hospital readmissions and deaths in 2020 revealed that the COVID-19 effect has varied significantly with postdischarge time and time since the pandemic onset. However, the complex dynamics cannot be characterized by existing varying coefficient models. To address this issue, we propose a bivariate varying coefficient model for competing risks, where tensor-product B-splines are used to estimate the surface of the COVID-19 effect. An efficient proximal Newton algorithm is developed to facilitate the fitting of the new model to the massive data for Medicare beneficiaries on dialysis. Difference-based anisotropic penalization is introduced to mitigate model overfitting and effect wiggliness …

Ray Bai

Ray Bai

University of South Carolina

Journal of Computational and Graphical Statistics

Generative quantile regression with variability penalty

Quantile regression and conditional density estimation can reveal structure that is missed by mean regression, such as multimodality and skewness. In this article, we introduce a deep learning generative model for joint quantile estimation called Penalized Generative Quantile Regression (PGQR). Our approach simultaneously generates samples from many random quantile levels, allowing us to infer the conditional distribution of a response variable given a set of covariates. Our method employs a novel variability penalty to avoid the problem of vanishing variability, or memorization, in deep generative models. Further, we introduce a new family of partial monotonic neural networks (PMNN) to circumvent the problem of crossing quantile curves. A major benefit of PGQR is that it can be fit using a single optimization, thus, bypassing the need to repeatedly train the model at multiple quantile levels or use …

Trevor Hastie

Trevor Hastie

Stanford University

Journal of Computational and Graphical Statistics

Smooth multi-period forecasting with application to prediction of COVID-19 cases

Forecasting methodologies have always attracted a lot of attention and have become an especially hot topic since the beginning of the COVID-19 pandemic. In this article we consider the problem of multi-period forecasting that aims to predict several horizons at once. We propose a novel approach that forces the prediction to be “smooth” across horizons and apply it to two tasks: point estimation via regression and interval prediction via quantile regression. This methodology was developed for real-time distributed COVID-19 forecasting. We illustrate the proposed technique with the COVIDcast dataset as well as a small simulation example. Supplementary materials for this article are available online.

David Nott

David Nott

National University of Singapore

Journal of Computational and Graphical Statistics

Improving the accuracy of marginal approximations in likelihood-free inference via localization

Likelihood-free methods are an essential tool for performing inference for implicit models which can be simulated from, but for which the corresponding likelihood is intractable. However, common likelihood-free methods do not scale well to a large number of model parameters. A promising approach to high-dimensional likelihood-free inference involves estimating low-dimensional marginal posteriors by conditioning only on summary statistics believed to be informative for the low-dimensional component, and then combining the low-dimensional approximations in some way. In this article, we demonstrate that such low-dimensional approximations can be surprisingly poor in practice for seemingly intuitive summary statistic choices. We describe an idealized low-dimensional summary statistic that is, in principle, suitable for marginal estimation. However, a direct approximation of the idealized choice is difficult in …

Christopher Drovandi

Christopher Drovandi

Queensland University of Technology

Journal of Computational and Graphical Statistics

Improving the accuracy of marginal approximations in likelihood-free inference via localization

Likelihood-free methods are an essential tool for performing inference for implicit models which can be simulated from, but for which the corresponding likelihood is intractable. However, common likelihood-free methods do not scale well to a large number of model parameters. A promising approach to high-dimensional likelihood-free inference involves estimating low-dimensional marginal posteriors by conditioning only on summary statistics believed to be informative for the low-dimensional component, and then combining the low-dimensional approximations in some way. In this article, we demonstrate that such low-dimensional approximations can be surprisingly poor in practice for seemingly intuitive summary statistic choices. We describe an idealized low-dimensional summary statistic that is, in principle, suitable for marginal estimation. However, a direct approximation of the idealized choice is difficult in …