Discussion of ‘Event history and topological data analysis’

Biometrika

Published On 2021/12/1

Although topological data analysis has been around for many decades with well-grounded theoretical development, it still suffers from numerous statistical and computational issues. For these reasons, it has not yet become a standard tool for data scientists. The authors point out the difficulty of directly applying existing statistical models to persistent homology due to the heterogeneous nature of topological features. The statistical development in topological data analysis in the last decade has been focused on making heterogeneous features into homogenous structured data by transformations or smoothing. Thus, the idea of applying survival analysis techniques to the birth and death process of topological features is very intriguing. The authors succeeded in elucidating the connection between event history methods and the lifetime of topological features, and the paper has stimulated many new interesting questions.

Journal

Biometrika

Published On

2021/12/1

Volume

108

Issue

4

Page

775-778

Authors

Hernando Ombao

Hernando Ombao

King Abdullah University of Science and Technology

Position

Professor of Statistics

H-Index(all)

46

H-Index(since 2020)

30

I-10 Index(all)

0

I-10 Index(since 2020)

0

Citation(all)

0

Citation(since 2020)

0

Cited By

0

Research Interests

Time Series

Signal Processing

Space-Time Models

Neuroscience

Jesper Møller

Jesper Møller

Aalborg Universitet

Position

Professor in Statistics

H-Index(all)

46

H-Index(since 2020)

23

I-10 Index(all)

0

I-10 Index(since 2020)

0

Citation(all)

0

Citation(since 2020)

0

Cited By

0

Research Interests

Mathematical Statistics

Probability Theory

University Profile Page

Moo K. Chung

Moo K. Chung

University of Wisconsin-Madison

Position

H-Index(all)

44

H-Index(since 2020)

32

I-10 Index(all)

0

I-10 Index(since 2020)

0

Citation(all)

0

Citation(since 2020)

0

Cited By

0

Research Interests

Computational Anatomy

Shape Analysis

Functional Data Analysis

Brain Connectivity

Topological Data Analysis

University Profile Page

Peter Bubenik

Peter Bubenik

University of Florida

Position

Professor of Mathematics

H-Index(all)

19

H-Index(since 2020)

17

I-10 Index(all)

0

I-10 Index(since 2020)

0

Citation(all)

0

Citation(since 2020)

0

Cited By

0

Research Interests

Applied Topology

Topological Data Analysis

University Profile Page

Christophe Biscio

Christophe Biscio

Aalborg Universitet

Position

H-Index(all)

10

H-Index(since 2020)

10

I-10 Index(all)

0

I-10 Index(since 2020)

0

Citation(all)

0

Citation(since 2020)

0

Cited By

0

Research Interests

Spatial Statistics

Topological Data Analysis

University Profile Page

Other Articles from authors

Hernando Ombao

Hernando Ombao

King Abdullah University of Science and Technology

IEEE Journal of Biomedical and Health Informatics

Graph autoencoders for embedding learning in brain networks and major depressive disorder identification

Brain functional connectivity (FC) networks inferred from functional magnetic resonance imaging (fMRI) have shown altered or aberrant brain functional connectome in various neuropsychiatric disorders. Recent application of deep neural networks to connectome-based classification mostly relies on traditional convolutional neural networks (CNNs) using input FCs on a regular Euclidean grid to learn spatial maps of brain networks neglecting the topological information of the brain networks, leading to potentially sub-optimal performance in brain disorder identification. We propose a novel graph deep learning framework that leverages non-Euclidean information inherent in the graph structure for classifying brain networks in major depressive disorder (MDD). We introduce a novel graph autoencoder (GAE) architecture, built upon graph convolutional networks (GCNs), to embed the topological structure and node …

Hernando Ombao

Hernando Ombao

King Abdullah University of Science and Technology

Journal of the Royal Statistical Society Series B: Statistical Methodology

Hernando Ombao’s contribution to the Discussion of ‘the Discussion Meeting on Probabilistic and statistical aspects of machine learning’

Detecting change points in data is challenging because of the range of possible types of change and types of behaviour of data when there is no change. Statistically efficient methods for detecting a change will depend on both of these features, and it can be difficult for a practitioner to develop an appropriate detection method for their application of interest. We show how to automatically generate new offline detection methods based on training a neural network. Our approach is motivated by many existing tests for the presence of a change point being representable by a simple neural network, and thus a neural network trained with sufficient data should have performance at least as good as these methods. We present theory that quantifies the error rate for such an approach, and how it depends on the amount of training data. Empirical results show that, even with limited training data, its performance is competitive …

Moo K. Chung

Moo K. Chung

University of Wisconsin-Madison

arXiv preprint arXiv:2403.06687

Advancing Graph Neural Networks with HL-HGAT: A Hodge-Laplacian and Attention Mechanism Approach for Heterogeneous Graph-Structured Data

Graph neural networks (GNNs) have proven effective in capturing relationships among nodes in a graph. This study introduces a novel perspective by considering a graph as a simplicial complex, encompassing nodes, edges, triangles, and -simplices, enabling the definition of graph-structured data on any -simplices. Our contribution is the Hodge-Laplacian heterogeneous graph attention network (HL-HGAT), designed to learn heterogeneous signal representations across -simplices. The HL-HGAT incorporates three key components: HL convolutional filters (HL-filters), simplicial projection (SP), and simplicial attention pooling (SAP) operators, applied to -simplices. HL-filters leverage the unique topology of -simplices encoded by the Hodge-Laplacian (HL) operator, operating within the spectral domain of the -th HL operator. To address computation challenges, we introduce a polynomial approximation for HL-filters, exhibiting spatial localization properties. Additionally, we propose a pooling operator to coarsen -simplices, combining features through simplicial attention mechanisms of self-attention and cross-attention via transformers and SP operators, capturing topological interconnections across multiple dimensions of simplices. The HL-HGAT is comprehensively evaluated across diverse graph applications, including NP-hard problems, graph multi-label and classification challenges, and graph regression tasks in logistics, computer vision, biology, chemistry, and neuroscience. The results demonstrate the model's efficacy and versatility in handling a wide range of graph-based scenarios.

Hernando Ombao

Hernando Ombao

King Abdullah University of Science and Technology

International Journal of Environmental Science and Development

Clustering Provinces with Drought Risk Based on Daily Maximum Temperature

Changes in global weather patterns are sweeping the world, including Indonesia. One of the causes of this change was the El Niño event, where sea surface temperatures in the central Pacific Ocean experienced an increase. Apart from causing temperatures to increase, it also causes the intensity of rainfall to decrease, causing drought disasters. Anticipating natural disasters and disaster mitigation needs to be carried out to reduce their negative impacts. Efforts can be made by identifying areas with a high potential for drought and clustering areas based on the level of potential drought. This article focuses on extreme data from maximum temperatures in 34 provinces in Indonesia. Clustering was performed using the k-means and k-medoids methods and evaluated using the Davies-Bouldin index. Predict the highest maximum temperature in a specific period using the return level. The result shows that the k-means method is more suitable and better implemented by checking on the Davies-Boulding index, which is 0.9945.

Peter Bubenik

Peter Bubenik

University of Florida

arXiv preprint arXiv:2402.15058

Mixup Barcodes: Quantifying Geometric-Topological Interactions between Point Clouds

We combine standard persistent homology with image persistent homology to define a novel way of characterizing shapes and interactions between them. In particular, we introduce: (1) a mixup barcode, which captures geometric-topological interactions (mixup) between two point sets in arbitrary dimension; (2) simple summary statistics, total mixup and total percentage mixup, which quantify the complexity of the interactions as a single number; (3) a software tool for playing with the above. As a proof of concept, we apply this tool to a problem arising from machine learning. In particular, we study the disentanglement in embeddings of different classes. The results suggest that topological mixup is a useful method for characterizing interactions for low and high-dimensional data. Compared to the typical usage of persistent homology, the new tool is sensitive to the geometric locations of the topological features, which is often desirable.

Hernando Ombao

Hernando Ombao

King Abdullah University of Science and Technology

Foundations of Data Science

Dynamic topological data analysis of functional human brain networks

Developing reliable methods to discriminate different transient brain states that change over time is a key neuroscientific challenge in brain imaging studies. Topological data analysis (TDA), a novel framework based on algebraic topology, can handle such a challenge. However, existing TDA has been somewhat limited to capturing the static summary of dynamically changing brain networks. We propose a novel dynamic-TDA framework that builds persistent homology over a time series of brain networks. We construct a Wasserstein distance based inference procedure to discriminate between time series of networks. The method is applied to the resting-state functional magnetic resonance images of the human brain. We demonstrate that our proposed dynamic-TDA approach can distinctly discriminate between the topological patterns of male and female brain networks. MATLAB code for implementing this method …

2023/12/18

Article Details
Moo K. Chung

Moo K. Chung

University of Wisconsin-Madison

Foundations of Data Science

Dynamic topological data analysis of functional human brain networks

Developing reliable methods to discriminate different transient brain states that change over time is a key neuroscientific challenge in brain imaging studies. Topological data analysis (TDA), a novel framework based on algebraic topology, can handle such a challenge. However, existing TDA has been somewhat limited to capturing the static summary of dynamically changing brain networks. We propose a novel dynamic-TDA framework that builds persistent homology over a time series of brain networks. We construct a Wasserstein distance based inference procedure to discriminate between time series of networks. The method is applied to the resting-state functional magnetic resonance images of the human brain. We demonstrate that our proposed dynamic-TDA approach can distinctly discriminate between the topological patterns of male and female brain networks. MATLAB code for implementing this method …

2023/12/18

Article Details
Hernando Ombao

Hernando Ombao

King Abdullah University of Science and Technology

Frontiers in Human Neuroscience

Unleashing the potential of fNIRS with machine learning: classification of fine anatomical movements to empower future brain-computer interface

In this study, we explore the potential of using functional near-infrared spectroscopy (fNIRS) signals in conjunction with modern machine-learning techniques to classify specific anatomical movements to increase the number of control commands for a possible fNIRS-based brain-computer interface (BCI) applications. The study focuses on novel individual finger-tapping, a well-known task in fNIRS and fMRI studies, but limited to left/right or few fingers. Twenty-four right-handed participants performed the individual finger-tapping task. Data were recorded by using sixteen sources and detectors placed over the motor cortex according to the 10-10 international system. The event's average oxygenated Δ HbO and deoxygenated Δ HbR hemoglobin data were utilized as features to assess the performance of diverse machine learning (ML) models in a challenging multi-class classification setting. These methods include …

Peter Bubenik

Peter Bubenik

University of Florida

arXiv preprint arXiv:2402.04242

Exact weights and path metrics for triangulated categories and the derived category of persistence modules

We define exact weights on a triangulated category to be nonnegative functions on objects satisfying a subadditivity condition with respect to exact triangles. Such weights induce a metric on objects in the triangulated category, which we call a path metric. Our exact weights generalize the rank functions of J.\ Chuang and A.\ Lazarev and are analogous to the exact weights for an exact category given by the first author and J.\ Scott and D.\ Stanley. We show that cohomological functors from a triangulated category to an abelian category with an additive weight induce an exact weight on the triangulated category. We prove that triangle equivalences induce an isometry for the path metrics induced by cohomological functors. In the perfectly generated or compactly generated case, we use Brown representability to express the exact weight on the triangulated category. We give three characterizations of exactness for a weight on a triangulated category and show that they are equivalent. We also define Wasserstein distances for triangulated categories. Finally, we apply our work to derived categories of persistence modules and to representations of continuous quivers of type .

Hernando Ombao

Hernando Ombao

King Abdullah University of Science and Technology

The Journal of Laryngology & Otology

Brain waves spectral analysis of human responses to odorous and non-odorous substances: a preliminary study

ObjectiveThe aim of this study was to identify the potential electrophysiological biomarkers of human responses by comparing the electroencephalogram brain wave changes towards lavender versus normal saline in a healthy human population.MethodThis study included a total of 44 participants without subjective olfactory disturbances. Lavender and normal saline were used as the olfactory stimulant and control. Electroencephalogram was recorded and power spectra were analysed by the spectral analysis for each alpha, beta, delta, theta and gamma bandwidth frequency upon exposure to lavender and normal saline independently.ResultsThe oscillatory brain activities in response to the olfactory stimulant indicated that the lavender smell decreased the beta activity in the left frontal (F7 electrode) and central region (C3 electrode) with a reduction in the gamma activity in the right parietal region (P4 electrode) (p …

Hernando Ombao

Hernando Ombao

King Abdullah University of Science and Technology

The Annals of Applied Statistics

Filtrated common functional principal component analysis of multigroup functional data

The online Supplementary Material includes technical proofs, additional simulation and real data analysis results, and pseudocode for the developed community detection algorithm.

Christophe Biscio

Christophe Biscio

Aalborg Universitet

Multi-Sensor Multi-Scan Radar Sensing of Multiple Extended Targets

We propose an efficient solution to the state estimation problem in multi-scan multi-sensor multiple extended target sensing scenarios. We first model the measurement process by a doubly inhomogeneous-generalized shot noise Cox process and then estimate the parameters using a jump Markov chain Monte Carlo sampling technique. The proposed approach scales linearly in the number of measurements and can take spatial properties of the sensors into account, herein, sensor noise covariance, detection probability, and resolution. Numerical experiments using radar measurement data suggest that the algorithm offers improvements in high clutter scenarios with closely spaced targets over state-of-the-art clustering techniques used in existing multiple extended target tracking algorithms.

Jesper Møller

Jesper Møller

Aalborg Universitet

arXiv preprint arXiv:2404.09525

Coupling results and Markovian structures for number representations of continuous random variables

A general setting for nested subdivisions of a bounded real set into intervals defining the digits of a random variable with a probability density function is considered. Under the weak condition that is almost everywhere lower semi-continuous, a coupling between and a non-negative integer-valued random variable is established so that have an interpretation as the ``sufficient digits'', since the distribution of conditioned on does not depend on . Adding a condition about a Markovian structure of the lengths of the intervals in the nested subdivisions, becomes a Markov chain of a certain order . If then are IID with a known distribution. When and the Markov chain is uniformly geometric ergodic, a coupling is established between and a random time so that the chain after time is stationary and follows a simple known distribution. The results are related to several examples of number representations generated by a dynamical system, including base- expansions, generalized L\"uroth series, -expansions, and continued fraction representations. The importance of the results and some suggestions and open problems for future research are discussed.

Jesper Møller

Jesper Møller

Aalborg Universitet

arXiv preprint arXiv:2404.08387

The asymptotic distribution of the scaled remainder for pseudo golden ratio expansions of a continuous random variable

Let be the base- expansion of a continuous random variable on the unit interval where is the positive solution to for an integer (i.e., is a generalization of the golden mean for which ). We study the asymptotic distribution and convergence rate of the scaled remainder when tends to infinity.

Jesper Møller

Jesper Møller

Aalborg Universitet

Methodology and Computing in Applied Probability

How many digits are needed?

Let be the digits in the base-q expansion of a random variable X defined on [0, 1) where is an integer. For , we study the probability distribution of the (scaled) remainder : If X has an absolutely continuous CDF then converges in the total variation metric to the Lebesgue measure on the unit interval. Under weak smoothness conditions we establish first a coupling between X and a non-negative integer valued random variable N so that follows and is independent of , and second exponentially fast convergence of and its PDF . We discuss how many digits are needed and show examples of our results.

Peter Bubenik

Peter Bubenik

University of Florida

Journal of Applied and Computational Topology

Topological and metric properties of spaces of generalized persistence diagrams

Motivated by persistent homology and topological data analysis, we consider formal sums on a metric space with a distinguished subset. These formal sums, which we call persistence diagrams, have a canonical 1-parameter family of metrics called Wasserstein distances. We study the topological and metric properties of these spaces. Some of our results are new even in the case of persistence diagrams on the half-plane. Under mild conditions, no persistence diagram has a compact neighborhood. If the underlying metric space is-compact then so is the space of persistence diagrams. However, under mild conditions, the space of persistence diagrams is not hemicompact and the space of functions from this space to a topological space is not metrizable. Spaces of persistence diagrams inherit completeness and separability from the underlying metric space. Some spaces of persistence diagrams inherit being …

Hernando Ombao

Hernando Ombao

King Abdullah University of Science and Technology

Econometrics and Statistics

Bayesian Nonparametric Multivariate Mixture of Autoregressive Processes with Application to Brain Signals

One of neuroscience’s goals is to study the interactions between different brain regions during rest and while performing specific cognitive tasks. Multivariate Bayesian autoregressive decomposition (MBMARD) is proposed as an intuitive and novel Bayesian non-parametric model to represent high-dimensional signals as a low-dimensional mixture of univariate uncorrelated latent oscillations. Each latent oscillation captures a specific underlying oscillatory activity and, hence, is modeled as a unique second-order autoregressive process due to a compelling property, namely, that its spectral density’s shape is characterized by a unique frequency peak and bandwidth, parameterized by a location and a scale parameter. The posterior distributions of the latent oscillation parameters are computed using a Metropolis-within-Gibbs algorithm. One of the advantages of the MBMARD model is its higher robustness against …

Hernando Ombao

Hernando Ombao

King Abdullah University of Science and Technology

arXiv preprint arXiv:2404.09157

Statistics of Extremes for Neuroscience

This chapter illustrates how tools from univariate and multivariate statistics of extremes can complement classical methods used to study brain signals and enhance the understanding of brain activity and connectivity during specific cognitive tasks or abnormal episodes, such as an epileptic seizure.

Peter Bubenik

Peter Bubenik

University of Florida

Neuroinformatics

Topological Data Analysis Captures Task-Driven fMRI Profiles in Individual Participants: A Classification Pipeline Based on Persistence

BOLD-based fMRI is the most widely used method for studying brain function. The BOLD signal while valuable, is beset with unique vulnerabilities. The most notable of these is the modest signal to noise ratio, and the relatively low temporal and spatial resolution. However, the high dimensional complexity of the BOLD signal also presents unique opportunities for functional discovery. Topological Data Analyses (TDA), a branch of mathematics optimized to search for specific classes of structure within high dimensional data may provide particularly valuable applications. In this investigation, we acquired fMRI data in the anterior cingulate cortex (ACC) using a basic motor control paradigm. Then, for each participant and each of three task conditions, fMRI signals in the ACC were summarized using two methods: a) TDA based methods of persistent homology and persistence landscapes and b) non-TDA based …

Hernando Ombao

Hernando Ombao

King Abdullah University of Science and Technology

arXiv preprint arXiv:2401.16928

Dynamic MRI reconstruction using low-rank plus sparse decomposition with smoothness regularization

The low-rank plus sparse (L+S) decomposition model has enabled better reconstruction of dynamic magnetic resonance imaging (dMRI) with separation into background (L) and dynamic (S) component. However, use of low-rank prior alone may not fully explain the slow variations or smoothness of the background part at the local scale. In this paper, we propose a smoothness-regularized L+S (SR-L+S) model for dMRI reconstruction from highly undersampled k-t-space data. We exploit joint low-rank and smooth priors on the background component of dMRI to better capture both its global and local temporal correlated structures. Extending the L+S formulation, the low-rank property is encoded by the nuclear norm, while the smoothness by a general \ell_{p}-norm penalty on the local differences of the columns of L. The additional smoothness regularizer can promote piecewise local consistency between neighboring frames. By smoothing out the noise and dynamic activities, it allows accurate recovery of the background part, and subsequently more robust dMRI reconstruction. Extensive experiments on multi-coil cardiac and synthetic data shows that the SR-L+S model outp

Other articles from Biometrika journal

Julia A. Palacios

Julia A. Palacios

Stanford University

Biometrika

Statistical summaries of unlabelled evolutionary trees

Rooted and ranked phylogenetic trees are mathematical objects that are useful in modelling hierarchical data and evolutionary relationships with applications to many fields such as evolutionary biology and genetic epidemiology. Bayesian phylogenetic inference usually explores the posterior distribution of trees via Markov chain Monte Carlo methods. However, assessing uncertainty and summarizing distributions remains challenging for these types of structures. While labelled phylogenetic trees have been extensively studied, relatively less literature exists for unlabelled trees that are increasingly useful, for example when one seeks to summarize samples of trees obtained with different methods, or from different samples and environments, and wishes to assess the stability and generalizability of these summaries. In our paper, we exploit recently proposed distance metrics of unlabelled ranked binary trees …

Colin B. Fogarty

Colin B. Fogarty

Massachusetts Institute of Technology

Biometrika

No-harm calibration for generalized oaxaca-blinder estimators

In randomized experiments, adjusting for observed features when estimating treatment effects has been proposed as a way to improve asymptotic efficiency. However, among parametric methods, only linear regression has been proven to form an estimate of the average treatment effect that is asymptotically no less efficient than the treated-minus-control difference in means regardless of the true data generating process. Randomized treatment assignment provides this do-no-harm property, with neither truth of a linear model nor a generative model for the outcomes being required. We present a general calibration method that confers the same no-harm property onto estimators leveraging a broad class of nonlinear models. This recovers the usual regression-adjusted estimator when ordinary least squares is used, and further provides noninferior treatment effect estimators using methods such as logistic and …

Changliang Zou

Changliang Zou

Nankai University

Biometrika

Selective conformal inference with false coverage-statement rate control

Conformal inference is a popular tool for constructing prediction intervals. We consider here the scenario of post-selection/selective conformal inference, that is prediction intervals are reported only for individuals selected from unlabelled test data. To account for multiplicity, we develop a general split conformal framework to construct selective prediction intervals with the false coverage-statement rate control. We first investigate benjamini2005false's false coverage rate-adjusted method in the present setting, and show that it is able to achieve false coverage-statement rate control but yields uniformly inflated prediction intervals. We then propose a novel solution to the problem called selective conditional conformal prediction. Our method performs selection procedures on both the calibration set and test set, and then constructs conformal prediction intervals for the selected test candidates with the aid of …

Fan Li

Fan Li

Duke University

Biometrika

Covariate adjustment in randomized experiments with missing outcomes and covariates

Covariate adjustment can improve precision in analysing randomized experiments. With fully observed data, regression adjustment and propensity score weighting are asymptotically equivalent in improving efficiency over unadjusted analysis. When some outcomes are missing, we consider combining these two adjustment methods with inverse probability of observation weighting for handling missing outcomes, and show that the equivalence between the two methods breaks down. Regression adjustment no longer ensures efficiency gain over unadjusted analysis unless the true outcome model is linear in covariates or the outcomes are missing completely at random. Propensity score weighting, in contrast, still guarantees efficiency over unadjusted analysis, and including more covariates in adjustment never harms asymptotic efficiency. Moreover, we establish the value of using partially observed covariates …

Jared D. Huling

Jared D. Huling

University of Minnesota-Twin Cities

Biometrika

Robust Sample Weighting to Facilitate Individualized Treatment Rule Learning for a Target Population

Learning individualized treatment rules is an important topic in precision medicine. Current literature mainly focuses on deriving individualized treatment rules from a single source population. We consider the observational data setting when the source population differs from a target population of interest. Compared with causal generalization for the average treatment effect that is a scalar quantity, individualized treatment rule generalization poses new challenges due to the need to model and generalize the rules based on a prespecified class of functions that may not contain the unrestricted true optimal individualized treatment rule. The aim of this paper is to develop a weighting framework to mitigate the impact of such misspecification, and thus facilitate the generalizability of optimal individualized treatment rules from a source population to a target population. Our method seeks covariate balance over a …

Stefano Peluso

Stefano Peluso

Università degli Studi di Milano-Bicocca

Biometrika

Bayesian learning of network structures from interventional experimental data

Directed acyclic graphs provide an effective framework for learning causal relationships among variables given multivariate observations. Under pure observational data, directed acyclic graphs encoding the same conditional independencies cannot be distinguished and are collected into Markov equivalence classes. In many contexts, however, observational measurements are supplemented by interventional data that improve directed acyclic graph identifiability and enhance causal effect estimation. We propose a Bayesian framework for multivariate data partially generated after stochastic interventions. To this end, we introduce an effective prior elicitation procedure leading to a closed-form expression for the directed acyclic graph marginal likelihood and guaranteeing score equivalence among directed acyclic graphs that are Markov equivalent post intervention. Under the Gaussian setting, we show, in …

Simone Vantini

Simone Vantini

Politecnico di Milano

Biometrika

Populations of unlabelled networks: Graph space geometry and generalized geodesic principal components

Statistical analysis for populations of networks is widely applicable, but challenging, as networks have strongly non-Euclidean behaviour. Graph space is an exhaustive framework for studying populations of unlabelled networks that are weighted or unweighted, uni- or multilayered, directed or undirected. Viewing graph space as the quotient of a Euclidean space with respect to a finite group action, we show that it is not a manifold, and that its curvature is unbounded from above. Within this geometrical framework we define generalized geodesic principal components, and we introduce the align-all-and-compute algorithms, all of which allow for the computation of statistics on graph space. The statistics and algorithms are compared with existing methods and empirically validated on three real datasets, showcasing the potential utility of the framework. The whole framework is implemented within the geomstats …

Yuekai Sun

Yuekai Sun

University of Michigan

Biometrika

A linear adjustment-based approach to posterior drift in transfer learning

We present new models and methods for the posterior drift problem where the regression function in the target domain is modelled as a linear adjustment, on an appropriate scale, of that in the source domain, and study the theoretical properties of our proposed estimators in the binary classification problem. The core idea of our model inherits the simplicity and the usefulness of generalized linear models and accelerated failure time models from the classical statistics literature. Our approach is shown to be flexible and applicable in a variety of statistical settings, and can be adopted for transfer learning problems in various domains including epidemiology, genetics and biomedicine. As concrete applications, we illustrate the power of our approach (i) through mortality prediction for British Asians by borrowing strength from similar data from the larger pool of British Caucasians, using the UK Biobank data, and (ii …

Guanhua Chen

Guanhua Chen

University of Wisconsin-Madison

Biometrika

Robust sample weighting to facilitate individualized treatment rule learning for a target population

Learning individualized treatment rules is an important topic in precision medicine. Current literature mainly focuses on deriving individualized treatment rules from a single source population. We consider the observational data setting when the source population differs from a target population of interest. Compared with causal generalization for the average treatment effect that is a scalar quantity, individualized treatment rule generalization poses new challenges due to the need to model and generalize the rules based on a prespecified class of functions that may not contain the unrestricted true optimal individualized treatment rule. The aim of this paper is to develop a weighting framework to mitigate the impact of such misspecification, and thus facilitate the generalizability of optimal individualized treatment rules from a source population to a target population. Our method seeks covariate balance over a …

Peng Ding

Peng Ding

University of California, Berkeley

Biometrika

Power and sample size calculations for rerandomization

Power analyses are an important aspect of experimental design, because they help determine how experiments are implemented in practice. It is common to specify a desired level of power and compute the sample size necessary to obtain that power. Such calculations are well known for completely randomized experiments, but there can be many benefits to using other experimental designs. For example, it has recently been established that rerandomization, where subjects are randomized until covariate balance is obtained, increases the precision of causal effect estimators. This work establishes the power of rerandomized treatment-control experiments, thereby allowing for sample size calculators. We find the surprising result that, while power is often greater under rerandomization than complete randomization, the opposite can occur for very small treatment effects. The reason is that inference under …

wanjie wang

wanjie wang

National University of Singapore

Biometrika

Network-adjusted covariates for community detection

Community detection is a crucial task in network analysis that can be significantly improved by incorporating subject-level information, i.e., covariates. Existing methods have shown the effectiveness of using covariates on the low-degree nodes, but rarely discuss the case where communities have significantly different density levels, i.e. multiscale networks. In this paper, we introduce a novel method that addresses this challenge by constructing network-adjusted covariates, which leverage the network connections and covariates with a node-specific weight for each node. This weight can be calculated without tuning parameters. We present novel theoretical results on the strong consistency of our method under degree-corrected stochastic blockmodels with covariates, even in the presence of misspecification and multiple sparse communities. Additionally, we establish a general lower bound for the community …

Mark van der Laan

Mark van der Laan

University of California, Berkeley

Biometrika

One-step targeted maximum likelihood estimation for targeting cause-specific absolute risks and survival curves

This paper considers the one-step targeted maximum likelihood estimation methodology for multi-dimensional causal parameters in general survival and competing risk settings where event times take place on the positive real line and are subject to right censoring. We focus on effects of baseline treatment decisions possibly confounded by pretreatment covariates, but remark that our work generalizes to settings with time-varying treatment regimes and time-dependent confounding. We point out two overall contributions of our work. First, our methods can be used to obtain simultaneous inference for treatment effects on multiple absolute risks in competing risk settings. Second, our methods can be used to achieve inference for the full survival curve, or a full absolute risk curve, across time. The one-step targeted maximum likelihood procedure is based on a one-dimensional universal least favourable submodel …

Zach Branson

Zach Branson

Carnegie Mellon University

Biometrika

Power and sample size calculations for rerandomization

Power analyses are an important aspect of experimental design, because they help determine how experiments are implemented in practice. It is common to specify a desired level of power and compute the sample size necessary to obtain that power. Such calculations are well known for completely randomized experiments, but there can be many benefits to using other experimental designs. For example, it has recently been established that rerandomization, where subjects are randomized until covariate balance is obtained, increases the precision of causal effect estimators. This work establishes the power of rerandomized treatment-control experiments, thereby allowing for sample size calculators. We find the surprising result that, while power is often greater under rerandomization than complete randomization, the opposite can occur for very small treatment effects. The reason is that inference under …

Subha Maity

Subha Maity

University of Michigan-Dearborn

Biometrika

A linear adjustment based approach to posterior drift in transfer learning

We present new models and methods for the posterior drift problem where the regression function in the target domain is modelled as a linear adjustment, on an appropriate scale, of that in the source domain, and study the theoretical properties of our proposed estimators in the binary classification problem. The core idea of our model inherits the simplicity and the usefulness of generalized linear models and accelerated failure time models from the classical statistics literature. Our approach is shown to be flexible and applicable in a variety of statistical settings, and can be adopted for transfer learning problems in various domains including epidemiology, genetics and biomedicine. As concrete applications, we illustrate the power of our approach (i) through mortality prediction for British Asians by borrowing strength from similar data from the larger pool of British Caucasians, using the UK Biobank data, and (ii …

Xinran Li

Xinran Li

University of Illinois at Urbana-Champaign

Biometrika

Treatment effect quantiles in stratified randomized experiments and matched observational studies

Evaluating the treatment effect has become an important topic for many applications. However, most existing literature focuses mainly on average treatment effects. When the individual effects are heavy tailed or have outlier values, not only may the average effect not be appropriate for summarizing treatment effects, but also the conventional inference for it can be sensitive and possibly invalid due to poor large-sample approximations. In this paper we focus on quantiles of individual treatment effects, which can be more robust in the presence of extreme individual effects. Moreover, our inference for them is purely randomization based, avoiding any distributional assumptions on the units. We first consider inference in stratified randomized experiments, extending the recent work by . We show that the computation of valid p-values for testing null hypotheses on quantiles of individual effects can be transformed into …

Iván Díaz

Iván Díaz

Cornell University

Biometrika

Nonparametric efficient causal mediation with intermediate confounders (vol 108, pg 627, 2021)

Interventional effects for mediation analysis were proposed as a solution to the lack of identifiability of natural (in)direct effects in the presence of a mediator-outcome confounder affected by exposure. We present a theoretical and computational study of the properties of the interventional (in)direct effect estimands based on the efficient influence function in the nonparametric statistical model. We use the efficient influence function to develop two asymptotically optimal nonparametric estimators that leverage data-adaptive regression for the estimation of nuisance parameters: a one-step estimator and a targeted minimum loss estimator. We further present results establishing the conditions under which these estimators are consistent, multiply robust, -consistent and efficient. We illustrate the finite-sample performance of the estimators and corroborate our theoretical results in a simulation study. We also …

Dehan Kong

Dehan Kong

University of Toronto

Biometrika

Promises of parallel outcomes

A key challenge in causal inference from observational studies is the identification and estimation of causal effects in the presence of unmeasured confounding. In this paper, we introduce a novel approach for causal inference that leverages information in multiple outcomes to deal with unmeasured confounding. An important assumption in our approach is conditional independence among multiple outcomes. In contrast to existing proposals in the literature, the roles of multiple outcomes in the conditional independence assumption are symmetric, hence the name parallel outcomes. We show nonparametric identifiability with at least three parallel outcomes and provide parametric estimation tools under a set of linear structural equation models. Our proposal is evaluated through a set of synthetic and real data analyses.

Haojie Ren

Haojie Ren

Penn State University

Biometrika

Selective conformal inference with false coverage-statement rate control

Conformal inference is a popular tool for constructing prediction intervals. We consider here the scenario of post-selection/selective conformal inference, that is prediction intervals are reported only for individuals selected from unlabelled test data. To account for multiplicity, we develop a general split conformal framework to construct selective prediction intervals with the false coverage-statement rate control. We first investigate benjamini2005false's false coverage rate-adjusted method in the present setting, and show that it is able to achieve false coverage-statement rate control but yields uniformly inflated prediction intervals. We then propose a novel solution to the problem called selective conditional conformal prediction. Our method performs selection procedures on both the calibration set and test set, and then constructs conformal prediction intervals for the selected test candidates with the aid of …

Aldo Solari

Aldo Solari

Università degli Studi di Milano-Bicocca

Biometrika

Flexible control of the median of the false discovery proportion

We introduce a multiple testing procedure that controls the median of the proportion of false discoveries in a flexible way. The procedure only requires a vector of p-values as input and is comparable to the Benjamini–Hochberg method, which controls the mean of the proportion of false discoveries. Our method allows free choice of one or several values of alpha after seeing the data, unlike the Benjamini–Hochberg procedure, which can be very anti-conservative when alpha is chosen post hoc. We prove these claims and illustrate them with simulations. Our procedure is inspired by a popular estimator of the total number of true hypotheses. We adapt this estimator to provide simultaneously median unbiased estimators of the proportion of false discoveries, valid for finite samples. This simultaneity allows for the claimed flexibility. Our approach does not assume independence. The time complexity of our method …

Jonathan Terhorst

Jonathan Terhorst

University of Michigan

Biometrika

A linear adjustment-based approach to posterior drift in transfer learning

We present new models and methods for the posterior drift problem where the regression function in the target domain is modelled as a linear adjustment, on an appropriate scale, of that in the source domain, and study the theoretical properties of our proposed estimators in the binary classification problem. The core idea of our model inherits the simplicity and the usefulness of generalized linear models and accelerated failure time models from the classical statistics literature. Our approach is shown to be flexible and applicable in a variety of statistical settings, and can be adopted for transfer learning problems in various domains including epidemiology, genetics and biomedicine. As concrete applications, we illustrate the power of our approach (i) through mortality prediction for British Asians by borrowing strength from similar data from the larger pool of British Caucasians, using the UK Biobank data, and (ii …