BIOS Colloquium - Spring 2020

Spring 2020

Organizer: A. Chapple, PhD

Speaker: Sameer Deshpande, PhD
Affiliation: Post Doctoral Researcher at MIT, Boston, MA
Title: Bayesian Clustering with Particle Optimization
Date and Time: January 13th, 2020, 3:00PM
Room: LEC, Room 303
Abstract: In this talk, we describe a new optimization approach to Bayesian clustering that simultaneously identifies multiple high posterior probability clusterings. We illustrate this method with a case study about the spatiotemporal variation in crime patterns in the city of Philadelphia. Bayesian hierarchical modeling is a natural way to study spatial variation in urban crime dynamics at the neighborhood level, since it facilitates principled “sharing of information” between spatially adjacent neighborhoods. Typically, however, cities contain many physical and social boundaries that may manifest as spatial discontinuities in crime patterns. In this setting, standard prior choices often yield overly-smooth parameter estimates and miscalibrated forecasts. To prevent potential over-smoothing, we introduce a prior that first partitions the set of neighborhoods into several clusters and then encourages spatial smoothness within each cluster. In terms of model implementation, conventional stochastic search techniques are computationally prohibitive as they must traverse a combinatorially vast space of partitions. At a high level, our proposed particle optimization approach runs several “mutually aware” greedy searches that are discouraged from visiting the same point in the latent discrete space.

January 20^th, OFF MLK day

Speaker: Lynn LaMotte, PhD
Affiliation: Biostatistics Program, LSUHSC, New Orleans, LA
Title: A Look Back at Some People and Currents in Statistics
Date and Time: February 3rd, 2020, 3:00PM
Room: LEC, Room 303
Abstract: My hope in this talk is to mention some of the currents, people, controversies, and institutions that have shaped our discipline in the relatively few generations that it has existed. Some have said that modern applied statistics began with Karl Pearson’s 1900 paper on the chi-squared distribution. That paper spawned a famous feud with R. A. Fisher over degrees of freedom. For a field mostly regarded as dry and uncontroversial, it is perhaps surprising that there should have arisen several fierce, even personal, disputes. In addition to disputes, it is interesting to trace some of the institutions and bloodlines that have influenced the spread and development of our discipline. I’ll present some documentation and commentary that I have assembled on these topics. Sources are easy to find online. In addition, there are still people around who have lots of interesting anecdotes about the people and places that shaped modern statistics. I hope this discussion might stimulate you to take some time to explore your academic roots, too.

Speaker: Kevin Potcner
Affiliation: JMP Pro
Title: Data Visualization, Analysis and Modeling with JMP Pro
Date and Time: February 6th, 2020, 12:00PM
Room: LEC, Room 303
Abstract: JMP is an easy-to-use, standalone statistics and graphics software from SAS Institute. It includes comprehensive capabilities for every academic field, and its interactive point-and-click interface and linked analyses and graphics make it ideal for research and for use in the classroom, from the introductory to the advanced levels. JMP runs on Windows and Macintosh operating systems and also functions as an interface to SAS®, R, Python, MATLAB and Excel. Come and see how to use JMP for data summary, analysis, visualization, and predictive modeling.

Speaker: Hua He, PhD
Affiliation: Biostatistics Department, School of Public Health and Tropical Medicine, Tulane University
Title: Statistical Tests for Latent Class in Censored Data due to Detection Limit
Date and Time: Februrary 10^th, 2020, 3:00PM
Room: LEC, Room 303
Abstract: Measures of substance concentration in urine, serum or other biological matrices often have an assay limit of detection. When concentration levels fall below the limit, the exact measures cannot be obtained. Instead, the measures are censored as only partial information that the levels are under the limit is known. Assuming the concentration levels are from a single population with a normal distribution or follow a normal distribution after some transformation, Tobit regression models, or censored normal regression models, are the standard approach for analyzing such data. However, in practice, it is often the case that the data can exhibit more censored observations than what would be expected under the Tobit regression models. One common cause is the heterogeneity of the study population, caused by the existence of a latent group of subjects who lack the substance measured. For such subjects, the measurements will always be under the limit. If a censored normal regression model is appropriate for modeling the subjects with the substance, the whole population follows a mixture of a censored normal regression model and a degenerate distribution of the latent class. While there are some studies on such mixture models, a fundamental question about testing whether such mixture modeling is necessary, i.e., whether such a latent class exists, has not been studied yet. In this talk, three tests including Wald test, likelihood ratio test and score test are developed for testing such latent class. Simulation studies are conducted to evaluate the performance of the tests, and two real data examples are employed to illustrate the tests.

Speaker: Heping Zhang, PhD
Affiliation: Yale University, Department of Biostatistics, New Haven-CT
Title: Back to the Basics: Residuals and Diagnostics for Generalized Linear Models
Date and Time: Februrary 17^th, 2020 , 3:00PM
Room: LEC, Room 303
Abstract: Ordinal outcomes are common in scientific research and everyday practice, and we often rely on regression models to make inference. A long-standing problem with such regression analyses is the lack of effective diagnostic tools for validating model assumptions. The difficulty arises from the fact that an ordinal variable has discrete values that are labeled with, but not, numerical values. The values merely represent ordered categories. In this paper, we propose a surrogate approach to defining residuals for an ordinal outcome Y. The idea is to define a continuous variable S as a “surrogate” of Y and then obtain residuals based on S. For the general class of cumulative link regression models, we study the residual’s theoretical and graphical properties. We show that the residual has null properties similar to those of the common residuals for continuous outcomes. Our numerical studies demonstrate that the residual has power to detect misspecification with respect to 1) mean structures; 2) link functions; 3) heteroscedasticity; 4) proportionality; and 5) mixed populations. The proposed residual also enables us to develop numeric measures for goodness-of-fit using classical distance notions. Our results suggest that compared to a previously defined residual, our residual can reveal deeper insights into model diagnostics. We stress that this work focuses on residual analysis, rather than hypothesis testing. The latter has limited utility as it only provides a single p-value, whereas our residual can reveal what components of the model are misspecified and advise how to make improvements. This is a joint work with Dungang Liu, University of Cincinnati Lindner College of Business.

February 25^th, OFF MARDI GRAS

Speaker: Sandip Barui, PhD
Affiliation: Department of Math and Statistics, University of South Alabama, Mobile, ALA
Title: Semiparametric Methods for Survival Data with Measurement Error under Additive Hazards Cure Rate Models
Date and Time: March 2^nd, 2020, 3:00PM
Room: LEC, Room 303
Abstract: It is well established that measurement error has drastically negative impact on data analysis. It can not only bias parameter estimates but may also cause loss of power for testing relationship between variables. Although survival analysis of error-contaminated data has attracted extensive interest, relatively little attention has been paid to dealing with survival data with error-contaminated covariates when the underlying population is characterized by a cured fraction. In this paper, we consider this problem for which lifetimes of the non-cured individuals are featured by the additive hazards model and the measurement error process is described by an additive model. Unlike estimating the relative risk in the proportional hazards model, the additive hazards model allows us to estimate the absolute risk difference associated with the covariates. To allow the model flexibility, we incorporate time-dependent covariates in the model. We develop estimation methods for the two scenarios, without or with measurement error. The proposed methods are evaluated from both the theoretical view point and the numerical perspectives.

Speaker: Yang Ni, PhD
Affiliation: Statistics Department, Texas A&M University, College Station, TX
Title: Bayesian nonparametric bi-clustering of microbiome data
Date and Time: March 9th , 2020, 3:00 pm
Room: LEC, Room 303
Abstract: We develop a novel Bayesian nonparametric bi-clustering algorithm for microbiome data. We propose a mixture model framework to dynamically dichotomize multinomial data into two categories. On top of the mixture layer, a double feature allocation model is imposed on the binary mixture indicators. Double feature allocation model clusters both observations and variables (OTUs). Moreover, it allows for overlapping clustering structures. We demonstrate the utility of our method with case studies.

Speaker: David Anderson, PhD
Affiliation: Mathematics Department, Xavier University of Louisiana, New Orleans, LA
Title: Multifractal and Gaussian Fractional Sum-Difference Models for Internet Traffic
Date and Time: March 16th, 2020, 3:00 pm
Room: LEC, Room 303
Abstract: A multifractal fractional sum-difference model (MFSD) is a monotone transformation of a Gaussian fractional sum-difference model (GFSD). The GFSD is the sum of two independent components: a moving sum of length two of discrete fractional Gaussian noise (fGn); and white noise. Internet traffic packet interarrival times are very well modeled by an MFSD in which the marginal distribution is Weibull; this is validated by extensive model checking for 715,665,213 measured arrival times on three Internet links. The simplicity of the model provides a mathematical tractability that results in a foundation for understanding the statistical properties of the arrival process. The current foundation is time scaling; properties of aggregate arrivals in successive equal length time intervals and how the properties change with the interval length. This scaling is also the basis for the widely discussed multifractal wavelet models. The MFSD provides a more fundamental foundation that is based on how changes in the fGn and white noise components result in changes in the arrival process as various factors change such as the aggregation time length or the traffic packet rate. Logistic models relate the MFSD model parameters to the packet rate, so only the rate needs to be specified in using the MFSD model to generate synthetic packet arrivals for network engineering simulation studies.

Speaker: Larry Smolinsky, PhD
Affiliation: Mathematics Department, Louisiana State University, Baton Rouge, LA
Title: A version of the Mantel-Haenszel statistic in altmetics
Date and Time: March 30th, 2020, 3:00 pm
Room: LEC, Room 303
Abstract: Scientometrics includes analyzing information about publications in the sciences including the performance of individuals and institutions. Many traditional measures are based on citation and publication data. An alternative measure of how a scientist or institution performs might be to measure its mentions in Wikipedia, Facebook, or Twitter or views on Mendeley, Researchgate, or the ArXiv. Metrics based on alternative measures, in contrast to the more traditional measures, are called altmetrics. This altmetric data is sparse scientometric data. Lutz Bornmann and Robin Haunschild introduced an altmetic indicator to measure the relative performance of researchers, institutions, or other units in these nontraditional theaters. Their indicator, denoted MHq, is based on the Mantel-Haenszel statistic for odds ratios and risk ratios. The definition of MHq is a problematic variation of the Mantel-Haenszel statistic. It is not clear MHq has a consistent philosophical meaning and Bornmann and Haunschild’s published confidence interval is wrong. I will discuss MHq and its relationship to Mantel-Haenszel statistics odds ratios and risk ratios.

Speaker: Yongli Sang, PhD
Affiliation: Mathematics Department, University of Louisiana Lafayette, Lafayette – LA
Title: A Jackknife Empirical Likelihood Approach for K-sample Tests
Date and Time: April 13^th , 2020, 3:00 pm
Room: LEC, Room 303
Abstract: The categorical Gini correlation is an alternative measure of dependence between a categorical and numerical variable, which characterizes the independence of the variables. A nonparametric test for the equality of K distributions has been developed based on the categorical Gini correlation. By applying the jackknife empirical likelihood approach, the standard limiting Chi-squared distribution with degree freedom of K-1 is established and is used to determine critical value and $p$-value of the test. Simulation studies show that the proposed method is competitive to existing methods in terms of power of the tests in most cases. The proposed method is illustrated in an application on a real dataset.