I am a research scientist and entrepreneur focused on developing and using machine learning techniques to improve quality of life. I work to help radiologists and other clinicians increase their diagnostic accuracy through the assistance of deep learning technology for medical image analysis. Additionally, my academic research focuses on using statistical machine learning methods to build "smart" tutoring systems for students.
2015 - Present
Chief Scientist, Co-Founder
Imagen is a machine learning startup that has raised $21 million in funding and recruited notable scientific advisors such as Michael Mozer, Rob Fergus, Serge Belongie, and Rich Zemel. I am the technical founder of Imagen and lead a team of scientists toward our goal of improving the diagnostic accuracy of radiologists. I am responsible for the scientific direction of the company and guide the full pipeline of machine learning product development, including ideation and prototyping, labeled dataset acquisition and quality control, deep learning model development and evaluation, clinical trial design and execution, and interactions with the FDA concerning our scientific work.
2014 - 2015
Deep Learning Consultant
Deep learning and statistical machine learning consulting for the U.S. defense industry.
2014 - 2015
Machine Learning Research Scientist
I developed novel computer vision and deep learning models for face verification. They are designed to be resistant to impostor attacks and, unlike most other biometric security systems, work on-device (without internet access). The models are being used in many consumer electronic devices, including LG phones.
2008 - 2014
University of Colorado, Boulder
PhD, Computer Science
Advisor: Michael Mozer
NSF Graduate Research Fellow
2005 - 2008
Rensselaer Polytechnic Institute
BS, Dual Major in Computer Science and Philosophy
Summa Cum Laude
Advisor: Wayne Gray
Michael Mozer, Denis Kazakov, Robert Lindsey
Abstract We investigate recurrent neural network architectures for event-sequence processing. Event sequences, characterized by discrete observations stamped with continuous-valued times of occurrence, are challenging due to the potentially wide dynamic range of relevant time scales as well as interactions between time scales. We describe four forms of inductive bias that should benefit architectures for event sequences: temporal locality, position and scale homogeneity, and scale interdependence. We extend the popular gated recurrent unit (GRU) architecture to incorporate these biases via intrinsic temporal dynamics, obtaining a continuous-time GRU. The CT-GRU arises by interpreting the gates of a GRU as selecting a time scale of memory, and the CT-GRU generalizes the GRU by incorporating multiple time scales of memory and performing context-dependent selection of time scales for information storage and retrieval. Event time-stamps drive decay dynamics of the CT-GRU, whereas they serve as generic additional inputs to the GRU. Despite the very different manner in which the two models consider time, their performance on eleven data sets we examined is essentially identical. Our surprising results point both to the robustness of GRU and LSTM architectures for handling continuous time, and to the potency of incorporating continuous dynamics into neural architectures.
Big Data in Cognitive Science
Michael Mozer, Robert Lindsey
Abstract Cognitive psychology has long had the aim of understanding mechanisms of human memory, with the expectation that such an understanding will yield practical techniques that support learning and retention. Although research insights have given rise to qualitative advice for students and educators, we present a complementary approach that offers quantitative, individualized guidance. Our approach synthesizes theory-driven and data-driven methodologies. Psychological theory characterizes basic mechanisms of human memory shared among members of a population, whereas machine-learning techniques use observations from a population to make inferences about individuals. We argue that despite the power of big data, psychological theory provides essential constraints on models. We present models of forgetting and spaced practice that predict the dynamic time-varying knowledge state of an individual student for specific material. We incorporate these models into retrieval-practice software to assist students in reviewing previously mastered material. In an ambitious year-long intervention in a middle-school foreign language course, we demonstrate the value of systematic review on long-term educational outcomes, but more specifically, the value of adaptive review that leverages data from a population of learners to personalize recommendations based on an individual’s study history and past performance.
Mohammad Khajah, Robert Lindsey, Michael Mozer
Best Paper Award Available on Github
Abstract In theoretical cognitive science, there is a tension between highly structured models whose parameters have a direct psychological interpretation and highly complex, general purpose models whose parameters and representations are difficult to interpret. The former typically provide more insight into cognition but the latter often perform better. This tension has recently surfaced in the realm of educational data mining, where a deep learning approach to predicting students’ performance as they work through a series of exercises—termed deep knowledge tracing or DKT—has demonstrated a stunning performance advantage over the mainstay of the field, Bayesian knowledge tracing or BKT. In this article, we attempt to understand the basis for DKT’s advantage by considering the sources of statistical regularity in the data that DKT can leverage but which BKT cannot. We hypothesize four forms of regularity that BKT fails to exploit: recency effects, the contextualized trial sequence, inter-skill similarity, and individual variation in ability. We demonstrate that when BKT is extended to allow it more flexibility in modeling statistical regularities—using extensions previously proposed in the literature—BKT achieves a level of performance indistinguishable from that of DKT. We argue that while DKT is a powerful, useful, generalpurpose framework for modeling student learning, its gains do not come from the discovery of novel representations— the fundamental advantage of deep learning. To answer the question posed in our title, knowledge tracing may be a domain that does not require ‘depth’; shallow models like BKT can perform just as well and offer us greater interpretability and explanatory power.
Mohammad Khajah, Brett Roads, Robert Lindsey, Yun-En Liu, Michael Mozer
Abstract We use Bayesian optimization methods to design games that maximize user engagement. Participants are paid to try a game for several minutes, at which point they can quit or continue to play voluntarily with no further compensation. Engagement is measured by player persistence, projections of how long others will play, and a post-game survey. Using Gaussian process surrogatebased optimization, we conduct ecient experiments to identify game design characteristics—specifically those influencing diculty—that lead to maximal engagement. We study two games requiring trajectory planning, the diculty of each is determined by a three-dimensional continuous design space. Two of the design dimensions manipulate the game in user-transparent manner (e.g., the spacing of obstacles), the third in a subtle and possibly covert manner (incremental trajectory corrections). Converging results indicate that overt diculty manipulations are effective in modulating engagement only when combined with the covert manipulation, suggesting the critical role of a user’s self-perception of competence.
NIPS 2016 Workshop on Machine Learning for Education
Kevin Wilson, Xiaolu Xiong, Mohammad Khajah, Robert Lindsey, Siyuan Zhao, Yan Karklin, Eric G. Van Inwegen, Bojian Han, Chaitanya Ekanadham, Joseph E. Beck, Neil Heffernan, Michael C. Mozer
Robert Lindsey, Mohammad Khajah, Michael Mozer
Available on Github
Abstract To master a discipline such as algebra or physics, students must acquire a set of cognitive skills. Traditionally, educators and domain experts use intuition to determine what these skills are and then select practice exercises to hone a particular skill. We propose a technique that uses student performance data to automatically discover the skills needed in a discipline. The technique assigns a latent skill to each exercise such that a student’s expected accuracy on a sequence of same-skill exercises improves monotonically with practice. Rather than discarding the skills identified by experts, our technique incorporates a nonparametric prior over the exerciseskill assignments that is based on the expert-provided skills and a weighted Chinese restaurant process. We test our technique on datasets from five different intelligent tutoring systems designed for students ranging in age from middle school through college. We obtain two surprising results. First, in three of the five datasets, the skills inferred by our technique support significantly improved predictions of student performance over the expertprovided skills. Second, the expert-provided skills have little value: our technique predicts student performance nearly as well when it ignores the domain expertise as when it attempts to leverage it. We discuss explanations for these surprising results and also the relationship of our skilldiscovery technique to alternative approaches.
Psychonomic Bulletin & Review
Sean Kang, Robert Lindsey, Michael Mozer, Harold Pashler
Abstract If multiple opportunities are available to review to-be-learned material, should a review occur soon after initial study and recur at progressively expanding intervals, or should the reviews occur at equal intervals? Landauer and Bjork (1978) argued for the superiority of expanding intervals, whereas more recent research has often failed to find any advantage. However, these prior studies have generally compared expanding versus equal-interval training within a single session, and have assessed effects only upon a single final test. We argue that a more generally important goal would be to maintain high average performance over a considerable period of training. For the learning of foreign vocabulary spread over four weeks, we found that expanding retrieval practice (i.e., sessions separated by increasing numbers of days) produced recall equivalent to that from equal-interval practice on a final test given eight weeks after training. However, the expanding schedule yielded much higher average recallability over the whole training period.
Robert Lindsey, Jeff Shroyer, Harold Pashler, Michael Mozer
Abstract Human memory is imperfect; thus, periodic review is required for the long-term preservation of knowledge and skills. However, students at every educational level are challenged by an evergrowing amount of material to review and an ongoing imperative to master new material. We developed a method for ecient, systematic, personalized review that combines statistical techniques for inferring individual differences with a psychological theory of memory. The method was integrated into a semester-long middle school foreign language course via retrieval-practice software. In a cumulative exam administered after the semester’s end that compared time-matched review strategies, personalized review yielded a 16.5% boost in course retention over current educational practice (massed study) and a 10.0% improvement over a one-size-fits-all strategy for spaced study.
Maximizing students' retention via spaced review: Practical guidance from computational models of memory
Topics in Cognitive Science
Mohammad Khajah, Robert Lindsey, Michael Mozer
Abstract During each school semester, students face an onslaught of material to be learned. Students work hard to achieve initial mastery of the material, but when they move on, the newly learned facts, concepts, and skills degrade in memory. Although both students and educators appreciate that review can help stabilize learning, time constraints result in a trade-off between acquiring new knowledge and preserving old knowledge. To use time efficiently, when should review take place? Experimental studies have shown benefits to long-term retention with spaced study, but little practical advice is available to students and educators about the optimal spacing of study. The dearth of advice is due to the challenge of conducting experimental studies of learning in educational settings, especially where material is introduced in blocks over the time frame of a semester. In this study, we turn to two established models of memory—ACT-R and MCM—to conduct simulation studies exploring the impact of study schedule on long-term retention. Based on the premise of a fixed time each week to review, converging evidence from the two models suggests that an optimal review schedule obtains significant benefits over haphazard (suboptimal) review schedules. Furthermore, we identify two scheduling heuristics that obtain near optimal review performance: (a) review the material from l-weeks back, and (b) review material whose predicted memory strength is closest to a particular threshold. The former has implications for classroom instruction and the latter for the design of digital tutors
Mohammad Khajah, Rowan Wing, Robert Lindsey, Michael Mozer
Best Paper Award
Abstract An effective tutor—human or digital—must determine what a student does and does not know. Inferring a student’s knowledge state is challenging because behavioral observations (e.g., correct vs. incorrect problem solution) provide only weak evidence. Two classes of models have been proposed to address the challenge. Latent-factor models employ a collaborative filtering approach in which data from a population of students solving a population of problems is used to predict the performance of an individual student on a specific problem. Knowledge-tracing models exploit a student’s sequence of problem-solving attempts to determine the point at which a skill is mastered. Although these two approaches are complementary, only preliminary, informal steps have been taken to integrate them. We propose a principled synthesis of the two approaches in a hierarchical Bayesian model that predicts student performance by integrating a theory of the temporal dynamics of learning with a theory of individual differences among students and problems. We present results from three data sets from the DataShop repository indicating that the integrated architecture outperforms either alone. We find significant predictive value in considering the difficulty of specific problems (within a skill), a source of information that has rarely been exploited.
Abstract This thesis uses statistical machine learning techniques to construct predictive models of human learning and to improve human learning by discovering optimal teaching methodologies. In Chapters 2 and 3, I present and evaluate models for predicting the changing memory strength of material being studied over time. The models combine a psychological theory of memory with Bayesian methods for inferring individual differences. In Chapter 4, I develop methods for delivering efficient, systematic, personalized review using the statistical models. Results are presented from three large semester-long experiments with middle school students which demonstrate how this “big data” approach to education yields substantial gains in the long-term retention of course material. In Chapter 5, I focus on optimizing various aspects of instruction for populations of students. This involves a novel experimental paradigm which combines Bayesian nonparametric modeling techniques and probabilistic generative models of student performance. In Chapters 6 and 7, I present supporting laboratory behavioral studies and theoretical analyses. These include an examination of the relationship between study format and the testing effect and a parsimonious theoretical account of long-term recency effects.
Robert Lindsey, Michael Mozer, Bill Huggins, Harold Pashler
Oral Presentation (1% acceptance rate)
Abstract Psychologists are interested in developing instructional policies that boost student learning. An instructional policy specifies the manner and content of instruction. For example, in the domain of concept learning, a policy might specify the nature of exemplars chosen over a training sequence. Traditional psychological studies compare several hand-selected policies, e.g., contrasting a policy that selects only difficult-to-classify exemplars with a policy that gradually progresses over the training sequence from easy exemplars to more difficult (known as fading). We propose an alternative to the traditional methodology in which we define a parameterized space of policies and search this space to identify the optimal policy. For example, in concept learning, policies might be described by a fading function that specifies exemplar difficulty over time. We propose an experimental technique for searching policy spaces using Gaussian process surrogate-based optimization and a generative model of student performance. Instead of evaluating a few experimental conditions each with many human subjects, as the traditional methodology does, our technique evaluates many experimental conditions each with a few subjects. Even though individual subjects provide only a noisy estimate of the population mean, the optimization method allows us to determine the shape of the policy space and to identify the global optimum, and is as efficient in its subject budget as a traditional A-B comparison. We evaluate the method via two behavioral studies, and suggest that the method has broad applicability to optimization problems involving humans outside the educational arena.
Maximizing students' retention via spaced review: Practical guidance from computational models of memory
Mohammad Khajah, Robert Lindsey, Michael Mozer
Abstract During each school semester, students face an onslaught of material to be learned. Students work hard to achieve initial mastery of the material, but when they move on, the newly learned facts, concepts, and skills degrade in memory. Although both students and educators appreciate that review can help stabilize learning, time constraints result in a trade-off between acquiring new knowledge and preserving old knowledge. To use time efficiently, when should review take place? Experimental studies have shown benefits to long-term retention with spaced study, but little practical advice is available to students and educators about the optimal spacing of study. The dearth of advice is due to the challenge of conducting experimental studies of learning in educational settings, especially where material is introduced in blocks over the time frame of a semester. In this study, we turn to two established models of memory—ACT-R and MCM—to conduct simulation studies exploring the impact of study schedule on long-term retention. Based on the premise of a fixed time each week to review, converging evidence from the two models suggests that an optimal review schedule obtains significant benefits over haphazard (suboptimal) review schedules. Furthermore, we identify two scheduling heuristics that obtain near optimal review performance: (a) review the material from μ-weeks back, and (b) review material whose predicted memory strength is closest to a particular threshold. The former has implications for classroom instruction and the latter for the design of digital tutors.
Robert Lindsey, William Headden, Michael Stipicevic
Abstract Topic models traditionally rely on the bag-of-words assumption. In data mining applications, this often results in end-users being presented with inscrutable lists of topical unigrams, single words inferred as representative of their topics. In this article, we present a hierarchical generative probabilistic model of topical phrases. The model simultaneously infers the location, length, and topic of phrases within a corpus and relaxes the bag-of-words assumption within phrases by using a hierarchy of Pitman-Yor processes. We use Markov chain Monte Carlo techniques for approximate inference in the model and perform slice sampling to learn its hyperparameters. We show via an experiment on human subjects that our model finds substantially better, more interpretable topical phrases than do competing models.
Robert Lindsey, Erik Polsdofer, Michael Mozer, Sean Kang, Harold Pashler
Abstract When tested on a list of items, individuals show a recency effect: the more recently a list item was presented, the more likely it is to be recalled. For short interpresentation intervals (IPIs) and retention intervals (RIs), this effect may be attributable to working memory. However, recency effects also occur over long timescales where IPIs and RIs stretch into the weeks and months. These long-term recency (LTR) effects have intrigued researchers because of their scale-invariant properties and the sense that understanding the mechanisms of LTR will provide insights into the fundamental nature of memory. An early explanation of LTR posited that it is a consequence of memory trace decay, but this decay hypothesis was discarded in part because LTR was not observed in continuous distractor recognition memory tasks (Glenberg & Kraus, 1981; Bjork & Whitten, 1974; Poltrock & MacLeod, 1977). Since then, a diverse collection of elaborate mechanistic accounts of LTR have been proposed. In this article, we revive the decay hypothesis. Based on the uncontroversial assumption that forgetting occurs according to a power-law function of time, we argue that not only is the decay hypothesis a sufficient qualitative explanation of LTR, but also that it yields excellent quantitative predictions of LTR strength as a function of list size, test type, IPI, and RI. Through fits to a simple model, this article aims to bring resolution to the subject of LTR by arguing that LTR is nothing more than ordinary forgetting.
Michael Mozer, Harold Pashler, Matt Wilder, Robert Lindsey, Matt Jones, Michael Jones
Abstract For over half a century, psychologists have been struck by how poor people are at expressing their internal sensations, impressions, and evaluations via rating scales. When individuals make judgments, they are incapable of using an absolute rating scale, and instead rely on reference points from recent experience. This relativity of judgment limits the usefulness of responses provided by individuals to surveys, questionnaires, and evaluation forms. Fortunately, the cognitive processes that transform internal states to responses are not simply noisy, but rather are influenced by recent experience in a lawful manner. We explore techniques to remove sequential dependencies, and thereby decontaminate a series of ratings to obtain more meaningful human judgments. In our formulation, decontamination is fundamentally a problem of inferring latent states (internal sensations) which, because of the relativity of judgment, have temporal dependencies. We propose a decontamination solution using a conditional random field with constraints motivated by psychological theories of relative judgment. Our exploration of decontamination models is supported by two experiments we conducted to obtain ground-truth rating data on a simple length estimation task. Our decontamination techniques yield an over 20% reduction in the error of human judgments.
Robert Lindsey, Owen Lewis, Harold Pashler, Michael Mozer
Abstract Testing students as they study a set of facts is known to enhance their learning (Roediger & Karpicke, 2006). Testing also provides tutoring software with potentially valuable information regarding the extent to which a student has mastered study material. This information, consisting of recall accuracies and response latencies, can in principle be used by tutoring software to provide students with individualized instruction by allocating a student’s time to the facts whose further study it predicts would provide greatest benefit. In this paper, we propose and evaluate several algorithms that tackle the benefit-prediction aspect of this goal. Each algorithm is tasked with calculating the likelihood a student will recall facts in the future given recall accuracy and response latencies observed in the past. The disparate algorithms we tried, which range from logistic regression to a Bayesian extension of the ACT-R declarative memory module, proved to all be roughly equivalent in their predictive power. Our modeling work demonstrates that, although response latency is predictive of future test performance, it yields no predictive power beyond that which is held in response accuracy.
Michael Mozer, Harold Pashler, Nicholas Cepeda, Robert Lindsey, Ed Vul
Abstract When individuals learn facts (e.g., foreign language vocabulary) over multiple study sessions, the temporal spacing of study has a significant impact on memory retention. Behavioral experiments have shown a nonmonotonic relationship between spacing and retention: short or long intervals between study sessions yield lower cued-recall accuracy than intermediate intervals. Appropriate spacing of study can double retention on educationally relevant time scales. We introduce a Multiscale Context Model (MCM) that is able to predict the influence of a particular study schedule on retention for specific material. MCM’s prediction is based on empirical data characterizing forgetting of the material following a single study session. MCM is a synthesis of two existing memory models (Staddon, Chelaru, & Higa, 2002; Raaijmakers, 2003). On the surface, these models are unrelated and incompatible, but we show they share a core feature that allows them to be integrated. MCM can determine study schedules that maximize the durability of learning, and has implications for education and training. MCM can be cast either as a neural network with inputs that fluctuate over time, or as a cascade of leaky integrators. MCM is intriguingly similar to a Bayesian multiscale model of memory (Kording, Tenenbaum, & Shadmehr, 2007), yet MCM is better able to account for human declarative memory.
Robert Lindsey, Michael Mozer, Nicholas Cepeda, Harold Pashler
Abstract When individuals learn facts (e.g., foreign language vocabulary) over multiple sessions, the durability of learning is strongly influenced by the temporal distribution of study (Cepeda, Pashler, Vul, Wixted, & Rohrer, 2006). Computational models have been developed to explain this phenomenon known as the distributed practice effect. These models predict the accuracy of recall following a particular study schedule and retention interval. To the degree that the models embody mechanisms of human memory, they can also be used to determine the spacing of study that maximizes retention. We examine two memory models (Pavlik & Anderson, 2005; Mozer, Pashler, Lindsey, & Vul, submitted) that provide differing explanations of the distributed practice effect. Although both models fit experimental data, we show that they make robust and opposing predictions concerning the optimal spacing of study sessions. The Pavlik and Anderson model robustly predicts that contracting spacing is best over a range of model parameters and retention intervals; that is, with three study sessions, the model suggests that the lag between sessions one and two should be larger than the lag between sessions two and three. In contrast, the Mozer et al. model predicts equal or expanding spacing is best for most material and retention intervals. The limited experimental data pertinent to this disagreement appear to be consistent with the latter prediction. The strong contrast between the models calls for further empirical work to evaluate their opposing predictions.
Robert Lindsey, Michael Stipicevic, Dan Veksler, Wayne Gray
Published as an undergraduate
Abstract We describe Vector Generation from Explicitly-defined Multidimensional semantic Space (VGEM), a method for converting a measure of semantic relatedness (MSR) into vector form. We also describe Best path Length on a Semantic Self-Organizing Map (BLOSSOM), a semantic relatedness technique employing VGEM and a connectionist, nonlinear dimensionality reduction technique. The psychological validity of BLOSSOM is evaluated using test cases from a large free-association norms dataset; we find that BLOSSOM consistently shows improvement over VGEM. BLOSSOM matches the performance of its base-MSR using a 21 dimensional vector-space and shows promise to outperform its base-MSR with a more rigorous exploration of the parameter space. In addition, BLOSSOM provides benefits such as document relatedness, concept-path formation, intuitive visualizations, and unsupervised text clustering.
Robert Lindsey, Dan Veksler, Alex Grintsvayg, Wayne Gray
Published as an undergraduate
Abstract Measures of Semantic Relatedness (MSRs) provide models of human semantic associations and, as such, have been applied to predict human text comprehension (Lemaire, Denhiere, Bellissens, & Jhean-Iarose, 2006). In addition, MSRs form key components in more integrated cognitive modeling such as models that perform information search on the World Wide Web (WWW) (Pirolli, 2005). However, the effectiveness of an MSR depends on the algorithm it uses as well as the text corpus on which it is trained. In this paper, we examine the impact of corpus selection on the performance of two popular MSRs, Pointwise Mutual Information and Normalised Google Distance. We tested these measures with corpora derived from the WWW, books, news articles, emails, web-forums, and encyclopedia. Results indicate that for the tested MSRs, the traditionally employed books and WWW-based corpora are less than optimal, and that using a corpus based on the New York Times news articles best predicts human behavior
Alex Grintsvayg, Dan Veksler, Robert Lindsey, Wayne Gray
Published as an undergraduate
Dan Veksler, Alex Grintsvayg, Robert Lindsey, Wayne Gray
Published as an undergraduate