Data Repositories


The LearnLab DataShop is a data repository and web application for learning science researchers. It provides secure data storage as well as an array of analysis and visualization tools available through a web-based interface. DataShop was funded by a National Science Foundation grants (SBE-0836012, SBE-0354420) to LearnLab.


DataShop 10.5 Updated! (12/09/19)
Read more here...


DataStage is provided by the Vice Provost Office for Online Learning (VPOL) at Stanford, which facilitates the teaching of online classes. The instruction delivery platforms are instrumented to collect a variety of data around participants' interaction with the study material. Examples are participants manipulating video players as they view portions of a class, solution submissions to problem sets, uses of the online forum available for some classes, peer grading activities, and some demographic data. VPOL makes some of this data available for research on learning processes, and for explorations into improving instruction through Datastage.


MITx course data, in a variety of forms, is made available for research purposes by the Institutional Research section of the Office of the Provost at MIT. The release of data from MITx courses is subject to compliance with student privacy regulations. Researchers may request access to Learner Data for research to improve teaching and curriculum or contribute to scholarship on teaching and learning.

The ASSISTments data repository contains datasets from secondary school interactions with an online tutoring system, in many cases as part of online experiments of what learning works best. You can also submit studies at www.assistmentstestbed.org as well as get a lot of information on how to interpret your data.


The Databrary project aims to promote data sharing, archiving, and reuse among researchers who study the development of humans and other animals. The project focuses on creating tools for scientists to store, manage, preserve, analyze, and share video and other temporally dense streams of data. The project is based at New York University and at Penn State. The U.S. National Science Foundation (NSF BCS-1238599) and the U.S. National Institutes of Health (NIH U01-HD-076595) have provided the funding for this project.


TalkBank is an interdisciplinary research project to promote the study of human and animal communication. The subfields of study include first language acquisition, second language acquisition, conversation analysis, classroom discourse and aphasic language. TalkBank has been funded by grants from the National Science Foundation (including BCS-998009, 0324883) as well as the National Institutes of Health.


The Child Language Data Exchange System (CHILDES) is the part of TalkBank focused on child language, or first language acquisition. CHILDES provides tools for studying conversational interactions, including a transcripts database, programs for analyzing transcripts, methods for linguistic coding and systems for linking audio and video. CHILDES is supported by grants from the National Institutes of Health (R01-HD23998, R01-HD051698).

MITx and HarvardX Dataverse

The MITx and HarvardX Dataverse contains deidentified student-level data from the first year of HarvardX and MITx courses.

Computer Science Education Workshop

The Computer Science Education workshop was held in Pittsburgh June 5-6, 2017. This document outlines the discussion around Data, Analytics and Tool sharing.

Lastinger Center at The University of Florida

The The University of Florida Lastinger Center is an education innovation hub that blends cutting-edge academic research and practice to transform education. We create equitable educational systems where every child and educator, regardless of circumstances, experiences high-quality learning every day to support children’s achievement of critical milestones that are predictive of success in school and life.

Head Start Data Analytics Playbook

The Head Start Data Analytics Playbook is a collection of data visualizations used by your very own colleagues in Head Start. Have you ever said to yourself, “I wonder how others looked at their data about…”? Now, you don’t have to just wonder. Go to the Playbook and you’ll see examples of how others answered the same questions you have. Each example includes a picture of the data used, details of what you’re looking at, and a story of how the program used the information. The examples also include the technical details you need to be able to replicate the example using your own data.

University of Pittsburgh English Language Institute Corpus (PELIC)

The University of Pittsburgh English Language Institute Corpus dataset is large learner corpus of written and spoken texts. These texts were collected in an English for Academic Purposes (EAP) context over seven years in the University of Pittsburgh’s Intensive English Program, and were produced by students with a wide range of linguistic backgrounds and proficiency levels. PELIC is longitudinal, offering greater opportunities for tracking development in a natural classroom setting. An overview of PELIC is available here.

Processing & Analytics


The MOOCdb project aims to brings together education researchers, computer science researchers, machine learning researchers, technologists, database and big data experts to advance MOOC data science. The project founded at MIT includes a platform agnostic functional data model for data exhaust from MOOCs, a collaborative-open source-open access data visualization framework, a crowd sourced knowledge discovery framework and a privacy preserving software framework. The team is currently working to release a number of these tools and frameworks as open source.


DiscourseDB is a data infrastructure project, in the space of collaborative and Discussion-based learning, that aims to provide a common data model to accommodate diverse sources including but not limited to Chat, Threaded Discussions, Blogs, Twitter, Wikis and Text messaging. In the future, the project will make available analytics which will facilitate research questions related to the mediating and moderating effects of role taking, help exchange, collaborative knowledge construction and others.


There are additional DiscourseDB instances at three other sites: University of Toronto, SUNY Albany and Beijing Normal University (BNU). Contact us for information on accessing these instances.

MOOCdb Learner Project

Each time a learner interacts with an e-learning system it is possible to capture a record of their engagement. Data comprising mouse clicks, video controls, problem responses, programming, collaborations and discussions then becomes available to learning science. MLP’s goal is to tap into the immense potential of this data to provide insights into how students learn and how instructors can effectively teach. The challenge is to provide technology and develop new approaches that transforms this fundamentally different set of observations into actionable knowledge.

DataShop External Tools

Free tools submitted by developers in the educational data mining and intelligent tutoring systems communities.


The Simon DataLab is an emerging intellectual data commons to drive continuous improvement in student learning outcomes with a particular focus on supporting instructors and course developers in using data to improve their courses.

EDM Workbench

The Educational Data Mining Workbench will support learning scientists to perform a number of analytic tasks including 1) define and modify behavior categories of interest (e.g., gaming, unresponsiveness, off-task conversation, help avoidance), 2) label previously collected educational log data with the categories of interest, 3) validate inter-rater reliability between multiple labelers of the same educational log data corpus, and 4) provide support for running the labeled data through a machine-learning tool, such as WEKA or RapidMiner.

Learn Platform

LearnPlatform is a comprehensive edtech management and rapid-cycle evaluation system for educators and administrators to organize, streamline and analyze their classroom technology to improve instructional, operational and financial decisions.


Cortex is an integrated SIS and LMS that can create personalized learning progressions for students that allow students to own their own data and drive their own learning. With a focus on mastery of skills, Cortex allows teachers to track mastery against the Common Core or other state-based learning standards. The platform allows students and teachers to track completion of a progression across goals and subjects through a common data visualization for mastery of content.

The Learner Data Institute

The Learner Data Institute (LDI) is a National Science Foundation (NSF) funded project to lay the foundation of a Data Science institute for learner data. The institute is a collaboration among Academia, Industry and Government with a mission to harness the data revolution to further our understanding of how people learn, how to improve adaptive instructional systems (AISs), and how to make emerging learning ecologies that include online and blended learning with AISs more effective, efficient, engaging, and affordable.


Unizin is a learning data and analytics platform for higher education that enables institutions to capture, own, use, and capitalize on its learning data and analytics. And now the platform is available to any institution via Google Cloud. Read more here.

Generate Data

Content Providers

Educational Technology Development Tools

Assessment and Tutoring