Speakers

Lucas Beyer

Google Brain

Stella Biderman

EleutherAI

Executive Director

Aditi Raghunathan

Carnegie Mellon University

Assistant Professor

Topics and Theme

This is the fifth edition of highly successful workshops focused on data-centric AI, following the success of the Data-Centric AI workshop at NeurIPS 2021, ICML 2022, ICML 2023, and ICLR 2024.

Theme

Large-scale foundation models are revolutionizing machine learning, particularly in vision and language domains. While model architecture received significant attention in the past, recent focus has shifted towards the importance of data quality, size, and diversity, and provenance.

This workshop aims to highlight cutting-edge advancements in data-centric approaches for large-scale foundation models in new domains, in addition to language and vision, and engage the vibrant interdisciplinary community of researchers, practitioners, and engineers who tackle practical data challenges related to foundation models. By featuring innovative research and facilitating collaboration, it aims to bridge the gap between dataset-centric methodologies and the development of robust, versatile foundation models that are able to work in and across a variety of domains in service of humanity.

Topics will include, but are not limited to

Data sources for large-scale datasets:
Construction of datasets from large quantities of unlabeled/uncurated data
Model-assisted dataset construction
Quality signals for large-scale datasets
Datasets for evaluation
Datasets for specific applications.
Impact of dataset drifts in large-scale models
Ethical considerations for and governance of large-scale datasets
Data curation and HCI
Submissions to benchmarks such as DataPerf, DynaBench, and DataComp

If you are looking for examples of works previously presented at DMLR, you can find a list of papers here.

An overview of the history and vision behind DMLR, including links to previous keynotes, you can find in our editorial DMLR: Data-centric Machine Learning Research – Past, Present and Future.

Awards

A few selected exceptional research papers from DMLR workshop 2024 will be invited to contribute to the DMLR journal; the latest member of the JMLR family, aiming to provide a top archival venue for high-quality scholarly articles focused on the data aspect of machine learning research. The top submissions to the DMLR workshops will be invited to submit extended versions of their papers to the DMLR journal.

Logistics

Important Dates

(Time zone: Anywhere on Earth)

Paper Submission deadline: ~~May 24, 2024~~ May 30, 2024
Notification of Acceptance: June 17, 2024
Camera Ready Copy due: July 12, 2024

Session organization: virtual + in-person engagement

We aim at a discussion-centric workshop to allow for in-depth coverage of state-of-art and work-in-progress efforts and panel discussion and poster presentation along the data lifecycle in machine learning research and engineering: creation, quality and processing, governance and management/infrastructure.

The workshop will be organized in four components:

Keynotes and invited talks
Open panel discussions
Poster sessions
Networking sessions

For accepted papers, please see the details on the Program page.

About DMLR

DMLR is an open, distributed community organizing activities to discuss and advance research in data-centric machine learning.

We organize workshops and research retreats, maintain a journal, and run a working group at Machine Learning Commons (MLC) to support infrastructure projects.

You can find more details about the scope and history of our activities in the editorial Data-centric Machine Learning Research – Past, Present and Future.