Data Integration for Driver Telematics with Selection Biases |
Hashan Peiris (Simon Fraser University); Himchan Jeong (Simon Fraser University)*; Jae Kwang Kim (Iowa State University) |
pdf |
|
|
|
|
|
Accelerating Batch Active Learning Using Continual Learning Techniques |
Gantavya Bhatt (University of Washington, Seattle)*; Arnav Das (University of Washington); Megh M Bhalerao (University of Washington); Rui Yang (Memorial Sloan Kettering Cancer Center); Vianne R Gao (Weill Medical College); Jeff Bilmes (UW) |
pdf |
|
|
|
|
|
Taming Small-sample Bias in Low-budget Active Learning |
Linxin Song (Waseda University)*; Jieyu Zhang (University of Washington); Xiaotian Lu (Kyoto University); Tianyi Zhou (University of Maryland, College Park) |
pdf |
|
|
|
|
|
Training with Low-Label-Quality Data: Rank Pruning and Multi-Review |
Yue Xing (Michigan State University)*; Ashutosh Pandey (Meta Platforms); David Yan (Meta Platforms); Fei Wu (Meta); Michael Fronda (Meta Platforms); Pamela Bhattacharya (Meta Platforms) |
pdf |
|
|
|
|
|
Training on Thin Air: Improve Image Classification with Generated Data |
Yongchao Zhou (University of Toronto)*; Hshmat U Sahak (University of Toronto); Jimmy Ba (University of Toronto) |
pdf |
|
|
|
|
|
DMOps: Data Management Operations and Recipes |
Eujeong Choi (Upstage); Chanjun Park (Upstage)* |
pdf |
|
|
|
|
|
Inter-Annotator Agreement in the Wild: Uncovering Its Emerging Roles and Considerations in Real-World Scenarios |
NamHyeok Kim (Upstage); Chanjun Park (Upstage)* |
pdf |
|
|
|
|
|
Transcending Traditional Boundaries: Leveraging Inter-Annotator Agreement (IAA) for Enhancing Data Management Operations (DMOps) |
Damrin Kim (Konkuk University); NamHyeok Kim (Upstage); Chanjun Park (Upstage)*; Harksoo Kim (Konkuk University) |
pdf |
|
|
|
|
|
DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining |
Sang Michael Xie (Stanford University)*; Hieu Pham (Google); Xuanyi Dong (University of Technology Sydney); Nan Du (Google Brain); Hanxiao Liu (Google Brain); Yifeng Lu (Google Brain); Percy Liang (Stanford University); Quoc Le (Google Brain); Tengyu Ma (Stanford); Adams Wei Yu (Google Brain) |
pdf |
|
|
|
|
|
On the Trade-off of Intra-/Inter-class Diversity for Supervised Pre-training |
Jieyu Zhang (University of Washington)*; Bohan Wang (University of Science and Technology of China); zhengyu hu (NA); Pang We Koh (University of Washington); Alexander J Ratner (University of Washington) |
pdf |
|
|
|
|
|
Algorithm Selection for Deep Active Learning with Imbalanced Datasets |
Jifan Zhang (University of Wisconsin)*; Shuai Shao (Meta); Saurabh Verma (Meta); Robert Nowak (University of Wisconsin, Madison) |
pdf |
|
|
|
|
|
How to Improve Imitation Learning Performance with Sub-optimal Supplementary Data? |
Ziniu Li (The Chinese University of Hong Kong, Shenzhen)*; Tian Xu (Nanjing University); Zeyu Qin (HKUST); Yang Yu (Nanjing University); Zhiquan Luo (The Chinese University of Hong Kong, Shenzhen and Shenzhen Research Institute of Big Data) |
pdf |
|
|
|
|
|
Synthetic Alone: Exploring the Dark Side of Synthetic Data for Grammatical Error Correction |
Chanjun Park (Upstage)*; Seonmin Koo (Korea University); Seolhwa Lee (University of Copenhagen); Jaehyung Seo (Korea University); Sugyeong Eo (Korea University); Hyeonseok Moon (Korea University); Heuiseok Lim (Korea University) |
pdf |
|
|
|
|
|
Contrastive clustering of tabular data |
Piotr Przemielewski (Jagiellonian University)*; Witold Wydmański (Jagiellonian University); Marek Śmieja (Jagiellonian University) |
pdf |
|
|
|
|
|
Detecting Errors in Numerical Data via any Regression Model |
Hang Zhou (UC Davis); Jonas Mueller (Cleanlab)*; Mayank Kumar (Cleanlab); Jane-Ling Wang (UC Davis); Jing Lei (Carnegie Mellon University) |
pdf |
|
|
|
|
|
THOS: A Benchmark Dataset for Targeted Hate and Offensive Speech |
Saad A Almohaimeed (University of Central Florida)*; Saleh Almohaimeed (University of Central Florida); Ashfaq Ali Shafin (Florida International University); Bogdan Carbunar (Florida International University); Ladislau Boloni (University of Central Florida) |
pdf |
|
|
|
|
|
Understanding Unfairness via Training Concept Influence |
Yuanshun Yao (ByteDance); Yang Liu (UC Santa Cruz)* |
pdf |
|
|
|
|
|
Promises and Pitfalls of Threshold-based Auto-labeling |
Harit Vishwakarma (University of Wisconsin Madison)*; Heguang Lin (University of Wisconsin-Madison); Frederic Sala (University of Wisconsin-Madison); Ramya Korlakai Vinayak (University of Wisconsin-Madison) |
pdf |
|
|
|
|
|
Detecting Dataset Drift and Non-IID Sampling via k-Nearest Neighbors |
Jesse E Cummings (MIT)*; Jonas Mueller (Cleanlab); Elías Snorrason (Cleanlab) |
pdf |
|
|
|
|
|
Data-Driven Approach for Formality-Sensitive Machine Translation: Language-Specific Handling and Synthetic Data Generation |
Seungjun Lee (Korea University)*; Hyeonseok Moon (Korea University); Chanjun Park (Upstage); Heuiseok Lim (Korea University) |
pdf |
|
|
|
|
|
Towards Declarative Systems for Data-Centric Machine Learning |
Stefan Grafberger (University of Amsterdam); Bojan Karlaš (Harvard University); Paul Groth (University of Amsterdam); Sebastian Schelter (University of Amsterdam)* |
pdf |
|
|
|
|
|
DataCI: A Platform for Data-Centric AI on Streaming Data |
Huaizheng Zhang (BreezeML)*; Yizheng Huang (BreezeML); Yuanming Li (Independent Researcher) |
pdf |
|
|
|
|
|
EPIC: Graph Augmentation with Edit Path Interpolation via Learnable Cost |
Jaeseung Heo (POSTECH)*; Seungbeom Lee (POSTECH); Sungsoo Ahn (POSTECH); Dongwoo Kim (POSTECH) |
pdf |
|
|
|
|
|
Repeated Random Sampling for Minimizing the Time-to-Accuracy of Learning |
Patrik Okanovic (ETH Zurich); Roger Waleffe (University of Wisconsin-Madison)*; Vasilis Mageirakos (ETH Zurich); Konstantinos Nikolakakis (Yale University); Amin Karbasi (Yale); Dionysios Kalogerias (Yale University); Nezihe Merve Gürel (ETH Zürich); Theodoros Rekatsinas (ETH Zurich) |
pdf |
|
|
|
|
|
Data-Centric Defense: Shaping Loss Landscape with Augmentations to Counter Model Inversion |
Si Chen (Virginia Tech)*; Feiyang Kang (Virginia Tech); Nikhil Abhyankar (Virginia Tech); Ming Jin (Virginia Tech); Ruoxi Jia (Virginia Tech) |
pdf |
|
|
|
|
|
Dataset Interfaces: Diagnosing Model Failures Using Controllable Counterfactual Generation |
Joshua L Vendrow (MIT)*; Saachi Jain (MIT); Logan Engstrom (MIT); Aleksander Madry (MIT) |
pdf |
|
|
|
|
|
Performance Scaling via Optimal Transport: Enabling Data Selection from Partially Revealed Sources |
Feiyang Kang (Virginia Tech)*; Hoang Anh Just (Virginia Tech); Anit Kumar Sahu (Amazon Alexa AI); Ruoxi Jia (Virginia Tech) |
pdf |
|
|
|
|
|
Toward Practical Automatic Speech Recognition and Post-Processing: a Call for Explainable Error Benchmark Guideline |
Seonmin Koo (Korea University)*; Chanjun Park (Upstage); Jinsung Kim (Korea University); Jaehyung Seo (Korea University); Sugyeong Eo (Korea University); Hyeonseok Moon (Korea University); Heuiseok Lim (Korea University) |
pdf |
|
|
|
|
|
A Skew-Sensitive Evaluation Framework for Imbalanced Data Classification |
Min Du (Palo Alto Networks)*; Nesime Tatbul (Intel Labs and MIT); Brian Rivers (Intel); Akhilesh Kumar Gupta (University of Pennsylvania); Lucas Hu (Palo Alto Networks); Wei Wang (Palo Alto Networks); Ryan C Marcus (MIT); Shengtian Zhou (Snap); Insup Lee (University of Pennsylvania); Justin Gottschlich (Merly and Stanford University) |
pdf |
|
|
|
|
|
Investigating minimizing the training set fill distance in machine learning regression |
Paolo Climaco (Institut für Numerische Simulation, Universität Bonn)*; Jochen Garcke (University Bonn) |
pdf |
|
|
|
|
|
Evaluating the Capabilities of Multi-modal Reasoning Models with Synthetic Task Data |
Nathan Vaska (MIT Lincoln Laboratories)*; Victoria Helus (MIT Lincoln Laboratory) |
pdf |
|
|
|
|
|
Addressing Discrepancies in Semantic and Visual Alignment in Neural Networks |
Natalie Abreu (MIT Lincoln Laboratory); Nathan Vaska (MIT Lincoln Laboratories)*; Victoria Helus (MIT Lincoln Laboratory) |
pdf |
|
|
|
|
|
Knowledge Graph-Augmented Korean Generative Commonsense Reasoning |
Dahyun Jung (Korea University)*; Jaehyung Seo (Korea University); Jaewook Lee (Korea University); Chanjun Park (Upstage); Heuiseok Lim (Korea University) |
pdf |
|
|
|
|
|
Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias |
Yue Yu (Georgia Institute of Technology)*; Yuchen Zhuang (Georgia Institute of Technology); Jieyu Zhang (University of Washington); Yu Meng (University of Illinois Urbana-Champaign); Alexander J Ratner (University of Washington); Ranjay Krishna (University of Washington); Jiaming Shen (Google Research); Chao Zhang (Georgia Institute of Technology) |
pdf |
|
|
|
|
|
Principlism Guided Responsible Data Curation |
Jerone T A Andrews (Sony AI)*; Dora Zhao (Sony AI); William Thong (Sony AI); Apostolos Modas (Sony); Orestis Papakyriakopoulos (Sony AI); Alice Xiang (Sony AI) |
pdf |
|
|
|
|
|
RewriteLM: An Instruction-Tuned Large Language Model for Text Rewriting |
Lei Shu (Google); Liangchen Luo (Google)*; Jayakumar Hoskere (Google); Yun Zhu (Google); Yinxiao Liu (Google); Simon Tong (Google); Jindong Chen (Google); Lei Meng (Google) |
pdf |
|
|
|
|
|
Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value |
Yongchan Kwon (Columbia University); James Zou (Stanford University)* |
pdf |
|
|
|
|
|
Partial Label Learning meets Active Learning: Enhancing Annotation Efficiency through Binary Questioning |
Shivangana Rawat (Indian Institute of Technology, Hyderabad)*; Chaitanya Devaguptapu (Fujitsu Research); Vineeth Balasubramanian (Indian Institute of Technology Hyderabad) |
pdf |
|
|
|
|
|
Characterizing Risk Regimes for Safe Deployment of Deep Regression Models |
Jayaraman J. Thiagarajan (Lawrence Livermore National Laboratory)*; Vivek Narayanaswamy (Lawrence Livermore National Laboratory); Puja Trivedi (University of Michigan); Rushil Anirudh (Lawrence Livermore National Laboratory) |
pdf |
|
|
|
|
|
Improve Model Inference Cost with Image Gridding |
Shreyas Krishnaswamy (University of California, Berkeley)*; Lisa Dunlap (UC Berkeley); Lingjiao Chen (University of Wisconsin-Madison); Matei Zaharia (Stanford and Databricks); James Zou (Stanford University); Joey Gonzalez (Berkeley) |
pdf |
|
|
|
|
|
Towards an Efficient Algorithm for Time Series Forecasting with Anomalies |
Hao Cheng (University of California, Santa Cruz); Qingsong Wen (Alibaba DAMO Academy)*; Yang Liu (UC Santa Cruz); Liang Sun (Alibaba Group) |
pdf |
|
|
|
|
|
Skill-it! A Data-Driven Skills Framework for Understanding and Training Language Models |
Mayee Chen (Stanford University)*; Nicholas Roberts (University of Wisconsin-Madison); Kush Bhatia (Stanford University); Jue WANG (Zhejiang University); Ce Zhang (ETH); Frederic Sala (University of Wisconsin-Madison); Christopher Re (Stanford University) |
pdf |
|
|
|
|
|
The Matrix Reloaded: A Counterfactual Perspective on Bias in Machine Learning |
Andre V Carreiro (Fraunhofer Portugal AICOS); Mariana Pinto (Faculty of Science and Technology, Nova University of Lisbon); Pedro S Madeira (Fraunhofer Portugal AICOS)*; Alberto Lopez (Imprensa Nacional - Casa da Moeda); Hugo Gamboa (LIBPhys, Faculdade de Ciências e Tecnologia, Universidade Noval de Lisboa) |
pdf |
|
|
|
|
|
Graphtester: Exploring Theoretical Boundaries of GNNs on Graph Datasets |
M. Eren Akbiyik (ETH Zurich)*; Florian Grötschla (ETH Zürich); Beni Egressy (ETH Zurich); Roger Wattenhofer (ETH Zurich) |
pdf |
|
|
|
|
|
No Imputation without Representation |
Oliver U Lenz (Universiteit Gent)*; Daniel Peralta (Ghent University ); Chris Cornelis (Ghent University) |
pdf |
|
|
|
|
|
L3Cube-MahaSent-MD: A Multi-domain Marathi Sentiment Analysis Dataset and Transformer Models |
Aabha Pingle (Pune Institute of Computer Technology)*; Aditya Vyawahare (Pune Institute of Computer Technology); Isha Joshi (Pune Institute of Computer Technology); Rahul Tangsali (SCTR’s Pune Institute of Computer Technology); Raviraj Joshi (Indian Institute of Technology Madras) |
pdf |
|
|
|
|
|
Point Cloud Classification with ModelNet40: What is left? |
Jarne Van den Herrewegen (Oqton / Ghent University)*; Tom Tourwé (Oqton); Francis Wyffels (Ghent University) |
pdf |
|
|
|
|
|
In or Out? Fixing ImageNet Out-of-Distribution Detection Evaluation |
Julian Bitterwolf (University of Tübingen)*; Maximilian Mueller (University of Tübingen); Matthias Hein (University of Tübingen) |
pdf |
|
|
|
|
|
Localized Data Work as a Precondition for Data-Centric ML: A Case Study of Full Lifecycle Crop Disease Identification in Ghana |
Darlington Akogo (minoHealth); Issah A Samori (minoHealth AI Labs); Cyril S K Akafia (minoHealth AI Labs); Harriet Dede Fiagbor (minoHealth AI Labs); Andrews A Kangah (KaraAgro AI Labs); Donald Donald (KaraAgro); Kwabena Fuachie (Kara Agro AI); Luis Oala ( Dotphoton AG)* |
pdf |
|
|
|
|
|
Offline Reinforcement Learning with Imbalanced Datasets |
Li Jiang (Tsinghua University)*; Sijie Chen (Fudan University); Jielin Qiu (Carnegie Mellon University); Haoran Xu (JD Technology); Victor Chan (TBSI); DING ZHAO (Carnegie Mellon University) |
pdf |
|
|
|
|
|
Bayesian Optimisation Against Climate Change: Applications and Benchmarks |
Sigrid Passano Hellan (University of Edinburgh)*; Christopher Lucas (University of Edinburgh); Nigel Goddard (University of Edinburgh) |
pdf |
|
|
|
|
|
On the Usefulness of Synthetic Tabular Data Generation |
Dionysis Manousakas (Amazon)*; Sergul Aydore (Amazon) |
pdf |
|
|
|
|
|
Active learning for time instant classification |
Nauman Ahad (Georgia Institute of Technology)*; Namrata Nadagouda (Georgia Institute of Technology); Eva L Dyer (Georgia Tech); Mark Davenport (Georgia Institute of Technology) |
pdf |
|
|
|
|
|
Speech Wikimedia: A 77 Language Multilingual Speech Dataset |
Rafael Mosquera Gómez (MLCommons); Julian Eusse (MLCommons); Juan Manual Ciro (Factored); Daniel Galvez (NVIDIA)*; Ryan Hileman (Talon Voice); Kurt Bollacker (The Long Now Foundation); David Kanter (MLCommons) |
pdf |
|
|
|
|
|
Data-Efficient Contrastive Self-supervised Learning: Most Beneficial Examples for Supervised Learning Contribute the Least |
Siddharth Joshi (UCLA)*; Baharan Mirzasoleiman (UCLA) |
pdf |
|
|
|
|
|
Characterizing the Impacts of Semi-supervised Learning for Weak Supervision |
Jeffrey Li (University of Washington)*; Jieyu Zhang (University of Washington); Ludwig Schmidt (University of Washington); Alexander J Ratner (University of Washington) |
pdf |
|
|
|
|
|
Estimating label quality and errors in semantic segmentation data via any model |
Vedang Lad (MIT); Jonas Mueller (Cleanlab)* |
pdf |
|
|
|
|
|
STG-MTL: Scalable Task Grouping for Multi-Task Learning Using Data Maps |
Ammar Sherif (Nile University)*; Abubakar Abid (Hugging Face); Mustafa Elattar (Nile University); Mohamed ElHelw (Nile University) |
pdf |
|
|
|
|
|
ObjectLab: Automated Diagnosis of Mislabeled Images in Object Detection Data |
Ulyana Tkachenko (Cleanlab); Aditya Thyagarajan (CleanLab); Jonas Mueller (Cleanlab)* |
pdf |
|
|
|
|
|
Beyond Scale: the Diversity Coefficient as a Data Quality Metric Demonstrates LLMs are Pre-trained on Formally Diverse Data |
Alycia Y Lee (Stanford University)*; Brando Miranda (Stanford University); Sanmi Koyejo (Stanford University) |
pdf |
|
|
|
|
|
Learning pipeline-invariant representation for robust brain phenotype prediction |
Xinhui Li (Georgia Institute of Technology)*; Alex Fedorov (Georgia Institute of Technology); Mrinal Mathur (Georgia State University); Anees Abrol (TReNDS); Gregory Kiar (Child Mind Institute); Sergey Plis (Georgia State University); Vince Calhoun (TReNDS) |
pdf |
|
|
|
|
|
Is Pre-training Truly Better Than Meta-Learning? |
Brando Miranda (Stanford University)*; Patrick Yu (University of Illinois Urbana-Champaign); Saumya Goyal (Stanford University); Yu-Xiong Wang (University of Illinois at Urbana-Champaign); Sanmi Koyejo (Stanford University) |
pdf |
|
|
|
|
|
Adaptive Aggregated Drift Detector |
Beverly A Quon (University of California, Irvine)*; Jean-Luc Gaudiot (University of California, Irvine) |
pdf |
|
|
|
|
|
On Estimating the Epistemic Uncertainty of Graph Neural Networks using Stochastic Centering |
Puja Trivedi (University of Michigan)*; Mark Heimann (Lawrence Livermore); Rushil Anirudh (Lawrence Livermore National Laboratory); Danai Koutra (U Michigan); Jayaraman J. Thiagarajan (Lawrence Livermore National Laboratory) |
pdf |
|
|
|
|
|
LabelBench: A Comprehensive Framework for Benchmarking Label-Efficient Learning |
Jifan Zhang (University of Wisconsin)*; Yifang Chen (University of Washington); Gregory H Canal (University of Wisconsin-Madison); Stephen O Mussmann (University of Washington); Yinglun Zhu (University of Wisconsin-Madison); Simon Du (University of Washington); Kevin Jamieson (U Washington); Robert Nowak (University of Wisconsin, Madison) |
pdf |
|
|
|
|
|
Internet Explorer: Targeted Representation Learning on the Open Web |
Alexander C Li (Carnegie Mellon University)*; Ellis L Brown (Carnegie Mellon University); Alexei A Efros (UC Berkeley); Deepak Pathak (Carnegie Mellon University) |
pdf |
|
|
|
|
|
Uncovering Neural Scaling Law in Molecular Representation Learning |
Dingshuo Chen (University of Chinese Academy of Sciences)*; Yanqiao ZHU (University of California, Los Angeles); Jieyu Zhang (University of Washington); Yuanqi Du (Cornell University); Zhixun Li (The Chinese University of Hong Kong); Qiang Liu (Institute of Automation, Chinese Academy of Sciences); Shu Wu (NLPR, China); Liang Wang (NLPR, China) |
pdf |
|
|
|
|
|
MultiLegalPile: A 689GB Multilingual Legal Corpus |
Joel Niklaus (University of Bern)*; Veton Matoshi (Bern University of Applied Sciences); Matthias Stürmer (University of Bern); Ilias Chalkidis (University of Copenhagen); Daniel Ho (Stanford Law) |
pdf |
|
|
|
|
|
Self-supervised Autoencoder for Correlation-Preserving in Tabular GANs |
Siddarth Ramesh (Adobe); Surgan Jandial (MDSR Labs, Adobe)*; Gauri Gupta (MIT); Piyush Gupta (Adobe Systems India Pvt Ltd); Balaji Krishnamurthy () |
pdf |
|
|
|
|
|
D4: Improving LLM Pretraining via Document De-Duplication and Diversification |
Kushal Tirumala (FAIR)*; Daniel Simig (Meta AI); Armen Aghajanyan (FAIR); Ari S Morcos (Facebook AI Research (FAIR)) |
pdf |
|
|
|
|
|
Ensemble Fractional Imputation for Incomplete Categorical Data with a Graphical Model |
Yonghyun Kwon (Iowa State University)*; Jae Kwang Kim (Iowa State University) |
pdf |
|
|
|
|
|
Put on your detective hat: What’s wrong in this video? |
Rohith Peddi (The University of Texas at Dallas)*; Shivvrat Arya (The University of Texas at Dallas ); Bharath Challa (The University of Texas at Dallas); Likhitha Pallapothula (University of Texas at Dallas ); AKSHAY VYAS (University of Texas at Dallas); Qifan Zhang (The University of Texas at Dallas); Jikai Wang (University of Texas at Dallas); Vasundhara Komaragiri (UT Dallas); Eric Ragan (University of Florida); Nicholas Ruozzi (UT Dallas); Yu Xiang (The University of Texas at Dallas); Vibhav Gogate (UT Dallas) |
pdf |
|