Day 1 - June 07
Opening Keynote: Bridging the Last Mile: Applying Foundation Models with Data-Centric AI
Today, large language or Foundation Models (FMs) represent one of the most powerful new ways to build AI models; however, they still struggle to achieve production-level accuracy out of the box on complex, high-value, and/or dynamic use cases, often “hallucinating” facts, propagating data biases, and misclassifying domain-specific edge cases. This “last mile” problem is always the hardest part of shipping real AI applications, especially in the enterprise- and while FMs provide powerful foundations, they do not “build the house”.
In this talk, I’ll provide an overview of how this last mile adaptation is increasingly all about the data (not eg. the model architecture, hyperparameters, or algorithms), and give an overview of modern data-centricAI development approaches to solve this and preview new state of the art techniques and tools for handling all stages of data-centric development for foundation models, from pre-training to instruction-tuning and alignment, to task-specific fine tuning and distillation.
Fireside Chat: Alex Ratner and Gideon Mann on building BloombergGPT
Join us for a fireside chat with Gideon Mann, Head of Machine Learning Product and Research, CTO Office at Bloomberg about how they built a domain-specific LLM, BloombergGPT.
Fireside chat: The role of data in building Stable Diffusion and Generative AI
Discover the transformative power of data in developing Stable Diffusion and Generative AI as Emad Mostaque shares insights into the pivotal role data plays in creating these groundbreaking technologies. Explore the journey of leveraging data-driven approaches to drive innovation, unlock new possibilities, and shape the future of AI.
Panel – The Linux Moment of AI: Open Sourced AI Stack
In this panel, seasoned experts Julien, Ed, and Travis will delve into how open-source models and tools can revolutionize AI. Julien will shed light on projects like Big Science and explore how open-source projects can lead to a more adaptable AI stack, empowering developers to create use-case-specific solutions. With his vast experience in deploying and monitoring AI systems, Ed will discuss how open-source aids these processes and the challenges and potential solutions when scaling these systems. Meanwhile, Travis will share insights from his work with Ludwig, demonstrating how open-source innovation fosters faster, easier, and more collaborative development. As we witness the evolution of applications like ChatGPT, our panelists will discuss the open-source community’s crucial role in steering future developments and ensuring the ethical and responsible use of such technologies
A Practical Guide to Data Centric AI – A Conversational AI Use Case
In this talk, we will provide real-world examples of how data-centric AI is being used to solve complex problems at Ally. We will dive deep into an innovative use of data-centric AI, specifically using Generative AI and LLMs to set up Conversational AI for Ally Auto customers. Overall, this talk will provide insights into how data-centric AI can be used in a practical sense to drive innovation and create value in industry.
Panel – Adopting AI: With Power Comes Responsibility
Vijay Janapa Reddi
In our panel session, we’ll dissect the complexities inherent to responsibly leveraging Generative AI in the midst of an escalating ML arms race. We’ll probe into the ethical implications of large-scale AI experiments and the ongoing parameter wars, weighing the computational demand against potential fallout. As AI regulation efforts globally accelerate, we’ll discuss their influence on deep learning trajectories and the necessary proactive engagement from organizations. In anticipation of a rise in AI incidents due to rapid ML model scaling and hardware complexity, we’ll explore risk mitigation strategies. Lastly, we’ll delve into the increasing significance of data-centric ML systems, focusing on high-quality data acquisition and the role of data quality in minimizing risks associated with swift AI advancements. Join us as we navigate the technical intricacies of AI innovation and responsibility.
The Future is Neurosymbolic
Generating Synthetic Tabular Data That’s Differentially Private
While generative models are able to produce synthetic datasets that preserve the statistical qualities of the training dataset without identifying any particular record in the training dataset, most generative models to date do not offer mathematical guarantees of privacy that can be used to facilitate information sharing or publishing. Without such mathematical guarantees, each adversarial attack on these models and the synthetic data they generate needs to be thwarted reactively. We can never be aware of adversarial attacks that might become feasible in the future. This is exactly the problem that differential privacy (DP) solves by bounding the probability that a compromising event occurs. By introducing calibrated noise into an algorithm, DP defends against all future privacy attacks with a high probability. In this session, we’ll explore approaches to applying differential privacy, including one that relies on measuring low dimensional distributions in a dataset combined with learning a graphical model representation. We'll end with a preview of Gretel's new generative model that applies this method to create high-quality synthetic tabular data that is differentially private.
Fireside Chat: The Building Blocks of Modern Enterprise AI
Aparna Lakshmi Ratan
In this illuminating fireside chat, we dive into the heart of modern enterprise AI, exploring the dynamic intersection of data, models, and MLops platforms that define the new ML stack. We’ll investigate how factors such as model form factors, data types, use case variety, enterprise constraints, and the use of private data in AI applications shape this landscape, all while casting an anticipatory gaze towards the future of AI. As we decode the intricacies of today’s AI environment and predict tomorrow’s game-changers, this session offers a comprehensive insight into the building blocks of modern enterprise AI.
Panel: Navigating the LLM Labyrinth in a World of Rules
In this session, we’ll dive into the intricacies of Large Language Models (LLMs) within regulated industries. Our expert panel will discuss strategies for tuning LLMs to reduce misinterpretations and errors in conversational AI applications, emphasizing the necessity of precision in such sectors. They’ll explore the challenges and potential solutions organizations might encounter when transitioning from rules-based approaches to LLMs. Further, they will shed light on strategies to ensure LLMs’ compliance and predictability, given their capacity for creativity, and discuss how to manage data privacy concerns in LLM training. Finally, our panelists will delve into the role of data-centric approaches and programmatic labeling in aligning LLMs’ behavior with industry-specific requirements and ethical norms.
DataComp: In search of the next generation of multimodal datasets
Large multimodal datasets have been instrumental in recent breakthroughs such as CLIP, Stable Diffusion, and GPT-4. At the same time, datasets rarely receive the same attention as model architectures or training algorithms. To address this shortcoming in the ML ecosystem, we introduce DataComp, a benchmark where the training code is fixed and researchers innovate by proposing new training sets. We provide a testbed for dataset experiments centered around a new candidate pool of 12.8B image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing on 38 downstream test sets. Our benchmark consists of multiple scales, which facilitates the study of scaling trends and makes the benchmark accessible to researchers with varying resources.
Our baseline experiments show that the DataComp workflow is a promising way of improving multimodal datasets. We introduce a new dataset DataComp-1B and show that CLIP models trained on this dataset outperform OpenAI’s CLIP model by 3.7 percentage points on ImageNet while using the same compute budget. Compared to LAION-5B, our data improvement corresponds to a 9x improvement in compute cost.
Poster Competition: Procedure-Aware Pretraining for Instructional Video Understanding
Instructional videos depict humans demonstrating how to perform multi-step tasks such as cooking, repairing, etc. Building good video representations from instructional videos is challenging due to the small amount of video annotations available. This makes extracting the procedural knowledge such as the identity of the task (e.g., ‘make latte’), its steps (e.g., ‘pour milk’) challenging. Our insight is that instructions for procedures depict sequences of steps that repeat between instances of the same or different tasks, and that this structure can be well represented by a Procedural Knowledge Graph, where nodes are discrete steps and edges connect steps that occur sequentially in the instructional activities. This graph can then be used to generate pseudo labels to train a video representation that encodes the procedural knowledge in a more accessible form to generalize to multiple procedure understanding tasks. We call this Procedural Knowledge Graph based pre-training method and the resulting model Paprika, Procedure-Aware PRe-training for Instructional Knowledge Acquisition. We evaluate Paprika on COIN and CrossTask for procedure understanding tasks such as task recognition, step recognition, and step forecasting. Paprika yields a video representation that improves over the state of the art: up to 11.23% gains in accuracy in 12 evaluation settings.
Poster Competition: JoinBoost: Tree Training with just SQL
Data and machine learning (ML) are crucial for enterprise operations. Enterprises store data in databases for management and use ML to gain business insights. However, there is a mismatch between the way ML expects data to be organized (a single table) and the way data is organized in databases (a join graph of multiple tables). Current specialized ML libraries (e.g., LightGBM, XGBoost) necessitate data denormalization, data export, and data import, as they operate as separate programs incompatible with databases. The existing method not only increases operational complexity but also faces scalability limitations, slower performance, and security risks. But what if there was a way to achieve competitive tree training performance with just SQL? We present JoinBoost, a lightweight Python library that transforms tree training algorithms over normalized databases into pure SQL queries. Compatible with any DBMS and data stack, JoinBoost is a simplified, all-in-one data stack solution that avoids data denormalization, export, and import. JoinBoost delivers exceptional performance and scalability tailored to the capabilities of the underlying DBMS. Our experiments reveal that JoinBoost is 3x (1.1x) faster for random forests (gradient boosting) when compared to LightGBM, and scales well beyond LightGBM in terms of features, DB size, and join graph complexity.
Poster Competition: Data-IQ: Characterize & Audit your training data with 2 lines of code!
High model performance, on average, can hide that models may systematically underperform on subgroups of the data. To tackle this, we propose Data-IQ, a framework to systematically stratify examples into subgroups with respect to their outcomes --- allowing users to audit their tabular, image or text data with just two lines of extra code!
We do this by analyzing the behavior of individual examples during training, based on their predictive confidence and, importantly, the aleatoric (data) uncertainty. Capturing the aleatoric uncertainty permits a principled characterization and then subsequent stratification of data examples into three distinct subgroups (Easy, Ambiguous, Hard). We show that Data-IQ's characterization of examples is most robust to variation across similarly performant (yet different models), compared to baselines. Since Data-IQ can be used with any ML model (including neural networks, gradient boosting etc.), this property ensures consistency of data characterization, while allowing flexible model selection. Taking this a step further, we demonstrate that the subgroups enable us to construct new approaches to both feature acquisition and dataset selection. Furthermore, we highlight how the subgroups can inform reliable model usage, noting the significant impact of the Ambiguous subgroup on model generalization.
LLMOps: Making LLM Applications Production-Grade
Large language models are fluent text generators, but they struggle at generating factual, correct content. How can we convert these capabilities into reliable, production-grade applications? In this talk, I'll cover several techniques to do this based on my work and experience at Stanford and Databricks. On the research side, we've been developing programming frameworks such as Demonstrate-Search-Predict (DSP) that reliably connect an LLM to factual information and automatically improve the app's performance over time. On the industry side, Databricks has been building a stack of simple yet powerful tools for "LLMOps" into the MLflow open source framework.
Data-Driven Government: A Fireside Chat with the Former U.S. Chief Data Scientist
Join us for an engaging fireside chat as we delve into data science's history, impact, and challenges in the United States government. Our esteemed guest, the former U.S. Chief Data Scientist, will share insights into the origins of this vital role and their experiences in managing critical initiatives. Discover the strategies to drive data-driven decisions within the complex governmental landscape and explore the balance between data scientists and engineers. Gain valuable perspectives on the future of AI/ML, the ethical considerations in data science, and the transformative potential of leveraging data to better society. Don't miss this opportunity to unravel the fascinating data science journey at the highest levels of governance.
Day 1 Recap + Poster Competition Winners
A look at the highlights of day one and winners of the poster competition announced.
Day 2 - June 08
Opening Keynote: New introductions from Snorkel AI
Join Snorkel AI CEO, Alex Ratner, as he introduces our latest programmatic AI solutions. Designed to optimize foundation model development, enhance data development, and accelerate customization, these new solutions mark a leap in AI technology. Be the first to explore how Snorkel AI makes it practical for enterprises to leverage LLMs and foundation models.
Fireside Chat: Journey of Data: Transforming the Enterprise with Data-Centric Workflows
Join Nurtekin Savas, Head of Enterprise Data Science at Capital One, as he embarks on an insightful exploration of the data's journey across an enterprise. In this session, Nurtekin will unravel how data, from its creation to its ultimate insights, navigates and transforms within the complex enterprise stack. He will spotlight the power of data-centric workflows and their crucial role in driving business decisions, improving operational efficiency, and fueling AI innovation.
The Opportunity of Data Centric AI in Insurance
Alejandro Zarate Santovena
Accelerate ML Adoption by Addressing Hidden Needs
While there has been and continues to be a substantial amount of strategic investment in ML, that investment is rarely rewarded with an attractive return. The industry has seen rapid evolution in capabilities and even dramatic improvement in cost efficiencies; yet, an attractive return on investment remains elusive. This discussion will focus on hidden needs that must be addressed before ML can break free from strategic investment and enjoy broad adoption within the enterprise.
Transforming the Customer Experience with AI: Wayfair’s Data-Centric Way
In this talk, we will walk through the problems we solve at Wayfair using machine learning, which impact all aspects of a customer's journey. We will provide insights on how we use ML to understand our customers as well as the products in our catalog. We will also discuss some of the challenges we face in our space and how we are using ML best practices, state of the art foundation models, and data-centric approaches to solve these problems.
One way we help our customers find products is by cleaning and enriching our catalog. We do this by automating image tagging using a data-centric approach. We will provide insights on how we have accomplished this and share our findings.
Finally, we will touch on an important aspect of our approach: the collaboration between subject matter experts (SMEs) and data scientists (DS). By working closely together, we are able to quickly iterate on model development and testing, ultimately leading to a faster time-to-market for the models we develop.
Unleashing Human Potential with AI Augmentation
In this presentation, we will delve into the innovative ways artificial intelligence can augment and assist human capabilities, leading to novel applications in various domains. We will touch upon personal experience that demonstrates the power of AI in enhancing human creativity. Further, we will draw parallels to commercial work, showcasing how these techniques can be applied generically across multiple industries. By illustrating the vast potential of AI in both unique and professional contexts, this talk aims to inspire and inform attendees about the limitless possibilities AI offers in enhancing human potential.
Tackling advanced classification using Snorkel Flow
In this talk, we’ll discuss the key challenges and approaches for productionizing classification models in the age of foundation models. To start, we’ll highlight common but underrated challenges related to label schema definition, high cardinality, and multi-label problem formulations. We’ll dive into specific user experiences in Snorkel Flow to overcome these challenges, including ways to leverage foundation models, targeted error analysis, and supervision from subject matter experts. Finally, we’ll zoom out with a few case studies to describe how enterprise teams leverage data-centric workflows to build highly quality production models and unblock previously untenable problems in Snorkel Flow.
Combining domain knowledge with data to track and predict heavy-equipment service events
In this talk, we will illustrate how a century-old company like Caterpillar combines its domain knowledge with data to develop modern analytics that provides value to the enterprise, its dealership network, and its customers. In particular, we will describe how domain expertise and data are used to classify and predict repair events of heavy equipment.
Accelerating information extraction with data-centric iteration
During this session, we’ll discuss practical workflows for building enterprise information extraction applications. We’ll start with an end-to-end deep dive into “sequence tagging” tasks in Snorkel Flow, where we’ll highlight how teams of data scientists and subject matter experts can rapidly build powerful, zero-to-one models. In doing so, we’ll cover the key annotation, error analysis, and model-guided iteration capabilities that have helped our customers unblock models that power high-value use cases in production. Finally, we’ll discuss exciting opportunities for even further acceleration of these workflows in an FM-first world.
Data Driven AI for Threat Detection
Network Security has been a complex area to apply traditional machine learning on. The number of possible threats is vast, but at the same time, the number of labeled attack samples is very small. Moreover, when enough sample data is collected for a particular type of threat, the threat-vector changes.
While collecting samples for the true positives is difficult, security analysts usually have good mental heuristics about how the threats behave. They manually “execute” the heuristics to identify the threat among the massive network data. Typically these heuristics are applied after the unsupervised techniques identify the anomalies and outliers in the data. While this works well in practice, the approach is computationally expensive - due to the very nature of the unsupervised algorithms and with unpredictable accuracy in the field.
Weak supervision provides an alternative approach to utilizing the heuristics to identify the threats. It allows us to push the heuristics to the raw data to help us build more efficient models with predictable accuracy. In this talk, I will discuss one prototype of using weak supervision in the cyber security domain with exciting results.
Comcast SpeechNet: Weakly Supervised, End-to-End Speech Recognition at Industrial Scale
End-to-end automatic speech recognition systems represent the state of the art, but they rely on thousands of hours of manually annotated speech for training, as well as heavyweight computation for inference. Of course, this impedes commercialization since most companies lack vast human and computational resources. In this paper, we explore training and deploying an ASR system in the label-scarce, compute-limited setting. To reduce human labor, we use a third-party ASR system as a weak supervision source, supplemented with Snorkel labeling functions derived from implicit user feedback. To accelerate inference, we propose to route production-time queries across a pool of CUDA graphs of varying input lengths, the distribution of which best matches the traffic's. Compared to our third-party ASR, we achieve a relative improvement in word-error rate of 8% and a speedup of 600%. Our system, called SpeechNet, currently serves 12 million queries per day on our voice-enabled smart television.
Applying weak supervision and foundation models for computer vision
Ravi Teja Mullapudi
In this session, we will explore the latest advancements in computer vision that enable data-centric image classification model development. We will showcase how visual prompts and fast parameter-efficient models built on top of foundation models provide immediate feedback to rapidly iterate on data quality and model performance resulting in significant time-savings and performance improvements. Moreover, we will delve into the importance of adapting model representations via large-scale fine-tuning on weakly labeled data to address the limitations of fast but small models trained on fixed features. Finally, we will discuss the necessary scaling and model adaptations needed to transition from image-level classification to object-level detection and segmentation. Overall, this talk aims to provide insights into how computer vision data and models can be effectively improved in tandem and adjusted for downstream applications.
AI and the Future of Tax
AI is transforming the Tax services sector. Learn how organizations are adapting and leveraging generative AI and machine learning to prepare for the future
Leveraging Data-centric AI for Document Intelligence and PDF Extraction
Extracting entities from semi-structured documents is often a challenging task, requiring complex and time-consuming manual processes. In this session, we will explore how data-centric AI can be leveraged to simplify and streamline this process. We will start by discussing the challenges associated with extracting from PDFs and other semi-structured documents. We will explore how they can be overcome using Snorkel’s data-centric approach. Finally, we will dive into how foundation models can be utilized to further accelerate development of these extraction models.
Leveraging foundation models and LLMs for enterprise-grade NLP
In recent years, large language models (LLMs) have shown tremendous potential in solving natural language processing (NLP) problems. However, deploying LLMs in enterprise comes with its own set of challenges, especially when it comes to adapting the models to customer-specific data and incorporating domain knowledge. In this talk, we will explore how Snorkel AI can help address these challenges and enable businesses to leverage LLMs to extract insights from text data. We will walk through how Snorkel Flow can enable businesses to drive value from LLMs today, making the most of enterprise-grade NLP.
Bias Busters: Strategies for Monitoring, Managing, and Mitigating AI Bias
Dive into the world of AI Bias. This talk explores the pervasive issue of AI bias and its implications. Understand various forms of bias, from data and perception bias to survivorship and availability bias, and how they influence AI models. Learn practical strategies to counteract these biases, such as A/B testing, bias detection during model training, and comprehensive monitoring during model scoring. The talk concludes with a focus on action steps post-detection, including model retraining and selection of challenger models, intending to achieve equitable outcomes, enhance transparency, and meet evolving regulatory requirements.
Lessons From a Year with Snorkel: Data-Centric Workflows with SMEs at Georgetown
When the Center for Security and Emerging Technology began experimenting with Snorkel, we had two high-level goals. We aimed to address recurring bottlenecks in our ML projects, and improve collaborative workflows between data scientists and subject-matter experts. In this talk, we share takeaways from the half-dozen project teams who used Snorkel in the past year. We identify friction points in adoption, summarize feedback from SMEs, and we discuss which challenges Snorkel has helped us address, and which remain.
The future of AI is hybrid
As generative AI adoption grows at record-setting speeds and computing demands increase, hybrid processing is more important than ever. But just like traditional computing evolved from mainframes and thin clients to today’s mix of cloud and edge devices, AI processing must be distributed between the cloud and devices for AI to scale and reach its full potential.
In this talk, you’ll learn:
• Why on-device AI is key
• Which generative AI models can run on device
• Why the future of AI is hybrid
• Qualcomm Technologies’ role in making hybrid AI a reality
Fireside chat: Building RedPajama
Foundation models such as GPT-4 have driven rapid improvement in AI. However, the most powerful models are closed commercial models or only partially open. RedPajama is a project to create a set of leading, fully open-source models. In this session Ce Zhang, Together CTO, will discuss the data collection and training processes that went into building the RedPajama models.
Day 2 Recap
In-person meetup: Snorkel AI Headquarters – Redwood City, California
Bay Area Data Science Community
Join us for an in-person meetup at Snorkel Headquarters in Redwood City! Brews, artisan pizza, stand-up comedic moment and an unofficial Cornhole tournament (seeking a challenger for a Snorkel AI co-founder who has a real game). Bring two friends!
Date: June 8
Time: 4 PM - 7 PM
Address: 55 Perry Street, Redwood City, CA
Parking: Plentiful and close by. Metered for $1/ hour.
CalTrain: Redwood City stop, 3 minute walk