Salmon Run: Future of Data Centric AI -- Trip Report

I attended the Future of Data Centric AI 2023 this week, a free virtual conference organized by Snorkel AI. Snorkel.AI is a company built around the open-source Snorkel framework for programmatic data labeling. The project originally started at Stanford University's Hazy Research group, and many (all?) of the company's founders and some engineers are from the original research team. Snorkel.AI has been building and improving their flagship product, Snorkel Flow, an integrated tool for iterative data labeling and model building, so there were some presentations centered around that. In addition, its 2023, the year of generative LLMs (or GoLLuMs or Foundation Models) so Snorkel's ability to interface with these Foundation Models (FMs) also featured prominently. Maybe its a Stanford thing but presenters seem to prefer calling them FMs, so I will do the same, if only to distinguish them from the BERT / BART style large language models (LLMs).

If you are unfamiliar with what Snorkel does, I recommend checking out Snorkel and the Dawn of Weakly Supervised Machine Learning (Ratner et al, 2017) for a high-level understanding. For those familiar with the original open source Snorkel (and Snorkel METAL), Snorkel Flow is primarily a no-code web based tool to support the complete life-cycle of programmatic data labeling and model development. Because it is no-code it is usable by domain experts who don't necessarily know how to program. While the suite of built-in no-code Label Function (LF) templates are quite extensive, it supports adding programmatic LFs as well if you need them. In addition, it provides various conveniences such as cold-start LF recommendations and error analysis and recipes on how to address various classes of error to support an iterative approach to do model development almost like a programmer's edit-compile-run cycle. Over the last few months, they have added LLMs as another source of weak supervision and a possible source of LFs as well.

The last bit is important, because I think it points to the pragmatism of the Snorkel team. The FM applications ecosystem currently seems filled with pipelines that feature the FM front and center, i.e. use the FM for everything it can possibly do. Given their high infrastructure costs to run them and their high latencies, these pipelines don't seem very practical. Most of us were taught to cache (or pre-cache) as much as possible, so the customer does not pay the price during serving, or they will soon cease to be customers. Matthew Honnibal, creator of Spacy, makes a similar, though probably better argued, point in his Against LLM Maximalism blog post, where he advocates for smaller, more reliable, models for most tasks in the pipeline, and reserving the FM for tasks that truly need its capabilities. Snorkel Flow goes one step further by taking them out of the pipeline altogether -- instead using them to help generate good labels, thus benefiting from the FMs world-knowledge while still retaining the flexibility, reliability and explainability in the generated models.

However, Snorkel.AI is addressing the needs of the FM market as well, through their soon to be announced new tools -- Foundry and GenFlow -- which Alex Ratner (CEO and co-founder of Snorkel.AI) mentioned in his keynote addresses. They classify the usage of FMs into four stages -- pre-training (either from scratch or from trained weights, where it becomes more of a domain adaptation exercise), instruction tuning for behavior, fine tuning for a particular task, and distillation of the model into a smaller, more easily deployable model. As the DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining (Xie et al, 2023) paper shows, the mix of data used to train or adapt the FM can make a significant impact upon its quality, and Foundry and GenFlow are aimed at improving data and prompt quality for the first and second stages respectively, by ensuring optimum sampling, filtering and ranking.

Over the course of the presentation, presenters repeatedly talked about the importance of having high quality data to train models. Not surprising, since the conference has "Data-Centric AI" in its name, a term coined by Andrew Ng who was the first to emphasize this idea. However, the Snorkel team have really taken this idea to heart, and along with their customers, have developed some really cool applications, some of which they showcased in this conference. Apart from the keynotes and some panel discussions, presentations were in two parallel tracks, and I chose the ones that emphasized practice over theory, and I skipped a few, so the list below may be slightly biased. Videos of the talks will become available on the Snorkel Youtube channel in about a month, I will update the links once that happens (if I remember).

Bridging the Last Mile: Applying Foundation Models with Data-Centric AI (Alex Ratner) -- basic idea is that FMs are analogous to generalists that (think they) know lots of things, but for specific tasks they need to be trained to do well. Alex envisions data scientists of the future that are less machine learning experts and more domain and product experts. Alex's talks contain many interesting observations, too numerous to list here, and its just the right mixture of academic and practical for lay people such as myself.
Fireside Chat: building Bloomberg GPT (Gideon Mann and Alex Ratner) -- interesting insights into the rationale for Bloomberg GPT and the work that went into building it.
Fireside Chat: Stable Diffusion and Generative AI (Emad Mostaque and Alex Ratner) -- lot of cool technical insights about FMs from Emad Mostaque, CEO of Stability.AI (Stable Diffusion).
A Practical Guide to Data Centric AI -- A Conversational Use AI Use case (Daniel Lieb and Samira Shaikh) -- practical tips to building an intent classifier for conversational chatbots. Similarity function for clustering conversations was adapted from the paper Modeling Semantic Containment and Exclusion in Natural Language Inference (MacCartney and Manning, 2008).
The Future is Neurosymbolic (Yoav Shoham) -- somewhat philosophical discussion of why FMs can never do the kind of things humans can do, and why, from the founder of AI21 Labs.
Generating Synthetic Tabular Data that is Differentially Private (Lipika Ramaswamy) -- a somewhat technical discussion arguing for differential privacy to generate synthetic datasets that could be used to train FMs and thereby address the problem of them memorizing sensitive training data.
DataComp: Significance of Data for Multimodal AI (Ludwig Schmidt) -- discusses DATACOMP, a benchmark which aims to improve an image-text dataset used to train multi-modal models such as CLIP, by keeping the model fixed and improving the dataset. By applying a simple quality filter on the original dataset, they were able to model that was smaller in size, took 7x less time to train, and outperformed a larger model. More details in the DATACOMP: In search of the next generation of multimodal datasets (Gadre et al, 2023) paper.
New Introductions from Snorkel AI (Alex Ratner) -- second day keynote where Alex formally announced Snorkel Foundry and GenFlow, among other things, some of which were repeats from the previous day's keynote.
Transforming the Customer Experience with AI: Wayfair's Data Centric Way (Archana Sapkota and Vinny DeGenova) -- this was a really cool presentation, showing how they labeled their product images programatically with Snorkel for design, pattern, shape and theme, and used that to fine tune a CLIP model, which they now use in their search pipeline. More info about this work in this blog post.
Tackling advanced classification with Snorkel Flow (Angela Fox and Vincent Chen) -- the two big use cases where people leverage Snorkel are document classification and sequence labeling. Here they discuss several strategies for multi-label and single-label document classification.
Accelerating information extraction with data-centric iteration (John Smardijan and Vincent Chen) -- this presentation has a demo of Snorkel flow to label documents with keywords for a specific use case (for which off the shelf NERs do not exist). The demo shows how one can rapidly reach a good score (precision and coverage) by iterating through creating and applying an LF, then training and evaluating a model on the labels created by the LF, doing error analysis to correct the issues pointed out by creating another LF, etc, until the desired metrics are reached. They called this the Data-Model flywheel.
Applying Weak Supervision and Foundation Models for Computer Vision (Ravi Teja Mullapudi) -- talked about using Snorkel for image classification, including a really cool demo of Snorkel Periscope (an internal Labs tool) applied to satellite data to build classifiers that look for images of a particular type, using UMAP visualizations and cosine similarity distributions.
Leveraging Data-Centric AI for Document Intelligence and PDF Extraction (Ashwini Ramamoorthy) -- a talk about information extraction from PDF documents, similar to the one listed earlier, but as with that one, Ashwini shares a huge amount of practical information that I found very useful.
Leveraging Foundation Models and LLMs for Enterprise Grade NLP (Kristina Lipchin) -- slightly high level but very interesting take on FMs from a product manager viewpoint, echoes much of the same ideas about last mile handling covered in earlier talks, but identifies Domain Adaptation and Distillation as the primary use cases for most organizations.
Lessons from a year with Snorkel Data-Centric with SMEs and Georgetown (James Dunham) -- this is a hugely informative talk about Georgetown University's experience with using Snorkel Flow for a year. Not only did their domain experts adapt to it readily and love the experience, both data scientists and domain experts benefited from it. Some major benefits noted are the ability to ramp up labeling efforts faster and with less risk, since it is easier to iterate on labels (adding/removing/merging classes, etc) as your understanding of the data grows, the ability to fail fast and without too much sunk cost, and overall lowering of project risk. If you are contemplating purchasing a Snorkel Flow subscription, this talk provides lots of useful information.
Fireside chat: building RedPajamas (Ce Zheng and Braden Hancock) -- RedPajama is an open source initiative to produce a clean-room reimplementation of the popular LLaMA FM from Meta. The focus is on replicating their dataset recipe carefully, but using open source documents, and training base and instruction tuned versions of the LLaMMA model on this data that does not block commercial adoption. Ce is the head of Together Computer the company behind RedPajama, and Braden and Ce discuss the work that has been done so far in this project.

In many cases, it is not the lack of data, but a lack of labeled data that is the major hurdle to Machine Learning adoption within a company. Snorkel's support for weak supervision provides a practical path to generate labels using a programmatic approach. As someone who came to Machine Learning from Search, where featurization is basically TF-IDF and more lately using a trained tokenizer to feed a neural model, I was initially not particularly skilled at detecting features from data. However, over time, as I started looking at data, initially for error analysis and later for feature extraction in cases where labels were not available apriori, the process has become easier, so hopefully my next experience with Snorkel will be smoother. Furthermore, Snorkel's focus on FMs also provides a path to harness this powerful new resource as an additional source of weak supervision.