Great Data Products

░░░░░░░░░░░░░░░░░░░

A podcast about the ergonomics and craft of data. Brought to you by Source Cooperative. Subscribe ↓

→ Episode 1: Why LLM Progress is Getting Harder


YouTube video thumbnail

Show notes

Jed Sundwall and Drew Breunig explore why LLM progress is getting harder by examining the foundational data products that powered AI breakthroughs. They discuss how we’ve consumed the “low-hanging fruit” of internet data and graphics innovations, and what this means for the future of AI development.

The conversation traces three datasets that shaped AI: MNIST (1994), the handwritten digits dataset that became machine learning’s “Hello World”; ImageNet (2008), Fei-Fei Li’s image dataset that launched deep learning through AlexNet’s 2012 breakthrough; and Common Crawl (2007), Gil Elbaz’s web crawling project that fueled 60% of GPT-3’s training data. Drew argues that great data products create ecosystems around themselves, using the Enron email dataset as an example of how a single data release can generate thousands of research papers and enable countless startups. The episode concludes with a discussion of benchmarks as modern data products and the challenge of creating sustainable data infrastructure for the next generation of AI systems.

Key Takeaways

  1. Great data products create ecosystems - They don’t just provide data, they enable entire communities and industries to flourish
  2. Benchmarks are data products with intent - They encode values and shape the direction of AI development
  3. We’ve consumed the easy wins - The internet and graphics innovations that powered early AI breakthroughs are largely exhausted
  4. The future is specialized - Progress will come from domain-specific datasets, benchmarks, and applications rather than general models
  5. Data markets need new models - Traditional approaches to data sharing may not work in the AI era

Tags: