Common Crawl

→ Episode 1: Why LLM Progress is Getting Harder

Show notes

Jed Sundwall and Drew Breunig explore why LLM progress is getting harder by examining the foundational data products that powered AI breakthroughs. They discuss how we’ve consumed the “low-hanging fruit” of internet data and graphics innovations, and what this means for the future of AI development.

The conversation traces three datasets that shaped AI: MNIST (1994), the handwritten digits dataset that became machine learning’s “Hello World”; ImageNet (2008), Fei-Fei Li’s image dataset that launched deep learning through AlexNet’s 2012 breakthrough; and Common Crawl (2007), Gil Elbaz’s web crawling project that fueled 60% of GPT-3’s training data. Drew argues that great data products create ecosystems around themselves, using the Enron email dataset as an example of how a single data release can generate thousands of research papers and enable countless startups. The episode concludes with a discussion of benchmarks as modern data products and the challenge of creating sustainable data infrastructure for the next generation of AI systems.

Links and Resources

Common Crawl Foundation Event - October 22nd event at Stanford!
Cloud-Native Geospatial Forum Conference 2026 - 6-9 October 2026 at Snowbird in Utah!
Why LLM Advancements Have Slowed: The Low-Hanging Fruit Has Been Eaten - Drew’s blog post that inspired this conversation
Unicorns, Show Ponies, and Gazelles - Jed’s framework for sustainable data organizations
ARC AGI Benchmark - François Chollet’s reasoning benchmark
Thinking Machines Lab - Mira Murati’s reproducibility research lab
Terminal Bench - Stanford’s coding agent evaluation benchmark
Data Science at the Singularity - David Donoho’s masterful paper examining the power of frictionless reproducibility
Rethinking Dataset Discovery with DataScout - New paper examining dataset discovery
MNIST Dataset - The foundational machine learning dataset on Hugging Face

Key Takeaways

Great data products create ecosystems - They don’t just provide data, they enable entire communities and industries to flourish
Benchmarks are data products with intent - They encode values and shape the direction of AI development
We’ve consumed the easy wins - The internet and graphics innovations that powered early AI breakthroughs are largely exhausted
The future is specialized - Progress will come from domain-specific datasets, benchmarks, and applications rather than general models
Data markets need new models - Traditional approaches to data sharing may not work in the AI era

Transcript

(this is an auto-generated transcript and may contain errors)

Jed Sundwall (01:00.661)

All right, well, Drew, welcome to Great Data Products, episode one. Thanks for doing this with us.

Drew Breunig (01:11.182)

Not a problem.

Jed Sundwall (01:12.537)

yeah, as I said, I’m going to ask you to introduce yourself in a second, but before I go, just want to, explain a little bit like why we started this podcast, which is, that we believe that.

Understanding what makes a good data product is just very understudied. We’ve been doing it as a species for a while now, every now and then sharing data. There have been laws on the books saying, you know, thou shalt open your data or policies from research funders saying that, researchers need to open their data. sometimes it goes well and sometimes nothing really happens with it. And we’re, think we have enough experience under our belt now that like we can see there are a handful of data products that have come out.

that have had a huge impact on research. And we’re at the point where we’ve got to figure out like, why, like why those, what made them good? There’s Eleanor Ostrom said this very somewhat famously, at least for me, I’m a big fan of hers, but you know, she was, she’d spent all of her life working on trying to understand how people share common resources that are, that are limited. like a fishery or a forest or, you know, grazing fields and stuff like that.

And she’s like, look, we know this happens. Like humans have figured out how to do this. We know it works in practice. Now we have to figure out how it works in theory. And I love that. So that’s, that’s what we’re doing is trying to figure out. We know that some data products are really great. We want to tease out some theories as to explain why. so, for reasons that are obvious to me, but may, might not be obvious to everybody tuning in or listening. You were one of the first people I’d ever want to talk to you about this. So why did you explain a little bit about your.

background of what you do.

Drew Breunig (02:57.574)

yeah, first I want to put a pin in that quote you said, cause I think one of the things that’s crazy about that is like a fishery is like, it’s a zero sum game. Like that is a exhaustible resource source. data products have entirely different dynamics. like you can go full like old school, boing, boing, Cory doctor out data wants to be free. It’s not theft if you can reproduce it, but at the same time, it grants you this immense advantage.

that then allows you to create more data in a way that isn’t free. it’s kind of anyway. So yeah, my name is Drew Brunig. I write a bunch on AI and data. I’ve been working in data. ran helped run or ran data science and products at a company called Place IQ for about a decade. And then led strategy at precisely when it came to the data and

Jed Sundwall (03:30.325)

Right. Yeah.

Drew Breunig (03:55.15)

intelligence business. I see data as a really interesting space because it’s an intersection between humans and compute, essentially. Because you’re essentially converting humans or work of humans or observations made by humans into something that is programmatically readable, you can build products upon it and that. that, I also think the other thing that’s interesting about that is that’s not a one way street.

It’s a two-way street. So you are converting humans into data, but at the same time you’re preparing data and figuring out how it can be leveraged to inform those humans. So kind of making data human, making humans data. And that is an active negotiation of borderlands as it were, rather than just one way that comes in and goes out.

Jed Sundwall (04:47.425)

Oh man, fantastic. All right. See you. This is a rich well to draw on. Um, yeah. And, what you, um, what’d you just said about like Corey, Dr. O and like sort of the economics of this. think this is my, I’ll just keep saying it out loud over and over again. This is that like, this is the Nobel prize challenge is like, can you, can we figure out how data functions as a market? Good. Because, because it’s weird, right? Like to your point about.

Drew Breunig (04:52.844)

Yeah, we can go, but.

Jed Sundwall (05:13.909)

What Ostrom was studying was, limited resources, which she called the common pool resources, but with the assumption that they were, they were limited and you needed governance to manage access to them. And just to, yeah, just quick primer on Ostrom for a lot of people. And I’m not like a full Ostrom scholar, but like a lot of what made that work was the fact that like, you had to live with the people that you shared the resource with. And so if you were a jerk about it, like you would get punched. Like that’s just part of it. And yeah.

Drew Breunig (05:35.938)

Yeah.

Drew Breunig (05:42.006)

Yeah, mean, guess you can kind of say that exists when it comes to licenses, which is a whole different messy world, which is like, god, please don’t. So much of my beef with licenses is that it’s the will of people when the data wants to be free. And the real way that you can kind of put your fingerprint on the market is you actually put the data out there in the shape and the form that is

Jed Sundwall (05:47.783)

yeah, we’re going to talk about licenses.

Drew Breunig (06:10.286)

what you want that makes what you want in the world to happen. But the idea of releasing it and then gating it is just insane to me. It doesn’t make any sense. It’s backwards. You kind of want the option, but you want to control how people use it, which is just like, why are you even bothering in the first place? But yeah, and I think that’s like, now you’re getting into the familiar terrain of like the data is the new oil claims and other things like that. And I feel like that’s a quote even that we debated and wandered around.

and talked about for decades. And part of the reason we talked about it is because it made people who work in data feel important. It made them feel like this justifies my paycheck, my job title, my power within the organization. But I don’t really feel we got to the point where data is the new oil became somewhat true until LLMs and post-Chat GPT.

Jed Sundwall (06:46.505)

yeah. yeah!

Drew Breunig (07:05.71)

specifically, those were the engines needed. It’s like you can create oil, but if no one owns an engine, no one has anything that they can do with it. Like that’s kind of the era we were in where we figuring it out. We could drill it, but we weren’t sure what there was potential energy there, but what do we actually turn it into? And prior to that, there was one thing you turned it into, which was ad products. That was the one thing you turned it into. That was the way to monetize. And now we’re turning it into large language models and other things like that.

Jed Sundwall (07:25.889)

Right.

Drew Breunig (07:35.104)

figuring out the economics of it, I believe is hard. Because the other, like, I don’t know, I think one of the things is like, you can find so many different metaphors for this, because it’s a complex thing and this complex bucket that like kind of reigns it in. But I do think like one of the king metaphors for data is it’s the platypus. Like it has, because, well, what is a platypus, Chad?

Jed Sundwall (07:56.021)

Go? Go on.

It’s all sorts of crazy stuff.

Drew Breunig (08:03.404)

Yeah, it’s got a bill. It’s poisonous, lays eggs, mammal, it’s got fur. Yeah, it’s like that’s that’s data. Like it sometimes you can, you can make it like oil. Other times you can make it like a lighthouse, which is like a public good that makes it so ships don’t crash. And you can put it at the right

Jed Sundwall (08:07.755)

Yeah. Lack dates in a really weird way. yeah. Sure.

Jed Sundwall (08:25.493)

Mm-hmm.

Drew Breunig (08:29.998)

point that encourages very specific trade routes to occur and economic activity to incur. And so you influence the world by putting it out there. And because it’s a public good that can’t be gated, became that was something governments did. And you could make the same argument. And then you can also find metaphors for like data being countless other metaphors as you can kind of run into. But I do think when you put a data product in the world,

getting towards the definition of what this podcast is, a great data product creates an ecosystem around itself, I think is the way I would say it. And I would say like, perhaps, and this can happen intentionally, it can also happen accidentally. And so by way of kicking this off, like I almost wanna pose to you,

Jed Sundwall (09:05.099)

Yes, yeah.

Drew Breunig (09:26.476)

what I think is the best data product ever created, or one of them, the Enron email data set. Are you familiar with this one?

Jed Sundwall (09:30.081)

Let’s go.

Jed Sundwall (09:34.85)

Ah, I am. Uh, because, uh, so just flashback here when I, when I joined AWS in 2014 to, to build the open data program there, AWS already had this thing called the public data sets program. Um, which was that sort of preceded me. Um, that was not, know, there, it was, had already been dabbling in sharing open data, but there was no kind of like program around it. And this program was somewhat abandoned. And, um, but.

Drew Breunig (09:51.266)

Yes.

Jed Sundwall (10:04.021)

how it had been set up was using elastic block storage volumes. So this is data that to access it, you had to turn on EC2. You had to turn on a server and then attach one of these volumes to that server. Then you could access it. But we had all these EBS snapshots, these volumes of data that you could load up. it was like, one of them was the Enron email database, but some other funny ones, there was like a cannabis genome, like,

maybe the Marvel Cinematic Universe, there’s something to do with like, it was like a graph database of like Marvel characters or something like that. And some Japanese census data that someone found. And it was just, it was kind of this fascinating snapshot. I’m sure there’s plenty of like internet archive screenshots of the site. It was just sort of like, here’s some random data that engineers at AWS found circa 2012. But yeah, the Enron database was in there. So go on, let’s talk about it.

Drew Breunig (10:58.509)

Yeah.

Drew Breunig (11:01.976)

Well, I just think Enron email database, so for those of you who aren’t familiar with the Enron email database, so Enron was a company that blew up in spectacular fashion. When did it blow up? Like 2001, 2002?

Jed Sundwall (11:19.013)

And you’re talking blow up pejoratively, like it was catastrophic.

Drew Breunig (11:21.612)

Yes. Yes, it was not a physical literal blow up. It was just a mountain of fraud. when the case kind of, there was a ton of public anger. A lot of people lost their pensions, a lot of people lost their stock, and effectively it went to zero and gets taken over. And it was a big company. In 2003, as part of the court proceedings,

Jed Sundwall (11:29.077)

Yeah. Yeah.

Drew Breunig (11:51.406)

the I think it was the Energy Regulation Commission released the emails from about 150 senior Enron executives. So this is about 1.6 million emails that get released. And this is 2003, keep in mind. like, that is an amount of emails that would be out of reach for most people because

You would, it’s just incredibly hard to download. Um, though putting it in AWS was, I’m sure it was very popular. you search Enron email dataset, uh, MapReduce, you will find hundreds and hundreds and hundreds of tutorials. And so it became this incredibly popular data set that people wrote papers about, about internal dynamics of workplace culture and language. Um, I think at one point there were like 30,000 papers a year.

that were citing this. And when I checked Google Scholar, maxed out. It was over 100k. Then you start to look at the companies that were booted up around it. I know multiple startups who started building email software or enterprise SaaS software that would start with the Enron email data set. You would start with it to kind of build your products around it. Because there was no email data set. Like even today, you see it used in AI evals and pipelines.

Jed Sundwall (13:17.536)

Interesting.

Drew Breunig (13:18.408)

is just this, it’s the only large email data set that is friendly license free to use. And it is generated an immense, I think it would be a very fun study for someone to do would be to calculate what the economic benefits of this email release from this absolutely failed company and how much it generated from this. so like, to me, that has the qualities of a great data product, which is it provides data that wasn’t existed anywhere else. It doesn’t, so you

There was no competing offering and any competing offering was just a minuscule, minuscule amount. Two, it has legs. We are, I want to say 22 years since the release and it remains as relevant as ever. It was freely available and accessible and easy to work with despite its size. It was a very common MapReduce demo, as I said, which would be the first step you would do in kind of dealing with it.

And it created an ecosystem around it, which I think is the biggest test case for a good data is do things grow out of it? And so it’s kind of like, I equated, I was at the Monterey Bay aquarium this weekend and they had an exhibit on whale falls when a whale dies and it goes to the bottom and it starts to decompose and all of the like critters and everything come to eat it. And it’s this like feasting moment. And that is the Enron email data set was the equivalent of a whale fall.

Jed Sundwall (14:42.699)

Yeah.

Very juicy. mean, yeah, so much, so much material in there. No, I love this. I you’re making me think about, I have this white paper that will come out eventually. I’ve been working on it for way too long. I may have mentioned this to you, but it’s called emergent standards where I make the case that the web is an engine for people to come up with new standards. and so basically like the way, like the server client dynamic of the web is that like,

If you have a server and a client that can talk to each other in a way that makes sense to one another, it works. like it’s worked with HTML and then, you know, we’ve figured out other ways to send more and more complex things over it. Um, including like, and what I talk about in the, in the paper is like, RSS, like we want to figure out how do we syndicate stuff to one another. Stack catalogs. Um, what’s the other one GTFS, which is the general transit feed specification. And, basically

Drew Breunig (15:17.656)

Yeah.

Jed Sundwall (15:41.602)

like what people don’t understand or like what a lot of people in policy don’t understand is that this is an emergent thing that happens as communities grow around types of data. So I’m agreeing with you, but like one conclusion I try to sort of land on with that, that this white paper is that this is effectively like language. If it’s useful to people, it will be adopted. Right. And so to your point about the, the, and the, this collection of emails,

Drew Breunig (16:05.55)

Yeah.

Jed Sundwall (16:10.109)

It’s practically useful in a way to a lot of people. so people have adopted it and it’s become a thing.

Drew Breunig (16:16.265)

Yeah, and I think the other thing too is like the it’s so much easier to create that standard or have a successful data set if you’re operating in the white space where it doesn’t exist. Like when we’re, so I work on the Overture Maps Foundation as you know, and like that’s a little bit of hard mode because you’re competing with a lot, trying to establish a with where standards exist to some degree.

Like open street map is really built more to be a map rather than a data set. So it doesn’t have great data standards for like easy data usage. It’s starting to adopt a lot of the moves that we’ve made at Overture, but at the same time it exists, it provides an alternative. And so it means we have to be that much better. Whereas with the Enron dataset, like there’s still no replacement for it. I was just looking, at the pile. The pile is a big data set. That’s about.

What is it? It’s about 900 gigabytes was used to train llama. It’s used to train lots of open agents. We can assume it’s being used to train closed agents, closed models as well. Again, it’s what? 900 gigabytes and the Enron emails are still in there. They’re still like one of like 25 sources. There is no better.

Jed Sundwall (17:42.305)

Amazing.

Drew Breunig (17:43.82)

email dataset. like operating in the white space means you get more rain to create those standards you go through.

Jed Sundwall (17:51.638)

Right. Interesting. Give me one interlude here. We have to, we got some technical difficulties. We’ve got to make sure the YouTube live stream is working or the chat is working. It’s apparently disabled, I’m going to, I’m going to do a thing. Hey everybody. mean, there are people on YouTube. I’m going to click on something and I don’t know what’s going to happen.

Jed Sundwall (18:21.183)

Now I’m like delayed. I’m watching myself on YouTube with the delay.

Jed Sundwall (18:31.657)

Okay, I think it works.

Drew Breunig (18:34.711)

You got it?

Jed Sundwall (18:36.811)

I think so. All right. Now how do I get out of here?

Drew Breunig (18:41.998)

I mean, look, you got your first episode here.

Jed Sundwall (18:45.441)

All right, we did it. No, we’re good, we’re good. We got, can see, it’s like all my friends. It’s like, this is so great. This is like romper room. Like, I don’t know if you ever watched that. It like a show and I was like really little and it’s like, yeah, it’s like, I see Alex and Camilla and Linda. This is good. Okay, so we’re good. So hold on, I do wanna talk more about the white space though.

Drew Breunig (18:49.077)

Nice.

Drew Breunig (18:54.22)

Ha ha ha

Drew Breunig (18:59.116)

Yeah, you can wave goodbye to them and you can’t hear us back.

Jed Sundwall (19:13.026)

define it more. You’re just saying like creating an entirely new kind of data product or working in entirely new domain.

Drew Breunig (19:17.088)

Yeah, well, I mean, I just think there are some things where it’s like, it’s, I think you see this a lot in culture and technology too, which is like, if you’re the first to come out, you have a longer shelf life than your if, if then the best which may come out later, technically. And so you have more ability to shape the standard, which is hard and a lot of pressure, because you can sit there and think about it forever, or you can just release it.

and then evolve it quickly as they come. But it’s hard when it’s a dataset because you release it and then it ceases. It’s the whale fall moment. You don’t get to go back and rebuild the whale and then drop it again.

Jed Sundwall (19:53.814)

Yeah. No, well, and this goes back to like what I was saying about like the Nobel Prize challenge of like, what are the economics of data? And I think you know this under working at Overture. It is expensive to produce good data. I cut my teeth.

Drew Breunig (20:08.546)

very expensive. It’s expensive to maintain good data too. I think like one of the things that like allow for longevity of these data sets are things where you don’t need that timeliness. Like it’s okay that the people in the Enron email data set are not still emailing and we aren’t still capturing those emails for the last 23 years. Because that’s not the function of that data set. It is a demonstration of how people use email rather than

And there’s been no competitor. Whereas if someone came out and said, I’m to make a business of selling select emails so people can see it, like say, but there’s so where we aren’t going to see that, but we do see it in other spaces.

Jed Sundwall (20:50.209)

Yeah. Well, yeah, let’s, I mean, we need to, talk about this for a little bit, like with a shape of a data product. Um, they, they, they can’t take on many shapes. Right. So my first job out of grad school. like, you know, my life story is I studied foreign policy. I got a master’s in foreign policy. Thought I was going to like work for the state department. I wanted to be a diplomat and I was like, I, I grew up in DC.

It had no appeal to me, like it had no luster to it. So I was like, actually, I just want to work on the internet. Like I had what I’ve called like a coming out process in 2006 where I was like, I care about the internet, like, and I don’t care who knows, like, this is just who I am and worked. So I took a job as a marketing enthusiast at eventful.com, which was like a web 2.0 company.

Drew Breunig (21:32.844)

Well, I mean, look, that’s a title that comes in the Web 2.0 era. Marketing enthusiasts. That was a special time for titles.

Jed Sundwall (21:36.62)

True. Yeah. Yeah. It’s like not a ninja. Like, like definitely like a amateur. Yeah. just an enthusiast, but it was my foot in the door and it ultimately I think was a very good decision. But what eventful did, there’s a site called eventful that was like, they gathered all the world’s events data that they could find by scraping websites and getting access to feeds and then standardizing it and making available via an API.

Drew Breunig (21:42.638)

Not a rock star, not a ninja. Yeah. Just an enthusiast.

Jed Sundwall (22:04.553)

And what we learned very painfully was like, this is very expensive and the bulk of our database becomes useless every day. yeah. Yeah.

Drew Breunig (22:11.318)

Yeah, no, exactly. That’s like the opposite, which is like it’s event data. It’s just gone. It’s done. And I think you see other people who have to struggle with this as well and try to figure it out, which was like, I think satellite imagery providers, you and I know many cases where like there’s several satellite imagery companies who are like, trying to figure out how to build a product that makes their old data valuable.

Jed Sundwall (22:16.063)

Mm-hmm. Yeah.

Drew Breunig (22:40.172)

because right now most satellite imagery providers are, their stuff is valuable because it gives you that snapshot of what’s going on right now. But they want to figure out everything else. And like, you’re not gonna crack that at Eventful. You’re not gonna crack that at, you know, anything that is, you know, temporal in nature.

Jed Sundwall (22:58.357)

Yeah. Yeah. Well, it actually, so this is, this is actually very timely. This, Antoine on the, on the chat, I love this is asking like, what about, you know, what about Freebase? This is the issue. It’s like, what about Eventful? Like Eventful never pretended to be an open data resource. was doing the hard work of taking a lot of open data or data that was like, you know, small enough that it didn’t feel like we were just ripping people off because also we were like,

Drew Breunig (23:05.427)

Yeah.

Drew Breunig (23:19.032)

Yeah.

Jed Sundwall (23:27.083)

highlighting events that people wanted to highlight, but then assembling it into a white pages like product where it’s a huge compendium where the product is like, we have everything in one place and then we sell access to it. Long story short, I don’t think Eventful exists anymore. That the problem that’s solved has been solved in other ways and whatever events are still kind of a difficult space to aggregate in. But so Freebase, awesome example of around the same era. I think it was started in

Drew Breunig (23:42.454)

No, it doesn’t.

Jed Sundwall (23:56.514)

hang on, I just looked up the Wikipedia page. was launched in 2007. 2007, for what it’s worth, the year that AWS announces its first service and the year that the iPhone is announced. It’s a very consequential year. Very heady days of Web 2.0, like seeing what the internet can become. And so…

Drew Breunig (24:12.718)

Yeah.

succeeded according to the Wikipedia by Wikidata.

Jed Sundwall (24:20.277)

Yeah. So, so the, mean, there’s, there’s room for this WikiData. think people like it. It seems good in some ways. I’ve never really relied on it very much, but.

Drew Breunig (24:30.082)

Well, Wikidata is a good example of like the importance of data UX, which is, you know, one of the things that was so nice about Freebase was it was this, it’s kind of what Overture tries to do with like its GURS identifiers, which is for everything, there would be an entity that you could then walk. like, you know, there’s here’s an entity for Jed. Now we can find everything Jed has open to it. And yeah, I, I think Wikidata is kind of sneakily one of the best

crosswalks on the web. think they track over 800 different crosswalk kind of identifiers like Apple Maps ID, Google Maps ID, lot of federal IDs and everything else. And it is fairly successful. It’s API, like think there is a little learning curve for that. I think also when trying to build products off that, it’s incredibly good for crosswalking data, though oftentimes you have to do a little bit of

hurdles to get the data down for that crosswalk. But again, that’s like a whale fall. Again, it’s the same thing, which is once Google walked away, it’s nice because it allowed for Wikidata to exist in a way and utilize the free base data as its code. But then it had to kind of supply the revenue or at least the donation model to keep it going.

Jed Sundwall (25:56.13)

Right. And it’s all goes back to the fact like this is expensive and hard. what I, you know, uh, these days, uh, the year 2025, there’s a lot of concern about sources of data that we had long thought were kind of like unimpeachable and we’re going to be a reliable provided by governments. And, um, that’s just no longer self, you know, a safe assumption to make.

Drew Breunig (26:00.6)

Yes.

Jed Sundwall (26:18.365)

And I’ve actually been a voice, you know, shouting into the void for years. Like this was never a safe assumption to make that we need to think a lot harder about this kind of infrastructure. Because it’s hard. It’s expensive to produce. And if we could figure out the economics of it and get better, have better markets for data, I think we would have more data. the, one of the hard things to, to grapple with here though, is that like nothing is free and

What you were saying before about the difference between like a fishery and a dataset is that like, there’s this phenomenon that I chalk this up to what’s called nano economics, which is like the economics of like individual, like very small transactions. so if you examine like voting behavior, people are like, my vote, like, how could it possibly count? It doesn’t matter, but like votes do matter, right? And like,

Drew Breunig (27:13.847)

Yeah.

Jed Sundwall (27:16.137)

We don’t perceive the emissions that we create by living our lives, but like they obviously add up. And so same thing, like Wikipedia, it feels free to open up an article on Wikipedia, kind of to all involved. Like Wikipedia itself doesn’t really register one page load. And it’s, certainly seems free to you, but Jimmy Wales is going to ask you, he’s going to nag you to donate because they need money. Like, yeah.

Drew Breunig (27:41.63)

Yeah. And, and, and I think there’s also the flip side to that as well, which is something that we see. So during the, when the advertising ecosystem was the way you monetize data, I’m sure many people talk to you about like the dream that everybody wanted to figure out is, how can we, we’ve, I’ve solved the privacy problem in advertising. I’m going to create a system where people can opt in to share their data and they get paid for

Countless, I know countless companies or people who dreamed of trying to figure this out because they’re like, look, people get real value with they sell their data. The advertising ecosystem is incredibly huge. The problem is, is that your data on its own is worth nothing, absolutely nothing. And it’s worth something in aggregate, but

Jed Sundwall (28:34.773)

Nothing. Yeah, exactly.

Drew Breunig (28:40.744)

nothing in in in by itself. And so people would make runs at this, which is like, we’re a co-op, we get to brand together, you like try to get some maybe economic innovation of like, okay, you’re, you know, have a longer timeline, take advantage of compounded interest, all these other things. But it’s it’s kind of the same thing, which is your usage of Wikipedia is a rounding error, but it’s expensive. But the value of the data you create

is a rounding error. And we saw this during the ad era and we’re seeing it again. There was, what’s the mobile phone network that launched a couple of days ago where it’s like, we get training data on all your calls. And so you get cheaper voicemail or cheaper phone service.

Jed Sundwall (29:23.626)

Whoa.

How about this one? fascinating. Tons of people are going to sign up.

Drew Breunig (29:28.278)

Yes. I don’t think, but again, like I can’t, like I haven’t looked at the cost. It can’t be high. Like, like how much of a discount can it actually apply? I’m looking it up because I want to see, I just saw it. Cause it’s, it’s way easier for someone like Meta or Google to just give you the service and the service is predicated on sharing data. But we will just never see that go away.

Jed Sundwall (29:35.947)

Yeah, right.

Jed Sundwall (29:56.726)

No, no, because in aggregate it’s just too, too powerful, too seductive and they provide really good services. Yeah.

Drew Breunig (30:00.95)

And now we’re seeing it, but now we’re seeing it like the flip side of this is the the anthropic case right now, which is how much per book was that settlement was like $3,000 per book, which is like fairly, if you’re an author $3,000 for a book, like for a lot of authors, it’s gonna be a lot for a lot of authors, it is not going to be. But it’s it is more than you would expect. And they’re going back to the well, because the judge took away the settlement.

Jed Sundwall (30:19.659)

Yeah, yeah.

Drew Breunig (30:27.084)

And so we’ll see where that does net out. do think like trying to figure out the cost and training is hard. I don’t know if that’s something like the idea of opting into training, think is, like, you’re going to get applications that rise up too quickly that are just going to take your training data. So chat, you’d be T anthropic just asked everybody to re opt in, change their privacy, because they’re going to be training on that, meta always has always will. and, and so.

Jed Sundwall (30:49.582)

interesting.

Drew Breunig (30:57.818)

how are you going to create an ecosystem to pay people within that? They’re just going to go use these services and kind of knock it out. So, I don’t know.

Jed Sundwall (31:07.881)

Amazing. Okay. Well, let’s, let’s, let’s shift to your blog post now then, cause let’s talk about large language models, talking about Anthropic, and the basis of these things. So you, in your blog post, which I highly recommend, it’s, it’s in the, whatever we linked to it when people registered for the thing, you can put it in the chat. but great overview of, these three data products. And, and again, this is another sort of chance for us to talk about.

what is a data product. So let’s start with the beginning and talk about MNIST. Yeah.

Drew Breunig (31:43.2)

Yeah, so one of the reasons I think large language models and AI in general are the fulfillment of data is the new oil is because previously, if you wanted to write a computer program, you had to worry or make a computer program, we really had to worry about three things. You always worry about your software and your hardware. Actually, two things really. That’s it. Just write my software, run it on hardware. I’m done.

With machine learning, deep learning, and now what we call AI and all those subsets of it, you have to have software, hardware, and then data. The data bit is non-negotiable. You need the data because the way machine learning and deep learning works is rather than having the programmer write the rules for what the program does,

You give a sufficient volume of data and present it to a computer and a computer program for making machine learning or deep learning models. And you give it instructions and you ask it to interpret the patterns in the data without and figure it out for itself. And within deep learning, this is even another layer on top of that, which is you figure it out without even telling what to pay attention to. You aren’t labeling it. You aren’t telling it. It’s just, here’s a pile of data. Go find the patterns.

Now in the early days, there wasn’t a lot of data because think about it this way, which is if you were an early adopter of computers, let’s say to 1994 in this case, you would go to the computer store, you buy your computer, you bring it home, you plug it in. And that was that. If you got any data into your computer, it was because you typed it out or you inserted a floppy disk.

that you got in the mail or picked up at the store, maybe a CD-ROM if you were real fancy. That’s it. The bottom, there was no internet connection. There was no downloading. So to acquire data was an incredible exercise. And so as a result, could you build machine learning systems? Not really. You had to have this access to data that you just weren’t going to get. So people didn’t do that. And so

Drew Breunig (34:06.154)

It wasn’t a field. wasn’t a thing. people are going to say, neural networks were back in the seventies. And it’s true, but there weren’t many who could play with it because the access to the data was so limited. And then what we found though, is that, and this gets back to the white space, which is really any data that was delivered to your door was brand new data. There was no competition for it.

Like, I don’t know about you, like, mean, like you’d get maybe a CD-ROM in your magazine or like, like what would you get for data? Like, what was the consistency of data? I think the only thing you would have is like maybe some project Gutenberg floppy disks you would pass around, maybe some like encyclopedia Britannica CD-ROMs you would pull out. There wasn’t kind of a world of data. And in this environment is the first data set we’re going to talk about. Cause we’re going to explain kind of the history of AI.

in three data sets. And the first data set is the MNIST data set, the M-N-I-S-T data set. Now, this data set, now it’s on a hugging face, as you can see. You can install the hugging face data sets pip library and download it. And it’s also bundled with almost every machine learning library. So if you install TensorFlow.

or Keras or whatever the backend and then you say like install MNIST. It’s almost certainly there because it is the data set that is the effective hello world of machine learning because back in 2004 or 1994 even longer. So what is MNIST? MNIST is a collection of 28 by 28 pixel images squared and they are handwritten letters.

actually is no, it’s digits. It’s not even letters, just digits, just numbers and digits. They collect these from two sources. One of them from, I think, census employees. And then the other one was from a high school class. So like, this is like a classic case of just like someone had access to two people, they were getting values out of them, they’re writing down numbers either as

Jed Sundwall (36:04.991)

It’s just digits. Yeah. It’s just numbers. Yeah.

Drew Breunig (36:30.942)

doing, filling out forms, filling out tests, and just someone in the right position is like, this could be useful or we’re scanning these anyway. And so they took some time. We don’t really know how this happened. They basically realized, hey, let’s make a data set of handwritten digits. They didn’t put a lot of thought into it or how it might be used for machine learning. Like one of the issues is, when you’re building machine learning systems, you have a test and a train.

subset and you should never mix the data. So your train is what you build your model on. You train, you learn from this, and then you test the quality of the model in your test data set. In the initial distribution, one of those data sets was like the high schoolers and then one of them was the census people, which is a terrible way. You should have it all mixed up and scrambled because you can make some assumptions that the census people may have different handwriting than a bunch of teenagers.

that have had no training. so later they improve this. But again, they put no thought into this. This is the equivalent. And they decided to distribute it. distributing was literally like CD-ROMs, burn CD-ROMs. You would get it in the mail. You would get, you you’d have to order it. And this was the NIST data set, the first one.

Jed Sundwall (37:59.094)

Yeah. So, and again, I think we need to maybe tell people like what NIST is. It’s the national something, something national is. Yeah. Yeah. So the government agency. Yeah.

Drew Breunig (38:04.832)

Institute of standards in technology. So the type of people who would be looking at pictures of numbers, and they’re the type of people who thinks there’s something here. Do you ever watch the movie Ed Wood, one of my favorite movies, great movie. You should you should watch Ed Wood. But there’s a scene in the beginning where he’s like on the studio lot. So Ed Wood is famous as like the worst movie director of all time. And he’s walking the studio lot and he’s he’s

Jed Sundwall (38:19.615)

No, I really should.

Drew Breunig (38:34.07)

walks into someone’s office and they’re reviewing the new stock photo, stock video they just shot or stock film they just shot, which they just keep in the studio library to like insert into movies later. And he’s just watching like disconnected random scenes. And he’s like, man, you could make a whole movie out of this. Just like highlighting how bad his taste is. But at the same time, looking at pictures of numbers and saying, we have something here is something you expect from the Bureau of

standards in technology. So they put it on CD-ROMs and mailed them out. And one of the people they mailed them out was a computer programmer at Bell Labs, back when Bell Labs was still like the institutional research standard. And the guy who got it there was Jan Lacoon, who is one of the godfathers of neural networking, one of the AI leaders at Metta.

Jed Sundwall (39:24.683)

Amazing.

Drew Breunig (39:28.054)

led kind of llama and other things. just released the world model last week. He’s just kind of a godfather of this stuff. And he had been working on the problem of trying to recognize numbers because he worked at Bell Labs. This is something they would want to do is they had to automate and kind of figure out and look at mail, look at zip codes. That was all it was trying to do is like, can we look at a camera and look at zip codes and automate the entire thing? And so using MNIST, he

trained a neural network, one of the first neural networks, and could basically delivered a watershed moment in accuracy. Like the error rate now was down to 0.8. He modified NNIST, mixed up the sample sets so it wasn’t just high schoolers and census. It became the Hello World. And at its peak, AT &T was using this original neural network software to read more than 10 % of all the checks deposited in the US.

which was then software that gets sold by Bell Labs. You will find this in almost every machine learning textbook, every deep learning textbook. And part of it was just, it was staged because once Jan got it, he reformatted the data, and this is touching on a question someone just asked, specifically for his task of training neural networks.

which is why this data set is so valuable and why it’s become this hello world is that you can do a one line install for MNIST data and it’s ready for you to use. It’s segmented into the different data sets. It’s all standardized. The levels of contrast and anti-aliasing, the flipping reversals, all of those things are all ready for it to be used. And it has kind of survived this test of time and enabled the foundation of the very first neural networks. Again,

This is a data set that was distributed on CD-ROM. It was sneaker net. It was mail. And it, would argue birthed what would later become our deep learning ecosystem that would lead to AI.

Jed Sundwall (41:34.614)

Yeah. think, no, I mean, and this guy, they’re to pull it up because they think his name, right. Donahoe, this guy at Harvard or sorry, at Stanford, David Donahoe. so wrote this paper that, I still have not finished. It’s very long. I’m putting it in the chat, but look, Donahoe is a smart guy, but the title is a little clickbaity for my tastes. It’s data science at the singularity. Not a terrible title though. I mean, I think he makes the case that there’s something going on here.

Drew Breunig (41:44.429)

on how.

Jed Sundwall (42:04.469)

but he credits Lacoon as the godfather. You would agree completely with what you just said. And the gist of what Donahoe says in this paper is that the machine learning has made the enormous strides it has because its community has adopted a practice of frictionless reproducibility. So one of these fantastic phrases,

similar to undifferentiated heavy lifting. It’s like impossible to say, but very useful. But this idea of frictionless reproducibility within the machine learning space where people have been able to share these great data products, compete around them to go going back to your point about a great data product has a community around it, have leaderboards and it’s just been like to the moon. And this it’s a great, this sort of tees up Alex’s question in the chat, you know, like

How would we get, for example, environmental data to be seen by A models? How do we do that? My answer would be, and this is defending everything that we do with this podcast and also with Source Cooperative is we would improve access to great data products. Like we would then work hard at that. Yeah. Sure. Yeah.

Drew Breunig (43:17.05)

Well, I think there’s two steps, which is cheating ahead. But there’s a couple things that come in, which is this idea of reproducibility, though. That was great in machine learning and deep learning. It’s really hard now. mean, Mira Murati, she left OpenAI and founded Thinking Machines Lab, her own lab, one of the many OpenAI people who have left to find it. And right now, they’re focused on reproducibility, because it’s near impossible because of the probabilistic software and

the way inference works at test time. And so it’s almost impossible now, and they’re innovating on that sense. But the other thing I would say, is we’ll get to this. But I think the other interesting thing is benchmarks, which is you don’t just need to put the data out there. You need to define the problem and provide the means for testing against it. And so if you want to say, it’s not enough to get seen by AI model,

Jed Sundwall (44:00.076)

Yes.

Drew Breunig (44:16.472)

because guess what? They don’t care. They’re just gonna go suck everything else. What you need to worry about is that the people building them have a benchmark to build against that now you’re, it’s the, what’s the, a metric or something becomes a metric, it becomes the, exactly. And that’s what it is, which is like, and this gets back to, I would even say, my funding. It’s not enough to just be there. You have to.

Jed Sundwall (44:32.213)

Metrics become targets. mean, yeah.

Drew Breunig (44:43.714)

you challenge these things and provide a mechanism for measuring success. If you don’t do that, no one’s going to care about it. But yeah, so that’s Yann LeCun. He’s doing his thing with CD-ROMs, sending it out. And it’s crazy to think part of what the internet has done and broadband is it speeds everything up because it makes exchange so much easier. And yes, the test benchmarks need to be actually relevant to the use cases. Yes.

The thing about benchmarks is that they are like they are shipped by people who care about specific things. If you don’t if you’re shipping a benchmark and you don’t have an understanding for why it’s important and why you care about it and you have some stake in what that is, you’re wasting your time like why are you shipping a benchmark in the first place, the point of putting the benchmark out there is to challenge people to perform against the thing that you care about. And there’s lots of

great examples of that.

Jed Sundwall (45:43.778)

Actually, can you help educate me on something I’m like very naive about and this is embarrassing, but I’m just going to be vulnerable on this podcast. is so to Tyler’s point, there’s been my understanding, there’s a lot of discussion about benchmarking with like earth observation, AI, AI models and stuff like that. And, and a gripe is that you can benchmark these things based on some sort of like, you can create like a technical benchmark or something like that, but it is divorced from reality, like from like what’s actually happening on the ground.

And it’s basically, like you can test, you can run a model and then test it to see if it’s performed in a certain way that like indicates that it’s a good model, but that does not indicate if it’s actually useful. Can you explain this to me a little bit more?

Drew Breunig (46:22.509)

Yes.

Well, I disagree with that. I think there’s lots of ways you can game benchmarks, but here’s the best way to think about benchmarks in my opinion, is that they are an encapsulation of knowledge with an opinion that allows you to test your performance against that encapsulation of knowledge. Yeah, we’ll talk about overfitting in a second, Joey. That’s very much a thing.

But the thing that I have that’s a problem is like a lot of people in earth sciences or sciences in general is they go to like big private companies and say, my thing is really important. You need to build against it. And that is first off, you have to get them to believe your thing is important. And then B,

They have to get up and running and understand that space really, really, really, really well. And then they have to see, build against it and follow it to create this own benchmark and have this thing is that, so when you create a benchmark, you are doing that work for them. And when you do that work for them, you get to encode the things you care about.

Drew Breunig (47:52.81)

It comes back to the like, there’s a I think it’s Louis Pasteur quote, which is, give me a laboratory and I will move the world. And he was talking about it in the case of like being able to freeze benchmarks and maintain science or freeze a variable and maintain science. And so if you can create a benchmark, you are creating the eval reality that you are asking that model to be held against. And this happens for lots of things. And so I think right now,

The two most successful benchmarks are the Arc AGI benchmark, which Francois Chollet built, which is, again, he basically said, everybody’s talking about AGI, but it’s not in reasoning. It’s really just fact memorization and repetition. He has a different thing, which is like all about pattern recognition. It should be incredibly easy for a human to do, but incredibly hard for a model to do. And so that has been…

kind of the thing he is a, he is a, has been in the deep learning space for over a decade. He is a leading voice. He created this and all of a sudden it became the thing that everyone starts to brag about when they get this because it’s really hard. When O1, OpenAI’s O1 was the first one to do it even somewhat passively, it was a really big deal. And ever since like we’re still kind of chasing it. So like both

his design, his leadership, his brand helped set that as this big thing. The other more tangible example of like, you don’t have to be a leader in the space, but you just found the white space is a benchmark called terminal bench. So terminal bench is how do you testing a model’s ability to use the terminal, use tools in the terminal. So with coding agents, this is so important.

Jed Sundwall (49:46.017)

Mmm. Yeah.

Drew Breunig (49:47.988)

Why do I care about having MCPs? Why do I care about having all these crazy tool sets? Just teach the model how to use the terminal and all the problems are solved. And this was put out by a really great team and they designed it in a specific way to, cause to basically get the agents they want. They spent a lot of time on this. This is out of Stanford and funded by LOD. And this has now become

the thing that people get against. like Anthropic, if you look at when their models come out, they will always put the terminal bench benchmark as like their top thing. When they bumped Claude Opus from four to 4.1, the main thing they cited was their terminal bench improvement. So that’s a good example of like, I’m creating the package of the reality I want from this. So someone in the chat replied to your…

Jed Sundwall (50:42.027)

Yeah.

Drew Breunig (50:45.216)

Earth observation benchmark, is like, all right, benchmarks are great, but my gripe is that most Earth observation benchmarks, so it’s looking at satellite imagery, they’re focused on object detection. Very few are focused on temporal signatures of change. Well, what that says to me, Tyler, is that’s an opportunity for you to create a benchmark or for someone to create a benchmark to measure this capability that you want to build into this model. A benchmark is a data product.

It is honed and I think it’s kind of the current way that data products are released or one of the main form factors they can take in this model. Yes, you have a worry about overfitting. The SWE bench is the software engineering bench. is a, again, it was one of the biggest one first to market, which is can a model take Google issues and submit changes?

Jed Sundwall (51:15.2)

Hmm.

Drew Breunig (51:44.382)

and submit PRs that pass. And it was adopted quickly as the main thing people were building against. I talked to AI researchers at foundation model companies and they’re like, I’m just trying to get another point in Swibench. That is what keeps me up every single day. But again, it has its own shortcomings. Like 50 % of Swibench is just the Python Django library. Like, so it’s really good at building the Django library, but maybe not very good at some…

rust, or maybe not very good at, you know, some data pipeline you’re building. So again, these things shape the outcomes and communities grow up and private companies grow up. And so that’s kind of why I think benchmarks are kind of a modern data product.

Jed Sundwall (52:30.913)

Interesting. Okay. There’s a lot to think about. I’m looking at the clock. want to, boy, where are we going? I want to talk about common crawl, but I also, like, we did not specify an end time for this because like good podcasts just go off the rails. you have a hard stop at the top of the hour? then okay.

Drew Breunig (52:37.196)

Yeah.

Drew Breunig (52:46.72)

I have a hard stop, but yes. Yes, not our top of the hour at one.

Jed Sundwall (52:54.977)

Okay.

Drew Breunig (52:55.958)

Yeah, we got an hour and 10.

Jed Sundwall (52:57.727)

yeah. So we booked our own time. for those listening, we’ve blocked our calendar for two hours so we can go this long. We’re going to go for as long as we want, but, but no further than 1 PM Pacific.

Drew Breunig (53:03.192)

There you go.

Yes.

But I think this, the benchmark thing, that’s how we talk about it today. But transitioning into, from MNIST, we went to ImageNet. And that is something that Fei-Fei Li created when starting at Princeton, because she built it as that challenge. She saw that there was a WordNet,

Jed Sundwall (53:15.766)

Yes.

Jed Sundwall (53:22.496)

Yes.

Drew Breunig (53:37.922)

which was out there, which was essentially a natural language processing training data set. And she said, well, I want this for ImageNet because I want people to build better image recognition models. And so to do that, she realized they needed a way to test it and train it. And it became not only a thing you could train models on to improve the software, but it also became like the foundational improvements of kind of deep learning in general was again, you put out your challenge.

and you make people go to it. It was a benchmark as much of a data set.

Jed Sundwall (54:11.583)

Well, right. like built in somehow, you know, I don’t know how she did this in terms of like funding and her stature at Stanford or whatever, like challenges. was just sort of like, this is a data product, you know, that we’re putting out there and we’re going to run challenges. And it was not, this is one of these overnight successes that took something like six years or something like that. don’t know like when, it was a long time before AlexNet came out.

Drew Breunig (54:26.262)

Mm-hmm.

Drew Breunig (54:32.322)

It was a long time. they, also, yeah. And I think the other thing too is like they had to create it like the only way they were able. So MNIST came out on a CD-ROM pre-internet. ImageNet could only have been created after the internet existed because they leveraged mechanical Turk. They leveraged Google image search. They basically were just paying to label images and the price they just would not have.

Jed Sundwall (54:52.427)

Mm-hmm.

Drew Breunig (55:01.806)

So I think there was a couple years that, because ImageNet was pretty, it was after Common Crawl first launched, but its breakout moment came before Common Crawl’s breakout moment occurred. And so AlexNet was 2012, whereas ImageNet was like, I think 2008, 2007. But yeah, and so go ahead.

Jed Sundwall (55:22.987)

Okay. Yeah.

Jed Sundwall (55:28.361)

Well, ImageNet’s another interesting example though of, well, this is when I said I wanted to talk about licenses because when I was at AWS, people were like, hey, you should host ImageNet in the open data program. And I’m like, I mean, sure. Like I think that would be cool if we did. Also like people can get it. Like you don’t need, you didn’t need S3 necessarily to get ImageNet. Like people, you could download it. Like it wasn’t like so huge.

Drew Breunig (55:38.893)

Yeah.

Jed Sundwall (55:53.09)

that it mattered so much. But I was also like, look, my lawyers aren’t going to like this. Like if we’re going to host these images, we don’t know. Yeah, they’re just like random licenses all over the place. And, but it just reveals just how like, how brittle this sort of like licensing regime is for this sort of stuff where it’s like, look, who’s, who’s going to sue you honestly, because you’re using some like, like 120 by 120 pixels square picture of like a dog.

Drew Breunig (55:59.726)

peeled off of Google search, like…

Jed Sundwall (56:22.719)

Like, you know, like.

Drew Breunig (56:23.006)

Yeah, I mean, do like it is that weird thing where it’s like, it’s fine to bootstrap it. But if like, you’re really successful, someone comes knocking, it’s kind of like, like, you know, Google looks the other way on people using Street View images, even though they know, they know that they are being crawled in some way or another.

Jed Sundwall (56:33.249)

Yeah.

Jed Sundwall (56:44.159)

Yeah. Yeah. No, I mean, or, you know, come for Anthropic once they’ve raised enormous amounts of money and they’ll be like, sure. Great. Actually, like we’re, it’s an honor to pay this because we know that no one can come up behind us now. It’s like, you know, cause we have got the cash.

Drew Breunig (56:51.18)

There you go.

Drew Breunig (56:57.28)

Yeah. And that’s what you’re paying for. mean, some would argue that’s why Google bought YouTube was purely to buy the court case or one of the main reasons. yeah, so that was so ImageNet was basically a database of I think about 1.4 million images that were labeled. Thousand categories. 1.4 million. And then just said, hey, every year we’re going to hold a contest.

Jed Sundwall (57:07.731)

Interesting. Yeah. Okay.

Jed Sundwall (57:19.201)

Something like that.

Drew Breunig (57:27.362)

to see who can get the best one. Now the idea of waiting every year is a positively quaint notion. People just download and run the benchmarks every single day. You have to upload it, the whole thing. But I do think ImageNet was every year. And so that went around for a while. Side note. go ahead.

Jed Sundwall (57:33.664)

Yeah.

Jed Sundwall (57:48.097)

There you go. Do this.

Drew Breunig (57:49.314)

Side note, I was thinking about this last night. So in a former life, I was a media strategist at a large media buying company. And in 2009, I was writing media strategy for Nvidia. And I was thinking about this last night because Nvidia had a new technology that they were very excited about called CUDA. And they…

I remember going down to the briefing and they’re like, here’s what we’re going to show at the floor at our next big conference. Here’s all the demos for CUDA. CUDA is this idea of we can use GPUs for generic computing and we can use it for immense parallel processing. We think this is going to be really big. And we would ask, all right, well, what are people going to use it for?

And they had like eight demos. None of them were machine or deep learning. There was a couple of biotech ones about like protein folding or what have you. There was a lot of cloth simulators. like, hey, we can sell this to fashion designers to simulate how a cloth is going to drape over someone. They just had like tons of different things. And they had no idea what they were going to use CUDA for. They just knew it was going to be this big thing.

Jed Sundwall (59:05.875)

Yeah.

Drew Breunig (59:07.544)

but they had no idea. so like, and you could make a very strong argument that like, CUDA is the reason Nvidia is in the position it is today being, you one of the most valuable companies in the world. And they had no idea what it was for. And so that was CUDA came out 2008. I worked on it in 2009. And it was still this thing, like no one knew. You would go around and like, they just would look at you and they’re like, well, we can do these things. And you’re like, that’s kind of interesting. And it wasn’t until 2012.

Jed Sundwall (59:17.416)

yeah.

Drew Breunig (59:37.408)

where they kind of had the first glimpse. So we’re hitting the big names here. Jeffrey Hinton, who did win a Nobel Prize for deep learning and machine learning, Ilya Sutskever, who later would found OpenAI, Alex Krzywinski. I can never pronounce his last name, Krzywinski. They built AlexNet, which basically performed against ImageNet.

with a score of 84.7. And you have to understand this was a 10 point plus difference compared to any previous competitor that year or before. And it was the first time that they had used deep learning accelerated by a GPU. And they were just using two consumer GPU cards. They were using like basically what you would buy to game at that time. And that basically started deep learning.

Deep learning was like how we talked about AI before AI. And so that was kind of what set it off. And I think the big step change here is that, again, this comes back to the benchmark thing, which is Fei-Fei Li created this space essentially out of a benchmark.

which is deep learning became a thing because its value was proven because someone built a data set and then people gamed to see how well they could perform against it. A benchmark is essentially a data set with intent. And when you ship that out into the world, you get people to do things against it if you make it exciting, if you make it collaborative, and if you’re operating in the white space.

Jed Sundwall (01:01:16.331)

Yeah. Yeah. This is, mean, this is, I’m going back. Linda said something like, wait, yeah. She said data is typically purpose built understanding. This will force us to examine our data more rigorously. Creating a significant demand for data repurposing, especially with AI. What I’m, what I’m hearing is like, or where this is coming together for me is that like you can produce a data product and we’ve been, we’re going to talk about common crawl next. I we should. And, then.

And then you have to produce benchmarks attendant to that data, that data set or that data product, which are basically just any number of arbitrary goal posts that you want to set. Maybe like, because common crawl is so rich, obviously it can be used for so many things. So you just needed to benchmark for each of those things, you know, and just say like, well, can you do this? Can you do that? Yeah.

Drew Breunig (01:02:02.241)

Yeah.

Drew Breunig (01:02:06.334)

Yeah, and I think some of the most interesting things out there are benchmarks. So we talked about Terminal Bench is one of my favorite examples. The other is the Berkeley Function Calling Leaderboard, which is just testing how well LLMs can use tools that are given to them for agentic purposes. And it’s really, really interesting. And then

What’s the other one that I really like? It’s not empathy bench. What is it?

There’s another one. Sorry, John, here it’s Sam Paish has a great benchmark that is one of my EQ bench. And he maintains this himself. Just some dude, love his stuff. he’s like, I’m interested in having LLMs become a better writer. And again, it’s like one of those things that’s really hard to quantify, how to make it a better writer. So again, he just he.

He’s like, here’s one metric. Here’s another metric. One metric is like, how often do you reuse the same phrases? OK, great. That’s great. We can’t do this. But two, long form writing. It’s like all of these. And it’s a really interesting thing. And he admits. He’s like, this isn’t perfect. But again, you start to see people building against it. And it does start to influence and shape the arc of development.

Jed Sundwall (01:03:40.822)

Yeah. well, let’s, let’s talk about common crawl, a bit more in depth, but I got a shout out, Sam ready at common crawl. She’s like, Hey, we have an event coming up. so people who are in the Bay area at Stanford on October 22nd, there’s an event, called preserving humanity’s knowledge and making it accessible.

Drew Breunig (01:03:45.95)

yes.

Jed Sundwall (01:04:06.913)

addressing challenges of public web data. This is the kind of thing I would love to go to. I’m unfortunately booked at another event. Um, Shane Zuckerberg initiative, think whatever CZI stands for. I’m going to be at one of their open science event. Uh, but man, if I weren’t with CZI, I would definitely be trying to go to this thing. Um, so, and you can watch online. So I put the link in the chat and we’ll, we’ll share this. I think we should share this podcast before October 22nd for sure. So.

Drew Breunig (01:04:20.16)

ooo

Jed Sundwall (01:04:36.929)

Shout out to Common Crawl. Drew, tell me all of your deepest thoughts and feelings about Common Crawl. It’s a great story.

Drew Breunig (01:04:42.434)

mean, Common Crawl is novel for how early it started and that it wasn’t really built with machine learning or AI in mind. It was, so to give you some perspective, the Common Crawl project, is essentially, it’s like the idea is that, hey, we’re going to scrape the internet and put it in one data file ready for people to use. So you don’t have to go scrape it.

Jed Sundwall (01:04:48.223)

Yeah.

Drew Breunig (01:05:11.878)

because again, we believe that again, lots of people can build things if all of this is accessible. And so the, the net value out of it would be tremendous. it began in 2007, the same year that Feifei Li launched, ImageNet. and so Gil Elbaz, his, yeah, it Gil, good old Gil. is, he started it.

Jed Sundwall (01:05:33.409)

Yeah.

Drew Breunig (01:05:38.354)

and he formed the common crawl foundation to basically, it’s funny, he founded it as he left Google. so it kind of tells you, you know, what his motivations were, which is like, want to build essentially. I don’t want Google to get a lock on the internet. I want to kind of expose the thing that’s really expensive to bootstrap and start up, especially in 2007, which is crawling and preparing all of the files. And, now they, it’s a single data set essentially with.

250 billion web pages collected over nearly 18 years. And about three to five billion pages are added a month, though, sadly, Common Crawl is getting shaped a little differently because its crawlers are getting blocked. And the reason its crawlers are getting blocked is because of AI-driven crawling. in a weird twist of fate, Common Crawl became one of the foundational things that early language models would train on.

It would become a critical ingredient in the pile Google C4 data set Basically subsequent data sets kind of child data sets, which is like hey We’re not going to include every single forum or we’re not going to include, you know Duplicative data where we’re gonna filter all this stuff down to the high quality stuff But then once you start building it and this is where it gets into the data like oil I use that let’s say I use that to build my model that later becomes chap GPT. I have so much

Jed Sundwall (01:06:52.075)

Right.

Drew Breunig (01:07:07.118)

I’m not going to rely on common crawl anymore. I’m going to start building my own crawlers and go out to the things that I care about and do it with a much greater frequency so that I can improve my model. You get enough of these, which you do. There are a lot of people out there hosting websites right now that are having to think about how to gate their content to prevent legitimate and gray market crawlers that are just hammering their sites. And so now, like,

Common Crawl created this thing, but now we’re kind of having a tragedy of the commons, which is everyone who grew up around it now sees running their own crawler as a competitive differentiation. And they’re going out there and kind of doing that itself. All the while, Common Crawl is still going, but it’s kind of surface area is starting to shrink a little bit because different web pages are shutting off access to crawlers because of this mess that it has created. So I do think it’s this closest thing that we have in data to a tragedy of the commons.

but yeah, I’ll pause right there before I talk about why the text is so important.

Jed Sundwall (01:08:10.433)

Yeah, no, I mean, I think it’s, um, that’s an amazing story. Gil, I mean, has told me he’s like, he’s like, I’m pretty sure common crawl is like the most impactful nonprofit ever. Um, there’s definitely a case to be made there. don’t know exactly how you’d quantify that, but holy cow. um, yeah, yeah. Yeah.

Drew Breunig (01:08:25.842)

Yeah, mean, because everything grew up around it. even you’ll look and people will say, so-and-so didn’t use common crawl. But then you look at the data sets they did use and they were derived from common crawl. So it basically fueled the entire first wave of large language models, which is what percentage of our GDP at this moment?

Jed Sundwall (01:08:50.081)

think it’s 140 % of our GDP. Yeah. Yeah. Yeah, none of the math makes sense when you’re hearing what people are talking about large language models now.

Drew Breunig (01:08:52.074)

Yeah, 100 % it is. We don’t know how that’s possible, but it is. That’s what we’re

Yeah, and this is like one of those weird things that like when he built this, like, one of the weird things about large language models is that everyone was kind of surprised when the first large language model like worked, like, like attention is all you need. Because like, it’s this thing where like previously, you would have to put structured data into these deep learning models, and then they would have to figure out the relationships. No one at the time like when when people thought of structured data,

Jed Sundwall (01:09:21.878)

Right.

Drew Breunig (01:09:28.962)

they thought of the work that like Fei-Fei Li put together with ImageNet, which is here’s an image and here’s some labels. And so the big gate for deep learning is like anyone who wants to build on deep learning, they’d say, all right, well, where am going to get that labeled data? Where am I going to get that structured data? With large language models, the thing that was shocking to everybody is like, wait, language is structured.

because we can see the order of the word. Some words come before the each other who come before after in all of these assemblages. And we don’t need to label language because it’s already organized and structured. We just have to have enough of it. That was the thing. The magic thing was that you built something big enough that it would display spooky, intelligent qualities. And that was what Common Crawl enabled. Because if you didn’t have that, you you couldn’t test that.

Jed Sundwall (01:10:17.461)

It’s wild. Yeah.

Drew Breunig (01:10:22.952)

randomly, because you would have had to stand up your own crawlers before that. So like the fact that it just existed allowed for that discovery to be made, which is why I think I wouldn’t argue with Gill’s claim.

Jed Sundwall (01:10:36.459)

Yeah. No, it’s, it’s, it’s incredible. I have other apocryphal story. mean, we hosted common crawl at my, my, I mean, my program at AWS is the home of common crawl. I have stories that I probably shouldn’t tell. So I won’t, like it’s, it’s, it’s phenomenal. and kind of insane. And I was joking about this last week, at an event at climate week.

because I was in a room with a bunch of organizations, I’ll just say like very large corporations, not a government insight, talking about sustainability data for global supply chains. I won’t go into much more detail than that. But I said, you got to understand, there’s this story about this guy, this one dude, granted a billionaire, who’s just like, here’s a thing I’m gonna do and does it. And it has this huge impact.

And I’m like, this heartwarming story of the impact that one billionaire can have on the world. But the point also being that like, it is possible to create a data product that has a very consequential impact. And if you feel like there’s something there, there might be something there. In Gil’s case, I mean, my story, at least from what I recall, him explaining this to me is that he creates AdSense.

Drew Breunig (01:11:49.474)

Yeah.

Jed Sundwall (01:11:58.658)

it’s acquired by Google, he spends his time at Google and he’s like, there’s gotta be some kind of fail safe for this kind of thing. And where we can’t have one company that is like, know, owns all of the world’s information. There’s some irony in the fact like of like what Anthropic and OpenAI are becoming is just sort of like the next version of that sort of thing. But you know, I’m not mad about it. Like, yeah.

Drew Breunig (01:12:19.946)

But I mean, I think about that a lot. I think it’s interesting now we’ve gone from the crawl being the thing that’s valuable to the interaction data. So like when they were talking about breaking up Google, one of the things that they were talking about was making the ranking data, like making the index open, which isn’t just the data. It’s also the relationships that exist in the data. But again, one of the things that I’m shocked about with LLMs, which I

Jed Sundwall (01:12:32.959)

Right.

Jed Sundwall (01:12:39.083)

Yeah.

Drew Breunig (01:12:50.498)

fine to be really interesting is that no one’s running away with it. Sonnet 4.5 came out and said, hey, this is the best model this week, the best coding model. But the thing is, the difference between Sonnet 4.5, GPT-5, even the open models, the larger Quen coding models, they might not be perfect, but they’re a lot closer than you’d think.

And it’s to the point where like everybody jumps on whatever the newest thing, but you could just be like sitting on like, you could have been sitting on GPT-4.0 for a year and you would have been fine. Like, and I do think what’s wild is that the floor is coming up faster than the ceiling. The ability of 7 billion parameter models to effectively, you know, double in quality every year is just absolutely insane. And so like,

You will get some things from like throughput and other things like that. But like, I think the weird thing is that even if these guys win, you may end up having like free access to something running on your device. That will be, it’s bizarre and it’s really weird to think about.

Jed Sundwall (01:14:02.658)

That’s incredible. Yeah. Yeah. It is. Well, let me, let me, um, I want to go back to the data is oil thing and how this like LLMs change this sort of stuff. And Alex left another comment about, know, people trying to use robots, text, or there’s like LLMs texts to try to influence how the bots can, navigate the web. I, so I have this theory, I’ll just bounce it off of you. don’t know if it’s a theory, but this idea that like,

Drew Breunig (01:14:20.739)

Yeah.

Jed Sundwall (01:14:30.091)

So the internet has been full of like really amazing data for a very long time. And what a lot of us who’ve worked in open data have just been sort of like scratching our heads about it’s just like, well, why doesn’t it all, why doesn’t it get used? You know, there’s all these open data portals that don’t get used. And my, one of my answers to that is that humans don’t know how to use data by and large. Like you did, you know, you just take a sample of like, a million humans, you’re going to get a very small percentage that actually like know how to do stuff with data. And.

And also like have time. I mean, this was always kind of the funny thing is an early realization for me when I was working in civic tech stuff was that there’s people that are like, yeah, like we’ll just open up our city’s data. And then some, these people will just like do cool stuff with it. And I’m like, hey, if someone knows how to do anything with your data, which is not that good, it’s good. It’s kind of a pain to work with. They have a job. You have a narrow window of like college kids and like civic sort of like tech activists people.

who before they like enter, exactly have kids, I was just gonna say like get a wife or a husband and have a job, like they’re willing to do that sort of stuff. And that’s it, and they just kind of go away after a while. But LLMs 24 seven can do stuff with data. And so we are at the point where I think that we might have created a market for data if we can get, and here’s my crazy idea.

Drew Breunig (01:15:32.994)

have children and full-time jobs. Yeah.

Drew Breunig (01:15:55.213)

Yeah.

Jed Sundwall (01:15:59.244)

Tell me if I’m crazy. Also, I think this is already happening, like, OpenAI and Anthropic should pay for data. Like they should just like, hey, they come to some data portal thing where it’s like, hey, we maintain this data. If you’re a bot, we’re gonna charge you a 10,000th of a penny per request here so that you can, know, it’s basically your research budget. Yeah, I think it’s a good idea. I don’t think Cloudflare, I think Source Cooperative should do it.

Drew Breunig (01:16:20.27)

Well, I mean, that’s what Cloudflare is trying to do.

Jed Sundwall (01:16:28.961)

because we’re not owned by anyone, but anyway.

Drew Breunig (01:16:30.676)

Yeah, no, I think it’s a it’s an interesting one. And the incentives are absolutely crazy to think about.

Drew Breunig (01:16:41.003)

I mean…

Jed Sundwall (01:16:48.353)

Don’t loan your mind.

Drew Breunig (01:16:48.462)

I’m thinking about what angle to approach that from. What do you optimize for? Also, do you mind if I a quick break, a one minute break and be back while I think about this? We’ll handle it in the edit. One second. Someone’s knocking.

Jed Sundwall (01:17:00.577)

Sure.

Okay. Okay. All right. For those of you who are watching the live stream, someone knocked on Drew’s door and he had to get it. I’m going to use this chance because I don’t know when this is going to end, but we still have some people on here. We are doing, for those of you who don’t know about the cloud data of geospatial forum, we did an event in Utah this year at the end of April, early May at Snowbird. was fantastic. Everyone loved it. We pulled everybody at the end of it and

We got like five stars, I don’t know, 97 % of people said they would come back and they loved it. So we’re doing it again. So you’re hearing it here first. We just lost a follower, but anyway, we’re gonna be doing the Cloud8 of GeoForum conference again, October 6th to 9th, not next week, but October 6th to 9th, 2026. So we’re gonna do it in the fall next year, but we’re gonna do it again. We’ll have a landing page up before too long.

and, you know, we’ll, we’ll have links to share out, but anyhow, it’s, very exciting. Alex left another comment. yeah, so exactly. Like, so there’s Alex leaves less comments saying, you know, a lot of journalism Reddit and orgs like Wikimedia are doing with their enterprise APIs is locking them down. I think this is fine. You know, I think people were coming out of like web, web 2.0 era. And I think a lot of the excitement around having open APIs like

Drew Breunig (01:18:22.688)

So.

Jed Sundwall (01:18:31.741)

is understandable, now we’re realizing again, we’re just realizing now we have about a decade of knowledge understand this has a cost. Yeah.

Drew Breunig (01:18:38.594)

Well, I mean, the other thing that’s crazy about it too is like a lot of the Web 2.0 dream is being enabled by LLMs, but like now you go to the meme, like not like that. Like, like we dreamed and we loved the idea of a semantic web that you could ask questions and just access things. And it has been delivered to us and it has been delivered not as an open force, but as an intermediating force. And now we’re having lots of second.

Jed Sundwall (01:18:50.314)

Yeah. Yeah.

Drew Breunig (01:19:07.8)

questions about that.

Jed Sundwall (01:19:10.751)

Yeah. So, I mean, yeah, we’re going to have to figure it out. But I think what I would want to say is that like, we should be, it’s fine. Like I think we should just be sort of sober about this and say, if we want to have reliable access to data in these ways, someone should pay for it. And what’s interesting about chat GBT is that people pay for chat GBT. Like I pay for chat GBT. It should have a research budget. Like

Some fractions of those pennies could go towards maintaining accurate up-to-date data about school enrollment in America or whatever it is, like whatever kind of research I wanna do. There’s actually money flowing because that kind of stuff was never gonna be supported by an ad model. Yeah.

Drew Breunig (01:19:48.183)

Yeah.

Drew Breunig (01:19:54.55)

Yeah. I mean, I don’t know. It’s going to be supported by an ad model eventually. Don’t worry. It’ll come. I don’t know if you’ve seen the announcements OpenAI has made over the last couple of days. They’re very much ad model friendly. They’re selling stuff. They want to give you a morning report where they browse the web for you and go find all the things you should be looking at. And that’s going to have an ad in there. I mean.

Jed Sundwall (01:20:05.812)

Yeah.

Well, either they’re selling stuff through ChetupiTea.

Jed Sundwall (01:20:16.714)

man.

Drew Breunig (01:20:24.686)

I mean, well, like, and this is like, but also I’m gonna play devil’s advocate to you because like, if there’s one thing I get frustrated about in the open space is people saying, well, we should be paid. This idea of like, you’re making money off of my library, you should be paying us, you are free loading. And it bothers me because,

I agree with it, like in a perfect world, I want every open project funded, like that gets usage. But your argument cannot be, we should be paid, you’re making money off of it. It needs to be a realistic, practical, pragmatic exchange for how you deliver that. And so I do think there is like a mechanism, like there could be a mechanism for the way information gets distributed and access to it.

And I think it’s going to get really fraught right now because like…

Like the whole ad model is going to go crazy. Not just because of, because it’s going to get intermediated. Like the ad model in the, like is based on attention. And if we have these agents out there making decisions for us as proxies, that attention is now theoretically infinite. How do we kind of govern that relationship and how does it get re-monetized? So.

Jed Sundwall (01:21:50.422)

Yeah. Yeah. So I, so I’m, I’m with you. mean, my, I’m putting in the chat, I’m just a flog my own blog, but like the gazelles blog post from, you know, near and half where I’m just like, we, one thing I haven’t been explicit about, and there’s going to be like a follow-up blog post at some point, but like, we’re like, the idea of a gazelle is that we should have entities that are, I would say non-owned, not owned by investors exclusively.

Drew Breunig (01:22:01.144)

Sure.

Jed Sundwall (01:22:19.649)

that provide some sort of, usually are providing data, but that they are accountable to the market. And so I’m with you in that the conversation needs to go way beyond like we should be paid, which is there’s so much entitlement in the open community. drives me insane. You like you should give me data for free too. Like you should, it’s a public good. I’m like.

Drew Breunig (01:22:34.563)

Yeah.

And everybody should be giving data away for free. Like, I want people to think about their monetization policies because it gives them the control over their own future, which is, that’s, and that, and that is like me clarifying when I get frustrated, when I hear open people begging for money, because that’s what it is. They don’t have the leverage. They’ve never thought about it before. And now we’re finally have to coming back to it. And as I encourage everyone to think about money before you think you need to, because it’s going to help control.

Jed Sundwall (01:22:41.312)

Yeah.

Jed Sundwall (01:22:48.998)

That’s right.

Jed Sundwall (01:22:59.521)

That’s right.

Drew Breunig (01:23:08.438)

your future and your destiny and not end up being beholden to something else.

Jed Sundwall (01:23:13.237)

That’s exactly right. And it’s hard. I say like, I, you know, I’m fighting two fronts with my sort of notions of gazelles and new public sector organizations. One is the easy one where it’s like, these billionaires have too much power and some of these, you know, tech companies are like out of control with too much power. People are like, yeah, blah, blah, blah. We all kind of agree. The harder battle to fight though, is for me to go to my colleagues in the open world and say like, hey, we should maybe put a price on what we do.

and think about the value of what we’re doing and see if the market supports that. And they’re like, what? Like, there’s, I believe that there’s a huge part of this or the cultural legacy of the philanthropic world comes from like European aristocracy where they’re like, we do not touch money. Like I don’t work for a living. Like, you know, like it’s like a leisure class stuff.

Drew Breunig (01:23:53.676)

Yes.

Drew Breunig (01:24:01.13)

Or the money, like it goes back to like, well, I think it goes back to like Stallman and others where it’s like, you know, the cathedral and the bizarre type thing, which is like, we should have this free exchange, everything is better with exchange, everything is better with open, but then we get issues often.

Jed Sundwall (01:24:20.117)

Well, you get steamrolled by people who actually have market power.

Drew Breunig (01:24:22.952)

I see the Ruby community right now. I don’t know if you’ve been following that, but that’s a good example.

Jed Sundwall (01:24:27.585)

A little bit. I saw you created a foundation. Tell me more.

Drew Breunig (01:24:31.538)

No, it’s there’s just a there’s a governance argument right now about who has control over what and what org has control over what and and how much power does Shopify have as the big person bankrolling everything in this example. And so you have all of these things that like stack together until you get into these uncomfortable scenarios when the incentives are not aligned or not aligned way you expect them to be aligned.

Jed Sundwall (01:24:48.979)

Interesting.

Drew Breunig (01:25:00.002)

which is almost just as dangerous. And so like, I do think there is a market for data, but like you have to provide the utility of it. And I do think like it comes back to data discovery and data democratization, but like, we’re not going to create these things just because we want them. We have to create them and build the structures around them.

Jed Sundwall (01:25:20.065)

That’s right. That’s right. And so that’s what I need to figure out is like, can we create some sort of mechanism whereby, look, I mean, I’ll just talk about source, like the vision for source. This is the source cooperative podcast is like, we have this notion of you have data products, which in our opinion is a data product. It is a collection of files of objects that have been shared by an organization or a person that you know who they are. Like that’s fundamental to source, which is like, this is a data product that came from planet.

Drew Breunig (01:25:45.261)

Yeah.

Jed Sundwall (01:25:49.878)

you know, the satellite company, for example, that it is up to the user, the beholder to determine whether or not that data is worth their time. And what is interesting to figure out is how could we communicate that to an LLM? Like, could somebody say like, hey, chat GBT, I wanna know this information, but I only wanna get data directly from Planet or from NASA or the census department or whatever it is.

And then it’s up to OpenAI to determine it’s like, yeah, sure, we’re willing to throw a few shekels over to Planet to get access to this data and return it. because they, you my assumption is that OpenAI is just gonna hoover up whatever they can get. Is the credibility and provenance of data actually important to consumers? Maybe sometimes, but who…

Drew Breunig (01:26:37.763)

Yeah.

Jed Sundwall (01:26:46.922)

It’s weird because who’s making that determination? It’s many times not going to be the user. It’s just, they’re going to be asking an idle question.

Drew Breunig (01:26:52.946)

Yeah, I also think it matters in the domain. And that’s where you’re seeing a lot of random startups. Like I was just talking to someone who’s using, who’s starting a company that’s based on medical spending records. So looking at like Medicare receipts and Medicaid receipts and like it’s a highly regulated industry. can’t have hallucinations. You have to have

provenance, figure in when you start to build, different products with this. And like they’ve had to build their own custom pipeline. And this is getting into the question Alex just to ask, like, wonder how rag changes the play look, even if you build your own custom pipeline, they were doing texts to SQL, which is kind of the predating of rag, the first use case of texts to SQL, but then they’re having to figure out, right, well, how do we then go validate and subsequently confirm? And so

getting back to what you’re saying with RAG is like self-subscribing confirmation. And that’s kind of where the messiness comes in. The challenge is here is that like they’re working on one specific domain. Their surface area is a lot tighter, both in terms of the questions being asked and the data that can ask it. So their needle in a haystack exercise is different. And you’re gonna see the same types of companies come up with law. So like, how do I cite legal cases and actually they exist so I don’t get chewed out by a judge and told to like,

you know, go f off. And then the and you’re to see that in each little domain where there’s regularization, where there’s penalties and where you can sell that higher quality. I think the challenge that Anthropic and OpenAI and all these guys have is like, there’s really two markets right now, which is like their chatbot market and their coding market. And so like they’ll care about citation and coding stuff.

The rest, they’re just like, all right, how do I drive down hallucination and citation? They do have citation benchmarks. There benchmarks and evals for people to get to go judge their ability to correctly name things without hallucinating. But coming back to what you’re saying, I think the challenge here too is that you’re not, with LLMs, you also have to worry about multiple stages in the pipeline.

Drew Breunig (01:29:09.208)

So what I mean by that is like, there’s different stages when you build the, the, pipeline, have the, the pre-training, which is like when you train on the super messy common crawl type data that builds up your kind of base English capabilities or be a base language capabilities and establishes your knowledge base. Then you have post-training post-training is like when you teach the model, how to talk with an interface, that’s when you train it to reason. That’s when you train it to chat and go back and forth.

That’s when you train it to use tools. And then after that, people might fine tune it or they might put further tools on top of that, like data, rag, other similar things. And so what you’re talking about is providing function from basically post-training all the way through to fine tuning, to tool deployment, to framework around it, to the actual application. It’s this wide spectrum of applicability.

that also has different pricing terms as you start to come in. the problem I have with paying it is just like, it’s just, I worry about, it’s one thing if you’re Reddit and you cover everything. It’s another thing if you’re a really, really, really narrow niche, because again, you’re selling into a model that does everything. So how do they value that use case to justify your acquisition?

Jed Sundwall (01:30:34.431)

Yeah, well, mean, so this is where we’re.

Drew Breunig (01:30:37.388)

as I drink with an anthropic sticker on my water bottle.

Jed Sundwall (01:30:40.385)

Cool, man. I wish I had an anthropic sticker. I should put, I’ve got a cloud native geo stickers here still. Nice, nice. Okay. No, I mean, this, so Alex brings up another really interesting point here though. That’s very important. It’s like, you you’re mentioning there’s, if you’re working in very, very narrow space, the applicability of, know, whatever you’re putting out there is very broad.

Drew Breunig (01:30:49.932)

I have one of those, I just read it.

Jed Sundwall (01:31:09.883)

I am a hundred percent like my perspective on like the best gazelles create far more value than the capture, right? They should be the kind of thing that’s like only putting something out there, you know, that’s, quite small and simple and you can vouch for it. And then what people can do with it. Go nuts. If people can become billionaires off of it, that’s great. With climate stuff. This is just what we have to acknowledge. Like head on is the fact that like we actually talked about this right before we started rolling is that like we are at this point.

Drew Breunig (01:31:17.143)

Yes.

Jed Sundwall (01:31:39.86)

where we are actually talking about making interventions to perturb the environment in order to protect the world as it suits humans, the roughly however many of us there are right now, basically to cool it off, where we’re like, we’re gonna make this decision now. Like we’re gonna gather up a bunch of climate data, a bunch of information about the planet, and it will be used for us to manipulate the environment in a way that is much more deliberate than we have done in the past.

as we discussed, we’ve been messing with the environment quite a bit, not deliberately, but now we’re like, we’re gonna do this sort of stuff on purpose. This has huge, huge repercussions on like global governance. And we do have to figure out models that can allow us to make huge volumes of data available reliably. And I would say like, they absolutely should be available to AIs, but just how do we, who pays for that?

Somebody’s gotta pay for it. And I’m with you. The answer should not be, well, we should get paid to do it.

Drew Breunig (01:32:44.222)

Well, I thought you were going the answer is not communism or something similar.

Jed Sundwall (01:32:50.195)

No, but I mean, do think, I mean, that’s the other thing is that like, we don’t have the luxury of being too idealistic now, which is like, ideally, it wouldn’t be shaking down billionaires, but there are enough billionaires around that like, we should be shaking them down. think philanthropy has a role to play here. I’m very interested in endowments for, you know, guaranteeing access to data over time. So there’s something to be done here, but it will be.

Drew Breunig (01:33:11.779)

Mm-hmm.

Jed Sundwall (01:33:19.937)

It’s this is a huge challenge. It’s an exciting challenge though. Yeah.

Drew Breunig (01:33:22.744)

I mean, I think that comes down to discovery. And I think that’s one of the big challenges, which is like, I mean.

Like that’s the, so I shared a paper with Jed yesterday, which is going brand new. Just a PhD student came out with it yesterday. I’m gonna link it. And wait, let me find the link I sent you. And it’s about searching for, teaching LLMs to search for data and assess data.

Jed Sundwall (01:33:38.576)

yeah.

Drew Breunig (01:33:57.494)

And I think of it as a natural extension of, you know, one of the first things that happened when chat GPT came out in that first year is there was a lot of text SQL applications is I think it’s a further extension of layers upon that, which is, I’m going to understand a data source, build a representation artifact that is queryable.

so that then we can kind of query it on top of that. And so I think we’re starting to see these systems. the good news is, here’s the thing that I do think is incredibly valuable, is you look at…

this application and you can see why a company would fund it because you can say, all right, would Databricks fund this? Would AWS fund this? Would Microsoft fund this? Like would Tableau fund this? 100 % they would because they want people to find more data and the right data because if you find more data and the right data and it’s valuable to you, you have to generate the compute to actually utilize that. so I do think that we’re going to see

things that are aligned with these functionalities when it comes to data discovery, because there is a huge market opportunity for it. And I do think like maybe that’s the value that gets put on, which is not the access to the data, but the discovery of the data and the service of finding that. And I would be, that to me is like, that would be a huge problem to be solved for tons of enterprises that I’ve talked to.

Jed Sundwall (01:35:25.75)

Yeah.

Jed Sundwall (01:35:34.146)

Hmm. Okay. Well, very relevant to what I want to do. So I’m going read it. this, so Camilla asks a question, an important question. does the inability of, for some sort of royalty model disappear with the complexity and lack of explainability of how inputs are ultimately used, um, in these models at the end of the day? uh, so yeah, basically, so it’s like, yeah, open AI, ChatGPT says like, yeah, we just got the coolest data from

the Gates Foundation, here’s our answer. You know, and it’s like.

Drew Breunig (01:36:08.758)

Yeah, I mean…

Jed Sundwall (01:36:09.289)

A lot of people are gonna be like, okay, I trust your interpretation of this.

Drew Breunig (01:36:13.132)

Yeah, let me tell you a story based off that. So I am one of the best ways to learn about new companies, especially new models. And this is something. So at Place IQ, we cared about privacy a lot. And we embraced new privacy mechanisms, regulations. We designed our systems with privacy in mind. And so I learned a lot about privacy during those eras.

Jed Sundwall (01:36:17.109)

Okay.

Drew Breunig (01:36:42.004)

OpenAI came out with ChatGBT, and they launched ChatGBT, and they launched the model. I knew something how the model was made. And so the first thing I’m like is like, there’s a lot of privacy issues that are inherent in this, especially because once you train the model, coming out and selecting the data from the model that it learned from your private data is basically impossible. You can only kind of add it. You can’t go in and surgically remove it.

So as just for fun, because I’m weird, I filed a CCPA request with to open AI. CCPA request is a California privacy regulation that allows you to contact any company that has your data and you have to say, hey, do you have my PII, my private personally identifiable data? What is it? And I also have the right to correct it or delete it if I require. So

Jed Sundwall (01:37:39.499)

Hmm.

Drew Breunig (01:37:41.482)

you read their privacy policy and it was all about the accounts you create when you create an account. It wasn’t about the model or the training data they used for the model. They seem to have kind of deliberately skirted that question because it would be a really big question. But at the same time, it’s still PII and it still have it. And I know for a fact that they have my website because I know my website’s in common curl. And so I filed the request and

This was like in the first year after a chat GPT and like the person who was on the other line had no idea what to do with it. And they’re like, well, here’s your email. And I’m like, no, no, no, I want to know about the training data. And they’re like, I don’t know. So it kept, I would go through periods of very quiet and then it would get elevated and then it would get very quiet and then it was elevated. And finally, they’re just like, well, your email’s not in our training data. We have processes for removing your email.

Jed Sundwall (01:38:34.895)

Hmm.

Drew Breunig (01:38:36.502)

So I used the prompt exploit to get my email out of ChatGPT. So I did, you can use all sorts of tricks to get around its alignment and safety protocols. And I did that. And I got it to say, Drew Brunig’s email is what Drew Brunig’s email is. I’m not going to say it here. So I emailed them back and said, here is proof that you have my email. Somewhere in your data banks, it exists.

Jed Sundwall (01:38:53.758)

Yeah.

Drew Breunig (01:39:05.805)

And they’re like, can you share the prompt? they got really, like, it got elevated, elevated, elevated. And finally, they closed the issue because they said, well, your email is actually something that could be really easily guessed. And we could have learned it from other things and then inferred the naming pattern. And so that’s how it came out. And so, but this is the crazy thing, but it’s still my email.

Jed Sundwall (01:39:24.769)

Hmm.

Jed Sundwall (01:39:29.663)

man.

Drew Breunig (01:39:34.764)

So from a privacy perspective, like it’s still happened. The email existed, whether it guessed it or not, it’s kind of immaterial, especially if it guessed it, then it falls a foul off the CAN spam act, which is using a software to automate the guessing of like brute forcing emails. If it didn’t guess it, then it has a PII in my training data, which it almost certainly does. And like, I’m not gonna lawyer up and go fight this fight, but it’s like a good example of like even they,

Jed Sundwall (01:39:39.638)

Yeah.

Drew Breunig (01:40:03.16)

can’t tell what was the model was trained on. And so to Camilla’s question, like the royalty model does kind of disappear because there’s kind of different scenarios you can plan for, which is did the model just hallucinate it? Did the model figure it out based on the fact that it has seen previous patterns that are similar to that? And your question combined with the weights would manage to evoke your email, or is it recalling it from the way it’s buried deep in its kind of the depths of the weight?

And so like, there will be ways where they try to do this, like, like Anthropic just a couple months ago had a big thing. It’s like, Hey, we can explain what’s actually happening inside these models. And they could, but they had to train a special model just for explainability. And then they had to train a different model. And it was like a model that was the equivalent of like Claude two, like, and then they had another one that looked at it.

Jed Sundwall (01:40:33.333)

Right.

Jed Sundwall (01:40:57.057)

Yeah.

Drew Breunig (01:40:59.724)

And then it would have to go through the output and it was like two expert researchers would have to spend two months of their time just unbundling all the traces to figure out what actually happened for one query. So like, it’s not a scalable mechanism, like, and it doesn’t even work on the largest models. So yes, no one knows where the data is coming from. In fact, a lot of people say like, that’s why reasoning models are a net good, because you can kind of see the logic of how they arrive at their conclusion. But.

Jed Sundwall (01:41:10.837)

pray.

Drew Breunig (01:41:29.486)

I think it is. Yeah, it’s a challenge.

Jed Sundwall (01:41:35.423)

Yeah. Yeah. I mean, well, this is kind of, again, sort of going to the philosophy of source is that like, you should be able to view source. the model can’t be explained, like whenever possible, there should be some sort of like auditable layer of data. That’s not always going to happen, but like there’s, are things, I mean, I’m going back to Alex’s point about like climate data. Like if we’re talking about environmental data where we’re, this is,

Drew Breunig (01:41:55.502)

Yeah.

Jed Sundwall (01:42:05.461)

deliberately being shared so that we can impact the environment. It’s an impact on everyone’s There are layers of the internet that have to be auditable. And yes, the large companies, they’re gonna wanna have plenty of secret sauce in their models, but there’s some stuff that can’t be secret. We should fight for being auditable.

Drew Breunig (01:42:25.4)

But like, so then like.

I don’t know. I don’t think you can make the model auditable. I think we’re past that term.

Jed Sundwall (01:42:34.943)

Yeah, I agree. I don’t think we’re making the models auditable, but at least you should be able to say, we know where some of this data came from and you can do your own research if you want to.

Drew Breunig (01:42:48.288)

Yeah, it’s funny because I do think that like a lot of labs would like that. The problem is, that like they see their competition as like actively stealing their stuff. And so like, how do you enforce that internationally is the big question that comes in. And then also the desire to not fall behind, you know, other countries, I think is the other issue that you start to get into the politics of the thing.

Jed Sundwall (01:43:16.203)

Yeah. Yeah. Yeah.

Drew Breunig (01:43:18.958)

So like, I don’t know. Like I do think like, like getting into the goal of like training data products so that LLMs can understand them. Like, is that what you’re like angling at? And if that’s the case, like, I think it’s like, you need to make them like, Brian Bischoff talks about the map versus the terrain when it comes to creating data systems that LLMs can query is like, you do have to create that thing that, that fits within the context well and allows them to kind of.

navigate, negotiate with it.

Jed Sundwall (01:43:50.336)

Right. And that’s what I’m saying. That’s what I was trying to say before is like, we could create a great catalog at Source Cooperative that, and talk to, you know, talk to our friends, or I need to make friends at Anthropic at home in AI, I’ve got a few, but like, and be like, do you want to use this catalog? And if you use this catalog, are you willing to pay to access stuff from it? Like, how would, how do you train a model to know that data is worth paying for versus not paying for?

Drew Breunig (01:44:19.629)

Mm.

Jed Sundwall (01:44:21.025)

I don’t know. mean, and I don’t know if it could just be sort of a brute force thing, which is to say, open AI agrees that I’m going to use the Gates Foundation again for, you know, it’s just like the Gates Foundation that maintains a lot of useful data. Actually better examples is FactsUSA, another Microsoft guy, bomber who created it’s USA Facts, which is his nonprofit that’s that shares statistical data. Yeah. Fantastic. And they say, like,

Drew Breunig (01:44:45.036)

Great outlet.

Jed Sundwall (01:44:50.663)

OpenAI is like, Balmer can’t afford to keep this thing running himself. So we’re going to pay. This is where the argument falls apart with both Gates and Balmer is like, these are these are groups that do not need to be charging for access to this data. but it still we want to have a market for data to make sure that it’s continually being produced. Yeah.

Drew Breunig (01:45:09.022)

I do think one of the things there is getting in the provenance stack, which I do think is like, if you’re merging datasets, you’re going to have a ranking stack order for which ones you trust more than others. so I think that’s the service that may be a thing, which is validating and normalizing the data so that it can be referenced confidently.

That to me is like, that’s the service to provide. that like, because I love the question of like, when does an LLM know when to pay the data? Or like, when does it present that option? And like.

Drew Breunig (01:45:53.516)

What do you think goes into that question? Like, what do you think are the inputs that you can think about in that one?

Jed Sundwall (01:45:57.493)

Well, right. So, I mean, again, like what I’ve said about source is that like source, we, what you find at source are files that have been put there by people or organizations who you may or may not trust, but you at least know who put them there. And so then the question is how, we’re building a UI for source that we want people to be able to tell at first sight whether or not the data they find there is worth their time or not. We then have to answer the same question for a model.

Drew Breunig (01:46:19.596)

Yeah.

Jed Sundwall (01:46:25.513)

It’s like, this worth my time? Is this worth me spending some of my research budget on? And I think part of that just has to be like brute force through like partnerships to say that like OpenAI recognizes this as useful data source. Does it make sense to charge for the data at that point based on some metering thing? know, at fractions of a penny, or is it just like a, it’s a partnership where they pay for to just go in and out, you you get as much as they want. I don’t know.

Drew Breunig (01:46:51.212)

Yeah, I do think it’s like, and that’s the thing which is like, you’re, are the thing that is vouching for the data, I think is the service that is provided. But then now you, now you’d need to be a quality clearinghouse.

Jed Sundwall (01:47:06.251)

That’s right. Well, right. have to have all, that, okay, so here we go. And then we got to start wrapping this up. But like, one thing that we, there’s a bunch of stuff that we can do once we have these, so we have these files, we know who produced them. We can also have DOIs, right? So bear with me if you’re, if you cringe at the motion of, the notion of DOIs that I sometimes do, but we can say this data actually gets cited a lot.

We could track how many citations the data has gotten. We also have metrics that we want to share about like how much the data gets used. Hugging face is great at this, like on their data sets product, which I love and it’s kind of like my envy. There’s so much signal when you get to a hugging face data set landing page, like there’s a lot of signal for you to be able to tell like, is this being used or not? And that’s one way of motivating it. I mean, it’s the way

you you shop on Amazon and it’s like, this is a best seller. So you’re like, okay, if the whole market agrees that this is a good thing to buy, then it’s probably good enough for me to buy too. And so, but it’s a matter of communicating that both to humans and to Asians.

Drew Breunig (01:48:18.616)

I think that’s, mean, maybe you need to build a benchmark. Maybe you need to build a benchmark on like quality retrieval from source datasets, which is like, can you correctly augment? And so I think that to me is like an interesting thing, which is can you correctly augment site without hallucination? Cause like that’s the challenge, which is like, you may get the right pull, but then you don’t adhere to the prompt and you rely on something in your weights.

Jed Sundwall (01:48:25.546)

Yeah.

Drew Breunig (01:48:48.756)

So it’s just like, it’s kind of like a recall on a moving target data set. which I think is a really interesting idea.

Jed Sundwall (01:48:58.013)

Hmm. Okay. Well, I’m going to have to talk to you about that another time.

Drew Breunig (01:49:02.318)

Because I mean, that’s the challenge. You have a bunch of data. You want to basically check against it and then validate that it actually is repeating what it repeats back. Because I think that’s the thing, is like having high quality data isn’t enough. You need to have high quality data. And then you need to ship the yardstick for measuring that high quality data when used in violence.

Jed Sundwall (01:49:13.791)

Yeah, right.

Jed Sundwall (01:49:30.401)

All right. Well, the lesson from this conversation is, is benchmarks. Like we, we got to talk about benchmark design, not just designing great data products, but

Drew Breunig (01:49:36.418)

Well, I think benchmarks are, because this comes back to Common Crawl, which is like Common Crawl didn’t do anything to its data, just made it easier to access, didn’t make any choices or anything like that. But I do think it’s a really good exercise for like, all right, if Gill launched Common Crawl to build a better Google or build a better information recall or to not have Google monopolize it,

The benchmark he should ship is like, you’re building a search engine, like here’s the queries and here’s the ID records that you should be finding within your thing to like do this query and like you can start to rank against it. I think, I do think even if you don’t ship a benchmark, doing the performing the exercise of what benchmark you would ship for the data product you’re looking to ship is a good exercise because it forces you to say, well, what do I want people to be able to do with this? And

And then it focuses your kind of the way you package it up.

Jed Sundwall (01:50:39.829)

Yeah, lot of, mean, but such a, so, many consequential decisions, they come from that. So, okay. Well.

Drew Breunig (01:50:51.182)

So who’s building the temporal benchmark? Who do we assign that to?

Jed Sundwall (01:50:54.625)

That’s Tyler Erickson will be built. No, it’s useful feedback from we actually have a bit of funding right now to work on on some GOAI benchmarking work. So. Yeah. Anyway, this has been awesome. I couldn’t be happier with our first episode. We’ll get this out there. I we’ve got we got a link. Michelle cooked up a website for CNG Conference 2026. So mark your calendars. It’s official. It’s on there’s a URL.

Drew Breunig (01:51:05.196)

that it was my voice Tyler. Yeah.

Jed Sundwall (01:51:24.331)

So, yeah, Drew, I announced this while you were answering the door. We’re doing CNG 2026. Same location, but in the fall, six to ninth of October. No snow, which some people…

Drew Breunig (01:51:33.984)

Ooh, so no snow. That’s a big plus for me.

Jed Sundwall (01:51:42.635)

See, thank you. I’m glad you said that because people are like, no, like I like skiing. It’s like a few people have the time and energy to ski. I think most people wanted to get out on the mountain and just couldn’t because there was too much snow. anyway, thanks so much, man. We are going to do this again. You’re not gonna be, I predict you’ll be a many times repeat guest. And thank you everyone for tuning in. This has been a lot of fun to do it with a live chat.

Drew Breunig (01:51:58.114)

Yeah, I know.

Jed Sundwall (01:52:12.553)

Really appreciate everybody who chimed in. Anything else? Do you have anything to plug? Okay.

Drew Breunig (01:52:14.944)

Awesome. Well, no, that’s no. I’ll be at the Spatial Data Science Conference talking about GERS, talking about standards. And yeah, now I’m just thinking about evals, man.

Jed Sundwall (01:52:23.091)

good.

Jed Sundwall (01:52:32.437)

All right, well, stay tuned. We’ve got some work going on evals too. So, all right. Bye everybody. Thanks. Bye. Bye.

Drew Breunig (01:52:37.23)

Talk to you later, Jed. Always pleasure. Bye.

Featuring:

Drew Breunig

Jed Sundwall