Episodes

→ Episode 2: Protomaps and PMTiles

Video also available on LinkedIn

Show notes

Jed talks with Brandon Liu about building maps for the web with Protomaps and PMTiles. We cover why new formats won’t work without a compelling application, how a single-file base map functions as a reusable data product, designing simple specs for long-term usability, and how object storage-based approaches can replace server-based stacks while staying fast and easy to integrate. Many thanks to our listeners from Norway and Egypt who stayed up very late for the live stream!

Links and Resources

Protomaps — a free, customizable base map you can self-host
PMTiles Viewer — drag-and-drop viewer for .pmtiles files
Browse 2.7 billion building footprints in PMTiles in the Google-Microsoft-OSM Open Buildings - combined by VIDA product on Source
Emergent standards white paper from the Institutional Architecture Lab

Key takeaways

Ship a killer app if you want a new format to gain traction — The Protomaps base map is the product that makes the PMTiles format matter.
Single-file, object storage first — PMTiles runs from a bucket or an SD card, with a browser-based viewer for offline use.
Design simple, future‑proof specifications — Keep formats small and reimplementable with minimal dependencies; simplicity preserves longevity and portability.
Prioritize the developer experience — Single-binary installs, easy local preview, and eliminating incidental complexity drive adoption more than raw capability.
Build the right pipeline for the job — Separate visualization-optimized packaging from analysis-ready data; don’t force one format to do everything.

Transcript

(this is an auto-generated transcript and may contain errors)

Jed Sundwall (02:37.51) So I’m going to start it. first of all, happy Halloween, Brandon. Welcome to a special edition of Goth Data Products. If in case anyone’s wondering why we’re both red, if they’re watching, on the listening to the audio only don’t get the benefit of seeing us in this kind of like spooky, spooky color scheme. But welcome.

Brandon Liu (02:50.606) Thanks.

Brandon Liu (03:16.364) Yeah, so thanks for having me on the podcast. I’m excited to talk about, know, ProtoMaps, data source cooperative. So I’m here to answer questions, I guess. Yeah.

Jed Sundwall (03:25.753) Yeah, no, likewise. by the, sorry, I did chicken out. So I’ve changed the lighting. So I’m not right anymore. yeah, no, it’s, it’s great to have you. mean, when, when we started this thing, you were sort really top of mind of somebody who’s been very thoughtful about how to, about what I would call the, what we call the ergonomics of data, like figuring out how to make a lot of data accessible for people. so if you, if you don’t mind, let’s just start there. Like, can you just

Brandon Liu (03:47.97) Right.

Jed Sundwall (03:55.085) How do you describe yourself and what you do?

Brandon Liu (03:58.459) so the way I describe myself is,

I started a project called Protomaps six or seven years ago and the impetus for this was making it easy to make a map. And the direction that came from was very much just like, you think about a web developer that is making a website, like, so for example, they’re making like a site to look up different cafes in their neighborhood.

they might use something like Google Maps, but that is like a proprietary SaaS that they buy. And like, so I really wanted a way to sort of have like a home cooked way to make a map because there’s so many things you can publish on the web. You’re able to publish videos, you’re able to publish pictures or markdown or HTML, but being able to publish a interactive map has never been that way. So really the way I approach this is from

the idea of making it accessible for anyone to publish a map.

Jed Sundwall (05:06.149) Got it. Okay. And so amazing. and you’ve done it. And so you’ve reminded me. So, so one thing that we were going to be doing, I mean, I’m just gonna like say these things out loud, which is kind of funny is like part of the reason for doing this podcast is like, we’re doing so much stuff at radiant earth and we need like more channels to be able to talk about it. and so just last week, we put out a white paper and this will be in the show notes and I’ll put it in the in the chats, but it’s called emergent standards. so what you said is just like very relevant to this, which is that like in the paper, I argue that the web has turned out to be a really like an engine that helps people come up with new data standards. And so if you look at it from through that lens, you have HTML, which is like, let’s share a document and hypertext, you know, like hyperlinked documents with one another.

And then you end up, you’re like, well, what if I don’t want to load up a webpage, but I want a feed of updates. And so RSS emerged out of that. GTFS emerged out of the need for like standardized transit information. And I would say what you’re doing, and I guess specifically with PM tiles is like a way to do this, for vector tiles.

Brandon Liu (06:24.022) Yeah, I have a lot of, I guess, thoughts about the idea of standards in general, both in the web and also for geo. I think a lot of the web, we think about them as standards, like for example, HTML evolved very early. And maybe on the early web, was a lot of more sort of like, it was in the design phase where people would collaborate on creating some spec and that became a standard.

Nowadays, what you see is it’s more like if one of the big companies that makes browsers like Google or Microsoft, they make everyone adopt a standard because it’s in their incentive to do so. If Google can convince everyone to use, what is it, like JPEG 2000 instead of plain JPEG, then they can reduce the amount of bandwidth on the internet by 20%. And that is all that tech.

around things like serving video, serving audio and images is all like very mature to where you don’t really see a lot of emerging standards being adopted organically. They’re more like, there’s this committee at these huge companies that all collaborate on a standard. There is some examples of of sort of more like small scale solutions that became adopted. And that’s really how I see PMTiles fitting in with them.

is like, I don’t want it to be top down. Like I don’t want people to like make their organizations adopt PMTiles. I want people to use it because it solves the problem for them. There is a really cool format for images that I like. It’s called QOI. I think like it stands for literally like the quite okay image format. Like it’s very modest and it’s like, like it’s its name. But I think it is just like one guy came up with

Jed Sundwall (07:58.543) Right.

Jed Sundwall (08:14.437) Okay.

Brandon Liu (08:19.38) a way to do lossless compression of images that is a lot simpler than PNG and is good enough. It’s not more optimized, but it’s way faster to decode on a CPU thread. And that is one good example of a, not a standard from a standards body, but of something that had a simple design that became popular. And it was not adopted because it’s like,

Jed Sundwall (08:44.699) How popular? I’ve never heard of it.

Brandon Liu (08:48.428) I think it’s used, the original motivation was like for games, like if you have game assets and you need to be able to like decompress them and move them around in like, in just like raw RGB formats, then QoI is supported by like some of those engines. But actually like another one you mentioned is GTFS. So GTFS is like more geo adjacent. And that was, it also came out of, think Google’s requirement to like have some

systematic way of storing transit routes. But it wasn’t like some sort of consortium of transit agencies that came together to design like this like CSV format. It was like, it just became a widely adopted solution because it happened to be good enough.

Jed Sundwall (09:35.441) Right.

Yeah, well,

Brandon Liu (09:38.252) And that’s really how I see PMTiles. Yeah.

Jed Sundwall (09:41.423) Yeah. Well, so yeah, I mean, it happened to be good enough. Also Google had this cruise ship that everybody wanted to get on. I think, I don’t know who first said that, described it that way, but like every transit agency in the world was like, we want our data to be in Google maps. And so they had an incentive to do that. And so that’s a concept we explore in the white paper, which is like, you do need this mix of like good enoughness, because that is usually where things land is you have something that’s

good enough for a lot of people to adopt. They’re like, is fine. mean, sorry, what’s the acronym for the image? Like adequate, what is it? Adequate, quite okay. Like I love it. Like that’s usually where things land. Like RSS, the story of RSS is like a bunch of people fighting and a bunch of attempts that like top-down approaches to syndication until people kind of threw their hands up. And, but tellingly then the New York Times adopted it.

Brandon Liu (10:21.102) quite okay. Yeah.

Jed Sundwall (10:41.059) and started publishing RSS feeds and everyone’s like, okay, this is what we’re doing now. So it’s fascinating to see, do you have any sense for the traction of PM tiles as being like this? Like who’s using it?

Brandon Liu (10:56.27) So I have a couple of proxy ways. So I don’t actually know how many people are using it because by nature, I can’t track. I can’t add a tracking pixel each time someone looks at a map. The one thing I can track is the number of NPM downloads. So NPM is the package manager for JavaScript. And that is, I think, the most popular client for reading NPM tiles. And it’s something that I’ve

Jed Sundwall (11:07.61) Yeah.

Brandon Liu (11:24.896) it’s something that I maintain and that crossed like 100,000 downloads per month or it’s either per month or per week. I can’t remember like this year. So you can see like a growth curve of people using this library. Now I don’t actually know if that means anything because it could just be like an automated CI script, like on GitHub actions that is downloading it like a thousand times. But it has some correlation with usage. So the only way that I can kind of

see if people are using PMTiles or if it’s being adopted is through this like proxy metric like NPM downloads. Or people show me like a site that is built using it. So actually like probably the biggest one is like I think the New York Times had a visualization on their homepage that was about like a space debris falling to earth. And that used a map data set that was served from PMTiles.

Jed Sundwall (12:15.558) Okay.

Brandon Liu (12:20.392) So probably like a dataset that’s being served on the New York Times like front page in PMTiles format is like probably like the most high traffic use of it.

Jed Sundwall (12:21.284) Okay.

Jed Sundwall (12:32.209) Here we are again with the New York Times. really, it’s kind of interesting. mean, you think about the legacy of the New York Times as being like it, the story about them sort of crowning RSS, the standard for syndication, like that’s true. Like they did that. And like, they do have the imprimatur to do that kind of thing, which is, that’s awesome. That’s great. Like that’s, that is a sign that you’ve, you’ve made it. Shout out to Tim Wallace, who probably had something to do with.

Brandon Liu (13:01.816) Okay.

Jed Sundwall (13:02.053) with the New York Times using PMTiles. That’s awesome. Okay. Well, so one thing I can say though is like on source, you know, we host a lot of PMTiles files and you can correct me on all of this. Like there are some kind of like base map objects I think that are in there or something like that. But if I search GitHub, it’s one of my favorite things to do is to search GitHub for…

references to the source data proxy, which is data.source.coop. as of today, earlier today, it’s like 612 results pop up when I search for it. But a lot of them are two PM files. Do you know anything about, do you have any insight into that?

Brandon Liu (13:47.362) right. so the project I run is sort of an umbrella project called Protomaps and PM Piles is just one part. And that was by design because I never thought it would be good enough to just design a format. Because it’s like, if you design a format, then you also have to have like some killer app that makes people actually care. Because just having like a spec with some implementations is like, people are like, that’s cool. But like,

Jed Sundwall (13:48.889) Or aware of that? Yeah.

Jed Sundwall (13:55.471) Yes, yeah. Okay.

Jed Sundwall (14:11.057) All

Brandon Liu (14:16.182) I can’t immediately take advantage of it. So the way I approached it was to have like a killer app, which is a base map or like what people think of when they think of a map, which is like you look at it and there’s like city names and there’s like water and like roads and stuff. That’s based on OSM. So the actual data product that is like open source and free by default in the PM tiles format is this base map.

that’s from OSM. And I think a lot of the links to source are to that because going back to what I started with, it’s like if people just want some solution for showing a map on their site, know, like as an open source replacement to Google that they can run themselves, that they can copy and they can move around, they can download like, so as if it was a video or an image. But I imagine a lot of the links are to that just because it’s designed to be something that’s like immediately useful.

Jed Sundwall (14:47.663) Yeah, that’s what I was guessing.

Brandon Liu (15:12.686) Now I think with source, I think the the cores policy is like quite open. So if there’s other data sets, like a scientific data set that is in PM tiles format, people could link to that. And hopefully people do that more or they download from source and mirror to their own buckets and use that.

Jed Sundwall (15:30.969) Yeah. Yeah. Yeah. So I mean, this is something that we have to, we’re going to have to do our own analysis on this at some point. which is like, what is the cost of us hosting those, those objects? Cause yeah, our core’s policy is wide open so people can do that. and we can do the math on this, but I mean, you know, shout out to AWS. Thank you to the AWS Open Data Program that, still exists.

after yesterday. anyway, was a tough day for a lot of people at Amazon yesterday. There were a lot of layoffs, but the Open Data program is alive and kicking. And so they subsidize all of our storage and bandwidth for source. But we do want to get serious about this at some point and have an understanding of like, how much should it really cost to do something like this at what scale? We have…

Brandon Liu (16:00.493) Yeah.

Jed Sundwall (16:28.057) All the analytics we need, just haven’t sifted through the data yet to figure out like which of those objects are being hit the most and how much and what’s the throughput that’s going out. Cause I know you’ve done analysis on the costs of doing these things. I imagine you have some data on how much it costs to deploy PM tiles, but we also have a lot of this data, but we just haven’t shared it yet. So.

Brandon Liu (16:52.994) Right. So going back to that for a moment though, like, so I wonder if you think about like, like that idea of like being able to search for GitHub for all the links to source for people that are like hotlinking to it. Like in some sense, like I think it’s, it’s not directly correlated to success. Just, just a number of people that are consuming source. If people are making a copy of the data, if people are copying the data they get from source to their own bucket and then using that.

Jed Sundwall (17:03.257) Yeah. Yeah.

Jed Sundwall (17:15.588) Of course not, yeah.

Brandon Liu (17:22.284) That is still like using the platform as intended. Like there isn’t really like by design, I don’t know if source is designed to be like an intermediary platform. Like for example, like Airbnb. So for Airbnb is like you go to the site and you look up like bookings, like listings, but they will stop you from trying to go off the platform to like make an arrangement with like your host because that’s like, that’s, that’s exactly against their business model. Right. That’s like.

Jed Sundwall (17:22.779) Yes.

Jed Sundwall (17:41.115) Yeah.

Brandon Liu (17:51.887) So it’s for Airbnb, the entire point is like, they’re an intermediary between you, like your desire for a room and the host. Now, so I don’t think source is by design as like a data platform to be an intermediary for all data. There is a lot of like open data platforms in the past that have worked that way, where they make it very difficult for you to be able to consume the data outside of the platform. But it feels like with the sort of cloud native focus, part of the idea is that you’re able to

Jed Sundwall (18:01.861) Right, right.

Brandon Liu (18:21.602) you know, just like package up data and take it to go or access it just in chunks instead of having to be locked in to just using source. So if there was some way to maybe promote that as like a first-class way to consume source instead of just linking to assets, then maybe that would help alleviate some of these ideas around like cost sharing for bandwidth.

Jed Sundwall (18:39.632) Yeah.

Jed Sundwall (18:45.667) Yeah, well, no, mean, let me address this and then I want to acknowledge we have a viewer, Sig Till, I’m not exactly sure who they are, but they’re Sig Till on YouTube, who is joining us from Norway. So we were like, let’s do this at 4 p.m. Pacific. Sorry, everybody in Europe, but we’re doing it Asia Pacific, or at least, I mean, it’s what, it’s 7 a.m. where you are. So we’re kind of in a…

weird time zone right now, but we had somebody from Norway tuning in to ask what’s in the future for PM tiles and which changes would you like to see in the format itself or new tools that use the format? But anyway, Sigtil, just don’t go to sleep just yet. We’ll answer your question. The vision of source is not so much to be an intermediary. Sources by design, it doesn’t really do much other than provide reliable access to objects.

So we call it, it’s a data publishing utility. It’s not an analytic tool. I’m happy to have, I want people to build stuff on top of source. So yes, I do want people to link to it. However, this is math that we, this is kind of my point in saying we have to do this analysis on our usage is to say, well, how much is that really gonna cost us if we do that? And are there ways for us to…

get a handle on bandwidth and usage so that we don’t, we’re not abused, you know, or rather, abuse isn’t the right term, but just so that we can afford to do that in a way that’s reasonable. And so, and to say like, look, if you don’t want to host your own object somewhere, which tons of people don’t, I mean, sort of a core tenant of the product design is that like, we just know that a lot of people don’t want to host their own stuff. Like they don’t want to their own servers. They don’t want to think about infrastructure at all. If we can,

let them just link to reliable assets that are available. That’s great. But we have to figure out a way to do that in a way that doesn’t, you know, could scale to the usage of something like Google Maps without bankrupting us, you know? Then that means we have to figure out, for example, with like the open course policies, do we have to have some sort of way to say like, no, no, you have to be put onto an allow list?

Jed Sundwall (21:07.099) to be able to link to this or something like that. We’re gonna have to figure that out. So you’re right that I don’t want to be an intermediary. We’re not really trying to log people into source, but we do wanna provide a service that allows people to access data without having to download and re-serve their own copies if they don’t wanna do that.

Brandon Liu (21:25.612) Right. I mean, on the other hand, feel like, so part of the messaging is that just having object storage is a commodity. And in my experience, talking to developers that use PMTiles or that use other cloud data formats, a lot of people find using S3 very accessible, and it’s not a huge lift to ask them to be like, hey, go put this thing in your bucket. And it’s even among non-

Like I would say you could just be a front end developer. could be someone that spends all their time doing TypeScript programming and know nothing about like servers and you can figure out like object storage. So I think part of the solving point I’m trying to make is like exactly. Yeah. Like that audience I think is extremely large of people that of people that like it’s too much of a lift to host something like a server.

Jed Sundwall (22:06.693) That’s my story. Yeah.

Brandon Liu (22:20.472) But just putting a thing in a bucket is actually like a very good experience. It’s very simple, it has a nice abstraction. And if you can sort of encourage the world to be more object storage-y, that’s the way I think about it. And that’s a big part of why I think PMTiles as a format has succeeded is because that audience is so large.

Jed Sundwall (22:42.191) Yeah, totally. mean, so yes, agree. I’ll just tell a bit of history. I’ve told this story, tell the story a million times. I’ll probably tell it a lot as we keep doing this podcast, but like the story of the origin of the cloud optimized GeoTIFF and all this was when I found myself at AWS building this open data program and I figured out this one weird trick that I could just get the company to give out free S3.

but I had no engineers. had no, like I was embedded within a sales organization. So like, due to like HR practices, like the idea of hiring engineers to build software or tools or anything was out of the question. And so I’m like, what can we get away with if we can only use S3? And I also being kind of, guess I would say a front end guy, although I’ve never been like ever officially hired as an engineer, loved S3.

It’s like very intuitive product, super powerful, very capable. I wasn’t afraid of it. And so I’ll say this, like you, very talented, smart person, knows how to use S3, isn’t afraid of it, and neither are your friends. There’s tons of people out there that are afraid of S3. Like Source, and actually I got to shout this out. We’ve been working with Development Seed on Source. Anthony Lukash, shout out to Anthony at Development Seed has been.

just cranking out new features on source. Today, we pushed out like you can upload stuff into S3 through the browser through source. for source users now, which you still have to be invited to be a source user, you don’t even have to use the CLI. You don’t have to, you don’t have to look at the AWS console. Like I’m just here to tell you there’s a whole universe of people out there that they’re like.

No, I am scared of S3. I’m scared of AWS. I don’t want to look at that console. And I saw somewhere some tweet that was like, it was in reference to Vercell or something like that, but it was just sort of like, it’s amazing how big of a business you can build just by building an abstraction layer on top of the AWS console. And so that’s really what we’re trying to do. And in fact, I do hope there will be people in the future. mean, we already have a…

Jed Sundwall (25:01.777) a bunch of other organizations that are hosting their own PMTiles on source, they would rather put it on source than host their own S3 server. So, or rather like manage your own AWS account. So, I’ll leave it at that. Let me make sure, I’m hoping Sig is still awake in Norway. Do you want to take this question? What’s in the future for PMTiles?

Brandon Liu (25:25.858) What’s in the future? I would say the current version of the spec version three is done. There aren’t any plans for a version four right now. And I think I kind of got lucky in that sense that there was nothing like someone at a conference last month in Japan, they asked me is like, do you have any regrets about like the format design right now? And I’m like, I thought about it. I’m like, not really. It’s not perfect. Like the design overall has very specific trade-offs, you know?

Jed Sundwall (25:38.566) Okay.

Jed Sundwall (25:49.125) Okay.

Brandon Liu (25:55.212) Like it’s, almost stupidly simple in some sense. And like, didn’t want it to like get too carried away. didn’t want to like embed CRS information and that kind of thing. I would say the lowest hanging fruit for PM tiles is better compression methods, but that’s blocked on browser implementations. it, like, so browsers only support GZIP for decompression stream APIs. If that supported something like Z standard.

That would be great, but that is blocked on Apple, Microsoft, Google implementing Z standard support. What changes would I like to see in the formats itself? The format itself is, right now it’s good enough for static data. I would really like to see another format emerge that is for dynamic data that is still like S3 optimized.

that handles rapidly changing data. Because right now, if you edited some geodata and created a PMTiles, you’d have to replace the whole file on object storage. And that is a huge trade-off. Thankfully, a lot of the data out there is you can generate this building data set once. And maybe once a month, you run a new job and it generates a new one. Each time you are replacing it.

Jed Sundwall (26:52.987) Yeah.

Jed Sundwall (27:05.711) Yeah. Yes.

Brandon Liu (27:21.496) What I really want to see is a cloud native storage engine for real-time data. That would be a totally different design than PMTiles, but I think it’s still possible to do a cloud native thing on S3, for example, where maybe you have data in chunks, and then those chunks are addressed by a hash. And then you have a header that is just a reference to hashes. And then as you upload new data or data changes, you create new chunks and reference those.

and then garbage collect them. So I would like to see some other new formats separate from PM tiles that addresses real-time data. In terms of new tools for the format, sort of along this line, one experimental tool I have for PM tiles is a way to do deltas. So you have to replace a PM tiles on S3 each time. But I was thinking about a way to rsync data.

Like if you have like a 200 gigabyte PM tiles on the cloud, and then you have 200 on your desktop and they’re mostly the same, but one part is changed. You can use an algorithm like R sync basically to just fetch the parts that have changed. So that’s like one way from like the cloud to your computer, not the other way around. But I would like to see some use cases for that because I sort of built it as an idea.

But there’s not really a strong compelling use case right now. So that’s, those are a lot of my ideas for the PM tiles ecosystem right now.

Jed Sundwall (28:56.657) Okay, I love that.

you’re unearthing some feelings about source and like, you so we’re trying to, want source to be kind of this like one-to-one proxy between like for S3, but the idea being that we can create durable URLs that are undergirded by.

as many object stores as we want. So like if you have an object, you should be able to mirror it in lots of different regions and across clouds. And if you have your own S3 compatible object store, like we should be able to point to it and stuff like that. But a really interesting thing happened. If you go to, you’ll have to look around on this, but like the data.source.coop repo on GitHub, which is the repo for our data proxy, this guy Sylvain Lassage, who we’ve been working with on viewers,

You’ve encountered him on GitHub. He’s like, it’s weird. Hugging face can stream CSVs, but S3 can’t. And he looked into it and it had something to do with some header stuff that I don’t remember the details of. But it was like an easy add to the proxy that was basically just like it would pass some more information in the header when you’re calling the CSV and you can stream the CSV. And so like.

We have, we’ve crossed that line. It’s like, it’s sort of like, we’re going to do something. S three API doesn’t do. And I can see us going down a path where we are.

Jed Sundwall (30:27.937) more than just like a very simple abstraction on top of S3, but we’re extending what object stores can do. So we should keep talking about that.

Brandon Liu (30:38.23) Right. And also like, so going back to the idea of like a top down versus a bottom up standard. So S3 has become like a de facto standard, like a totally undocumented standard where every other vendor like sort of only implements the features they need to be S3 compatible. And if something is like wrong or like broken, they’re like, well, that’s how S3 works, you know? So it’s sort of become this, this odd thing where this quirky design that Amazon came up with.

Jed Sundwall (30:44.527) Yeah. That’s right. Yep.

Jed Sundwall (30:58.821) Right. Right.

Brandon Liu (31:07.07) is now like what everyone has to do de facto because all the tooling is built on is built with those assumptions that like this API, this XML API exists. They’re trying to do new things though with like there’s that like S3 express one zone that works differently. There is I think a new way to do like partial uploads. Like you can define an upload as being copied from a different object and that’s like accelerated.

Jed Sundwall (31:07.12) Yeah.

Jed Sundwall (31:25.115) Yeah.

Brandon Liu (31:36.536) But yeah, like it would be cool if some other company came up with like an actual, like maybe a more, like a more featureful spec for S3. But again, probably why it succeeded to the point it has is because it’s so simple. It’s like dumb, you know, there’s no really fancy, there’s no fancy semantics around like content hashes and stuff. Like if you look at how Google storage works, you know, it does seem like they had some, you know, some…

Jed Sundwall (31:46.768) Yeah?

Jed Sundwall (31:51.344) Well, right.

Brandon Liu (32:06.04) whatever like level seven engineers sit in a basement for months and like come up with some cooler design that is like more correct or that is more scalable. So there is platforms like Google storage that seem to have more sophistication than S3, but they don’t have the adoption of S3 in terms of the API, not the specific Amazon platform, but like the API, the interface. And I think that is like a fundamental thing, which is there’s always gonna be this trade-off between like,

Jed Sundwall (32:13.915) Yeah. Yeah.

Brandon Liu (32:36.606) the simpler and dumber you make it, the more likely it is to thrive, you know, like thrive organically. In terms of people being able to write their own implementation, people writing tools. That I think is also like the trade off between something like PM tiles, which is, you know, like I keep saying, it’s, it’s, simple and dumb versus something that is more full fledged, like a server application that serves WMS tiles, for example.

Jed Sundwall (33:04.111) Right, yeah, I mean, so we just have to be very careful with how we go about this. I imagine you’re familiar with the concept of pace layering or pace layers. You heard of this? Yeah, so I’m putting another, I’m just gonna be putting stuff in the chat. this is, it’s an idea I think Stuart Brand came up with, which is basically the notion that like you,

Brandon Liu (33:15.63) I don’t think so.

Jed Sundwall (33:32.271) that society, like the world, like society is our experience as humans moving through the world. It’s based on all these things that are moving at different rates. like nature undergirds everything. And on top of that, we have all kinds of different life forms and then humans have developed culture and governance and law, language itself. But these are all layers that like they evolve at faster and faster rates.

The funny thing is like sort of the top layer of the pace layer diagram is always like fashion, which is like all over the place. like fashion is this kind of like unpredictable crazy thing that humans do, but that’s based on these other more sort of like foundational things like markets and law and language and blah, blah. And so that’s how I, so, mean, I was at Amazon for eight years. like, and I totally bought into

the philosophy of AWS, which is to provide primitives, to provide primitive services that are reliable and are effectively, extremely durable. We had an AWS crash quite recently. things go wrong, but it’s pretty remarkably stable service in terms of like how complex it is and how much stuff it supports. But the way they do that is by being very primitive.

I would say there’s, to your point, there’s obviously room to extend that. And I think the right way to go about it or to think about it is to extend on top of the primitive. But to go slowly, you wanna add layers very carefully on top.

Jed Sundwall (35:23.897) All right, let’s see here. I make sure that we’re… I’m figuring out this chat stream thing. I can see it here in Riverside. Sorry, everybody out there, but we’re still figuring out how to do this thing. So I’m curious to get your… I mean, when did you realize you could just do a really huge file? Just like one gigantic file.

Brandon Liu (35:48.29) so I started ProtoMaps the project before I created PMTiles. and the original plan was to have a server, like a server process, that like serve tiles out of a database. So the original design was like not like, it was not like cloud native or cloud optimized at all. It did not use range requests. It was like a, it was still one file.

Jed Sundwall (36:01.595) Yeah.

Brandon Liu (36:17.134) that you like stored on a server and you had to like run this program to be able to like serve it over HTTP. And then I like, I eventually figured out that I could sort of cut out that entire part just by making it something that you could put onto like on S3 as a static file. So that actually came in probably like one or two years into the project.

For me, it’s so in a lot of cases, like that idea of being locked into using the server process to like serve the tiles, that is sort of like a feature. Like for most businesses, like, like if you have to run it on server, that creates like lock-in, you know, and you can monetize that. You can add like, you can add a paywall. You can say, Hey, like, so if you want to like be able to access this thing, it goes to the server. Just like get this API key, you know, once you go, once you go over like 10,000.

Jed Sundwall (37:03.129) Exactly, yes.

Brandon Liu (37:15.33) request, then you can pay like a subscription, like pay as you go. So that’s like a feature is to be able to like have it be a, a file like on a server versus just a single static object. but then like, once my, my thinking around like, okay, well like, you know, what is the long-term way this project succeeds? I’m like, you know, isn’t it more interesting to have it just be this like single object?

that you can copy around, like as if it was a video. So right, the original like motivation for the project was coming from like being able to create custom maps and host them yourself. Just the nature of how that was hosted evolved from being a traditional like sort of sassy server thing to being this like object storage focused thing later on.

Jed Sundwall (38:11.503) Okay, fascinating. Yeah, I mean, the…

this notion that if you control the server, if you have to be this intermediary, you get to control the data flows and also the users. I was thinking like studying Netflix is a really interesting thing to do if you think about like a data business. Netflix is a data business. They sell subscriptions to data. And the way they’re able to do that is by controlling the entire interface, like the entire chain. And so you have to go through them and pay their subscription and…

experience, know, have the Netflix experience, which is good. You know, the fact is like they provide, there’s a huge audience for that kind of data, which is videos that people like to watch. And they’ve just nailed the experience and people are happy to pay for that. know, whereas like, there are certainly people out there that are like, nope, like you have to have your own DVDs or I’m going to run my own local NAS with a bunch of my own video files because I want to have control. But most people are like, whatever, I don’t want to have to think about this. And so.

So all I’m saying is like, I’m underscoring the point that like, there is a business in providing that kind of service to people, but the market for maps is way too small to justify that kind of thing. That’s why I think so many geospatial like SaaS companies have had such a hard time because they might be able to provide a great, great experience to get some vectors and rasters and stuff delivered over their interface, but like,

the market for it’s just way too small to justify it. anyway, I’m a fan of your approach for obvious reasons. And I’m sorry, let me just keep going because Rachel Googler on LinkedIn asked, this is relevant to this. She asked, she said, were the AWS outers last week in Azure issues today? Which I didn’t know about that. We’ve seen how reliant we are as a society on centralized cloud infrastructure. How can cloud native formats be used in temporary local area or

Jed Sundwall (40:14.447) or peer-to-peer networks when that centralized connectivity is gone, such as during natural disasters. I think you kind of answered her question right away, but do want to address that kind of idea directly? Like how you think about this?

Brandon Liu (40:29.326) So I think of the Protomaps project as something that works on a server or works on S3, but also as something that works on an SD card. It’s like, if you can put a map or you can put a dataset from source, like a scientific dataset onto an SD card and carry it into the forest, then that is like…

That’s good enough, right? That’s how most technology should work. That’s how videos work. That’s how Word documents work. So I think once you’ve built the primitives, it addresses a lot of these questions about like portability and being able to be resilient against like certain failures of networks, for example. There is some interesting things around peer-to-peer. I know one of the contributors to PMTiles was like,

playing around with IPFS, which is like this distributed storage system, like where everything is like addressed by hash. think it’s cool. don’t know a lot about it, but I’m happy to hear that just designing like a simple single file format can be directly applied or like it just works with these things like IPFS. And…

Jed Sundwall (41:32.517) Yeah. Yeah.

Brandon Liu (41:53.696) I haven’t seen a lot of adoption for that specific peer-to-peer system outside of some more niche use cases. But in theory, so you could build a really resilient network of storage for any kind of data as long as what you’re trying to serve is just these simple files.

Jed Sundwall (42:18.255) Yeah, yeah, well, I mean, and again, I mean, I think the sort of the Netflix example is a good one to explain this, like to highlight also the sort of Rachel’s point of like these single points of failure that can occur where like if you are relying on one system to be able to deliver content like in a very specific way, if that system is brittle, it goes down for any reason, like you’re hosed, but this is the…

This is core to the file-based approach to data architectures, or what I would say specifically the object-based approach, because I like object storage, is that resilience in the face of a system going down to your point, like you can put on an SD card and take it into a forest, that’s perfect, that’s a great way to think about it. There’s kind of no way of getting around the power and effectiveness of sneaker net. However, this opens up the…

the door to a question that I’ve had about PMTiles is that you’ve created PMTiles as this format. If you give, so if I show up with a PMTiles file on an SD card and give it to a random person, they will not be able to open it. They’re gonna double click on it and be like, what is this? How do you get away with that? I mean, yeah.

Brandon Liu (43:31.619) Yeah.

Brandon Liu (43:36.148) Yeah. I think it’s tough because like, it sort of depends on the observer, right? Or the person opening it, are they opening it on Android? Are they opening it on Windows? Can I go talk to Apple and ask them to put a PMTiles viewer into Mac OS or something? And I think like my solution is this web viewer. There’s a website called PMTiles.io that I maintain where you can just like drag and drop.

Jed Sundwall (43:45.402) Right.

Brandon Liu (44:04.276) a local PMTiles file or a URL of a PMTiles on the cloud. So the sort of intention was that viewer emerged at the very beginning. There has to be essentially a file preview for these things that works locally too. You shouldn’t have to spin out the web server to be able to look at something. So the thing about data is people want to look at it. People don’t believe that it exists until they can see it.

It’s just like this inherent bias. So we know the machine can read it. People don’t trust it until they can look at it. And that is a lot of why people care about PMTiles overall is because they might have geo data in some format, but if they want to visualize it, have to turn it into some more visualizable format. And that’s really what PMTiles is, is making visualization easy. So the answer for the web viewer is as long as they have a copy of that.

web viewer, is open source on the USB stick, then they should be able to open that offline in a browser and just like open up that PMTiles file. That viewer is built using all like pretty standard web stuff. It uses map Libre and some like browser APIs.

Jed Sundwall (45:23.673) Right. But is that all built? Can that viewer be… that all be… This is a very naive question. Could you just have like an HTML file on that stick that contains the entire viewer?

Brandon Liu (45:38.454) And a JavaScript bundle. Yeah. The, there is a static build of it, cause it’s hosted on GitHub pages actually. And GitHub pages is just static files. So you could just like clone down a copy of that HTML JavaScript CSS bundle and have it offline and that should work. there is this like interesting question though of like, okay, like, there’s certain like formats like for archiving that are like, I think it’s like the library of Congress. They have like standards about like.

Jed Sundwall (45:40.345) Yeah, okay.

Jed Sundwall (45:44.783) Okay. right. Yeah.

Jed Sundwall (45:52.569) Yeah. Yeah.

Brandon Liu (46:08.482) they recommend JPEG as a format because it’s like based on the likelihood of like in like 50 years, there’s like some like library science people that are like, like we have these like historical like scans of like restaurant menus, but how do we open them? Because there’s like this, there’s this image format that like was popular, you know, back in the, in the two thousands and now nobody can read it. So there’s like this open question of like, you know, is,

Jed Sundwall (46:11.461) Right.

Jed Sundwall (46:15.034) Right.

Jed Sundwall (46:21.958) Yeah.

Brandon Liu (46:35.414) is PMTiles like a resilient format in, but like by that standard of measure. And I think that the way the format is designed, it could fit on one page. You know, it’s like, like I know people that have written like a implementation in a different language, like Rust or Swift or something, and they can do it in like a day because the format is intentionally like, like as simple as possible, like going back to

Jed Sundwall (46:55.408) Yeah.

Right.

Brandon Liu (47:03.906) that QOI format, just like, it needs to fit on like one PDF page. It can’t be like a white paper, like 200 page book to be able to write a reader. So like my hope is that even if all of, know, if GitHub, you know, like it’s blessed into the sun and we lose all the code, but you have to like write a reader for PM tiles, like from scratch. And all you have is the spec. I don’t think it’s that hard. It should be doable.

So even if you didn’t have like that web viewer or a thing on a USB stick, you could figure it out.

Jed Sundwall (47:39.237) Yeah. Amazing. This is, I mean, this is great. We’ll, we’ll be announcing this right away. but the, the next episode of great data products is with, we were pretty sure it’s going to be with the Harvard, library innovation lab. It’s the Harvard law school library innovation lab. So where I found like my kind of librarians, you know, that are thinking a lot about, you know, understand the benefits of object storage and these, you know, primitive commoditized layers of storage, but they have a lot of thoughts about this and.

we’re talking about many different types of content, but I think, I hope I want to make sure they, they hear this because your thoughtfulness on this, think is like really, really great. I mean, thinking, you know, the tagline of this podcast is the ergonomic ergonomics and craft of data. And you’re thinking so far ahead, like, what are the ergonomics of like finding a PM tiles file in the like rubble left after the nuclear like winter and people be like, actually I can figure this out.

What, yeah, great experience you’re thinking of for the future archaeologists. Have, yeah.

Brandon Liu (48:46.668) Right. So just as a comparison point, like it’s probably fine to like sort of bash on, as Ray stuff here, like I saw, I don’t think I’m like, or it’s not a bashing on it, but even like file, like a file geo database, which is like an F F GDB format. There are city governments that publish F GDBs and they expect you to open them. And like most developers that are not into as re ecosystem cannot open these files.

Jed Sundwall (48:59.696) Yeah.

Brandon Liu (49:14.656) Like I think like it might’ve been like in New York city, like they distribute their like road network as an FGDB. And you know, that format was maybe designed like 15 years ago. And even then most people I talked to are like, what do do with this file? I have no idea what to do with it. So that’s like an extreme example of like, well, you know, it’s not even a question of like 50 years of like…

of being able to open the file like in 50 years, it’s a question of like even five years later after you publish it, can anyone deal with this thing? And it’s like, well, not really. I think it’s like kind of proprietary or maybe there is some spec, but even things like shapefile, like shapefile like was proprietary from the very beginning, right? And then people sort of like kind of made some, made some like reverse engineered like readers for shapefile.

Jed Sundwall (49:54.747) Right?

Brandon Liu (50:08.79) And even then there’s like undocumented extensions for like doing indexing and stuff on top of shapefile. But it’s like all those things are, I think they sort of like fail this question of like that library tests. Like are people going to adopt this if they are thinking about things, like if they’re trying to preserve things like for the future.

Jed Sundwall (50:29.093) Yeah, absolutely. I mean, this is, you’re thinking the right way, you know, and what’s interesting is that like, Jackson says, geo package. That’s, yeah, there’s an answer there. Yeah. mean, what’s remarkable about,

Brandon Liu (50:42.382) Geo package, yeah.

Jed Sundwall (50:49.489) about the, I mean, just thinking about this, like just how short the history of the internet and computing really is, you know? And so it’s fun to think about what things will be like a hundred years from now or whatever. But like we went through a blip, I would say, where people were like, oh yeah, the way to control the market is by controlling the standards. know, Microsoft did that very effectively and developed incredible network effects through the dock and know, XLS formats.

that have since been effectively opened, but who cares? By this time, the damage is already done. Everybody uses Word and Excel, which I should also say, I’m not mad about. I think they’re great, obviously powerful tools that everyone uses. It’s technology that’s well distributed, so I’m not mad about that. But in the future, we have to think more about exactly what you’re saying, which is just sort of like, how durable is this going to be, really?

And that means being very thoughtful about how you design the spec. And it’s usually gonna be something simple. The only other thing I’ll say here is that like, I don’t wanna seem like I’m picking on PM tiles, cause like if I double click on a PM tiles file, nothing will happen. The same is true for Parquet, right? And so Parquet is like all the rage. So much data on hugging face right now is in Parquet. We love having tons of Parquet data on source.

And I was showing a guy earlier today who’s not really familiar with it, but I opened up on source and these are my favorite demos. My PMTiles demo is the best demo source, because we’ve got a great viewer built in and you can just look at it and it’s easy for people. Thank you for that viewer, the viewer that you created. And then Sylvain also built this Parquet viewer and it’s like, great, like now, you know, I mean, as of today, somebody can drag and drop a Parquet file into source.

and they can look at it in the browser right away. And I showed this guy, I’m like, yeah, here’s a parquet file, it’s 800,000 rows. And it’s just like streaming right through really easily. And we’re already at a point where there’s so much data out there and so many files are being adopted that like, no one’s even bothering developing a desktop viewer for them. It’s all being done in the browser. Like it’s all the expectations that’s gonna be done over the internet, which is amazing.

Jed Sundwall (53:11.395) we got some comments coming through. Yousef from Egypt. Hello, I don’t know who knows what time it is over there. He says new versions of GDALC can open up FGDB now.

Cheetal for the win.

Brandon Liu (53:27.308) I think I saw that. Yeah. I think like my standard workflow now is like I downloaded like the FGDB of like, it’s like New York city road center lines. And then I do like an OGR to OGR and just get it into like a geo JSON or something. but yeah, I believe there is a solution now. I remember, I think there was one time, like a decade ago where I like downloaded like the ArcGIS Pro trial and like activated the trial just to be able to like open.

the FGDB and then like save it out as something else. But I think that like the status quo is better now. Yeah, for sure.

Jed Sundwall (53:59.973) Yeah. Yeah.

Yeah, mean, GDAL, it just…

Shout out to Evan. A few more comments on YouTube. Jackson, hello Jackson. He says he’s in the midst of writing an implementation of GeoPackage in Julia. Good luck. Let us know. If you want to write about that on the CNG blog, we have a process for submitting stuff to the blogs. That’d be cool. It’s 2 52 in the morning where Yusuf is. Brandon, you are very popular. People are like, this is incredible.

Sun never sets on the brand and Lou Proto Maps empire. And then we’ve got Sigtil again from Norway, staying awake. I love this, this late night energy we’re getting. Asking, how do you see the new kid in town Geo Park versus PM titles? They have some of the same properties and some differences also. As you said, there’s a lot of new clay. Yeah, so yeah, I have Zarak, Hog, Flat Geo Buffs.

Brandon Liu (54:43.394) Cheers.

Jed Sundwall (55:07.481) You’ve explained this to me before, sort of the nuance between like what PMTiles does as opposed to what GeoParK does. I mean, I have my own guesses about this because it’s, GeoParK is like more about like data than PMTiles, which is more about viewing. Is that how you would describe it? Or what’s your response there?

Brandon Liu (55:29.944) That’s how I see it. Yeah. Like, so I make the distinction between like an, and like a format for, that is for analysis versus a format that’s like for visualization. And I think that’s like, maybe not intuitive because in some cases, those are the same. Like for a cog, viewing it and analyzing it are sort of the same because analyzing it means like, what is the value at this pixel? And viewing it is like, show me the raster, you know, colored in some way.

Jed Sundwall (55:38.672) Yeah.

Jed Sundwall (55:46.651) Yeah.

Brandon Liu (56:00.014) For PMTiles, a lot of the use cases right now for PMTiles are vector-based. And for vector, you sort of need to split out the analysis and visualization into separate things. Because if you wanted an overview for a vector dataset, you can’t really show everything. It would be too noisy. So PMTiles is inherently generalized. Like it has like an overview pyramid.

Jed Sundwall (56:21.104) Yeah.

Brandon Liu (56:27.36) So you can load it at any scale and it looks correct. But what you actually see at that level is like not, is not everything. You have to do some filtering down of the data. Sort of like for, for cogs, you have to build overviews that are like smaller and smaller down sampled resolution, like images of the full thing. So GeoPARK is, is, does not have a lot of use case overlap with PMTiles because GeoPARK is like,

and analytical format that is all, it’s just like all the raw data and then only one version of each, and only one version of each data point. While PM tiles will have copies of a single data point because it has to build those overviews. Now there are like approaches to using GeoParkade and visualizing it directly. Like for example, so there’s a project called Lawn Board that lets you like just show,

Jed Sundwall (57:13.209) Right, right, right.

Brandon Liu (57:25.922) GeoParkay on a map, whether or not that’s practical to use on the web really depends because if you want to be able to download an entire GeoParkay data set to visualize of a city, that might be 200 megabytes, which is more than people usually expect for a single web page. I mean, it’s possible that in 10 years, bandwidth will be so fast and cheap.

that downloading 200 megs for a single webpage might not matter. And maybe we like at that point, we don’t actually need like a visualization format. We can just be downloading raw data like everywhere. But I expect like some sort of strategy around being able to visualize data with overviews is always going to be necessary just because like some datasets are just really big. Like there’s building datasets on source that are like, like maybe half a terabyte, like they’re like open buildings datasets.

Jed Sundwall (58:05.007) Yeah.

Jed Sundwall (58:22.935) Yeah, the VITA datasets, those are my favorite demos. They’re like 300 gigs or 230 gigs or something like that. like, yeah, it’s like, it’s only going to be streamed.

Brandon Liu (58:29.795) Yeah.

Jed Sundwall (58:34.095) My assumption is that storage will keep getting cheaper. There’s still plenty of room to progress in terms of the cost of storage itself, but bandwidth, networking has actual physical limits in terms of the speed of light that I think are really compressing space like that. The movement of bytes over space or across space is really hard.

One, actually Qsheng Wu, awesome to have Qsheng on here, says that DuckDB supports serving vector tiles through Parquet, so they’re on LinkedIn. So, cool. It’s great. And then we have another, I wanna talk to you about the Hilbert curve. We’re getting at about an hour, so we can maybe start wrapping it up. But then Alex Kovac asks, and I’m gonna test this out, I’m still figuring out how to do this. You can see it, okay, so.

Brandon Liu (59:13.635) Nice.

Brandon Liu (59:30.786) I see it.

Jed Sundwall (59:35.621) How did, I think the people on LinkedIn can’t see this. So this is tooling on PM tiles. And also for the purposes of the people listening after the fact. Alex says tooling around PM tiles such as the viewer CLI, typical new base maps package, et cetera, is super convenient. How did that evolve? And do you think there’s anything big missing? Yeah.

Brandon Liu (59:59.042) Yes. I think part of the part I put the most thought into was the overall developer experience of using pm tiles. And from the beginning, had to be like a single binary you could just download. I did not want you to have to homebrew install or npm install or Python package install a package, just because that’s going to fail for a lot of people.

Jed Sundwall (01:00:12.069) Yeah.

Jed Sundwall (01:00:17.168) Yeah.

Brandon Liu (01:00:27.222) If you’ve ever been to a workshop where people like use Python, like a scientific workshop where people are like, we’re providing the material as like a Jupiter notebook. And then someone’s like, I’m on windows. And then you’re like, just use Conda. And then you’re like, trying to like fiddle with this, this conda setup. And I’m like, I don’t, I just don’t like, like, I feel like it, like it pushes people away. Like I understand that like that tooling is mature, but for me, it’s like, I think the best developer experience for any sort of data tooling is like.

Jed Sundwall (01:00:43.469) You’re right.

Jed Sundwall (01:00:49.039) Yeah.

Brandon Liu (01:00:56.46) just download a single binary. Like those are the tools I see having the most adoption and least problems in terms of like the installation. So the installation has to be super simple, like a single download. The viewer we talked about is like the web viewer is for like the viewer for PMTiles files to just browse them. I would say if there’s something big missing, I think tpknew is great.

and PMTile support for that is built in, thanks to felt. But I would say it’s still too hard to install. Like a lot of people that want to build PMTiles, they get stuck on like, do you install the vector tile generator? I would say that is the biggest missing piece, which is to have a single binary download vector tile engine.

Jed Sundwall (01:01:52.752) Okay.

Brandon Liu (01:01:53.838) Like a lot of the limitation for that is because the libraries you need to do geometry, like geometry processing, are generally only in a couple of languages, like C++, Java. And right now the CLI is like a GIL program and there’s no good libraries for that to go. Even Rust doesn’t have that great of support. You probably need to bring in a Geos via C++ bindings. So the biggest missing part is still like some…

Jed Sundwall (01:02:18.726) Yeah.

Brandon Liu (01:02:22.922) easy to install and large amount of data generator for vector tiles. It’s something I do want to work on, but right now I think the tpknew solution is good enough. But it’s the major pain point for using PMTiles.

Jed Sundwall (01:02:40.175) Yeah. I mean, talk about ergonomics of data. The way you think about this is so great. Everyone learned from Brandon. You’re so thoughtful. this is also just kind of like, see this is you’re helping level up the species just by thinking through things this way. Because yeah, it’s so goofy. I mean, I’ve been in all these hackathons in these rooms where people are like, yeah, like.

you end up spending half the time debugging people’s Python installations. it’s just like, no, there’s got to be a better way. Yeah.

Brandon Liu (01:03:16.364) Right. There’s also this idea of different kinds of complexity. There’s like inherent complexity versus incidental complexity. And I think a lot of solving these pain points is around solving incidental complexity, which is just complexity that happens to be there as an artifact that is not related to the actual problem we’re solving. Like maybe you’re trying to solve some route optimization problem. And that is it’s…

like is inherently a interesting computer science problem. But then the, the incidental part is like, I need to like install these packages with Conda and Conda is the like, doesn’t like this wrong version of my machine or something. And it’s just like, all that stuff is just like the part that is like, we can really like, we have to eliminate that in order to actually get to working on the hard problems.

Jed Sundwall (01:03:54.501) Right.

Jed Sundwall (01:04:07.375) Yeah, exactly. There’s what’s the line? It’s sort of you make the hard stuff easy and the like impossible stuff possible or something. There’s some axiom around like, know, guiding software development along these lines, which is like, we should be continually progressing in that direction. But you’re asking all these great questions or like framing it in the right way, which is just sort of like you imagine somebody who’s coming to a hackathon.

how quickly can you get them up and running? If you’re gonna take an SD card into the forest, what can you actually do with that, realistically? And I often think in terms of, this is what I was saying before about Excel and Word being very successful, is that they are sufficiently distributed technologies. The whole idea that the future’s already here, it’s just not evenly distributed. There are some that are evenly distributed, like spreadsheet software.

Like everyone can open a CSV. Like that’s awesome. CSVs are great, like because of that. But you know, as we’re getting better at producing more complex forms of data, we need to think about the ergonomics in that way. Like what are the experiences of people being introduced to this? So, Yusuf says that to pick a new in Windows is a nightmare by the way. So FYI.

Brandon Liu (01:05:29.474) heard that as well. Yeah. Yeah, I’m aware.

Jed Sundwall (01:05:33.881) So I remember years ago I asked you if you’d ever seen the movie Tar.

Brandon Liu (01:05:39.367) which I still haven’t, but I need to now that you’ve mentioned it twice.

Jed Sundwall (01:05:41.027) Okay, well, I’m just like, it’s a, TAR is a weird, TAR fans come out and tell me if you’ve watched the movie TAR. It’s TAR with an accent on the A, it’s a Todd Field movie, in which David Hilbert is a character of sorts. Like he just shows up in the background and I think there are references in the movie to the Hilbert curve.

Tell me about the Hilbert curve. Let’s close on this. Why the Hilbert curve and how did you get into space filling curves? I love this stuff.

Brandon Liu (01:06:18.574) I kind of ripped it off of S2. So S2 is Google’s geospatial indexing library, and they use the Hilbert curve there. It has some nice properties that make it work well for geodata. And the motivation behind this is even in Cloud-Optimized GeoTIFF,

Jed Sundwall (01:06:26.883) Okay. Yeah.

Jed Sundwall (01:06:32.934) Yeah.

Jed Sundwall (01:06:38.916) Okay.

Brandon Liu (01:06:48.088) People argue about like, like, so we’re making like a cloud, like a cloud optimized format, but like how big should the blocks be? You know, you’re like fetching blocks. If you have small blocks, those are good for certain use cases. If you have big blocks, those are good for like, for more like bulk downloading use cases, it’s more efficient. And there’s some trade-off between small blocks and large blocks. But the Hilbert curve is like a way to like, it’s like a lazy way to get around that argument.

which is because like it’s both small blocks and big blocks in the same, like in the same format. You can actually have any size block as long as the power of two. And the reason this is good for PM tiles is because one of the operations on PM tiles is for extracting one part of the world from a larger file. And the imagined use case for this is, so I host my OpenStreetMap data set on the cloud.

Jed Sundwall (01:07:26.801) Yeah.

Brandon Liu (01:07:46.36) But maybe you only care about Seattle. You don’t want to have a copy of 100 gigs of the whole world. You only want Seattle. Or maybe you only want Capitol Hill. So the block size in the archive should be small if you only care about a neighborhood. But if somebody else wants all of Canada instead, then they want to be able to have a format that has big blocks so they can download Canada in one chunk.

So the Hilbert curve is useful because it encompasses both of those use cases without having to make a trade off. Because if you did small blocks, it would be good for Capitol Hill, it would be bad for Canada. If you did big blocks, it’d be good for Canada, it’d be bad for Capitol Hill. So because the Hilbert curve is sort of scale-free, it has the same self-similar structure at every power of two.

you sort of get the best of both worlds in one thing. And that’s really the motivation for why the Hilbert curve was useful for this design. I would say it’s not fundamentally essential. You could build a pretty good format just using like other space filling curves or like a Z-order curve. There is some drawbacks in terms of it’s more computationally expensive to decode the Hilbert curve versus other ones.

Jed Sundwall (01:08:43.611) Yeah.

Jed Sundwall (01:09:04.571) Okay.

Brandon Liu (01:09:12.59) For example, there is these Bing, Quan key tile indexes that are much faster to compute than the Hilbert curve. For most use cases though, the cost of decoding and encoding the Hilbert curve is trivial compared to the network. If it spends two milliseconds doing a bunch of tile coordinates on Hilbert, then you’re spending 50 milliseconds fetching something over the network.

Jed Sundwall (01:09:13.085) interesting.

Jed Sundwall (01:09:22.576) Okay.

Brandon Liu (01:09:42.39) So like overall, like holistically, the price you pay for using the Hilbert curve is not that much relative to other things going on in like in some actual use case. But that’s like kind of the whole story as to why we use this like weird thing that is apparently in a movie as well.

Jed Sundwall (01:10:03.813) Yeah, I mean, just the movie. I turned the light red again, just because it’s kind of a spooky movie. Let me, there’s BV on YouTube asked a question if H3 grids are similar to the useful, but one thing I want to clarify about the Hilbert curve and like to make sure I understand it, which I’m pretty sure I don’t, which is that like the idea is that you can map two dimensions along one dimension.

Brandon Liu (01:10:10.03) Yeah.

Jed Sundwall (01:10:32.581) Right? Like with, you you just have like one string that can be extended into two dimensions, like effectively anywhere at any resolution you want. If I’m doing, if I’m loading up the Canada tile, am I just loading up one band? Like, how does it, how does it work? Like, or is it making multiple requests to do that? That’s, can you explain that even? Like, it sounds like the kind of thing you would need a whiteboard to describe, but.

Brandon Liu (01:10:59.086) Yeah, you’re opening up multiple like, so if the entire world is on one length of string, then Canada is multiple segments of that range of string. Now, where you can adjust is how finely traced the borders of Canada are because

Jed Sundwall (01:11:07.535) Yeah.

Jed Sundwall (01:11:12.367) Yeah.

Jed Sundwall (01:11:26.608) Yeah.

Brandon Liu (01:11:27.15) If you’re working in a networked environment, you can do some optimizations. can say, I’m going to grab a little bit more data than I need, but have less ranges. I can represent Canada using fewer segments of string, even though I get a little bit of America on the side.

Jed Sundwall (01:11:31.76) Yeah.

Jed Sundwall (01:11:38.459) Yeah.

Jed Sundwall (01:11:50.235) Right.

Brandon Liu (01:11:55.448) Pretty much that, like there isn’t really one Canada tile, but you can sort of trace out a contiguous segment of the file that is all next to each other, that is all inside of Canada. And then maybe grab a little bit on the sides for like different outline areas. But the interior of Canada, as long as it’s like an area, you know, like most countries in the world or most regions are not like Chile where it’s just like one long thing.

most of them are like kind of rectangular-ish, you know, they have like an interior and then like a border. So this sort of space filling curve is well suited to how people usually think about areas as having like an internal volume and then being able to slice that into just parts of this space filling curve without having to, you know, like use an excess of

Jed Sundwall (01:12:23.024) Yeah.

Jed Sundwall (01:12:47.834) Okay.

Got it. And then one follow up question on that from the chat is that, is there a benefit here that also these requests are close to each other? Meaning like, you want to look at the full Canada tile and then like the Vancouver tile, should they be near each other? My intuition though is that that shouldn’t matter with object storage and range requests, because it’s not like you’re.

He’s saying like, it’s similar to like how you defragment an old spinning hard drive, but like, that’s not how object storage works. I mean, we’re not assuming that we’re using spinning disk. We might be, but do you have any insight there? Yeah.

Brandon Liu (01:13:30.55) Right, so it matters a lot on HDDs because it’s like on those old spinning hard drives, it’s like you have to move the needle more if they’re not by each other. But I think most storage now is solid state and there’s not a huge difference in the seek time for like a far away chunk versus a near chunk. But yeah, there is also benefits to certain operations. Just having parts that are close in space also be close in the file.

Jed Sundwall (01:13:36.453) You have a head. That’s right. That’s right.

Brandon Liu (01:13:59.276) that is taken advantage of in some parts of the tool.

Jed Sundwall (01:14:02.745) Okay. And then let’s, do you have opinions about H3? mean, so BV is asking, are H3 grids similarly useful? I see it as probably not, but I don’t know how H3 content is. H3 is more of like an indexing concept. know.

Brandon Liu (01:14:20.95) H3 is really useful for visualization. Yeah, I think it’s like, so H3 is like, you’re usually storing like a value in each cell. And I think it’s like, it’s really great for making like really good looking visualizations of data with hexagons. There is some trade-offs like in H3, one hexagon does not perfectly nest.

Jed Sundwall (01:14:34.885) Right.

Jed Sundwall (01:14:41.072) Yeah.

Brandon Liu (01:14:49.578) it’s child hexagons while in tiles there is a perfect nesting. But for certain use cases like showing like aggregate statistics, it doesn’t matter. So I would say H3 grids are the perfect use or are the perfect match for certain use cases around visualization that are separate from doing tiling.

Jed Sundwall (01:14:54.938) Right, right.

Jed Sundwall (01:15:08.965) Right.

Yeah, exactly. Yeah, that’s sort of my understanding. And it is especially good for like visualization, but then also like statistics. Like, so if you’re doing like analysis on, I mean, you just think about the origins of it with Uber wanting to measure demand and activity in very, like very certain areas of different grains. It’s like perfect for that. So, okay. Well, look, we’ve been going for an hour and 15 minutes. This is incredible. We’ve got…

people stand up to all sorts of crazy, guys go to bed. Again, there’s a podcast. Like this audio will go out so you can listen to it whenever. But I really, we have been honored. People are honoring us with their time. I hope this has been interesting for them. Brandon, I love talking to you. I love, I obviously love what you’re doing. We’re very proud to have you as a Radiant Earth Fellow and have had you as a fellow for a long time.

man, are you serious? It’s this, Sigtil in Norway won’t let up. He’s got to go to bed, but he’s asking more questions. Are there some geometries that are not supported more difficult? For instance, polygons with, boy, with holes and holes made of curves, et cetera. What was the most difficult geometry to work with across tiles? This is too hard of a question. Are you seeing this comment? Go for it.

Brandon Liu (01:16:33.676) No, it’s like I’m able to address this. Yeah, it’s I mean, this is a good like deep question, but it goes back to what I was saying is that there is certain geometries that are hard to deal with. And a lot of it is you have to have a geometry library that is very robust against certain like numerical precision errors. And the only libraries right now that get it totally right are basically like Geos, which is part of

Jed Sundwall (01:16:37.295) All right, do it and then we’ll wrap it up. Okay.

Jed Sundwall (01:16:46.8) Yeah.

Jed Sundwall (01:16:58.746) Yeah.

Brandon Liu (01:17:03.192) part of PostGIS and JTS, which is a Java library that is related to Geos. And then a couple other ones, like there’s one that Mapbox made. But yeah, like that difficult geometry is the limitation in being able to write like an easy to install vector tile generator. So I would, I’m happy to follow up over email or something if you wanna like know more about like geometry processing, cause it’s like a really deep.

Jed Sundwall (01:17:10.566) Yeah.

Brandon Liu (01:17:33.046) subject that sort of is a stealth hard problem. People don’t realize how hard that problem is until they find some weird geometry that’s broken. But yeah, that is a good question. And again, I’m happy to talk about it more.

Jed Sundwall (01:17:50.991) Okay, and then, so to contact you, I’m gonna just put in the chat, protomaps.com, go to protomaps.com, there’s info down at the bottom with how to reach you. So, you’re easy to reach. Obviously, everyone listening to this knows how thoughtful you are. So, anyway, I mean, thanks so much for what you’ve given to our community.

Can’t thank you enough. Anything else you want to talk about? we missed?

Brandon Liu (01:18:27.662) I just wanted to say thanks for having me on the podcast. I am also on the, CNG Slack, the source cooperative Slack, which one do you want people to use? if people are CNG members, then they can join.

Jed Sundwall (01:18:38.373) That’s right, yeah.

Well, yeah, so CNG members, you got to be a member. For both, you kind of have to be a member. So membership to CNG is pretty cheap. We say it’s a symbolic fee. these memberships don’t really add up to pay many bills, but we ask people to pay to join CNG just to make sure that we know that people are there on purpose. They really want to be there. So join CNG if you’re not, and Brandon’s in the Slack there. Sores is still invite-only.

But source, so yeah, the best point of entry right now is the CNG, the Cloud Native Geo Slack. You can go to cloudnativegeo.org slash join and learn how to learn about it there. I’ll put that in the chat as well. But yeah, thank you. Yeah, it would be great to see people interacting with Brandon on any of our slacks, but he’s easy to find otherwise.

All right. And then it’s what is it? 817 in the morning there now.

Brandon Liu (01:19:44.032) It is, yeah. It’s red and early.

Jed Sundwall (01:19:45.435) You got a whole day ahead of you. All right, well, happy Thursday. Thanks again for doing this. I bet we’ll do it again.

Brandon Liu (01:19:54.498) Awesome, yeah, I’m looking forward to the next episodes.

Featuring:

Brandon Liu

Jed Sundwall

→ Episode 1: Why LLM Progress is Getting Harder

Show notes

Jed Sundwall and Drew Breunig explore why LLM progress is getting harder by examining the foundational data products that powered AI breakthroughs. They discuss how we’ve consumed the “low-hanging fruit” of internet data and graphics innovations, and what this means for the future of AI development.

The conversation traces three datasets that shaped AI: MNIST (1994), the handwritten digits dataset that became machine learning’s “Hello World”; ImageNet (2008), Fei-Fei Li’s image dataset that launched deep learning through AlexNet’s 2012 breakthrough; and Common Crawl (2007), Gil Elbaz’s web crawling project that fueled 60% of GPT-3’s training data. Drew argues that great data products create ecosystems around themselves, using the Enron email dataset as an example of how a single data release can generate thousands of research papers and enable countless startups. The episode concludes with a discussion of benchmarks as modern data products and the challenge of creating sustainable data infrastructure for the next generation of AI systems.

Links and Resources

Common Crawl Foundation Event - October 22nd event at Stanford!
Cloud-Native Geospatial Forum Conference 2026 - 6-9 October 2026 at Snowbird in Utah!
Why LLM Advancements Have Slowed: The Low-Hanging Fruit Has Been Eaten - Drew’s blog post that inspired this conversation
Unicorns, Show Ponies, and Gazelles - Jed’s framework for sustainable data organizations
ARC AGI Benchmark - François Chollet’s reasoning benchmark
Thinking Machines Lab - Mira Murati’s reproducibility research lab
Terminal Bench - Stanford’s coding agent evaluation benchmark
Data Science at the Singularity - David Donoho’s masterful paper examining the power of frictionless reproducibility
Rethinking Dataset Discovery with DataScout - New paper examining dataset discovery
MNIST Dataset - The foundational machine learning dataset on Hugging Face

Key Takeaways

Great data products create ecosystems - They don’t just provide data, they enable entire communities and industries to flourish
Benchmarks are data products with intent - They encode values and shape the direction of AI development
We’ve consumed the easy wins - The internet and graphics innovations that powered early AI breakthroughs are largely exhausted
The future is specialized - Progress will come from domain-specific datasets, benchmarks, and applications rather than general models
Data markets need new models - Traditional approaches to data sharing may not work in the AI era

Transcript

(this is an auto-generated transcript and may contain errors)

Jed Sundwall (01:00.661)

All right, well, Drew, welcome to Great Data Products, episode one. Thanks for doing this with us.

Drew Breunig (01:11.182)

Not a problem.

Jed Sundwall (01:12.537)

yeah, as I said, I’m going to ask you to introduce yourself in a second, but before I go, just want to, explain a little bit like why we started this podcast, which is, that we believe that.

Understanding what makes a good data product is just very understudied. We’ve been doing it as a species for a while now, every now and then sharing data. There have been laws on the books saying, you know, thou shalt open your data or policies from research funders saying that, researchers need to open their data. sometimes it goes well and sometimes nothing really happens with it. And we’re, think we have enough experience under our belt now that like we can see there are a handful of data products that have come out.

that have had a huge impact on research. And we’re at the point where we’ve got to figure out like, why, like why those, what made them good? There’s Eleanor Ostrom said this very somewhat famously, at least for me, I’m a big fan of hers, but you know, she was, she’d spent all of her life working on trying to understand how people share common resources that are, that are limited. like a fishery or a forest or, you know, grazing fields and stuff like that.

And she’s like, look, we know this happens. Like humans have figured out how to do this. We know it works in practice. Now we have to figure out how it works in theory. And I love that. So that’s, that’s what we’re doing is trying to figure out. We know that some data products are really great. We want to tease out some theories as to explain why. so, for reasons that are obvious to me, but may, might not be obvious to everybody tuning in or listening. You were one of the first people I’d ever want to talk to you about this. So why did you explain a little bit about your.

background of what you do.

Drew Breunig (02:57.574)

yeah, first I want to put a pin in that quote you said, cause I think one of the things that’s crazy about that is like a fishery is like, it’s a zero sum game. Like that is a exhaustible resource source. data products have entirely different dynamics. like you can go full like old school, boing, boing, Cory doctor out data wants to be free. It’s not theft if you can reproduce it, but at the same time, it grants you this immense advantage.

that then allows you to create more data in a way that isn’t free. it’s kind of anyway. So yeah, my name is Drew Brunig. I write a bunch on AI and data. I’ve been working in data. ran helped run or ran data science and products at a company called Place IQ for about a decade. And then led strategy at precisely when it came to the data and

Jed Sundwall (03:30.325)

Right. Yeah.

Drew Breunig (03:55.15)

intelligence business. I see data as a really interesting space because it’s an intersection between humans and compute, essentially. Because you’re essentially converting humans or work of humans or observations made by humans into something that is programmatically readable, you can build products upon it and that. that, I also think the other thing that’s interesting about that is that’s not a one way street.

It’s a two-way street. So you are converting humans into data, but at the same time you’re preparing data and figuring out how it can be leveraged to inform those humans. So kind of making data human, making humans data. And that is an active negotiation of borderlands as it were, rather than just one way that comes in and goes out.

Jed Sundwall (04:47.425)

Oh man, fantastic. All right. See you. This is a rich well to draw on. Um, yeah. And, what you, um, what’d you just said about like Corey, Dr. O and like sort of the economics of this. think this is my, I’ll just keep saying it out loud over and over again. This is that like, this is the Nobel prize challenge is like, can you, can we figure out how data functions as a market? Good. Because, because it’s weird, right? Like to your point about.

Drew Breunig (04:52.844)

Yeah, we can go, but.

Jed Sundwall (05:13.909)

What Ostrom was studying was, limited resources, which she called the common pool resources, but with the assumption that they were, they were limited and you needed governance to manage access to them. And just to, yeah, just quick primer on Ostrom for a lot of people. And I’m not like a full Ostrom scholar, but like a lot of what made that work was the fact that like, you had to live with the people that you shared the resource with. And so if you were a jerk about it, like you would get punched. Like that’s just part of it. And yeah.

Drew Breunig (05:35.938)

Yeah.

Drew Breunig (05:42.006)

Yeah, mean, guess you can kind of say that exists when it comes to licenses, which is a whole different messy world, which is like, god, please don’t. So much of my beef with licenses is that it’s the will of people when the data wants to be free. And the real way that you can kind of put your fingerprint on the market is you actually put the data out there in the shape and the form that is

Jed Sundwall (05:47.783)

yeah, we’re going to talk about licenses.

Drew Breunig (06:10.286)

what you want that makes what you want in the world to happen. But the idea of releasing it and then gating it is just insane to me. It doesn’t make any sense. It’s backwards. You kind of want the option, but you want to control how people use it, which is just like, why are you even bothering in the first place? But yeah, and I think that’s like, now you’re getting into the familiar terrain of like the data is the new oil claims and other things like that. And I feel like that’s a quote even that we debated and wandered around.

and talked about for decades. And part of the reason we talked about it is because it made people who work in data feel important. It made them feel like this justifies my paycheck, my job title, my power within the organization. But I don’t really feel we got to the point where data is the new oil became somewhat true until LLMs and post-Chat GPT.

Jed Sundwall (06:46.505)

yeah. yeah!

Drew Breunig (07:05.71)

specifically, those were the engines needed. It’s like you can create oil, but if no one owns an engine, no one has anything that they can do with it. Like that’s kind of the era we were in where we figuring it out. We could drill it, but we weren’t sure what there was potential energy there, but what do we actually turn it into? And prior to that, there was one thing you turned it into, which was ad products. That was the one thing you turned it into. That was the way to monetize. And now we’re turning it into large language models and other things like that.

Jed Sundwall (07:25.889)

Right.

Drew Breunig (07:35.104)

figuring out the economics of it, I believe is hard. Because the other, like, I don’t know, I think one of the things is like, you can find so many different metaphors for this, because it’s a complex thing and this complex bucket that like kind of reigns it in. But I do think like one of the king metaphors for data is it’s the platypus. Like it has, because, well, what is a platypus, Chad?

Jed Sundwall (07:56.021)

Go? Go on.

It’s all sorts of crazy stuff.

Drew Breunig (08:03.404)

Yeah, it’s got a bill. It’s poisonous, lays eggs, mammal, it’s got fur. Yeah, it’s like that’s that’s data. Like it sometimes you can, you can make it like oil. Other times you can make it like a lighthouse, which is like a public good that makes it so ships don’t crash. And you can put it at the right

Jed Sundwall (08:07.755)

Yeah. Lack dates in a really weird way. yeah. Sure.

Jed Sundwall (08:25.493)

Mm-hmm.

Drew Breunig (08:29.998)

point that encourages very specific trade routes to occur and economic activity to incur. And so you influence the world by putting it out there. And because it’s a public good that can’t be gated, became that was something governments did. And you could make the same argument. And then you can also find metaphors for like data being countless other metaphors as you can kind of run into. But I do think when you put a data product in the world,

getting towards the definition of what this podcast is, a great data product creates an ecosystem around itself, I think is the way I would say it. And I would say like, perhaps, and this can happen intentionally, it can also happen accidentally. And so by way of kicking this off, like I almost wanna pose to you,

Jed Sundwall (09:05.099)

Yes, yeah.

Drew Breunig (09:26.476)

what I think is the best data product ever created, or one of them, the Enron email data set. Are you familiar with this one?

Jed Sundwall (09:30.081)

Let’s go.

Jed Sundwall (09:34.85)

Ah, I am. Uh, because, uh, so just flashback here when I, when I joined AWS in 2014 to, to build the open data program there, AWS already had this thing called the public data sets program. Um, which was that sort of preceded me. Um, that was not, know, there, it was, had already been dabbling in sharing open data, but there was no kind of like program around it. And this program was somewhat abandoned. And, um, but.

Drew Breunig (09:51.266)

Yes.

Jed Sundwall (10:04.021)

how it had been set up was using elastic block storage volumes. So this is data that to access it, you had to turn on EC2. You had to turn on a server and then attach one of these volumes to that server. Then you could access it. But we had all these EBS snapshots, these volumes of data that you could load up. it was like, one of them was the Enron email database, but some other funny ones, there was like a cannabis genome, like,

maybe the Marvel Cinematic Universe, there’s something to do with like, it was like a graph database of like Marvel characters or something like that. And some Japanese census data that someone found. And it was just, it was kind of this fascinating snapshot. I’m sure there’s plenty of like internet archive screenshots of the site. It was just sort of like, here’s some random data that engineers at AWS found circa 2012. But yeah, the Enron database was in there. So go on, let’s talk about it.

Drew Breunig (10:58.509)

Yeah.

Drew Breunig (11:01.976)

Well, I just think Enron email database, so for those of you who aren’t familiar with the Enron email database, so Enron was a company that blew up in spectacular fashion. When did it blow up? Like 2001, 2002?

Jed Sundwall (11:19.013)

And you’re talking blow up pejoratively, like it was catastrophic.

Drew Breunig (11:21.612)

Yes. Yes, it was not a physical literal blow up. It was just a mountain of fraud. when the case kind of, there was a ton of public anger. A lot of people lost their pensions, a lot of people lost their stock, and effectively it went to zero and gets taken over. And it was a big company. In 2003, as part of the court proceedings,

Jed Sundwall (11:29.077)

Yeah. Yeah.

Drew Breunig (11:51.406)

the I think it was the Energy Regulation Commission released the emails from about 150 senior Enron executives. So this is about 1.6 million emails that get released. And this is 2003, keep in mind. like, that is an amount of emails that would be out of reach for most people because

You would, it’s just incredibly hard to download. Um, though putting it in AWS was, I’m sure it was very popular. you search Enron email dataset, uh, MapReduce, you will find hundreds and hundreds and hundreds of tutorials. And so it became this incredibly popular data set that people wrote papers about, about internal dynamics of workplace culture and language. Um, I think at one point there were like 30,000 papers a year.

that were citing this. And when I checked Google Scholar, maxed out. It was over 100k. Then you start to look at the companies that were booted up around it. I know multiple startups who started building email software or enterprise SaaS software that would start with the Enron email data set. You would start with it to kind of build your products around it. Because there was no email data set. Like even today, you see it used in AI evals and pipelines.

Jed Sundwall (13:17.536)

Interesting.

Drew Breunig (13:18.408)

is just this, it’s the only large email data set that is friendly license free to use. And it is generated an immense, I think it would be a very fun study for someone to do would be to calculate what the economic benefits of this email release from this absolutely failed company and how much it generated from this. so like, to me, that has the qualities of a great data product, which is it provides data that wasn’t existed anywhere else. It doesn’t, so you

There was no competing offering and any competing offering was just a minuscule, minuscule amount. Two, it has legs. We are, I want to say 22 years since the release and it remains as relevant as ever. It was freely available and accessible and easy to work with despite its size. It was a very common MapReduce demo, as I said, which would be the first step you would do in kind of dealing with it.

And it created an ecosystem around it, which I think is the biggest test case for a good data is do things grow out of it? And so it’s kind of like, I equated, I was at the Monterey Bay aquarium this weekend and they had an exhibit on whale falls when a whale dies and it goes to the bottom and it starts to decompose and all of the like critters and everything come to eat it. And it’s this like feasting moment. And that is the Enron email data set was the equivalent of a whale fall.

Jed Sundwall (14:42.699)

Yeah.

Very juicy. mean, yeah, so much, so much material in there. No, I love this. I you’re making me think about, I have this white paper that will come out eventually. I’ve been working on it for way too long. I may have mentioned this to you, but it’s called emergent standards where I make the case that the web is an engine for people to come up with new standards. and so basically like the way, like the server client dynamic of the web is that like,

If you have a server and a client that can talk to each other in a way that makes sense to one another, it works. like it’s worked with HTML and then, you know, we’ve figured out other ways to send more and more complex things over it. Um, including like, and what I talk about in the, in the paper is like, RSS, like we want to figure out how do we syndicate stuff to one another. Stack catalogs. Um, what’s the other one GTFS, which is the general transit feed specification. And, basically

Drew Breunig (15:17.656)

Yeah.

Jed Sundwall (15:41.602)

like what people don’t understand or like what a lot of people in policy don’t understand is that this is an emergent thing that happens as communities grow around types of data. So I’m agreeing with you, but like one conclusion I try to sort of land on with that, that this white paper is that this is effectively like language. If it’s useful to people, it will be adopted. Right. And so to your point about the, the, and the, this collection of emails,

Drew Breunig (16:05.55)

Yeah.

Jed Sundwall (16:10.109)

It’s practically useful in a way to a lot of people. so people have adopted it and it’s become a thing.

Drew Breunig (16:16.265)

Yeah, and I think the other thing too is like the it’s so much easier to create that standard or have a successful data set if you’re operating in the white space where it doesn’t exist. Like when we’re, so I work on the Overture Maps Foundation as you know, and like that’s a little bit of hard mode because you’re competing with a lot, trying to establish a with where standards exist to some degree.

Like open street map is really built more to be a map rather than a data set. So it doesn’t have great data standards for like easy data usage. It’s starting to adopt a lot of the moves that we’ve made at Overture, but at the same time it exists, it provides an alternative. And so it means we have to be that much better. Whereas with the Enron dataset, like there’s still no replacement for it. I was just looking, at the pile. The pile is a big data set. That’s about.

What is it? It’s about 900 gigabytes was used to train llama. It’s used to train lots of open agents. We can assume it’s being used to train closed agents, closed models as well. Again, it’s what? 900 gigabytes and the Enron emails are still in there. They’re still like one of like 25 sources. There is no better.

Jed Sundwall (17:42.305)

Amazing.

Drew Breunig (17:43.82)

email dataset. like operating in the white space means you get more rain to create those standards you go through.

Jed Sundwall (17:51.638)

Right. Interesting. Give me one interlude here. We have to, we got some technical difficulties. We’ve got to make sure the YouTube live stream is working or the chat is working. It’s apparently disabled, I’m going to, I’m going to do a thing. Hey everybody. mean, there are people on YouTube. I’m going to click on something and I don’t know what’s going to happen.

Jed Sundwall (18:21.183)

Now I’m like delayed. I’m watching myself on YouTube with the delay.

Jed Sundwall (18:31.657)

Okay, I think it works.

Drew Breunig (18:34.711)

You got it?

Jed Sundwall (18:36.811)

I think so. All right. Now how do I get out of here?

Drew Breunig (18:41.998)

I mean, look, you got your first episode here.

Jed Sundwall (18:45.441)

All right, we did it. No, we’re good, we’re good. We got, can see, it’s like all my friends. It’s like, this is so great. This is like romper room. Like, I don’t know if you ever watched that. It like a show and I was like really little and it’s like, yeah, it’s like, I see Alex and Camilla and Linda. This is good. Okay, so we’re good. So hold on, I do wanna talk more about the white space though.

Drew Breunig (18:49.077)

Nice.

Drew Breunig (18:54.22)

Ha ha ha

Drew Breunig (18:59.116)

Yeah, you can wave goodbye to them and you can’t hear us back.

Jed Sundwall (19:13.026)

define it more. You’re just saying like creating an entirely new kind of data product or working in entirely new domain.

Drew Breunig (19:17.088)

Yeah, well, I mean, I just think there are some things where it’s like, it’s, I think you see this a lot in culture and technology too, which is like, if you’re the first to come out, you have a longer shelf life than your if, if then the best which may come out later, technically. And so you have more ability to shape the standard, which is hard and a lot of pressure, because you can sit there and think about it forever, or you can just release it.

and then evolve it quickly as they come. But it’s hard when it’s a dataset because you release it and then it ceases. It’s the whale fall moment. You don’t get to go back and rebuild the whale and then drop it again.

Jed Sundwall (19:53.814)

Yeah. No, well, and this goes back to like what I was saying about like the Nobel Prize challenge of like, what are the economics of data? And I think you know this under working at Overture. It is expensive to produce good data. I cut my teeth.

Drew Breunig (20:08.546)

very expensive. It’s expensive to maintain good data too. I think like one of the things that like allow for longevity of these data sets are things where you don’t need that timeliness. Like it’s okay that the people in the Enron email data set are not still emailing and we aren’t still capturing those emails for the last 23 years. Because that’s not the function of that data set. It is a demonstration of how people use email rather than

And there’s been no competitor. Whereas if someone came out and said, I’m to make a business of selling select emails so people can see it, like say, but there’s so where we aren’t going to see that, but we do see it in other spaces.

Jed Sundwall (20:50.209)

Yeah. Well, yeah, let’s, I mean, we need to, talk about this for a little bit, like with a shape of a data product. Um, they, they, they can’t take on many shapes. Right. So my first job out of grad school. like, you know, my life story is I studied foreign policy. I got a master’s in foreign policy. Thought I was going to like work for the state department. I wanted to be a diplomat and I was like, I, I grew up in DC.

It had no appeal to me, like it had no luster to it. So I was like, actually, I just want to work on the internet. Like I had what I’ve called like a coming out process in 2006 where I was like, I care about the internet, like, and I don’t care who knows, like, this is just who I am and worked. So I took a job as a marketing enthusiast at eventful.com, which was like a web 2.0 company.

Drew Breunig (21:32.844)

Well, I mean, look, that’s a title that comes in the Web 2.0 era. Marketing enthusiasts. That was a special time for titles.

Jed Sundwall (21:36.62)

True. Yeah. Yeah. It’s like not a ninja. Like, like definitely like a amateur. Yeah. just an enthusiast, but it was my foot in the door and it ultimately I think was a very good decision. But what eventful did, there’s a site called eventful that was like, they gathered all the world’s events data that they could find by scraping websites and getting access to feeds and then standardizing it and making available via an API.

Drew Breunig (21:42.638)

Not a rock star, not a ninja. Yeah. Just an enthusiast.

Jed Sundwall (22:04.553)

And what we learned very painfully was like, this is very expensive and the bulk of our database becomes useless every day. yeah. Yeah.

Drew Breunig (22:11.318)

Yeah, no, exactly. That’s like the opposite, which is like it’s event data. It’s just gone. It’s done. And I think you see other people who have to struggle with this as well and try to figure it out, which was like, I think satellite imagery providers, you and I know many cases where like there’s several satellite imagery companies who are like, trying to figure out how to build a product that makes their old data valuable.

Jed Sundwall (22:16.063)

Mm-hmm. Yeah.

Drew Breunig (22:40.172)

because right now most satellite imagery providers are, their stuff is valuable because it gives you that snapshot of what’s going on right now. But they want to figure out everything else. And like, you’re not gonna crack that at Eventful. You’re not gonna crack that at, you know, anything that is, you know, temporal in nature.

Jed Sundwall (22:58.357)

Yeah. Yeah. Well, it actually, so this is, this is actually very timely. This, Antoine on the, on the chat, I love this is asking like, what about, you know, what about Freebase? This is the issue. It’s like, what about Eventful? Like Eventful never pretended to be an open data resource. was doing the hard work of taking a lot of open data or data that was like, you know, small enough that it didn’t feel like we were just ripping people off because also we were like,

Drew Breunig (23:05.427)

Yeah.

Drew Breunig (23:19.032)

Yeah.

Jed Sundwall (23:27.083)

highlighting events that people wanted to highlight, but then assembling it into a white pages like product where it’s a huge compendium where the product is like, we have everything in one place and then we sell access to it. Long story short, I don’t think Eventful exists anymore. That the problem that’s solved has been solved in other ways and whatever events are still kind of a difficult space to aggregate in. But so Freebase, awesome example of around the same era. I think it was started in

Drew Breunig (23:42.454)

No, it doesn’t.

Jed Sundwall (23:56.514)

hang on, I just looked up the Wikipedia page. was launched in 2007. 2007, for what it’s worth, the year that AWS announces its first service and the year that the iPhone is announced. It’s a very consequential year. Very heady days of Web 2.0, like seeing what the internet can become. And so…

Drew Breunig (24:12.718)

Yeah.

succeeded according to the Wikipedia by Wikidata.

Jed Sundwall (24:20.277)

Yeah. So, so the, mean, there’s, there’s room for this WikiData. think people like it. It seems good in some ways. I’ve never really relied on it very much, but.

Drew Breunig (24:30.082)

Well, Wikidata is a good example of like the importance of data UX, which is, you know, one of the things that was so nice about Freebase was it was this, it’s kind of what Overture tries to do with like its GURS identifiers, which is for everything, there would be an entity that you could then walk. like, you know, there’s here’s an entity for Jed. Now we can find everything Jed has open to it. And yeah, I, I think Wikidata is kind of sneakily one of the best

crosswalks on the web. think they track over 800 different crosswalk kind of identifiers like Apple Maps ID, Google Maps ID, lot of federal IDs and everything else. And it is fairly successful. It’s API, like think there is a little learning curve for that. I think also when trying to build products off that, it’s incredibly good for crosswalking data, though oftentimes you have to do a little bit of

hurdles to get the data down for that crosswalk. But again, that’s like a whale fall. Again, it’s the same thing, which is once Google walked away, it’s nice because it allowed for Wikidata to exist in a way and utilize the free base data as its code. But then it had to kind of supply the revenue or at least the donation model to keep it going.

Jed Sundwall (25:56.13)

Right. And it’s all goes back to the fact like this is expensive and hard. what I, you know, uh, these days, uh, the year 2025, there’s a lot of concern about sources of data that we had long thought were kind of like unimpeachable and we’re going to be a reliable provided by governments. And, um, that’s just no longer self, you know, a safe assumption to make.

Drew Breunig (26:00.6)

Yes.

Jed Sundwall (26:18.365)

And I’ve actually been a voice, you know, shouting into the void for years. Like this was never a safe assumption to make that we need to think a lot harder about this kind of infrastructure. Because it’s hard. It’s expensive to produce. And if we could figure out the economics of it and get better, have better markets for data, I think we would have more data. the, one of the hard things to, to grapple with here though, is that like nothing is free and

What you were saying before about the difference between like a fishery and a dataset is that like, there’s this phenomenon that I chalk this up to what’s called nano economics, which is like the economics of like individual, like very small transactions. so if you examine like voting behavior, people are like, my vote, like, how could it possibly count? It doesn’t matter, but like votes do matter, right? And like,

Drew Breunig (27:13.847)

Yeah.

Jed Sundwall (27:16.137)

We don’t perceive the emissions that we create by living our lives, but like they obviously add up. And so same thing, like Wikipedia, it feels free to open up an article on Wikipedia, kind of to all involved. Like Wikipedia itself doesn’t really register one page load. And it’s, certainly seems free to you, but Jimmy Wales is going to ask you, he’s going to nag you to donate because they need money. Like, yeah.

Drew Breunig (27:41.63)

Yeah. And, and, and I think there’s also the flip side to that as well, which is something that we see. So during the, when the advertising ecosystem was the way you monetize data, I’m sure many people talk to you about like the dream that everybody wanted to figure out is, how can we, we’ve, I’ve solved the privacy problem in advertising. I’m going to create a system where people can opt in to share their data and they get paid for

Countless, I know countless companies or people who dreamed of trying to figure this out because they’re like, look, people get real value with they sell their data. The advertising ecosystem is incredibly huge. The problem is, is that your data on its own is worth nothing, absolutely nothing. And it’s worth something in aggregate, but

Jed Sundwall (28:34.773)

Nothing. Yeah, exactly.

Drew Breunig (28:40.744)

nothing in in in by itself. And so people would make runs at this, which is like, we’re a co-op, we get to brand together, you like try to get some maybe economic innovation of like, okay, you’re, you know, have a longer timeline, take advantage of compounded interest, all these other things. But it’s it’s kind of the same thing, which is your usage of Wikipedia is a rounding error, but it’s expensive. But the value of the data you create

is a rounding error. And we saw this during the ad era and we’re seeing it again. There was, what’s the mobile phone network that launched a couple of days ago where it’s like, we get training data on all your calls. And so you get cheaper voicemail or cheaper phone service.

Jed Sundwall (29:23.626)

Whoa.

How about this one? fascinating. Tons of people are going to sign up.

Drew Breunig (29:28.278)

Yes. I don’t think, but again, like I can’t, like I haven’t looked at the cost. It can’t be high. Like, like how much of a discount can it actually apply? I’m looking it up because I want to see, I just saw it. Cause it’s, it’s way easier for someone like Meta or Google to just give you the service and the service is predicated on sharing data. But we will just never see that go away.

Jed Sundwall (29:35.947)

Yeah, right.

Jed Sundwall (29:56.726)

No, no, because in aggregate it’s just too, too powerful, too seductive and they provide really good services. Yeah.

Drew Breunig (30:00.95)

And now we’re seeing it, but now we’re seeing it like the flip side of this is the the anthropic case right now, which is how much per book was that settlement was like $3,000 per book, which is like fairly, if you’re an author $3,000 for a book, like for a lot of authors, it’s gonna be a lot for a lot of authors, it is not going to be. But it’s it is more than you would expect. And they’re going back to the well, because the judge took away the settlement.

Jed Sundwall (30:19.659)

Yeah, yeah.

Drew Breunig (30:27.084)

And so we’ll see where that does net out. do think like trying to figure out the cost and training is hard. I don’t know if that’s something like the idea of opting into training, think is, like, you’re going to get applications that rise up too quickly that are just going to take your training data. So chat, you’d be T anthropic just asked everybody to re opt in, change their privacy, because they’re going to be training on that, meta always has always will. and, and so.

Jed Sundwall (30:49.582)

interesting.

Drew Breunig (30:57.818)

how are you going to create an ecosystem to pay people within that? They’re just going to go use these services and kind of knock it out. So, I don’t know.

Jed Sundwall (31:07.881)

Amazing. Okay. Well, let’s, let’s, let’s shift to your blog post now then, cause let’s talk about large language models, talking about Anthropic, and the basis of these things. So you, in your blog post, which I highly recommend, it’s, it’s in the, whatever we linked to it when people registered for the thing, you can put it in the chat. but great overview of, these three data products. And, and again, this is another sort of chance for us to talk about.

what is a data product. So let’s start with the beginning and talk about MNIST. Yeah.

Drew Breunig (31:43.2)

Yeah, so one of the reasons I think large language models and AI in general are the fulfillment of data is the new oil is because previously, if you wanted to write a computer program, you had to worry or make a computer program, we really had to worry about three things. You always worry about your software and your hardware. Actually, two things really. That’s it. Just write my software, run it on hardware. I’m done.

With machine learning, deep learning, and now what we call AI and all those subsets of it, you have to have software, hardware, and then data. The data bit is non-negotiable. You need the data because the way machine learning and deep learning works is rather than having the programmer write the rules for what the program does,

You give a sufficient volume of data and present it to a computer and a computer program for making machine learning or deep learning models. And you give it instructions and you ask it to interpret the patterns in the data without and figure it out for itself. And within deep learning, this is even another layer on top of that, which is you figure it out without even telling what to pay attention to. You aren’t labeling it. You aren’t telling it. It’s just, here’s a pile of data. Go find the patterns.

Now in the early days, there wasn’t a lot of data because think about it this way, which is if you were an early adopter of computers, let’s say to 1994 in this case, you would go to the computer store, you buy your computer, you bring it home, you plug it in. And that was that. If you got any data into your computer, it was because you typed it out or you inserted a floppy disk.

that you got in the mail or picked up at the store, maybe a CD-ROM if you were real fancy. That’s it. The bottom, there was no internet connection. There was no downloading. So to acquire data was an incredible exercise. And so as a result, could you build machine learning systems? Not really. You had to have this access to data that you just weren’t going to get. So people didn’t do that. And so

Drew Breunig (34:06.154)

It wasn’t a field. wasn’t a thing. people are going to say, neural networks were back in the seventies. And it’s true, but there weren’t many who could play with it because the access to the data was so limited. And then what we found though, is that, and this gets back to the white space, which is really any data that was delivered to your door was brand new data. There was no competition for it.

Like, I don’t know about you, like, mean, like you’d get maybe a CD-ROM in your magazine or like, like what would you get for data? Like, what was the consistency of data? I think the only thing you would have is like maybe some project Gutenberg floppy disks you would pass around, maybe some like encyclopedia Britannica CD-ROMs you would pull out. There wasn’t kind of a world of data. And in this environment is the first data set we’re going to talk about. Cause we’re going to explain kind of the history of AI.

in three data sets. And the first data set is the MNIST data set, the M-N-I-S-T data set. Now, this data set, now it’s on a hugging face, as you can see. You can install the hugging face data sets pip library and download it. And it’s also bundled with almost every machine learning library. So if you install TensorFlow.

or Keras or whatever the backend and then you say like install MNIST. It’s almost certainly there because it is the data set that is the effective hello world of machine learning because back in 2004 or 1994 even longer. So what is MNIST? MNIST is a collection of 28 by 28 pixel images squared and they are handwritten letters.

actually is no, it’s digits. It’s not even letters, just digits, just numbers and digits. They collect these from two sources. One of them from, I think, census employees. And then the other one was from a high school class. So like, this is like a classic case of just like someone had access to two people, they were getting values out of them, they’re writing down numbers either as

Jed Sundwall (36:04.991)

It’s just digits. Yeah. It’s just numbers. Yeah.

Drew Breunig (36:30.942)

doing, filling out forms, filling out tests, and just someone in the right position is like, this could be useful or we’re scanning these anyway. And so they took some time. We don’t really know how this happened. They basically realized, hey, let’s make a data set of handwritten digits. They didn’t put a lot of thought into it or how it might be used for machine learning. Like one of the issues is, when you’re building machine learning systems, you have a test and a train.

subset and you should never mix the data. So your train is what you build your model on. You train, you learn from this, and then you test the quality of the model in your test data set. In the initial distribution, one of those data sets was like the high schoolers and then one of them was the census people, which is a terrible way. You should have it all mixed up and scrambled because you can make some assumptions that the census people may have different handwriting than a bunch of teenagers.

that have had no training. so later they improve this. But again, they put no thought into this. This is the equivalent. And they decided to distribute it. distributing was literally like CD-ROMs, burn CD-ROMs. You would get it in the mail. You would get, you you’d have to order it. And this was the NIST data set, the first one.

Jed Sundwall (37:59.094)

Yeah. So, and again, I think we need to maybe tell people like what NIST is. It’s the national something, something national is. Yeah. Yeah. So the government agency. Yeah.

Drew Breunig (38:04.832)

Institute of standards in technology. So the type of people who would be looking at pictures of numbers, and they’re the type of people who thinks there’s something here. Do you ever watch the movie Ed Wood, one of my favorite movies, great movie. You should you should watch Ed Wood. But there’s a scene in the beginning where he’s like on the studio lot. So Ed Wood is famous as like the worst movie director of all time. And he’s walking the studio lot and he’s he’s

Jed Sundwall (38:19.615)

No, I really should.

Drew Breunig (38:34.07)

walks into someone’s office and they’re reviewing the new stock photo, stock video they just shot or stock film they just shot, which they just keep in the studio library to like insert into movies later. And he’s just watching like disconnected random scenes. And he’s like, man, you could make a whole movie out of this. Just like highlighting how bad his taste is. But at the same time, looking at pictures of numbers and saying, we have something here is something you expect from the Bureau of

standards in technology. So they put it on CD-ROMs and mailed them out. And one of the people they mailed them out was a computer programmer at Bell Labs, back when Bell Labs was still like the institutional research standard. And the guy who got it there was Jan Lacoon, who is one of the godfathers of neural networking, one of the AI leaders at Metta.

Jed Sundwall (39:24.683)

Amazing.

Drew Breunig (39:28.054)

led kind of llama and other things. just released the world model last week. He’s just kind of a godfather of this stuff. And he had been working on the problem of trying to recognize numbers because he worked at Bell Labs. This is something they would want to do is they had to automate and kind of figure out and look at mail, look at zip codes. That was all it was trying to do is like, can we look at a camera and look at zip codes and automate the entire thing? And so using MNIST, he

trained a neural network, one of the first neural networks, and could basically delivered a watershed moment in accuracy. Like the error rate now was down to 0.8. He modified NNIST, mixed up the sample sets so it wasn’t just high schoolers and census. It became the Hello World. And at its peak, AT &T was using this original neural network software to read more than 10 % of all the checks deposited in the US.

which was then software that gets sold by Bell Labs. You will find this in almost every machine learning textbook, every deep learning textbook. And part of it was just, it was staged because once Jan got it, he reformatted the data, and this is touching on a question someone just asked, specifically for his task of training neural networks.

which is why this data set is so valuable and why it’s become this hello world is that you can do a one line install for MNIST data and it’s ready for you to use. It’s segmented into the different data sets. It’s all standardized. The levels of contrast and anti-aliasing, the flipping reversals, all of those things are all ready for it to be used. And it has kind of survived this test of time and enabled the foundation of the very first neural networks. Again,

This is a data set that was distributed on CD-ROM. It was sneaker net. It was mail. And it, would argue birthed what would later become our deep learning ecosystem that would lead to AI.

Jed Sundwall (41:34.614)

Yeah. think, no, I mean, and this guy, they’re to pull it up because they think his name, right. Donahoe, this guy at Harvard or sorry, at Stanford, David Donahoe. so wrote this paper that, I still have not finished. It’s very long. I’m putting it in the chat, but look, Donahoe is a smart guy, but the title is a little clickbaity for my tastes. It’s data science at the singularity. Not a terrible title though. I mean, I think he makes the case that there’s something going on here.

Drew Breunig (41:44.429)

on how.

Jed Sundwall (42:04.469)

but he credits Lacoon as the godfather. You would agree completely with what you just said. And the gist of what Donahoe says in this paper is that the machine learning has made the enormous strides it has because its community has adopted a practice of frictionless reproducibility. So one of these fantastic phrases,

similar to undifferentiated heavy lifting. It’s like impossible to say, but very useful. But this idea of frictionless reproducibility within the machine learning space where people have been able to share these great data products, compete around them to go going back to your point about a great data product has a community around it, have leaderboards and it’s just been like to the moon. And this it’s a great, this sort of tees up Alex’s question in the chat, you know, like

How would we get, for example, environmental data to be seen by A models? How do we do that? My answer would be, and this is defending everything that we do with this podcast and also with Source Cooperative is we would improve access to great data products. Like we would then work hard at that. Yeah. Sure. Yeah.

Drew Breunig (43:17.05)

Well, I think there’s two steps, which is cheating ahead. But there’s a couple things that come in, which is this idea of reproducibility, though. That was great in machine learning and deep learning. It’s really hard now. mean, Mira Murati, she left OpenAI and founded Thinking Machines Lab, her own lab, one of the many OpenAI people who have left to find it. And right now, they’re focused on reproducibility, because it’s near impossible because of the probabilistic software and

the way inference works at test time. And so it’s almost impossible now, and they’re innovating on that sense. But the other thing I would say, is we’ll get to this. But I think the other interesting thing is benchmarks, which is you don’t just need to put the data out there. You need to define the problem and provide the means for testing against it. And so if you want to say, it’s not enough to get seen by AI model,

Jed Sundwall (44:00.076)

Yes.

Drew Breunig (44:16.472)

because guess what? They don’t care. They’re just gonna go suck everything else. What you need to worry about is that the people building them have a benchmark to build against that now you’re, it’s the, what’s the, a metric or something becomes a metric, it becomes the, exactly. And that’s what it is, which is like, and this gets back to, I would even say, my funding. It’s not enough to just be there. You have to.

Jed Sundwall (44:32.213)

Metrics become targets. mean, yeah.

Drew Breunig (44:43.714)

you challenge these things and provide a mechanism for measuring success. If you don’t do that, no one’s going to care about it. But yeah, so that’s Yann LeCun. He’s doing his thing with CD-ROMs, sending it out. And it’s crazy to think part of what the internet has done and broadband is it speeds everything up because it makes exchange so much easier. And yes, the test benchmarks need to be actually relevant to the use cases. Yes.

The thing about benchmarks is that they are like they are shipped by people who care about specific things. If you don’t if you’re shipping a benchmark and you don’t have an understanding for why it’s important and why you care about it and you have some stake in what that is, you’re wasting your time like why are you shipping a benchmark in the first place, the point of putting the benchmark out there is to challenge people to perform against the thing that you care about. And there’s lots of

great examples of that.

Jed Sundwall (45:43.778)

Actually, can you help educate me on something I’m like very naive about and this is embarrassing, but I’m just going to be vulnerable on this podcast. is so to Tyler’s point, there’s been my understanding, there’s a lot of discussion about benchmarking with like earth observation, AI, AI models and stuff like that. And, and a gripe is that you can benchmark these things based on some sort of like, you can create like a technical benchmark or something like that, but it is divorced from reality, like from like what’s actually happening on the ground.

And it’s basically, like you can test, you can run a model and then test it to see if it’s performed in a certain way that like indicates that it’s a good model, but that does not indicate if it’s actually useful. Can you explain this to me a little bit more?

Drew Breunig (46:22.509)

Yes.

Well, I disagree with that. I think there’s lots of ways you can game benchmarks, but here’s the best way to think about benchmarks in my opinion, is that they are an encapsulation of knowledge with an opinion that allows you to test your performance against that encapsulation of knowledge. Yeah, we’ll talk about overfitting in a second, Joey. That’s very much a thing.

But the thing that I have that’s a problem is like a lot of people in earth sciences or sciences in general is they go to like big private companies and say, my thing is really important. You need to build against it. And that is first off, you have to get them to believe your thing is important. And then B,

They have to get up and running and understand that space really, really, really, really well. And then they have to see, build against it and follow it to create this own benchmark and have this thing is that, so when you create a benchmark, you are doing that work for them. And when you do that work for them, you get to encode the things you care about.

Drew Breunig (47:52.81)

It comes back to the like, there’s a I think it’s Louis Pasteur quote, which is, give me a laboratory and I will move the world. And he was talking about it in the case of like being able to freeze benchmarks and maintain science or freeze a variable and maintain science. And so if you can create a benchmark, you are creating the eval reality that you are asking that model to be held against. And this happens for lots of things. And so I think right now,

The two most successful benchmarks are the Arc AGI benchmark, which Francois Chollet built, which is, again, he basically said, everybody’s talking about AGI, but it’s not in reasoning. It’s really just fact memorization and repetition. He has a different thing, which is like all about pattern recognition. It should be incredibly easy for a human to do, but incredibly hard for a model to do. And so that has been…

kind of the thing he is a, he is a, has been in the deep learning space for over a decade. He is a leading voice. He created this and all of a sudden it became the thing that everyone starts to brag about when they get this because it’s really hard. When O1, OpenAI’s O1 was the first one to do it even somewhat passively, it was a really big deal. And ever since like we’re still kind of chasing it. So like both

his design, his leadership, his brand helped set that as this big thing. The other more tangible example of like, you don’t have to be a leader in the space, but you just found the white space is a benchmark called terminal bench. So terminal bench is how do you testing a model’s ability to use the terminal, use tools in the terminal. So with coding agents, this is so important.

Jed Sundwall (49:46.017)

Mmm. Yeah.

Drew Breunig (49:47.988)

Why do I care about having MCPs? Why do I care about having all these crazy tool sets? Just teach the model how to use the terminal and all the problems are solved. And this was put out by a really great team and they designed it in a specific way to, cause to basically get the agents they want. They spent a lot of time on this. This is out of Stanford and funded by LOD. And this has now become

the thing that people get against. like Anthropic, if you look at when their models come out, they will always put the terminal bench benchmark as like their top thing. When they bumped Claude Opus from four to 4.1, the main thing they cited was their terminal bench improvement. So that’s a good example of like, I’m creating the package of the reality I want from this. So someone in the chat replied to your…

Jed Sundwall (50:42.027)

Yeah.

Drew Breunig (50:45.216)

Earth observation benchmark, is like, all right, benchmarks are great, but my gripe is that most Earth observation benchmarks, so it’s looking at satellite imagery, they’re focused on object detection. Very few are focused on temporal signatures of change. Well, what that says to me, Tyler, is that’s an opportunity for you to create a benchmark or for someone to create a benchmark to measure this capability that you want to build into this model. A benchmark is a data product.

It is honed and I think it’s kind of the current way that data products are released or one of the main form factors they can take in this model. Yes, you have a worry about overfitting. The SWE bench is the software engineering bench. is a, again, it was one of the biggest one first to market, which is can a model take Google issues and submit changes?

Jed Sundwall (51:15.2)

Hmm.

Drew Breunig (51:44.382)

and submit PRs that pass. And it was adopted quickly as the main thing people were building against. I talked to AI researchers at foundation model companies and they’re like, I’m just trying to get another point in Swibench. That is what keeps me up every single day. But again, it has its own shortcomings. Like 50 % of Swibench is just the Python Django library. Like, so it’s really good at building the Django library, but maybe not very good at some…

rust, or maybe not very good at, you know, some data pipeline you’re building. So again, these things shape the outcomes and communities grow up and private companies grow up. And so that’s kind of why I think benchmarks are kind of a modern data product.

Jed Sundwall (52:30.913)

Interesting. Okay. There’s a lot to think about. I’m looking at the clock. want to, boy, where are we going? I want to talk about common crawl, but I also, like, we did not specify an end time for this because like good podcasts just go off the rails. you have a hard stop at the top of the hour? then okay.

Drew Breunig (52:37.196)

Yeah.

Drew Breunig (52:46.72)

I have a hard stop, but yes. Yes, not our top of the hour at one.

Jed Sundwall (52:54.977)

Okay.

Drew Breunig (52:55.958)

Yeah, we got an hour and 10.

Jed Sundwall (52:57.727)

yeah. So we booked our own time. for those listening, we’ve blocked our calendar for two hours so we can go this long. We’re going to go for as long as we want, but, but no further than 1 PM Pacific.

Drew Breunig (53:03.192)

There you go.

Yes.

But I think this, the benchmark thing, that’s how we talk about it today. But transitioning into, from MNIST, we went to ImageNet. And that is something that Fei-Fei Li created when starting at Princeton, because she built it as that challenge. She saw that there was a WordNet,

Jed Sundwall (53:15.766)

Yes.

Jed Sundwall (53:22.496)

Yes.

Drew Breunig (53:37.922)

which was out there, which was essentially a natural language processing training data set. And she said, well, I want this for ImageNet because I want people to build better image recognition models. And so to do that, she realized they needed a way to test it and train it. And it became not only a thing you could train models on to improve the software, but it also became like the foundational improvements of kind of deep learning in general was again, you put out your challenge.

and you make people go to it. It was a benchmark as much of a data set.

Jed Sundwall (54:11.583)

Well, right. like built in somehow, you know, I don’t know how she did this in terms of like funding and her stature at Stanford or whatever, like challenges. was just sort of like, this is a data product, you know, that we’re putting out there and we’re going to run challenges. And it was not, this is one of these overnight successes that took something like six years or something like that. don’t know like when, it was a long time before AlexNet came out.

Drew Breunig (54:26.262)

Mm-hmm.

Drew Breunig (54:32.322)

It was a long time. they, also, yeah. And I think the other thing too is like they had to create it like the only way they were able. So MNIST came out on a CD-ROM pre-internet. ImageNet could only have been created after the internet existed because they leveraged mechanical Turk. They leveraged Google image search. They basically were just paying to label images and the price they just would not have.

Jed Sundwall (54:52.427)

Mm-hmm.

Drew Breunig (55:01.806)

So I think there was a couple years that, because ImageNet was pretty, it was after Common Crawl first launched, but its breakout moment came before Common Crawl’s breakout moment occurred. And so AlexNet was 2012, whereas ImageNet was like, I think 2008, 2007. But yeah, and so go ahead.

Jed Sundwall (55:22.987)

Okay. Yeah.

Jed Sundwall (55:28.361)

Well, ImageNet’s another interesting example though of, well, this is when I said I wanted to talk about licenses because when I was at AWS, people were like, hey, you should host ImageNet in the open data program. And I’m like, I mean, sure. Like I think that would be cool if we did. Also like people can get it. Like you don’t need, you didn’t need S3 necessarily to get ImageNet. Like people, you could download it. Like it wasn’t like so huge.

Drew Breunig (55:38.893)

Yeah.

Jed Sundwall (55:53.09)

that it mattered so much. But I was also like, look, my lawyers aren’t going to like this. Like if we’re going to host these images, we don’t know. Yeah, they’re just like random licenses all over the place. And, but it just reveals just how like, how brittle this sort of like licensing regime is for this sort of stuff where it’s like, look, who’s, who’s going to sue you honestly, because you’re using some like, like 120 by 120 pixels square picture of like a dog.

Drew Breunig (55:59.726)

peeled off of Google search, like…

Jed Sundwall (56:22.719)

Like, you know, like.

Drew Breunig (56:23.006)

Yeah, I mean, do like it is that weird thing where it’s like, it’s fine to bootstrap it. But if like, you’re really successful, someone comes knocking, it’s kind of like, like, you know, Google looks the other way on people using Street View images, even though they know, they know that they are being crawled in some way or another.

Jed Sundwall (56:33.249)

Yeah.

Jed Sundwall (56:44.159)

Yeah. Yeah. No, I mean, or, you know, come for Anthropic once they’ve raised enormous amounts of money and they’ll be like, sure. Great. Actually, like we’re, it’s an honor to pay this because we know that no one can come up behind us now. It’s like, you know, cause we have got the cash.

Drew Breunig (56:51.18)

There you go.

Drew Breunig (56:57.28)

Yeah. And that’s what you’re paying for. mean, some would argue that’s why Google bought YouTube was purely to buy the court case or one of the main reasons. yeah, so that was so ImageNet was basically a database of I think about 1.4 million images that were labeled. Thousand categories. 1.4 million. And then just said, hey, every year we’re going to hold a contest.

Jed Sundwall (57:07.731)

Interesting. Yeah. Okay.

Jed Sundwall (57:19.201)

Something like that.

Drew Breunig (57:27.362)

to see who can get the best one. Now the idea of waiting every year is a positively quaint notion. People just download and run the benchmarks every single day. You have to upload it, the whole thing. But I do think ImageNet was every year. And so that went around for a while. Side note. go ahead.

Jed Sundwall (57:33.664)

Yeah.

Jed Sundwall (57:48.097)

There you go. Do this.

Drew Breunig (57:49.314)

Side note, I was thinking about this last night. So in a former life, I was a media strategist at a large media buying company. And in 2009, I was writing media strategy for Nvidia. And I was thinking about this last night because Nvidia had a new technology that they were very excited about called CUDA. And they…

I remember going down to the briefing and they’re like, here’s what we’re going to show at the floor at our next big conference. Here’s all the demos for CUDA. CUDA is this idea of we can use GPUs for generic computing and we can use it for immense parallel processing. We think this is going to be really big. And we would ask, all right, well, what are people going to use it for?

And they had like eight demos. None of them were machine or deep learning. There was a couple of biotech ones about like protein folding or what have you. There was a lot of cloth simulators. like, hey, we can sell this to fashion designers to simulate how a cloth is going to drape over someone. They just had like tons of different things. And they had no idea what they were going to use CUDA for. They just knew it was going to be this big thing.

Jed Sundwall (59:05.875)

Yeah.

Drew Breunig (59:07.544)

but they had no idea. so like, and you could make a very strong argument that like, CUDA is the reason Nvidia is in the position it is today being, you one of the most valuable companies in the world. And they had no idea what it was for. And so that was CUDA came out 2008. I worked on it in 2009. And it was still this thing, like no one knew. You would go around and like, they just would look at you and they’re like, well, we can do these things. And you’re like, that’s kind of interesting. And it wasn’t until 2012.

Jed Sundwall (59:17.416)

yeah.

Drew Breunig (59:37.408)

where they kind of had the first glimpse. So we’re hitting the big names here. Jeffrey Hinton, who did win a Nobel Prize for deep learning and machine learning, Ilya Sutskever, who later would found OpenAI, Alex Krzywinski. I can never pronounce his last name, Krzywinski. They built AlexNet, which basically performed against ImageNet.

with a score of 84.7. And you have to understand this was a 10 point plus difference compared to any previous competitor that year or before. And it was the first time that they had used deep learning accelerated by a GPU. And they were just using two consumer GPU cards. They were using like basically what you would buy to game at that time. And that basically started deep learning.

Deep learning was like how we talked about AI before AI. And so that was kind of what set it off. And I think the big step change here is that, again, this comes back to the benchmark thing, which is Fei-Fei Li created this space essentially out of a benchmark.

which is deep learning became a thing because its value was proven because someone built a data set and then people gamed to see how well they could perform against it. A benchmark is essentially a data set with intent. And when you ship that out into the world, you get people to do things against it if you make it exciting, if you make it collaborative, and if you’re operating in the white space.

Jed Sundwall (01:01:16.331)

Yeah. Yeah. This is, mean, this is, I’m going back. Linda said something like, wait, yeah. She said data is typically purpose built understanding. This will force us to examine our data more rigorously. Creating a significant demand for data repurposing, especially with AI. What I’m, what I’m hearing is like, or where this is coming together for me is that like you can produce a data product and we’ve been, we’re going to talk about common crawl next. I we should. And, then.

And then you have to produce benchmarks attendant to that data, that data set or that data product, which are basically just any number of arbitrary goal posts that you want to set. Maybe like, because common crawl is so rich, obviously it can be used for so many things. So you just needed to benchmark for each of those things, you know, and just say like, well, can you do this? Can you do that? Yeah.

Drew Breunig (01:02:02.241)

Yeah.

Drew Breunig (01:02:06.334)

Yeah, and I think some of the most interesting things out there are benchmarks. So we talked about Terminal Bench is one of my favorite examples. The other is the Berkeley Function Calling Leaderboard, which is just testing how well LLMs can use tools that are given to them for agentic purposes. And it’s really, really interesting. And then

What’s the other one that I really like? It’s not empathy bench. What is it?

There’s another one. Sorry, John, here it’s Sam Paish has a great benchmark that is one of my EQ bench. And he maintains this himself. Just some dude, love his stuff. he’s like, I’m interested in having LLMs become a better writer. And again, it’s like one of those things that’s really hard to quantify, how to make it a better writer. So again, he just he.

He’s like, here’s one metric. Here’s another metric. One metric is like, how often do you reuse the same phrases? OK, great. That’s great. We can’t do this. But two, long form writing. It’s like all of these. And it’s a really interesting thing. And he admits. He’s like, this isn’t perfect. But again, you start to see people building against it. And it does start to influence and shape the arc of development.

Jed Sundwall (01:03:40.822)

Yeah. well, let’s, let’s talk about common crawl, a bit more in depth, but I got a shout out, Sam ready at common crawl. She’s like, Hey, we have an event coming up. so people who are in the Bay area at Stanford on October 22nd, there’s an event, called preserving humanity’s knowledge and making it accessible.

Drew Breunig (01:03:45.95)

yes.

Jed Sundwall (01:04:06.913)

addressing challenges of public web data. This is the kind of thing I would love to go to. I’m unfortunately booked at another event. Um, Shane Zuckerberg initiative, think whatever CZI stands for. I’m going to be at one of their open science event. Uh, but man, if I weren’t with CZI, I would definitely be trying to go to this thing. Um, so, and you can watch online. So I put the link in the chat and we’ll, we’ll share this. I think we should share this podcast before October 22nd for sure. So.

Drew Breunig (01:04:20.16)

ooo

Jed Sundwall (01:04:36.929)

Shout out to Common Crawl. Drew, tell me all of your deepest thoughts and feelings about Common Crawl. It’s a great story.

Drew Breunig (01:04:42.434)

mean, Common Crawl is novel for how early it started and that it wasn’t really built with machine learning or AI in mind. It was, so to give you some perspective, the Common Crawl project, is essentially, it’s like the idea is that, hey, we’re going to scrape the internet and put it in one data file ready for people to use. So you don’t have to go scrape it.

Jed Sundwall (01:04:48.223)

Yeah.

Drew Breunig (01:05:11.878)

because again, we believe that again, lots of people can build things if all of this is accessible. And so the, the net value out of it would be tremendous. it began in 2007, the same year that Feifei Li launched, ImageNet. and so Gil Elbaz, his, yeah, it Gil, good old Gil. is, he started it.

Jed Sundwall (01:05:33.409)

Yeah.

Drew Breunig (01:05:38.354)

and he formed the common crawl foundation to basically, it’s funny, he founded it as he left Google. so it kind of tells you, you know, what his motivations were, which is like, want to build essentially. I don’t want Google to get a lock on the internet. I want to kind of expose the thing that’s really expensive to bootstrap and start up, especially in 2007, which is crawling and preparing all of the files. And, now they, it’s a single data set essentially with.

250 billion web pages collected over nearly 18 years. And about three to five billion pages are added a month, though, sadly, Common Crawl is getting shaped a little differently because its crawlers are getting blocked. And the reason its crawlers are getting blocked is because of AI-driven crawling. in a weird twist of fate, Common Crawl became one of the foundational things that early language models would train on.

It would become a critical ingredient in the pile Google C4 data set Basically subsequent data sets kind of child data sets, which is like hey We’re not going to include every single forum or we’re not going to include, you know Duplicative data where we’re gonna filter all this stuff down to the high quality stuff But then once you start building it and this is where it gets into the data like oil I use that let’s say I use that to build my model that later becomes chap GPT. I have so much

Jed Sundwall (01:06:52.075)

Right.

Drew Breunig (01:07:07.118)

I’m not going to rely on common crawl anymore. I’m going to start building my own crawlers and go out to the things that I care about and do it with a much greater frequency so that I can improve my model. You get enough of these, which you do. There are a lot of people out there hosting websites right now that are having to think about how to gate their content to prevent legitimate and gray market crawlers that are just hammering their sites. And so now, like,

Common Crawl created this thing, but now we’re kind of having a tragedy of the commons, which is everyone who grew up around it now sees running their own crawler as a competitive differentiation. And they’re going out there and kind of doing that itself. All the while, Common Crawl is still going, but it’s kind of surface area is starting to shrink a little bit because different web pages are shutting off access to crawlers because of this mess that it has created. So I do think it’s this closest thing that we have in data to a tragedy of the commons.

but yeah, I’ll pause right there before I talk about why the text is so important.

Jed Sundwall (01:08:10.433)

Yeah, no, I mean, I think it’s, um, that’s an amazing story. Gil, I mean, has told me he’s like, he’s like, I’m pretty sure common crawl is like the most impactful nonprofit ever. Um, there’s definitely a case to be made there. don’t know exactly how you’d quantify that, but holy cow. um, yeah, yeah. Yeah.

Drew Breunig (01:08:25.842)

Yeah, mean, because everything grew up around it. even you’ll look and people will say, so-and-so didn’t use common crawl. But then you look at the data sets they did use and they were derived from common crawl. So it basically fueled the entire first wave of large language models, which is what percentage of our GDP at this moment?

Jed Sundwall (01:08:50.081)

think it’s 140 % of our GDP. Yeah. Yeah. Yeah, none of the math makes sense when you’re hearing what people are talking about large language models now.

Drew Breunig (01:08:52.074)

Yeah, 100 % it is. We don’t know how that’s possible, but it is. That’s what we’re

Yeah, and this is like one of those weird things that like when he built this, like, one of the weird things about large language models is that everyone was kind of surprised when the first large language model like worked, like, like attention is all you need. Because like, it’s this thing where like previously, you would have to put structured data into these deep learning models, and then they would have to figure out the relationships. No one at the time like when when people thought of structured data,

Jed Sundwall (01:09:21.878)

Right.

Drew Breunig (01:09:28.962)

they thought of the work that like Fei-Fei Li put together with ImageNet, which is here’s an image and here’s some labels. And so the big gate for deep learning is like anyone who wants to build on deep learning, they’d say, all right, well, where am going to get that labeled data? Where am I going to get that structured data? With large language models, the thing that was shocking to everybody is like, wait, language is structured.

because we can see the order of the word. Some words come before the each other who come before after in all of these assemblages. And we don’t need to label language because it’s already organized and structured. We just have to have enough of it. That was the thing. The magic thing was that you built something big enough that it would display spooky, intelligent qualities. And that was what Common Crawl enabled. Because if you didn’t have that, you you couldn’t test that.

Jed Sundwall (01:10:17.461)

It’s wild. Yeah.

Drew Breunig (01:10:22.952)

randomly, because you would have had to stand up your own crawlers before that. So like the fact that it just existed allowed for that discovery to be made, which is why I think I wouldn’t argue with Gill’s claim.

Jed Sundwall (01:10:36.459)

Yeah. No, it’s, it’s, it’s incredible. I have other apocryphal story. mean, we hosted common crawl at my, my, I mean, my program at AWS is the home of common crawl. I have stories that I probably shouldn’t tell. So I won’t, like it’s, it’s, it’s phenomenal. and kind of insane. And I was joking about this last week, at an event at climate week.

because I was in a room with a bunch of organizations, I’ll just say like very large corporations, not a government insight, talking about sustainability data for global supply chains. I won’t go into much more detail than that. But I said, you got to understand, there’s this story about this guy, this one dude, granted a billionaire, who’s just like, here’s a thing I’m gonna do and does it. And it has this huge impact.

And I’m like, this heartwarming story of the impact that one billionaire can have on the world. But the point also being that like, it is possible to create a data product that has a very consequential impact. And if you feel like there’s something there, there might be something there. In Gil’s case, I mean, my story, at least from what I recall, him explaining this to me is that he creates AdSense.

Drew Breunig (01:11:49.474)

Yeah.

Jed Sundwall (01:11:58.658)

it’s acquired by Google, he spends his time at Google and he’s like, there’s gotta be some kind of fail safe for this kind of thing. And where we can’t have one company that is like, know, owns all of the world’s information. There’s some irony in the fact like of like what Anthropic and OpenAI are becoming is just sort of like the next version of that sort of thing. But you know, I’m not mad about it. Like, yeah.

Drew Breunig (01:12:19.946)

But I mean, I think about that a lot. I think it’s interesting now we’ve gone from the crawl being the thing that’s valuable to the interaction data. So like when they were talking about breaking up Google, one of the things that they were talking about was making the ranking data, like making the index open, which isn’t just the data. It’s also the relationships that exist in the data. But again, one of the things that I’m shocked about with LLMs, which I

Jed Sundwall (01:12:32.959)

Right.

Jed Sundwall (01:12:39.083)

Yeah.

Drew Breunig (01:12:50.498)

fine to be really interesting is that no one’s running away with it. Sonnet 4.5 came out and said, hey, this is the best model this week, the best coding model. But the thing is, the difference between Sonnet 4.5, GPT-5, even the open models, the larger Quen coding models, they might not be perfect, but they’re a lot closer than you’d think.

And it’s to the point where like everybody jumps on whatever the newest thing, but you could just be like sitting on like, you could have been sitting on GPT-4.0 for a year and you would have been fine. Like, and I do think what’s wild is that the floor is coming up faster than the ceiling. The ability of 7 billion parameter models to effectively, you know, double in quality every year is just absolutely insane. And so like,

You will get some things from like throughput and other things like that. But like, I think the weird thing is that even if these guys win, you may end up having like free access to something running on your device. That will be, it’s bizarre and it’s really weird to think about.

Jed Sundwall (01:14:02.658)

That’s incredible. Yeah. Yeah. It is. Well, let me, let me, um, I want to go back to the data is oil thing and how this like LLMs change this sort of stuff. And Alex left another comment about, know, people trying to use robots, text, or there’s like LLMs texts to try to influence how the bots can, navigate the web. I, so I have this theory, I’ll just bounce it off of you. don’t know if it’s a theory, but this idea that like,

Drew Breunig (01:14:20.739)

Yeah.

Jed Sundwall (01:14:30.091)

So the internet has been full of like really amazing data for a very long time. And what a lot of us who’ve worked in open data have just been sort of like scratching our heads about it’s just like, well, why doesn’t it all, why doesn’t it get used? You know, there’s all these open data portals that don’t get used. And my, one of my answers to that is that humans don’t know how to use data by and large. Like you did, you know, you just take a sample of like, a million humans, you’re going to get a very small percentage that actually like know how to do stuff with data. And.

And also like have time. I mean, this was always kind of the funny thing is an early realization for me when I was working in civic tech stuff was that there’s people that are like, yeah, like we’ll just open up our city’s data. And then some, these people will just like do cool stuff with it. And I’m like, hey, if someone knows how to do anything with your data, which is not that good, it’s good. It’s kind of a pain to work with. They have a job. You have a narrow window of like college kids and like civic sort of like tech activists people.

who before they like enter, exactly have kids, I was just gonna say like get a wife or a husband and have a job, like they’re willing to do that sort of stuff. And that’s it, and they just kind of go away after a while. But LLMs 24 seven can do stuff with data. And so we are at the point where I think that we might have created a market for data if we can get, and here’s my crazy idea.

Drew Breunig (01:15:32.994)

have children and full-time jobs. Yeah.

Drew Breunig (01:15:55.213)

Yeah.

Jed Sundwall (01:15:59.244)

Tell me if I’m crazy. Also, I think this is already happening, like, OpenAI and Anthropic should pay for data. Like they should just like, hey, they come to some data portal thing where it’s like, hey, we maintain this data. If you’re a bot, we’re gonna charge you a 10,000th of a penny per request here so that you can, know, it’s basically your research budget. Yeah, I think it’s a good idea. I don’t think Cloudflare, I think Source Cooperative should do it.

Drew Breunig (01:16:20.27)

Well, I mean, that’s what Cloudflare is trying to do.

Jed Sundwall (01:16:28.961)

because we’re not owned by anyone, but anyway.

Drew Breunig (01:16:30.676)

Yeah, no, I think it’s a it’s an interesting one. And the incentives are absolutely crazy to think about.

Drew Breunig (01:16:41.003)

I mean…

Jed Sundwall (01:16:48.353)

Don’t loan your mind.

Drew Breunig (01:16:48.462)

I’m thinking about what angle to approach that from. What do you optimize for? Also, do you mind if I a quick break, a one minute break and be back while I think about this? We’ll handle it in the edit. One second. Someone’s knocking.

Jed Sundwall (01:17:00.577)

Sure.

Okay. Okay. All right. For those of you who are watching the live stream, someone knocked on Drew’s door and he had to get it. I’m going to use this chance because I don’t know when this is going to end, but we still have some people on here. We are doing, for those of you who don’t know about the cloud data of geospatial forum, we did an event in Utah this year at the end of April, early May at Snowbird. was fantastic. Everyone loved it. We pulled everybody at the end of it and

We got like five stars, I don’t know, 97 % of people said they would come back and they loved it. So we’re doing it again. So you’re hearing it here first. We just lost a follower, but anyway, we’re gonna be doing the Cloud8 of GeoForum conference again, October 6th to 9th, not next week, but October 6th to 9th, 2026. So we’re gonna do it in the fall next year, but we’re gonna do it again. We’ll have a landing page up before too long.

and, you know, we’ll, we’ll have links to share out, but anyhow, it’s, very exciting. Alex left another comment. yeah, so exactly. Like, so there’s Alex leaves less comments saying, you know, a lot of journalism Reddit and orgs like Wikimedia are doing with their enterprise APIs is locking them down. I think this is fine. You know, I think people were coming out of like web, web 2.0 era. And I think a lot of the excitement around having open APIs like

Drew Breunig (01:18:22.688)

So.

Jed Sundwall (01:18:31.741)

is understandable, now we’re realizing again, we’re just realizing now we have about a decade of knowledge understand this has a cost. Yeah.

Drew Breunig (01:18:38.594)

Well, I mean, the other thing that’s crazy about it too is like a lot of the Web 2.0 dream is being enabled by LLMs, but like now you go to the meme, like not like that. Like, like we dreamed and we loved the idea of a semantic web that you could ask questions and just access things. And it has been delivered to us and it has been delivered not as an open force, but as an intermediating force. And now we’re having lots of second.

Jed Sundwall (01:18:50.314)

Yeah. Yeah.

Drew Breunig (01:19:07.8)

questions about that.

Jed Sundwall (01:19:10.751)

Yeah. So, I mean, yeah, we’re going to have to figure it out. But I think what I would want to say is that like, we should be, it’s fine. Like I think we should just be sort of sober about this and say, if we want to have reliable access to data in these ways, someone should pay for it. And what’s interesting about chat GBT is that people pay for chat GBT. Like I pay for chat GBT. It should have a research budget. Like

Some fractions of those pennies could go towards maintaining accurate up-to-date data about school enrollment in America or whatever it is, like whatever kind of research I wanna do. There’s actually money flowing because that kind of stuff was never gonna be supported by an ad model. Yeah.

Drew Breunig (01:19:48.183)

Yeah.

Drew Breunig (01:19:54.55)

Yeah. I mean, I don’t know. It’s going to be supported by an ad model eventually. Don’t worry. It’ll come. I don’t know if you’ve seen the announcements OpenAI has made over the last couple of days. They’re very much ad model friendly. They’re selling stuff. They want to give you a morning report where they browse the web for you and go find all the things you should be looking at. And that’s going to have an ad in there. I mean.

Jed Sundwall (01:20:05.812)

Yeah.

Well, either they’re selling stuff through ChetupiTea.

Jed Sundwall (01:20:16.714)

man.

Drew Breunig (01:20:24.686)

I mean, well, like, and this is like, but also I’m gonna play devil’s advocate to you because like, if there’s one thing I get frustrated about in the open space is people saying, well, we should be paid. This idea of like, you’re making money off of my library, you should be paying us, you are free loading. And it bothers me because,

I agree with it, like in a perfect world, I want every open project funded, like that gets usage. But your argument cannot be, we should be paid, you’re making money off of it. It needs to be a realistic, practical, pragmatic exchange for how you deliver that. And so I do think there is like a mechanism, like there could be a mechanism for the way information gets distributed and access to it.

And I think it’s going to get really fraught right now because like…

Like the whole ad model is going to go crazy. Not just because of, because it’s going to get intermediated. Like the ad model in the, like is based on attention. And if we have these agents out there making decisions for us as proxies, that attention is now theoretically infinite. How do we kind of govern that relationship and how does it get re-monetized? So.

Jed Sundwall (01:21:50.422)

Yeah. Yeah. So I, so I’m, I’m with you. mean, my, I’m putting in the chat, I’m just a flog my own blog, but like the gazelles blog post from, you know, near and half where I’m just like, we, one thing I haven’t been explicit about, and there’s going to be like a follow-up blog post at some point, but like, we’re like, the idea of a gazelle is that we should have entities that are, I would say non-owned, not owned by investors exclusively.

Drew Breunig (01:22:01.144)

Sure.

Jed Sundwall (01:22:19.649)

that provide some sort of, usually are providing data, but that they are accountable to the market. And so I’m with you in that the conversation needs to go way beyond like we should be paid, which is there’s so much entitlement in the open community. drives me insane. You like you should give me data for free too. Like you should, it’s a public good. I’m like.

Drew Breunig (01:22:34.563)

Yeah.

And everybody should be giving data away for free. Like, I want people to think about their monetization policies because it gives them the control over their own future, which is, that’s, and that, and that is like me clarifying when I get frustrated, when I hear open people begging for money, because that’s what it is. They don’t have the leverage. They’ve never thought about it before. And now we’re finally have to coming back to it. And as I encourage everyone to think about money before you think you need to, because it’s going to help control.

Jed Sundwall (01:22:41.312)

Yeah.

Jed Sundwall (01:22:48.998)

That’s right.

Jed Sundwall (01:22:59.521)

That’s right.

Drew Breunig (01:23:08.438)

your future and your destiny and not end up being beholden to something else.

Jed Sundwall (01:23:13.237)

That’s exactly right. And it’s hard. I say like, I, you know, I’m fighting two fronts with my sort of notions of gazelles and new public sector organizations. One is the easy one where it’s like, these billionaires have too much power and some of these, you know, tech companies are like out of control with too much power. People are like, yeah, blah, blah, blah. We all kind of agree. The harder battle to fight though, is for me to go to my colleagues in the open world and say like, hey, we should maybe put a price on what we do.

and think about the value of what we’re doing and see if the market supports that. And they’re like, what? Like, there’s, I believe that there’s a huge part of this or the cultural legacy of the philanthropic world comes from like European aristocracy where they’re like, we do not touch money. Like I don’t work for a living. Like, you know, like it’s like a leisure class stuff.

Drew Breunig (01:23:53.676)

Yes.

Drew Breunig (01:24:01.13)

Or the money, like it goes back to like, well, I think it goes back to like Stallman and others where it’s like, you know, the cathedral and the bizarre type thing, which is like, we should have this free exchange, everything is better with exchange, everything is better with open, but then we get issues often.

Jed Sundwall (01:24:20.117)

Well, you get steamrolled by people who actually have market power.

Drew Breunig (01:24:22.952)

I see the Ruby community right now. I don’t know if you’ve been following that, but that’s a good example.

Jed Sundwall (01:24:27.585)

A little bit. I saw you created a foundation. Tell me more.

Drew Breunig (01:24:31.538)

No, it’s there’s just a there’s a governance argument right now about who has control over what and what org has control over what and and how much power does Shopify have as the big person bankrolling everything in this example. And so you have all of these things that like stack together until you get into these uncomfortable scenarios when the incentives are not aligned or not aligned way you expect them to be aligned.

Jed Sundwall (01:24:48.979)

Interesting.

Drew Breunig (01:25:00.002)

which is almost just as dangerous. And so like, I do think there is a market for data, but like you have to provide the utility of it. And I do think like it comes back to data discovery and data democratization, but like, we’re not going to create these things just because we want them. We have to create them and build the structures around them.

Jed Sundwall (01:25:20.065)

That’s right. That’s right. And so that’s what I need to figure out is like, can we create some sort of mechanism whereby, look, I mean, I’ll just talk about source, like the vision for source. This is the source cooperative podcast is like, we have this notion of you have data products, which in our opinion is a data product. It is a collection of files of objects that have been shared by an organization or a person that you know who they are. Like that’s fundamental to source, which is like, this is a data product that came from planet.

Drew Breunig (01:25:45.261)

Yeah.

Jed Sundwall (01:25:49.878)

you know, the satellite company, for example, that it is up to the user, the beholder to determine whether or not that data is worth their time. And what is interesting to figure out is how could we communicate that to an LLM? Like, could somebody say like, hey, chat GBT, I wanna know this information, but I only wanna get data directly from Planet or from NASA or the census department or whatever it is.

And then it’s up to OpenAI to determine it’s like, yeah, sure, we’re willing to throw a few shekels over to Planet to get access to this data and return it. because they, you my assumption is that OpenAI is just gonna hoover up whatever they can get. Is the credibility and provenance of data actually important to consumers? Maybe sometimes, but who…

Drew Breunig (01:26:37.763)

Yeah.

Jed Sundwall (01:26:46.922)

It’s weird because who’s making that determination? It’s many times not going to be the user. It’s just, they’re going to be asking an idle question.

Drew Breunig (01:26:52.946)

Yeah, I also think it matters in the domain. And that’s where you’re seeing a lot of random startups. Like I was just talking to someone who’s using, who’s starting a company that’s based on medical spending records. So looking at like Medicare receipts and Medicaid receipts and like it’s a highly regulated industry. can’t have hallucinations. You have to have

provenance, figure in when you start to build, different products with this. And like they’ve had to build their own custom pipeline. And this is getting into the question Alex just to ask, like, wonder how rag changes the play look, even if you build your own custom pipeline, they were doing texts to SQL, which is kind of the predating of rag, the first use case of texts to SQL, but then they’re having to figure out, right, well, how do we then go validate and subsequently confirm? And so

getting back to what you’re saying with RAG is like self-subscribing confirmation. And that’s kind of where the messiness comes in. The challenge is here is that like they’re working on one specific domain. Their surface area is a lot tighter, both in terms of the questions being asked and the data that can ask it. So their needle in a haystack exercise is different. And you’re gonna see the same types of companies come up with law. So like, how do I cite legal cases and actually they exist so I don’t get chewed out by a judge and told to like,

you know, go f off. And then the and you’re to see that in each little domain where there’s regularization, where there’s penalties and where you can sell that higher quality. I think the challenge that Anthropic and OpenAI and all these guys have is like, there’s really two markets right now, which is like their chatbot market and their coding market. And so like they’ll care about citation and coding stuff.

The rest, they’re just like, all right, how do I drive down hallucination and citation? They do have citation benchmarks. There benchmarks and evals for people to get to go judge their ability to correctly name things without hallucinating. But coming back to what you’re saying, I think the challenge here too is that you’re not, with LLMs, you also have to worry about multiple stages in the pipeline.

Drew Breunig (01:29:09.208)

So what I mean by that is like, there’s different stages when you build the, the, pipeline, have the, the pre-training, which is like when you train on the super messy common crawl type data that builds up your kind of base English capabilities or be a base language capabilities and establishes your knowledge base. Then you have post-training post-training is like when you teach the model, how to talk with an interface, that’s when you train it to reason. That’s when you train it to chat and go back and forth.

That’s when you train it to use tools. And then after that, people might fine tune it or they might put further tools on top of that, like data, rag, other similar things. And so what you’re talking about is providing function from basically post-training all the way through to fine tuning, to tool deployment, to framework around it, to the actual application. It’s this wide spectrum of applicability.

that also has different pricing terms as you start to come in. the problem I have with paying it is just like, it’s just, I worry about, it’s one thing if you’re Reddit and you cover everything. It’s another thing if you’re a really, really, really narrow niche, because again, you’re selling into a model that does everything. So how do they value that use case to justify your acquisition?

Jed Sundwall (01:30:34.431)

Yeah, well, mean, so this is where we’re.

Drew Breunig (01:30:37.388)

as I drink with an anthropic sticker on my water bottle.

Jed Sundwall (01:30:40.385)

Cool, man. I wish I had an anthropic sticker. I should put, I’ve got a cloud native geo stickers here still. Nice, nice. Okay. No, I mean, this, so Alex brings up another really interesting point here though. That’s very important. It’s like, you you’re mentioning there’s, if you’re working in very, very narrow space, the applicability of, know, whatever you’re putting out there is very broad.

Drew Breunig (01:30:49.932)

I have one of those, I just read it.

Jed Sundwall (01:31:09.883)

I am a hundred percent like my perspective on like the best gazelles create far more value than the capture, right? They should be the kind of thing that’s like only putting something out there, you know, that’s, quite small and simple and you can vouch for it. And then what people can do with it. Go nuts. If people can become billionaires off of it, that’s great. With climate stuff. This is just what we have to acknowledge. Like head on is the fact that like we actually talked about this right before we started rolling is that like we are at this point.

Drew Breunig (01:31:17.143)

Yes.

Jed Sundwall (01:31:39.86)

where we are actually talking about making interventions to perturb the environment in order to protect the world as it suits humans, the roughly however many of us there are right now, basically to cool it off, where we’re like, we’re gonna make this decision now. Like we’re gonna gather up a bunch of climate data, a bunch of information about the planet, and it will be used for us to manipulate the environment in a way that is much more deliberate than we have done in the past.

as we discussed, we’ve been messing with the environment quite a bit, not deliberately, but now we’re like, we’re gonna do this sort of stuff on purpose. This has huge, huge repercussions on like global governance. And we do have to figure out models that can allow us to make huge volumes of data available reliably. And I would say like, they absolutely should be available to AIs, but just how do we, who pays for that?

Somebody’s gotta pay for it. And I’m with you. The answer should not be, well, we should get paid to do it.

Drew Breunig (01:32:44.222)

Well, I thought you were going the answer is not communism or something similar.

Jed Sundwall (01:32:50.195)

No, but I mean, do think, I mean, that’s the other thing is that like, we don’t have the luxury of being too idealistic now, which is like, ideally, it wouldn’t be shaking down billionaires, but there are enough billionaires around that like, we should be shaking them down. think philanthropy has a role to play here. I’m very interested in endowments for, you know, guaranteeing access to data over time. So there’s something to be done here, but it will be.

Drew Breunig (01:33:11.779)

Mm-hmm.

Jed Sundwall (01:33:19.937)

It’s this is a huge challenge. It’s an exciting challenge though. Yeah.

Drew Breunig (01:33:22.744)

I mean, I think that comes down to discovery. And I think that’s one of the big challenges, which is like, I mean.

Like that’s the, so I shared a paper with Jed yesterday, which is going brand new. Just a PhD student came out with it yesterday. I’m gonna link it. And wait, let me find the link I sent you. And it’s about searching for, teaching LLMs to search for data and assess data.

Jed Sundwall (01:33:38.576)

yeah.

Drew Breunig (01:33:57.494)

And I think of it as a natural extension of, you know, one of the first things that happened when chat GPT came out in that first year is there was a lot of text SQL applications is I think it’s a further extension of layers upon that, which is, I’m going to understand a data source, build a representation artifact that is queryable.

so that then we can kind of query it on top of that. And so I think we’re starting to see these systems. the good news is, here’s the thing that I do think is incredibly valuable, is you look at…

this application and you can see why a company would fund it because you can say, all right, would Databricks fund this? Would AWS fund this? Would Microsoft fund this? Like would Tableau fund this? 100 % they would because they want people to find more data and the right data because if you find more data and the right data and it’s valuable to you, you have to generate the compute to actually utilize that. so I do think that we’re going to see

things that are aligned with these functionalities when it comes to data discovery, because there is a huge market opportunity for it. And I do think like maybe that’s the value that gets put on, which is not the access to the data, but the discovery of the data and the service of finding that. And I would be, that to me is like, that would be a huge problem to be solved for tons of enterprises that I’ve talked to.

Jed Sundwall (01:35:25.75)

Yeah.

Jed Sundwall (01:35:34.146)

Hmm. Okay. Well, very relevant to what I want to do. So I’m going read it. this, so Camilla asks a question, an important question. does the inability of, for some sort of royalty model disappear with the complexity and lack of explainability of how inputs are ultimately used, um, in these models at the end of the day? uh, so yeah, basically, so it’s like, yeah, open AI, ChatGPT says like, yeah, we just got the coolest data from

the Gates Foundation, here’s our answer. You know, and it’s like.

Drew Breunig (01:36:08.758)

Yeah, I mean…

Jed Sundwall (01:36:09.289)

A lot of people are gonna be like, okay, I trust your interpretation of this.

Drew Breunig (01:36:13.132)

Yeah, let me tell you a story based off that. So I am one of the best ways to learn about new companies, especially new models. And this is something. So at Place IQ, we cared about privacy a lot. And we embraced new privacy mechanisms, regulations. We designed our systems with privacy in mind. And so I learned a lot about privacy during those eras.

Jed Sundwall (01:36:17.109)

Okay.

Drew Breunig (01:36:42.004)

OpenAI came out with ChatGBT, and they launched ChatGBT, and they launched the model. I knew something how the model was made. And so the first thing I’m like is like, there’s a lot of privacy issues that are inherent in this, especially because once you train the model, coming out and selecting the data from the model that it learned from your private data is basically impossible. You can only kind of add it. You can’t go in and surgically remove it.

So as just for fun, because I’m weird, I filed a CCPA request with to open AI. CCPA request is a California privacy regulation that allows you to contact any company that has your data and you have to say, hey, do you have my PII, my private personally identifiable data? What is it? And I also have the right to correct it or delete it if I require. So

Jed Sundwall (01:37:39.499)

Hmm.

Drew Breunig (01:37:41.482)

you read their privacy policy and it was all about the accounts you create when you create an account. It wasn’t about the model or the training data they used for the model. They seem to have kind of deliberately skirted that question because it would be a really big question. But at the same time, it’s still PII and it still have it. And I know for a fact that they have my website because I know my website’s in common curl. And so I filed the request and

This was like in the first year after a chat GPT and like the person who was on the other line had no idea what to do with it. And they’re like, well, here’s your email. And I’m like, no, no, no, I want to know about the training data. And they’re like, I don’t know. So it kept, I would go through periods of very quiet and then it would get elevated and then it would get very quiet and then it was elevated. And finally, they’re just like, well, your email’s not in our training data. We have processes for removing your email.

Jed Sundwall (01:38:34.895)

Hmm.

Drew Breunig (01:38:36.502)

So I used the prompt exploit to get my email out of ChatGPT. So I did, you can use all sorts of tricks to get around its alignment and safety protocols. And I did that. And I got it to say, Drew Brunig’s email is what Drew Brunig’s email is. I’m not going to say it here. So I emailed them back and said, here is proof that you have my email. Somewhere in your data banks, it exists.

Jed Sundwall (01:38:53.758)

Yeah.

Drew Breunig (01:39:05.805)

And they’re like, can you share the prompt? they got really, like, it got elevated, elevated, elevated. And finally, they closed the issue because they said, well, your email is actually something that could be really easily guessed. And we could have learned it from other things and then inferred the naming pattern. And so that’s how it came out. And so, but this is the crazy thing, but it’s still my email.

Jed Sundwall (01:39:24.769)

Hmm.

Jed Sundwall (01:39:29.663)

man.

Drew Breunig (01:39:34.764)

So from a privacy perspective, like it’s still happened. The email existed, whether it guessed it or not, it’s kind of immaterial, especially if it guessed it, then it falls a foul off the CAN spam act, which is using a software to automate the guessing of like brute forcing emails. If it didn’t guess it, then it has a PII in my training data, which it almost certainly does. And like, I’m not gonna lawyer up and go fight this fight, but it’s like a good example of like even they,

Jed Sundwall (01:39:39.638)

Yeah.

Drew Breunig (01:40:03.16)

can’t tell what was the model was trained on. And so to Camilla’s question, like the royalty model does kind of disappear because there’s kind of different scenarios you can plan for, which is did the model just hallucinate it? Did the model figure it out based on the fact that it has seen previous patterns that are similar to that? And your question combined with the weights would manage to evoke your email, or is it recalling it from the way it’s buried deep in its kind of the depths of the weight?

And so like, there will be ways where they try to do this, like, like Anthropic just a couple months ago had a big thing. It’s like, Hey, we can explain what’s actually happening inside these models. And they could, but they had to train a special model just for explainability. And then they had to train a different model. And it was like a model that was the equivalent of like Claude two, like, and then they had another one that looked at it.

Jed Sundwall (01:40:33.333)

Right.

Jed Sundwall (01:40:57.057)

Yeah.

Drew Breunig (01:40:59.724)

And then it would have to go through the output and it was like two expert researchers would have to spend two months of their time just unbundling all the traces to figure out what actually happened for one query. So like, it’s not a scalable mechanism, like, and it doesn’t even work on the largest models. So yes, no one knows where the data is coming from. In fact, a lot of people say like, that’s why reasoning models are a net good, because you can kind of see the logic of how they arrive at their conclusion. But.

Jed Sundwall (01:41:10.837)

pray.

Drew Breunig (01:41:29.486)

I think it is. Yeah, it’s a challenge.

Jed Sundwall (01:41:35.423)

Yeah. Yeah. I mean, well, this is kind of, again, sort of going to the philosophy of source is that like, you should be able to view source. the model can’t be explained, like whenever possible, there should be some sort of like auditable layer of data. That’s not always going to happen, but like there’s, are things, I mean, I’m going back to Alex’s point about like climate data. Like if we’re talking about environmental data where we’re, this is,

Drew Breunig (01:41:55.502)

Yeah.

Jed Sundwall (01:42:05.461)

deliberately being shared so that we can impact the environment. It’s an impact on everyone’s There are layers of the internet that have to be auditable. And yes, the large companies, they’re gonna wanna have plenty of secret sauce in their models, but there’s some stuff that can’t be secret. We should fight for being auditable.

Drew Breunig (01:42:25.4)

But like, so then like.

I don’t know. I don’t think you can make the model auditable. I think we’re past that term.

Jed Sundwall (01:42:34.943)

Yeah, I agree. I don’t think we’re making the models auditable, but at least you should be able to say, we know where some of this data came from and you can do your own research if you want to.

Drew Breunig (01:42:48.288)

Yeah, it’s funny because I do think that like a lot of labs would like that. The problem is, that like they see their competition as like actively stealing their stuff. And so like, how do you enforce that internationally is the big question that comes in. And then also the desire to not fall behind, you know, other countries, I think is the other issue that you start to get into the politics of the thing.

Jed Sundwall (01:43:16.203)

Yeah. Yeah. Yeah.

Drew Breunig (01:43:18.958)

So like, I don’t know. Like I do think like, like getting into the goal of like training data products so that LLMs can understand them. Like, is that what you’re like angling at? And if that’s the case, like, I think it’s like, you need to make them like, Brian Bischoff talks about the map versus the terrain when it comes to creating data systems that LLMs can query is like, you do have to create that thing that, that fits within the context well and allows them to kind of.

navigate, negotiate with it.

Jed Sundwall (01:43:50.336)

Right. And that’s what I’m saying. That’s what I was trying to say before is like, we could create a great catalog at Source Cooperative that, and talk to, you know, talk to our friends, or I need to make friends at Anthropic at home in AI, I’ve got a few, but like, and be like, do you want to use this catalog? And if you use this catalog, are you willing to pay to access stuff from it? Like, how would, how do you train a model to know that data is worth paying for versus not paying for?

Drew Breunig (01:44:19.629)

Mm.

Jed Sundwall (01:44:21.025)

I don’t know. mean, and I don’t know if it could just be sort of a brute force thing, which is to say, open AI agrees that I’m going to use the Gates Foundation again for, you know, it’s just like the Gates Foundation that maintains a lot of useful data. Actually better examples is FactsUSA, another Microsoft guy, bomber who created it’s USA Facts, which is his nonprofit that’s that shares statistical data. Yeah. Fantastic. And they say, like,

Drew Breunig (01:44:45.036)

Great outlet.

Jed Sundwall (01:44:50.663)

OpenAI is like, Balmer can’t afford to keep this thing running himself. So we’re going to pay. This is where the argument falls apart with both Gates and Balmer is like, these are these are groups that do not need to be charging for access to this data. but it still we want to have a market for data to make sure that it’s continually being produced. Yeah.

Drew Breunig (01:45:09.022)

I do think one of the things there is getting in the provenance stack, which I do think is like, if you’re merging datasets, you’re going to have a ranking stack order for which ones you trust more than others. so I think that’s the service that may be a thing, which is validating and normalizing the data so that it can be referenced confidently.

That to me is like, that’s the service to provide. that like, because I love the question of like, when does an LLM know when to pay the data? Or like, when does it present that option? And like.

Drew Breunig (01:45:53.516)

What do you think goes into that question? Like, what do you think are the inputs that you can think about in that one?

Jed Sundwall (01:45:57.493)

Well, right. So, I mean, again, like what I’ve said about source is that like source, we, what you find at source are files that have been put there by people or organizations who you may or may not trust, but you at least know who put them there. And so then the question is how, we’re building a UI for source that we want people to be able to tell at first sight whether or not the data they find there is worth their time or not. We then have to answer the same question for a model.

Drew Breunig (01:46:19.596)

Yeah.

Jed Sundwall (01:46:25.513)

It’s like, this worth my time? Is this worth me spending some of my research budget on? And I think part of that just has to be like brute force through like partnerships to say that like OpenAI recognizes this as useful data source. Does it make sense to charge for the data at that point based on some metering thing? know, at fractions of a penny, or is it just like a, it’s a partnership where they pay for to just go in and out, you you get as much as they want. I don’t know.

Drew Breunig (01:46:51.212)

Yeah, I do think it’s like, and that’s the thing which is like, you’re, are the thing that is vouching for the data, I think is the service that is provided. But then now you, now you’d need to be a quality clearinghouse.

Jed Sundwall (01:47:06.251)

That’s right. Well, right. have to have all, that, okay, so here we go. And then we got to start wrapping this up. But like, one thing that we, there’s a bunch of stuff that we can do once we have these, so we have these files, we know who produced them. We can also have DOIs, right? So bear with me if you’re, if you cringe at the motion of, the notion of DOIs that I sometimes do, but we can say this data actually gets cited a lot.

We could track how many citations the data has gotten. We also have metrics that we want to share about like how much the data gets used. Hugging face is great at this, like on their data sets product, which I love and it’s kind of like my envy. There’s so much signal when you get to a hugging face data set landing page, like there’s a lot of signal for you to be able to tell like, is this being used or not? And that’s one way of motivating it. I mean, it’s the way

you you shop on Amazon and it’s like, this is a best seller. So you’re like, okay, if the whole market agrees that this is a good thing to buy, then it’s probably good enough for me to buy too. And so, but it’s a matter of communicating that both to humans and to Asians.

Drew Breunig (01:48:18.616)

I think that’s, mean, maybe you need to build a benchmark. Maybe you need to build a benchmark on like quality retrieval from source datasets, which is like, can you correctly augment? And so I think that to me is like an interesting thing, which is can you correctly augment site without hallucination? Cause like that’s the challenge, which is like, you may get the right pull, but then you don’t adhere to the prompt and you rely on something in your weights.

Jed Sundwall (01:48:25.546)

Yeah.

Drew Breunig (01:48:48.756)

So it’s just like, it’s kind of like a recall on a moving target data set. which I think is a really interesting idea.

Jed Sundwall (01:48:58.013)

Hmm. Okay. Well, I’m going to have to talk to you about that another time.

Drew Breunig (01:49:02.318)

Because I mean, that’s the challenge. You have a bunch of data. You want to basically check against it and then validate that it actually is repeating what it repeats back. Because I think that’s the thing, is like having high quality data isn’t enough. You need to have high quality data. And then you need to ship the yardstick for measuring that high quality data when used in violence.

Jed Sundwall (01:49:13.791)

Yeah, right.

Jed Sundwall (01:49:30.401)

All right. Well, the lesson from this conversation is, is benchmarks. Like we, we got to talk about benchmark design, not just designing great data products, but

Drew Breunig (01:49:36.418)

Well, I think benchmarks are, because this comes back to Common Crawl, which is like Common Crawl didn’t do anything to its data, just made it easier to access, didn’t make any choices or anything like that. But I do think it’s a really good exercise for like, all right, if Gill launched Common Crawl to build a better Google or build a better information recall or to not have Google monopolize it,

The benchmark he should ship is like, you’re building a search engine, like here’s the queries and here’s the ID records that you should be finding within your thing to like do this query and like you can start to rank against it. I think, I do think even if you don’t ship a benchmark, doing the performing the exercise of what benchmark you would ship for the data product you’re looking to ship is a good exercise because it forces you to say, well, what do I want people to be able to do with this? And

And then it focuses your kind of the way you package it up.

Jed Sundwall (01:50:39.829)

Yeah, lot of, mean, but such a, so, many consequential decisions, they come from that. So, okay. Well.

Drew Breunig (01:50:51.182)

So who’s building the temporal benchmark? Who do we assign that to?

Jed Sundwall (01:50:54.625)

That’s Tyler Erickson will be built. No, it’s useful feedback from we actually have a bit of funding right now to work on on some GOAI benchmarking work. So. Yeah. Anyway, this has been awesome. I couldn’t be happier with our first episode. We’ll get this out there. I we’ve got we got a link. Michelle cooked up a website for CNG Conference 2026. So mark your calendars. It’s official. It’s on there’s a URL.

Drew Breunig (01:51:05.196)

that it was my voice Tyler. Yeah.

Jed Sundwall (01:51:24.331)

So, yeah, Drew, I announced this while you were answering the door. We’re doing CNG 2026. Same location, but in the fall, six to ninth of October. No snow, which some people…

Drew Breunig (01:51:33.984)

Ooh, so no snow. That’s a big plus for me.

Jed Sundwall (01:51:42.635)

See, thank you. I’m glad you said that because people are like, no, like I like skiing. It’s like a few people have the time and energy to ski. I think most people wanted to get out on the mountain and just couldn’t because there was too much snow. anyway, thanks so much, man. We are going to do this again. You’re not gonna be, I predict you’ll be a many times repeat guest. And thank you everyone for tuning in. This has been a lot of fun to do it with a live chat.

Drew Breunig (01:51:58.114)

Yeah, I know.

Jed Sundwall (01:52:12.553)

Really appreciate everybody who chimed in. Anything else? Do you have anything to plug? Okay.

Drew Breunig (01:52:14.944)

Awesome. Well, no, that’s no. I’ll be at the Spatial Data Science Conference talking about GERS, talking about standards. And yeah, now I’m just thinking about evals, man.

Jed Sundwall (01:52:23.091)

good.

Jed Sundwall (01:52:32.437)

All right, well, stay tuned. We’ve got some work going on evals too. So, all right. Bye everybody. Thanks. Bye. Bye.

Drew Breunig (01:52:37.23)

Talk to you later, Jed. Always pleasure. Bye.

Featuring:

Drew Breunig

Jed Sundwall