Episodes
Show notes
Jed talks with Denice Ross, Senior Fellow at the Federation of American Scientists and former U.S. Chief Data Scientist, about federal data’s role in American life and what happens when government data tools sunset. Denice led efforts to use disaggregated data to drive better outcomes for all Americans during her time as Deputy U.S. Chief Technology Officer, and now works on building a Federal Data Use Case Repository documenting how federal datasets affect everyday decisions.
The conversation explores why open data initiatives have evolved over the years and how administrative priorities shape public data tool availability. Denice emphasizes that federal data underpins economic growth, public health decisions, and governance at every level. She describes how data users can engage with data stewards to create feedback loops that improve data quality, and why nonprofits and civil society organizations play an essential role in both data collection and advocacy.
Throughout the discussion, Denice and Jed examine the balance between official government data products and innovative tools built by external organizations. They discuss creative solutions for filling data gaps, the importance of identifying tools as “powered by federal data” to preserve datasets, and strategies for protecting federal data accessibility for the long term.
Links and Resources
Takeaways
- Federal data underpins daily life — From public health decisions to economic planning, federal datasets inform choices that affect Americans whether they realize it or not.
- Data tools require active protection — When administrative priorities shift, public data tools can disappear. Building awareness of data dependencies helps preserve access.
- Feedback loops improve data quality — Data users should engage directly with data stewards. Public participation in the data lifecycle leads to better, more relevant datasets.
- Civil society fills critical gaps — Nonprofits and external organizations can collect data and advocate for data resources in ways government cannot.
- Disaggregated data drives equity — Breaking down aggregate statistics reveals disparities and enables targeted interventions that benefit underserved communities.
- External innovation complements government stability — A healthy ecosystem keeps federal data stable while enabling community-driven tools to evolve and serve specific needs.
Transcript
(this is an auto-generated transcript and may contain errors)
Jed Sundwall:
Yes. Hello, Denise. Welcome to the great data products. Thanks for joining us from Virginia. Okay. That’s right. Okay. Want to make sure. no, but very, I mean, happy 2026, really, really interesting time to be talking about these things. just a bit of housekeeping as we get started. this is a, what I like to call a
Denice Ross:
Good to be here.
Denice Ross:
Northern Virginia.
Jed Sundwall:
live stream webinar podcast thing, where we talk about the craft and ergonomics of data and talk to people who, you know, professionals who’ve worked in the production and distribution of data about, you know, what works, what doesn’t work and what we’re working on. you are currently at the Federation of American Scientists as a, how do you describe yourself? Senior advisor, former chief data scientist of the United States. How else do you describe yourself?
Denice Ross:
Senior advisor.
Denice Ross:
That’s a good question. You know, I really like the title former, the former chief data scientist of the United States is serving me well. Yeah, I always wondered why my predecessor DJ Patel used that, you know, after he left his position. He went as the former and I see it’s a good title.
Jed Sundwall:
Hahaha
Jed Sundwall:
Good, yeah.
Jed Sundwall:
Yeah, it is a good title. well, and I think so we also share, I mean, we share a lot of interests and, but I think one thing we have in common is it’s you created, were a leader in New Orleans in open data back in the day. I also created a thing called open San Diego back in the day. Can you just share a little bit about your experience in New Orleans and how that got started?
Denice Ross:
Yep.
Denice Ross:
Yeah, absolutely. So I moved to New Orleans in 2001. It was the first time that the internet was a thing and the decennial census data were being released. there was this idea that we could democratize the data instead of decisions being made about communities behind closed doors by people in power and with resources to analyze the data and access it.
Jed Sundwall:
wow.
Denice Ross:
that neighborhoods and community organizations could have access to that data to advocate on their own behalf. And so, you know, I think when the civic tech movement arrived, you know, in sort of the 2005 to 2010, New Orleans was very primed to be a leader in that space, as was San Diego.
Jed Sundwall:
Okay.
Jed Sundwall:
Yeah, yeah. Well, I mean, so you were early on. mean, I think I came. San Diego was 10 years after that, like, or open San Diego. So you’re you’re way ahead of the game there. That’s fascinating. Well, OK, so as discussed, I mean, you know, as we plan for this, I’m curious to know what you’re looking forward to this year, both what you’re working on and sort of more broadly where you see things going.
Denice Ross:
Yeah, absolutely. So 2025 was tumultuous. think we can all agree on that from the data perspective. And as we head into 2026, what we have, though, is a pretty activated and informed citizenry around the role that the federal government plays in our everyday lives and our economy.
Jed Sundwall:
He’s really polite, yeah.
Denice Ross:
and just running a modern society and also the role that data play, that federal data play. I think there’s less of a tendency now to take it for granted data like the weather and data on jobs and the economy. So that to me feels like a good foundation to start building out a plan for what we want for the future of federal data. And at the same time also really protect the core of the federal data that
that we depend on that we may not really be paying attention to yet and perhaps have been taking for granted.
Jed Sundwall:
Yeah, actually, mean, it’s the, idea of taking things for granted, think is actually really, it’s something worth dwelling on this idea that like, there’s, there is so much that we take for granted that we, don’t notice it until it’s gone or until it’s disrupted. I think, you know, my dad worked in public health, his whole career. And when, you know, when COVID hit and suddenly that, know, the pandemic put
notions of public health and response and interventions and hard decisions like into the, you know, into people’s minds, everyone starts freaking out. They’re like, why is the government telling me what to do? and he realized, I mean, I think this is pretty insightful that like public health sort of had become a victim of its own success in that, like everyone just sort of takes for granted the fact that like you, everyone learns to wash their hands.
Denice Ross:
You
Jed Sundwall:
growing up, you know, like there’s like sort of the basic like cultural norms around like hygiene and behavior and things like that, that like, it actually was like, it took a ton of work to figure out how to get that out into the world and to train everybody on that. And that was done by public servants for the most part. And you don’t want to do a rug pull on those sorts of things. Cause anyway, we just take them all for granted. And, but I’m curious to get your take on like, what do you consider when you talk about core data?
Denice Ross:
Mm-hmm.
Jed Sundwall:
Are there specific data products that you have in mind or categories of data or what?
Denice Ross:
There are. And you know, actually, though, I wanted to, as you’re talking about this idea of taking for granted, I’m reminded of early in my career, I worked in lunar and planetary sciences. And I talked to this real old school planetary scientist. And his take was that the American space program had suffered because of science fiction, because Americans thought we could do so much more.
in terms of like, you know, exploring space than we actually could. And after Katrina, we used to joke because people just assumed that we would have the information on like, who’s moving back and what do they need and what are their characteristics and how many households have access to a vehicle and how many sexually active teenagers of this particular demographic live in Marrero. know, like people thought that we had this really detailed data.
Jed Sundwall:
Right.
Denice Ross:
And we used to joke that they thought that maybe the Star Trek enterprise could just scan the planet and get us the data that we need. And so there’s, I think there’s two things. We take it for granted, the data that are flowing. And we also just like assume that we have access to data that are really important. as you know, like it takes…
It takes a lot of effort and resources and coordination to create a data collection, a lot of intentionality. It doesn’t happen accidentally. And so as we think about the future, yeah, as we think about the future, it’s not just like what data do we, what are the core data that we are currently collecting, but also what data should we be collecting moving forward.
Jed Sundwall:
yeah, no exactly.
Jed Sundwall:
Right. Well, yeah. So I guess I am curious to know if, well, yeah, there’s, there’s a lot of threads to pull on here. I mean, you’ve been outspoken talking about the need for federal data. So maybe we can start there and just kind like, is, what is that category? And before I let you, I’ll just make one point. We grappled with this when I created Open San Diego, because we’re like, well, whose data are we talking about?
What are we advocating for? And what we landed on was data about San Diego, because there’s a San Diego County, there’s a city of San Diego. There’s also like the most heavily trafficked border in the world, I think maybe, is the border in San Diego, San Ysidro. So there’s like Mexico data and trade data. There’s all sorts of data that we realize. like, there’s a lot of data about San Diego that’s independent of the city government or the county government.
Denice Ross:
Right. Mm.
Jed Sundwall:
So when you talk about federal data as a category, what are you talking
Denice Ross:
Yeah, and that’s a really good distinction. So federal data are data that are produced by the federal government or with funding from the federal government. A lot of scientific data, health, climate, and environment are created through relationships with universities and whatnot. But I would call all of that federal data.
There’s two ways to think about what is core to me. And one is thinking about the primary collection of the data. What types of data sets have the need for scale and real comprehensiveness so we’re not leaving any places or people behind that only the federal government can do it. And so that’s sort of the horizontal of the core data. And then there’s the vertical. And that is maybe the federal government
collects the data, but then they also create different ways of accessing the data through lookup tools and maps and various APIs and resources. And that’s always a tension within federal government is how much do you build out those derivative works so that you can meet the needs of specific populations of Americans who need to make decisions or navigate some process.
Jed Sundwall:
Yeah. Yeah, that’s actually, mean, yeah, I’m very curious to get your, your take on, this. mean, the, the naming of this podcast comes from this, you know, this one weird trick that we do at radiant earth, where we just really. Yammer on about this. lot of the, we have to talk about data products. Like, I think one of the challenges that people like you and I have faced over the, our years working on this sort of stuff is that it’s very easy and fun and
apparently you can just talk about data in the abstract for as long as you want. But that doesn’t always get you, that might not get you very far. We find it’s really useful to talk about products. So like what you’re describing in this vertical thing, which is like APIs, maps, other tools and things like that, those are products and they have users in mind. And I’m curious to know like who are the, who are the users that throughout your experience, like you’ve engaged with most in that?
Denice Ross:
Mm-hmm.
Jed Sundwall:
Yeah, like who are these people? Because it’s not like average citizens, I think, in most cases.
Denice Ross:
Well, interestingly, so I’ll just mention a few recent examples of federal data that I’ve seen in the wild. I was getting money out of the ATM the other day, and I bank with USAA. They serve the military community. And the screen, when I was going to get my money, talked about firearm safety and suicide prevention.
Jed Sundwall:
You can correct me on that.
Denice Ross:
The reason that that campaign has been so successful is because it was based on evidence that came from the National Death Index that found that that veteran suicide rates went down when veterans were locking up their firearms. And so that federal data spurred this very successful social media campaign that then made it to my ATM.
Another example is, you know, we go camping with the Scouts a lot. And when you get to a campsite, you know, there’s that old school wooden sign that tells you what the fire danger is. Well, that’s an official federal data set that’s informing which wooden sign gets hung on the hooks. And another example is when you go to the pharmacy and, you know, you might be prescribed a generic equivalent.
there’s an official data set out of the FDA, it’s called the Orange Book, that determines the generic drug equivalency for brand name drugs. And so those are just like a few touch points where, you every day we’re interacting with federal data that has made it into the real world.
Jed Sundwall:
wow. Yeah.
Jed Sundwall:
Yeah, I love that. I mean, this is this reminder of like that wooden sign at a campsite is a data visualization. Like that’s a user interface, right? You never think of it that way, but that’s actually what it is. So actually, yeah, but this is a good segue into this other thing I wanted to ask you about. Cause so when you’re on Marketplace and when we publish this as like a podcast episode, we’ll put this in the show notes, but that you were on Marketplace last year.
Denice Ross:
Great.
Jed Sundwall:
which I’m jealous of because I love Marketplace. But you said in that segment how you’ve felt like a lot of those tools and interfaces that the federal government has provided are maybe like almost like demos that should inspire others to build on top of. I think the USAA example is a really interesting one where that’s taking data to this weird endpoint, which is an ATM screen, but it’s actually a good channel to get the data out there.
I’m just curious if you could say more about how you see that playing out or how you’d like to see more of, I don’t want to say private sector, but like other actors taking federal data and building on top.
Denice Ross:
Yeah.
Denice Ross:
Yeah, this, you know, my thinking on this really solidified in the years after Hurricane Katrina because I was on the outside of federal government working for a data intermediary. we, federal data couldn’t keep up with the rapid changing of, you know, both the exodus when 80 % of the city flooded and then the…
people rapidly coming back and also sort of different types of people as we were rebuilding. And we desperately needed information from local government in order to track those changes and to be able to have some community participation so that the recovery was complete and equitable. And I remember going to City Hall and asking…
I think it was like a parcel layer, a list of childcare centers or something like that. And the contractor who was running the data at that time tried to set up sort of a quid pro quo. Like, well, I’ll give you this data if you give me this data that you have. I’m like, but you’re a job, like you guys are the ones who produce this data. Like you’re the primary data producer and you’re the only ones who can give this data to the citizenry.
And although they were making the data available in maps, they weren’t making the raw data available, which you remember was an issue in the early days of the open data movement. And so at that point, I became pretty fixed in my sense that if a data set can only be produced by government, then that should absolutely be their priority. Like as resources come and go, protect the core of that collection because as long as it’s made open, then
others can build on it and innovate on it. But if the federal government or the local government’s not doing their job with that primary data collection and the publishing of it, then everything sort of falls apart and you have to get creative with inadequate proxies. so just given the limited resources that governments have, I really do focus on that primary role of collection and publishing.
Jed Sundwall:
Yeah.
Denice Ross:
and maintaining the high data quality and then comparability and the continuity across time and space. And that said, as you start to think about the different uses for any specific data set, there’s so many. Think about the American Community Survey or Landsat data, for example. Both of them have such broad uses across very different domains that it would
It’s unreasonable to expect that the federal government would build tools to meet all of those use cases. And it’s especially, you know, we’ve interacted with government websites, right? Like the government doesn’t generally do a good job at creating websites and tools and maybe they do like a good job once, but then, you know, it starts to age and, you know, isn’t sustained in the way that, you know, more modern life cycle outside of government might.
Jed Sundwall:
Alright.
Yeah, yeah.
Jed Sundwall:
Right. Yeah. I mean, we don’t need to pick on government people too, too hard, but yeah, we, we, it’s easy to fall into that. We can talk about procurement issues and why the government’s not that great at managing digital services or improving them over time. like, I, I totally agree. I felt this way for a long time that, a lot of this came from our work. When I, when I was at AWS, we worked a lot with Noah on publishing their data. And it was this kind of funny.
Now that I think about it, it’s sort of like a funny relationship in that we all sort of agreed, Noah was like, look, we can produce the data, but we really need you to get it out to more people. And we’re like, okay, that makes sense. But then also like, I can talk about my former employer, AWS doesn’t make great user interfaces either. Like AWS is a really, I mean, as far as like infrastructure as the service goes, hard to beat, know, they’ve done very well.
Denice Ross:
Mm-hmm.
Denice Ross:
Right.
Jed Sundwall:
But like when it comes to producing like consumer facing end user interfaces that can reach a lot of people, it’s just constitutionally the company doesn’t seem like that great at it. It’s, that’s really what AWS is built to do. Other people build those interfaces on top of AWS and that’s how we did it. But I just, I just kind of like, I’m just agreeing with you pretty violently that like, it’s okay to have the government stop at some point and let other actors take over to get things.
Denice Ross:
Right.
Jed Sundwall:
the last mile.
Denice Ross:
Yeah, I think it’s how we build resilience into the system, frankly, like, you know, let the federal government focus on the core. What is missing, though, to make this really work are the feedback loops so that federal data stewards have a really good sense of both how are the data being used, how could the data be improved to better meet the use cases, and then what untapped
possibilities are there for the data better serving the American people if the federal data collection of adjusts to changing conditions or data needs. And those feedback loops, when I was in the Biden administration, we did talk about how we might infuse more public participation and community engagement around federal data.
And because it’s tough, like right now, you the main avenue of giving feedback on a given data set is really only applies to data sets that are collected through forms and surveys and subject to the Paperwork Reduction Act, which triggers this sort of public notice and comment period. And then you have to be like watching the federal register to know that a comment period just opened. For example, just…
Jed Sundwall:
Right? Right.
Denice Ross:
Tomorrow, we’re, so I’m working on a project, two projects right now, which I should mention. The first is dataindex.us and we, it’s a collective of federal data watchers who have been, who we started with that paperwork reduction act data on changes to forms and surveys and are expanding to scientific and health and environmental and other types of data.
But we’re monitoring changes to the federal data and looking for opportunities for public input because when those policy windows open, those are going to be the times when public input is going to have the biggest difference. And so tomorrow we’ve got a webinar about the Pregnancy Risk Assessment Monitoring System, I believe it’s called. But it’s basically the only way that we understand maternal and infant mortality.
in America. And that collection, interestingly, so if you think about how those data are collected, it has to come from local public health institutions. then it reports up into the CDC or into the states and then the CDC. And recently, the CDC stepped back on
on aggregating the data at the national level. So now researchers, if you want to study maternal and infant health, you have to go to every state individually and ask for the data, which introduces so much friction into the system, right?
Jed Sundwall:
Yeah. Oh man. mean, I, we have a deal. This is all the time when I was running the open data program at AWS. Like it was like almost clockwork. It was like at least once a month, like at this like pretty regular cycle. Some people were like, Hey, wouldn’t it be cool if we had all of the X data about cities in the country? And I’m like, like crime. mean, a crime, the crime one came up a lot. It was like, wouldn’t it be cool if we had a data set of like all the, all of the crime in different cities in America. And I’m like, that would be cool. Who does that?
Like who would do it? Like that’s a very, it’s a very expensive process to carry out. And I agree it would be cool, but we have to find somebody like who actually is intended to do that. CDC, very clear, obvious mission here that’s, know, historically been funded to do this sort of thing. Um, what happens when we’re the, so I’ll just, I’ll just go ahead and say it, you know, although it’s already 2026, like
We can talk about core data and these sorts of things, but then what happens when the arbiter of the core data might not be seen as trustworthy?
Denice Ross:
Right, or just drops the ball, as is happening with prams now. Or if you think about what happened for the first year of COVID, where civil society, the COVID tracking project and Hopkins and others, filled in that role of harvesting the data from state and local health departments. And then it took about a year until the federal government really was on the ball with that.
Jed Sundwall:
Right.
Denice Ross:
There’s another example, though, recently speaking of crime. So historically, the FBI has released their crime data once a year. the year closes out at the end of December, and then it takes nine months to process the data. It’s the official statistics. so quality and continuity and all these things are really important. And so it takes nine months, and then they’re published. But that’s not timely enough for really understanding, for example,
you know, is carjacking becoming a problem or, you know, what, you know, what, like, what are the trends that we’re seeing in murder and informing the national dialogue and local policies. last September, Jeff Asher and his colleagues created the Real-Time Crime Index, where they are hoovering up data directly from the nation’s law enforcement agencies and then creating a monthly estimate.
And I was in the White House when the first month that that monthly estimate dropped. And it was amazing. Like immediately, every policymaker who was working on violence, especially gun violence in America, they changed the way that they consume their data about crime in America. And so they go to this real-time crime index for the monthly updates. But then it’s still essential to…
Jed Sundwall:
interesting.
Denice Ross:
benchmark that to the official data coming out of the FBI. And what I really I love about the resilience that that builds into the system, like we need both. We need the official slower, but really comprehensive and high quality data coming out of federal agencies. Data that, you know, the FBI director can go before Congress and talk about with confidence. So we need that. And then we also need
some of the scrappier sort civil society, best guesses of how things are going. They don’t have to go testify before Congress to talk about the quality of the data, right? They can have their methodology, it might be a little black boxy. And there might be even competitors in the space giving slightly different perspectives on what’s happening. We see that happen with flood risk, for example, where there’s different models that consume a lot of federal data that tell you about how at risk your particular property is.
Jed Sundwall:
Right. Yeah.
Denice Ross:
And I think that that combination of the official data plus the innovative data that might trade a little bit of quality for timeliness is important given how fast things are changing in America around crime and climate and society.
Jed Sundwall:
Yeah. Well, I mean, I also think like it’s super useful to acknowledge the fact that like, it’s always a, I don’t want to say like a negotiation, but like, think, you know, all models are wrong, but some are useful. That, that idea is, is to understand that like authoritative, authoritative data is useful in the sense that like, there’s a methodology you might have, you might
Denice Ross:
Mm-hmm.
Jed Sundwall:
be more comfortable about how it’s governed and produced. But it doesn’t always mean that it’s like the end all be all absolute truth, you know. It might be data that you’re required for some regulatory reason to rely on. It might be the safest data to use. So if you are hold in front of Congress, you can say where you got your numbers from. But like, I think it’s worthwhile to
Denice Ross:
Yep.
Jed Sundwall:
engage with that idea that like, okay, it is useful to have authority, authoritative data for some reasons, but we shouldn’t just sort of rest on our laurels and say like, oh, that’s the data from the government. So it must be true, you know? Yeah.
Denice Ross:
Right. Yeah, absolutely. And the other nice thing about having authoritative data then plus the innovation happening in civil society is, for example, with the crime data, the FBI sets the standards for that data. And then every software vendor in America serving law enforcement agencies conforms to those standards. So that gives you the comparability on the basics.
But then often law enforcement agencies need more details. so they can, so for example, some innovations were happening over the last few years because jurisdictions realized that they needed data on non-fatal shootings, not just the fatal ones. And the FBI standards didn’t include that. And so cities like Philadelphia and other cities started collecting data on non-fatal shootings to inform
their policing practices and community engagement. And so that innovation started to happen at the local level. And then the slower process of incorporating that into the official government standards was happening at the same time. And then in the last few months, that became an official part of the new standard, which would then be propagated across all of the nation’s law enforcement agencies. So there’s a really nice interplay between
Jed Sundwall:
Interesting to see it.
Denice Ross:
the slow building of standards and the sort of field expedient data collections that communities need in order to answer the questions that are before them.
Jed Sundwall:
That’s a great story. Have I ever shared with you this white paper that we published last year called Emergent Standards, where I’ll send this to you. I’ll put it in the show notes. like I tell the story of RSS, which is what’s used to publish blogs and podcasts and things like that. GTFS, which is the general transit feed specification. That’s how transit authorities share data.
Denice Ross:
Mm-mm.
Jed Sundwall:
largely with like Google Maps and like Big Map, like Apple Maps. But it’s very, it tells stories similar to like what you’re just saying, which is that like, you do have to have kind of like large institutions that can give the imprimatur or set standards or sort of define requirements in a way. But they should negotiate and engage with the data practitioners and learn from one another. And that’s really like, the web is actually really good, good at that, at like enabling that kind of
negotiation. And then after a while, people are like, okay, yeah, this is the standard. This is how we describe this data. This is what counts as a shooting, like in your case. You know, but that’s, that’s a, that’s a negotiation among a bunch, a bunch of different actors and data users that has to happen. And it’s never as simple as saying like, there’s a standard that some government agency set is the one and everyone agrees. I think you’ve probably lived this, I mean, many times.
Denice Ross:
Mm-hmm. Yeah.
Jed Sundwall:
why that’s not true. Okay. Well, I’m also curious to get, this is, mean, this was relevant to that. actually, hold on, before I go on, you said you were working on two projects. You mentioned dataindex.us. What’s the other thing you should brag about what you’re doing?
Denice Ross:
Mm-hmm.
Yeah.
Denice Ross:
When in the first Trump administration, and there were concerns about data, especially around climate and environment disappearing, and also concerns about the decennial census that took place in 2020, it became clear to me that we as data users and stakeholders and advocates had not done a good job of telling the story about why data matter.
And so that’s been some serious unfinished business for me. And as I saw things unfold almost a year ago with the pulldown of so many data sets to remove elements that were not compatible with administration priorities like DEI and gender and climate,
I saw the narrative in the media about how researchers were going to be harmed by the disappearing data. And I was like, no, actually, all Americans are going to be harmed by the degradation of federal data capacity. I realized as I started to look at how we generally think about data use cases, we center the user of the data and what task they need to accomplish.
for some outcome that they’re trying to reach. And I thought, well, what if we flip the script a little bit and focus on the beneficiary of the data rather than the user of the data? So for example, a cancer patient can find a clinical trial that’s a good fit for them because the clinicaltrials.gov data set.
is easily available and they can sort by the condition that they have. Or a football coach knows to move practice inside when it gets too hot so his players don’t get heat stroke because the National Weather Service publishes the heat index. so what we’ve done with a website called essentialdata.us is we’ve been crowdsourcing and building up
Denice Ross:
these little one sentence love letters to have specific federal data sets benefit everyday Americans and their livelihoods. And we’re almost at 100 data sets about nine months in. And it’s just been such a delight, but I’ll tell you, it takes about 20 to 30 minutes talking with a data user to shift their perspective from centering the users of the data
Jed Sundwall:
Nice.
Denice Ross:
to centering those who benefit from the data. And so I thought, sometimes I had these doubts at the beginning. was like, this is just too obvious. But it’s actually, it’s a big mindset shift. it’s something that anyone who cares about data, I think we all need to undergo that shift so that we can talk about how it benefits people in their everyday lives.
Jed Sundwall:
Interesting.
Jed Sundwall:
Yeah. It’s, oh man, I have so many thoughts about this, this issue. Um, a weird one though, comes from, uh, a book I read years ago called entangled life, which is about, about fungi. Um, it’s an awesome book. It’s actually a great book, but, there’s just one insight in the book that is just sort of like the author points out. He’s like, we are, you know, um, humans that live on the surface of the earth and we see things that, you know, above the soil. so.
We look at a plant or a tree and we’re like, yeah, like that’s a tree. Like there it is, I’m looking at it. And he’s like, well, you don’t see those all of the fungal activity in the soil that’s transferring nutrients. actually, I mean, we’ve learned like information from that tree to other plants and other life forms around it through the soil. So there’s all this stuff going on underneath. We just cannot see, we never consider it at all. And we think of a tree as a tree and it’s like, yeah, sure, it’s a tree, but it’s a part of so much else.
Denice Ross:
Right.
Jed Sundwall:
And this is going back to the whole taking things for granted thing. We live on this substrate that like, just, no one thinks about it at all. And we’re the beneficiaries of all of it, but it’s yeah, it’s totally invisible to people. Yeah.
Denice Ross:
Yeah, I love that metaphor. And it reminds me of digital tools and how they consume federal data, for example, all the real estate apps like Zillow and Redfin and whatnot. They consume data from the Department of Education about school performance. But it takes actually a lot of work to figure out that that data is federal data.
Jed Sundwall:
Right. Yeah.
Denice Ross:
And that’s one of the tricky things about these digital tools that we build is that we make it look like the data are all there and we sort of hide where it’s coming from and how it might be at risk. I remember…
Jed Sundwall:
Right.
Denice Ross:
I remember a survey question around attitudes around the decennial census data. And people were asked, the census decennial data, is it unique? Like, is it something that only the federal government can produce? And a common answer was, no, no, you can get that data from Google.
Jed Sundwall:
Yeah.
Jed Sundwall:
Wow, amazing. Yeah. Yeah.
Denice Ross:
Right? like, yeah, you can, but Google wouldn’t have the data if the census didn’t exist. And we’ve had some rough patches, right? Like with the economic data, for example, with the shutdown, where the private sector was able to sort of fill the gaps. But you have to have that federal benchmark to snap to, or the private sector data is going to veer further and further from reality.
Jed Sundwall:
Oh yeah. Yeah. Well, I mean, this is going back to this feedback loop thing, um, which, know, we, so we don’t have great feedback loops, right? So for like federal data providers or a lot of government data providers, they just have, really don’t have many ways to know how their data is being used and how valuable it is. And this is where I’m going to start. I’m approaching a third rail here. Um, cause I’m going to talk about like data markets and pricing and things like that, but like,
There’s another, this Google example is kind of funny because Landsat has a sort of similar story. Landsat had been around for a long time and very widely used, very, so for those who don’t know, think most people listening to this podcast are familiar with Landsat satellite data, earth observation data provided by USGS. But Google Earth Engine is created.
I won’t go into the whole history of how it’s created, but Google has suddenly has this thing called Google Earth Engine that is an incredibly powerful tool that makes Landsat so much more accessible to people and just like leads to like an explosion in usage of Landsat. I would also, I should take credit like at AWS, we subsequently did something similar putting Landsat data into AWS. But I do know that there was some consternation at USGS that like Google Earth Engine was getting all this credit for the Landsat.
Denice Ross:
Right.
Jed Sundwall:
And which is fair, you know, it’s like, well, hang on. Like we’ve been doing this forever. didn’t Google didn’t fly the satellite or take the risk in the seventies of developing this program and keep it going for decades. but this is, so this is where we get into the third rail territory, which is just sort of like Google earth engine was able to do what they did. I was able to do what I did at AWS because the data was free and open. And, because of that.
Denice Ross:
Yeah.
Jed Sundwall:
There’s some recent study from USGS showing like the value of Landsat is like billions of dollars for the economy. I’m like, well, if that’s true, why can’t you defend yourself? why, how are you not able to capture any of that value to make sure that you continue to exist? And I guess I’ll just leave that there for you to respond to, because I do think this.
Those of us who are open data enthusiasts have divorced ourselves from getting useful signal from markets. And I don’t know if that’s worth re-examining.
Denice Ross:
It’s a really good time for the private sector to step up and advocate for the continued flow of the data that they depend on.
Jed Sundwall:
Agree.
Denice Ross:
we haven’t seen a lot of that, frankly. mean, we, you if you think about the data advocacy, it tends to be more nonprofits, academics. and, and I think Steve Ballmer with USA Facts is one of the, you know, former Microsoft, leader. He, he’s one of the few private sector folks who’s been really advocating for the continued flow of federal data.
One thing to keep in mind, and I know there’s concern about appearing to be anti-administration, but there’s nothing inherently political about wanting data to keep flowing. And in fact, the Evidence Act was signed by President Trump in his first term.
and has a section in there that requires federal data stewards to engage with the public so that they can better understand how the data are used and how the data can be improved. so that type of public engagement is baked into the law that President Trump signed in 2019. The federal government, we just haven’t done a great job of creating those feedback loops.
And that’s why the work that we’re doing at dataindex.us, we’re trying to bridge that gap so that people who care about data don’t need to monitor the federal register on their own or keep an eagle eye on LinkedIn to see if their favorite data set is at risk. We sort of centralize the heavy lifting of that. And then when there’s an opportunity where public input can be really useful, then we mobilize folks.
to submit their public comments.
Jed Sundwall:
Yeah. Great. Well, I think what I’ll add to that though is like, there’s also just sort of like basic analytics that you, that we should be better at doing. Um, which is it’s crazy to me how hard it is to count data usage. I, fact, I had a text exchange earlier, like, um, like on source cooperative, we host three petabytes of data now. Um, and you know, we’re logging over 150 million requests a month now. And, and I was saying to,
Denice Ross:
my gosh, so true. Yeah.
Denice Ross:
Right.
Jed Sundwall:
shout out to Avery Cohen earlier today. I’m like, it gets really annoying when you’re counting tens of millions of things, you know, requests and then filtering through those and figuring out which data sets are being accessed. And can we, do we know anything about who’s accessing them? And what does this data even telling us? But in any event, like at a minimum, we should be able to know like, and this is also a hard conversation that’s starting to happen more and more often, which is that.
some data just never gets used and maybe we shouldn’t, you know, we should have, think the term I’ve heard a lot in 2025 is joyous funeral. Where there are probably some data products that were like, okay, we can let these ones go. It’s okay. You know.
Denice Ross:
No, I like that. I like the concept of a joyous funeral. I have enough humility now, having been in the field of data for 20 years, to know that I don’t know what all the use cases are. And you just never know. So I’ll mention one of my favorite data sets is the North American BAT Monitoring Database. Yeah, it’s this geospatial data set out of USGS. And there are 400.
Jed Sundwall:
Ooh.
Denice Ross:
organizations around the country that contribute to it, information on bat species, their locations, what they’re doing. And you might think like, well, why is the federal government collecting data on bats? Well, it turns out that bats provide billions of dollars of free services every year to America’s farmers. And if you want to protect that free service, you have to protect the bats. And if you want to protect bats, you need to know where they are. And if you’re building like a
a wind farm or expanding a mining operation or renovating a highway overpass. That all requires permitting that will require you to make sure you’re not harming bats. And so every one of those developers, if the bat database didn’t exist, they’d have to, what? don’t know, count the bats themselves to figure out what the impact would be.
And so this like streamlines permitting, makes it easier for development to happen in a responsible way. And then there’s also some research that shows that in areas where there have been precipitous declines in bat populations due to, for example, disease in agricultural areas, that infant mortality goes up.
which is strange, right? But the hypothesis here is that if the bats aren’t providing that free service of insect removal, then farmers need to use more pesticides.
Jed Sundwall:
Yeah, okay.
Denice Ross:
which gets into the bloodstream of pregnant women. So you wouldn’t, know, so an infant’s death, you wouldn’t say like, well, you know, that’s attributable to the fact that the North American bat monitoring database went away. But you just, you you have to be really careful about what data we say are not important anymore. And that’s one of the, frankly, one of the blind spots that we have is like, who’s using this data? And they’re probably like quietly in their basement.
Jed Sundwall:
Interesting. Has its own issue. Wow.
Jed Sundwall:
Right.
Denice Ross:
you know, like, you know, deep in some building using this data, but it could have some real, some super high impact application that just, you know, isn’t, isn’t that public.
Jed Sundwall:
Yeah. No, I mean, this is, you know, it’s kind of inevitable that I bring this up at some point. I’ve never talked about this on the podcast, but there’s the famous, there’s a famous XKCD comic about the open source tool. Hang on. I’ll put it in there, but yeah, it’s, mean, I guarantee there are people I know who’ve memorized the URL for this. It’s XKCD two, three, four, seven. I’ll put it, but it’s the,
The, the open source dependencies comic, which is basically it’s like, have this, this huge towering, you know, complex bit of, of digital infrastructure. And it’s just like all running off of like just one random thing that some guy in a basement is maintaining or, know, a bat database that a very dedicated and continually abused public servant has been heroically maintaining for forever.
And this is why I say, you know, I’m always very cautious and get nervous when I talk about market signal to support data is that there are data that are maybe very valuable, but for which the market signal is going to be extremely weak. So it’s not, the market won’t tell us that it’s valuable. And I actually, this is where, you know, I think you’ll agree with me. This is why the government’s role is so important is that there’s
all sorts of stuff that there’s no market signal for, but that we should probably be doing. And it’s the government’s responsibility to make those things happen.
Denice Ross:
Yeah, and that’s one thing. So having served in both the Obama administration and the Biden administration, in Obama, the focus was on open government, which was exciting and had shockwaves, really good shockwaves throughout the nation and state and local governments. And then the…
The Trump administration was really, you know, the first Trump administration was so focused on building evidence and data capacity and, you know, they’ve installed a chief data officer in every major agency. And so when I came back in the Biden administration, there was so much more data capacity in federal agencies. And what Biden really leaned into and was my role as the chief data scientist was how can we build the data backbone across agencies so that
we’re delivering better outcomes for all Americans. If you want to do that, you need to disaggregate the data in ways that the market may not be interested in. So you need to understand, you know, veteran status, know, caregivers, survivors, you need to understand rural versus urban, the role of sexual orientation and gender identity in outcomes, race, ethnicity, gender.
primary language spoken at home, whether you have access to a vehicle. There’s just so many ways to slice and dice the data to see which populations might be in areas are being overburdened or left behind. then adjust our policies and our programs so that we’re benefiting all Americans. And if you don’t…
If you don’t disaggregate the data to identify those disparities, it’s really easy to look at a number like, you know, we’re serving 99 % of America and declare a mission accomplished. But if you look at that 1%, it’s almost never evenly distributed. If you look at it geographically, you know, what you see the places left behind are Appalachia, you know, the Southern Black Belt.
Jed Sundwall:
yeah.
Denice Ross:
tribal communities, the border with Mexico, rural America, you the same places and the same groups of people are left behind repeatedly. market forces aren’t going to raise those data to consciousness.
Jed Sundwall:
Absolutely. Yeah. That’s absolutely. Yeah. I’ll agree with you a hundred percent. Well, okay. I’m going to shift gears a little bit because I’m, I’m leading you into talking about a, a dataset and a story that I think is really interesting, which is that they’re
Historically, you know, I mean, we go back far enough, it’s like, for a while there, like it was only the federal government that even like had a computer. So like, we’ve historically had to sort of rely on, we’ve looked to the government to gather and store data just because you needed the most powerful nation state in the world to even be able to do it in the first place. Those days are long gone. There’s all sorts of data that can be produced by non-government actors. You can call them commercial actors or other groups. I mean,
Denice Ross:
Hahaha
Jed Sundwall:
the environmental defense fund famously launched their own satellite, which was lost, which is sad, but like they did it. Like they launched a satellite that produced data. So there has come a time, we were well past the point where we don’t necessarily need the federal government to do all this sort of stuff. Do you have any thoughts on when it’s okay for other organizations to take over or to step in?
Denice Ross:
Hmm.
Jed Sundwall:
to support this kind of work and how do we know when that’s appropriate or not?
Denice Ross:
Yeah, I a few thoughts. maybe three examples can come to mind. The first is, goes back to that idea of the primary data production and the unique role that the federal government has in producing core primary data. And then there’s the data products that can be built with those data. A recent example is the billion dollar disaster, climate and weather disaster data set.
was terminated in 2025, but it’s a NOAA data product. And Climate Central hired the NOAA researcher behind that data set. And they are using similar methodology as was used when it was inside of government, but improving upon it. They’re talking about making a, like reducing the threshold so that they can track million dollar disasters.
So, you know, like maybe that’s the best place for the billion-dollar disaster data set, as long as the federal data that feed it keep flowing.
Jed Sundwall:
Yeah, yeah, yeah, right.
Denice Ross:
So that’s the big if there, right? So that’s one thing. But then if you talk about something like the Framingham Heart Study, that’s a federally funded study that completely transformed our understanding of heart disease.
Jed Sundwall:
Yes, this is the one I was…
Denice Ross:
It was a federal program that was initiated after World War II. Our president had recently died of heart disease. think 40 plus percent of American men had heart disease at the time. so heart disease was very much in the national consciousness. This was a priority. Congress funded the study for 20 years. At the end of that 20 year span, the National Heart Institute announced that it was gonna phase out the study the next year.
So the researchers, similar to what’s happening right now with climate and health and other research that’s been federally funded, that’s been producing essential data, the researchers started looking for other funding sources and they ended up raising money to keep this collection alive from unlikely groups, including the Tobacco Research Council and Oscar Mayer Meat Processing.
So they went to the private sector to fund the collection during the in-between years. But then the really cool part of this story is so, you know, it’s one thing to like, you know, find a way to keep the collection going, like maintain that continuity, right? So because that’s what makes, that’s what turns science into knowledge, into action, is the continuity across time and space. But you also have to have a policy game there because the federal government,
Jed Sundwall:
Yeah.
Denice Ross:
really belongs at, they should be the steward of the collection of these really critical data. And it turned out that President Nixon’s personal physician was a real stakeholder in this heart study. And he talked Nixon into advocating to get the funding turned back on for the Framingham Heart Study. So it was like this, you know, DC style interaction between the president’s doctor and the president.
Jed Sundwall:
Interesting.
Denice Ross:
that then got the funding back on track. it came back stronger than ever when it was funded again. They recruited the children of the original volunteers, and now that study is three generations long. And they also, as the demographics of Framingham, Massachusetts changed, they started to widen the sample to go beyond those initial families so that they could be more representative of the demographics of the US.
Jed Sundwall:
wow.
Denice Ross:
So I think that’s an interest, know, I think there’s some parallels for where we are right now, where we might be seeing some gaps in federal support. And so maybe we think about this as like, let’s create sort of a heart, lung bypass machine for our data, right? To keep it alive, keep the continuity there, but then let’s figure out what the long-term policy plays are to make sure that the data we need as a nation continue to flow and come back stronger.
Jed Sundwall:
Fascinating. Yeah.
Jed Sundwall:
Right.
Jed Sundwall:
Yeah.
Jed Sundwall:
Yeah. I mean, this is where I will advocate for my, you know, I, talk about this a lot at Radiant Earth, but it’s our new institutions and new data institutions, which is to say like, I, I won’t say I disagree, but like maybe the federal government isn’t always the right steward, but they’re in a very important stakeholder, right? So I guess, you know, framing up heartstudy.org, I assume, I just found the website. This is
some kind of independent nonprofit or entity that is the federal government is a large stakeholder as is Oscar Mayer. know, like it is this, this I don’t know if Oscar Mayer is still involved or Altria or whatever Philip Morris is now called. Like, but the, but the point is like, it is actually an independent entity that is able to receive resources from
Denice Ross:
Hahaha
Denice Ross:
Right.
Jed Sundwall:
a lot of different stakeholders. yes, I mean, I would agree that yes, the federal government, this should be a national priority to understand these things. Yeah.
Denice Ross:
No, and I agree. And I think those types of more creative arrangements that you often see in the sciences can build resilience into the system. Some data sets don’t have that luxury. For example, the Federal Employee Viewpoint Survey that OPM runs every year, during the greatest disruption ever to the federal workforce, there won’t be any data collected on
Jed Sundwall:
Yeah, great example.
Denice Ross:
employees feel about it. And so Partnership for Public Service stepped in and they’re running a lighter weight version of the survey, but they can’t possibly, they don’t have the Rolodex to reach out to every federal employee. there’s just, you know, it’s, I’m grateful that Partnership for Public Services is running it, but it’s not a replacement for what Office of Personnel Management should be doing.
Jed Sundwall:
Yeah. Well, then we can start landing this plane, but with a pretty big question then, is knowing what we know now, how would we protect a data product like that survey? Like, do you have any ideas?
Denice Ross:
I do. do. If I could just go back for a second, though. So I talked about the heart study. yeah, and the third example is, so I talked about the billion-dollar disaster data set, the heart study. And then the third one is an example of data that I think really do belong in the private sector but have a really important public use.
Jed Sundwall:
please.
Jed Sundwall:
Yeah, you three examples. I wasn’t sure if this is all of them.
Denice Ross:
And this is when there’s a disaster, one of the important pieces for response and recovery is knowing which gas stations are open.
Jed Sundwall:
Okay.
Jed Sundwall:
makes sense.
Denice Ross:
And so right after Superstorm Sandy, the Energy Information Administration was literally calling gas stations to see if they were open and if they had gas. And I don’t know if you remember the news coverage from that time, but gas was in short supply and tempers were flaring and there were lines of cars at gas stations just trying to get fuel so they could evacuate or go wherever they needed to go.
Jed Sundwall:
Amazing.
Denice Ross:
And so you can imagine how well received the phone call from the federal government was, that poor gas station owner, trying to get a sense for whether the station was open and closed. And then the data were so volatile that who knows what the actual status was. It turns out that a company like GasBuddy, which is a crowdsourcing tool that’s used by especially like truckers and rideshare drivers, taxi drivers, and
The way it works is that you go get gas and you type in the amount that you paid, and then you get rewards that you can spend in the little shop at the gas station. And so there’s this whole incentive structure built in. And so GasBuddy, it turns out, has actually the best data in the country on gas station status. Yeah. And so I know from my friends in the National Security Council that it causes them much consternation to have to cite GasBuddy.
Jed Sundwall:
Okay.
Jed Sundwall:
Wow!
Denice Ross:
when they’re reporting up to their superiors on the status of our fuel supply in a disaster impacted area, but GasBuddy actually is the best data set for that. So the question there is how might the federal government create some sort of agreement with GasBuddy so that those data can be reliably available to serve the public good when needed?
Jed Sundwall:
Yeah. Interesting. Okay. Well, I mean, this is kind of going back to the whole like, wouldn’t it be cool if we had all this crime data and I’m like, well, who’s going to do that? but yeah, it is. So many these, they just ended up being collective action problems. Right? So it’s like, yeah, that gasp. It is, and it’s, you can just imagine like what it like incredibly vast and complex data product that would be to create.
Denice Ross:
Right.
Jed Sundwall:
And also it’s the perfect sort of thing where a nerd would be like, well, why isn’t there just like an API? Like that, every gas station reports, you know, it’s prizes or something into anyway, it’s like that.
Denice Ross:
Right.
That would be nice, but we don’t even have that for power outages. The Department of Energy has to scrape power outage data from the public websites, from the electric service providers.
Jed Sundwall:
No, that’s it. Yeah. Yeah.
Jed Sundwall:
I, yeah, I’m not surprised. And again, collective action problems, but it’s a bummer because I think people like us who work in this, like we know that like, this is not a hard technological problem anymore. Like the tech required to do it isn’t hard. It’s the coordination that’s hard. Okay. Well then what was my question? my, other question. Yeah. So how would we make things, well, especially these things that are like, I mean, look,
Denice Ross:
Right.
Denice Ross:
less vulnerable.
Jed Sundwall:
I want to be charitable. You’ve said you’ve worked both in the Obama and Biden administrations. I live in Seattle. I run a nonprofit. think people can guess how we feel about things politically. But the truth is that for better or worse, half the country seems to be pretty mad at the president no matter who’s in office.
Anyway, I’m not going to start talking about like popular vote versus electoral college stuff. anyway, but regardless, there’s, we live in a country where people disagree with each other and people, and actually I think this is a great feature of America is that we’re very skeptical of our, of our leaders. Right. So, we’re lucky to have decades behind us of precedent where there’s a pretty, there’s a functional bureaucracy.
that has produced reliable data accurately and reliably for a long time. In the past year or so though, we’ve started to see like, yeah, data is getting taken down. Data really appears to be actively distorted in some ways. we’ve now crossed that threshold. Is there a way back from this or do you have thoughts on how to protect federal data in the future?
Denice Ross:
Yeah, I think the most important thing that we can do comes back to the idea of not taking the data for granted, making visible and explicit the role that federal data play in our everyday lives. And there’s probably three levels of intervention for that. And we’re starting with the people who use data, including the private sector entities that are using federal data.
and making it easier for them to mobilize, to share with federal data stewards and policymakers the ways that they use data, the way they depend on the federal data and why it’s really important for the economy, for example, that these data keep flowing. So my contention there is that anyone who’s a data user should also be a data advocate. And that is completely independent of who’s in office.
Jed Sundwall:
Yeah. Yeah. Okay.
Denice Ross:
And then the second audience for this is policymakers and the federal data stewards themselves because they often aren’t aware of the deep impact that these data sets have. so, for example, we’ve heard stories of federal data stewards who are able to collect
use cases about why their data collections matter to industries that this administration prioritizes. And that can have a real protective effect on the flow of data that can be used by a whole bunch of different domains. And then the broader, and then more broadly, just raising awareness with the general public about things like the no campfires sign.
at a national park and how that also comes from federal data so that we stand behind the investment in these essential data resources.
Jed Sundwall:
Yeah. That’s a great answer. mean, and yeah, I think the, again, I mean, a policy guy is like, like nerding out a little bit, but like a government is effectively, it’s job is to just understand what’s going on within its borders for a bunch of reasons. You know, it’s a pretty easy story to tell. Like it’s, it has, as you pointed out, the open data act, the evidence act, this is a bipartisan, you know, legislation.
this shouldn’t be that hard. And I would say it’s, it maybe sounds a little bit cynical, but I’m okay with it. Is it like every administration cares about businesses and economic growth in the country and data is vital to that. so it’s always, you know, this is, this is always the tricky thing though is I think there’s an obvious easy case to be made for a lot of data to be produced. Like weather data is a good one where it’s like the economy would like grind to a halt, without.
Denice Ross:
Right.
Jed Sundwall:
Maybe not a halt, but it would be really bad if we didn’t have weather data. But then also there’s this other universe of data that there might not be great market signal, but it’s just really important for governance and for public health or wellbeing or scientific research. I don’t know. It doesn’t seem like this should be that hard to advocate for. Anyway. Okay.
Denice Ross:
Yep. Well, in interview, you mentioned you’re a policy person. I think I was in this field for 15 years before I realized I did data policy. And if you think about it, there’s not really a pipeline of data policy wonks, right? We’ve got data users who just use the data and assume it will keep flowing. And they often use the data as is. They complain about its shortcomings. But they don’t…
Jed Sundwall:
Yeah.
Jed Sundwall:
No!
Denice Ross:
like go back to the data steward and say, hey, can you improve this? Like there’s because of those feedback loops that haven’t been put in place. And so I think we have a real opportunity to build the field of data policy, you know, so that any anyone who’s a data user, especially using public data also has a little bit of policy understanding so that they recognize that this is their data infrastructure to co-create as members of American society.
Jed Sundwall:
Yeah, no, that’s beautiful. And actually, mean, yeah, I you’re helping me realize what I was just trying to say. think we could be much more forceful. Is that like, it’s a core function of government to understand what’s happening through those boundaries. Like that’s done with data, you know? So, yes, there are dozens of us data policy nerds, but we should be more powerful. I think we can all agree. Yeah. Well, this has been awesome.
Denice Ross:
Hahaha.
Denice Ross:
So true.
Jed Sundwall:
I just checked in on the live stream. Apparently we weren’t live streaming on LinkedIn, which we’ll have to look into what’s happening there, but that’s okay, because this will go, this will still go out after, but no comments or questions from YouTube. So we’re in the clear. We don’t have to answer any hard questions. Only softballs from me. Anything else you want to share about your work or what people should be thinking about before we go?
Denice Ross:
Hahaha.
Denice Ross:
Yeah, would say if, think about your favorite federal data set, the one that you might be taking for granted, the one you wish were a little bit better, but you couldn’t live without, start practice talking to people about why it matters in a way so that you build your skills on that, because it’ll be useful. It will definitely be useful in the coming year. And if you come up with a good story about why these data matter,
Let us know at essentialdata.us because many of the use cases that are up there came from people who have deep expertise in a specific data set and we were able to turn it into a one sentence love story for that data set.
Jed Sundwall:
All right. Yeah. We’ll, we’ll point people to essential data.us. thanks for setting it up. mean, thanks for everything you do. Thanks for coming on. This has been, it’s been great. This conversation will continue. yeah. So we’ll do it again sometime too. Thank you. All right. Okay. So.
Denice Ross:
Thank you, Jed.
Video also available
on LinkedInShow notes
Jed talks with Matt Hanson from Element 84 about the SpatioTemporal Asset Catalog (STAC) specification and its role in making geospatial data findable and usable. Matt describes STAC as “a simple, developer-friendly way to describe geospatial data so that people can actually find it and use it.” The conversation covers how STAC emerged from a 2017 sprint in Boulder with 20 people and grew into a specification now adopted by NASA, USGS, and commercial satellite companies worldwide.
Matt promotes Howard Butler’s concept of “guerrilla standards” – a grassroots approach where stakeholders build something that serves everyone’s needs rather than making bespoke solutions. The central thesis: adoption is the only metric that matters. You can have the most elegant standard, but if nobody uses it, it’s not a success. STAC succeeded through community collaboration, simplicity of the core spec, an ecosystem of open source tooling, and timing—arriving just as cloud storage matured and satellite data exploded.
The conversation ranges into the limitations of remote sensing (“Remote sensing sucks,” Matt says, pointing to 20-30% error rates in land cover products), the future of purpose-built satellites, and why new data institutions are needed to validate emerging data products. Matt and Jed also discuss the credibility problem: launching a successful standard requires champions who have earned trust in the community. As Matt notes, “You have to earn credibility” – there’s no shortcut to building the relationships that make standards adoption possible.
Links and Resources
Takeaways
- Adoption is the only metric that matters — An elegant standard nobody uses isn’t a success. A “crappy” standard everyone adopts improves lives and enables interoperability.
- Guerrilla standards work through buy-in — When people are part of the process, their needs get addressed and they become champions who use the standard internally.
- Simplicity drives adoption — STAC focused on meeting 80% of needs with a simple core spec rather than trying to cover every possibility.
- Timing matters — STAC arrived when cloud storage matured, COGs gained traction, and satellite companies were launching rapidly. The previous methods weren’t working.
- Credibility can’t be skipped — Standards efforts need champions with established reputations. Chris Holmes’s involvement and relationships were essential to STAC’s early traction.
- Remote sensing has real limitations — 20-30% disagreement between land cover products is common. The value of remote sensing is in relative differences and time series, not absolute measurements.
Transcript
(this is an auto-generated transcript and may contain errors)
Jed Sundwall:
Welcome to Great Data Products. This is a live stream webinar podcast thing from Source Cooperative where we talk to data practitioners about their craft. We do this every month and you can visit us at greatdataproducts.com to see previous episodes and find links to subscribe on YouTube or wherever you get your podcasts. If you follow Source Cooperative on LinkedIn, we notify people about it there also. And then we also have a Luma calendar where
You can see that the next episode on great data products.com, but we actually have episodes scheduled out in January and February that you can see on Luma. I’ll talk about that in a minute. But today we’re joined by Matt Hanson from Elimin84 and a good old friend, I would say. And we’re going to talk about the spatial temporal asset catalogs specification. Matt, do you want to introduce yourself?
Matt Hanson:
Yeah, thanks Jed. Really happy to be here. Thanks for inviting me. I’m Matt Hansen. I work at element 84 and I have been, I’ll give a brief background. I’ve been working in the remote sensing field for geez, close to 30 years now. And got into open source about 15 years ago and was in, I went to phosphor G and was instantly like this, this is, this is it. This is what I want to do.
I started contributing to GeoNode, was my first open source project that I contributed to. And then I started working on other projects and eventually got into stack and standards.
Jed Sundwall:
All right.
Jed Sundwall:
Nice. well, I can say we’ve been lucky to have you in the community for a long time. so, and, yeah, I mean, we’ve, and we’ve got a lot to talk about. You’ve done a lot, you’ve accomplished a lot. And, I would say your involvement in stack is a really secured your legacy. mean, among others, it’s a community effort, which is, you know, partially what we’re going to talk about here. So, you recently, boy, like let me, let’s actually back way up.
And can you, how do you describe stack to people? And with, with the caveat that like this podcast is not necessarily a geospatial podcast. we, we do want to reach more people, who don’t necessarily have expertise in geospatial. So how do you describe stack at a very high level?
Matt Hanson:
Yeah, so Stack is, well, I describe Stack as a family of specifications as well as an open source ecosystem. And that’s maybe not a really layman’s way to describe it. So let’s talk, let’s say that it’s a simple developer friendly way to describe geospatial data so that people can actually find it and use it. That’s quick one sentence version.
Jed Sundwall:
Okay. And then, okay, okay. And okay, so I’m going to play lay person. And I actually don’t even have to like pretend that much. Like I’m actually this naive in a lot of ways. I’ve heard Mark Corver, another esteemed colleague in this world, describe stack as solving the problem of listing objects in S3.
Of course, I can’t help but be very nerdy here, but like part of the problem that we’re facing and it’s not just in the geospatial community, it’s in many other domains is that we’re dealing with so much data that even just listing the files that you have is expensive. Like it takes time. And so you can imagine having a corpus of millions and millions of satellite images and you have to, you know, go through that haystack to find stuff. One way to characterize stack is that it helps it make it
Matt Hanson:
Mm-hmm.
Jed Sundwall:
Basically easier to index all that stuff to find what you want. Is that fair to say?
Matt Hanson:
Yeah, I think that’s definitely fair to say. the tying it to S3, it’s not necessarily required, right? Like, stack could describe data files wherever. It doesn’t have to be in object storage. But no, I think that’s a good way to talk about it. When I give a stack presentation for new folks, like a stack 101, I often will talk about
about exactly this issue of the explosion of geospatial data. Like there’s been so much data and if you look at just like NASA’s holdings and their projected holdings over the next five years, we see so much data that if you don’t index the data, I had this saying that if your data is not indexed, it might as well not exist. Because if nobody can find the data and it’s just like, as you say, you’re just getting a listing of all the files, how can you actually find
the data that you want if there’s a billion files in object storage, let’s say. And that’s not far-fetched. That number is not all that far-fetched. Yeah. No, as I saying, if we look at Sentinel-2, right? If you look at the entire Sentinel-2 archive, there’s 25 million files. There’s 20 files for each. There’s 25 million scenes. And for each scene, there’s 20 files. So it starts adding up really
Jed Sundwall:
No, yeah, not at all. go ahead.
Jed Sundwall:
Okay. And then when you say easy, easy for who? like what, you know, stack stores its data in JSON. So who, who’s like the, the, typical user of stack, like what kind of software do they use? Like what kind of job title do they usually have? Yeah.
Matt Hanson:
Yeah, geez, that’s a good question. I think that the ultimate data user is probably a data scientist. Like that’s, and that’s, think that’s who the original target was. When we first started looking at this, were primarily looking at public data sets because that’s what is available. And, you know, we, that’s what we were looking to index was NAEP and Landsat and Sentinel-2. And it was really a
data science user problem. And that was my background. That was where I come from, was working with scientists and working with different types of data and having to use different formats and different tooling just in order to find and access the data. And so I think that really was the primary user. We talk about it being developer friendly because of the open source ecosystem.
but like, and that’s really developers working in tandem with data scientists in order to leverage and use the data.
Jed Sundwall:
Great. Yeah. mean, so I’m, I’m, leading you here a little bit in the, the, to the point being that like, think, I’ve, know, I’ve, I’ve worked in the open data space for my entire career, basically at this point. And so many conversations have revolved around like making data easy for anyone or something like that. And I argue that that hasn’t worked out super well. You actually need to find like, who are the actual practitioners that are going use the data and like, what, what will they be comfortable with?
or like what will actually help them rather than having a kind of nebulous like everyone thing. Yeah.
Matt Hanson:
Yeah, yeah, it’s clearly not everyone, yeah, mean, we have had like journalists, like we’ve people have reached out to us from like New York times and like they’re creating stories and they want to access geospatial data. And so they’ve used some of the tooling around that. So that’s as close to a lay person. I think that, you know, we’ve really worked with.
journalists who want to tell a story and they just want to find data. They just want data from five years ago and today to look at a change over time and use it to write a story about it. And they were able to use the tooling, like PyStack client, even before that there was SatSearch, was an earlier tool set, and they were able to figure that out. But they were still leveraging developers to
to do that.
Jed Sundwall:
Right. Well, but I think then there’s another clue here, which is that you have, we’ll go on this with journalists. You have an audience that typically has not been able to engage with imagery or like, you know, geospatial data. but they are, you know, we’ve watched this happen throughout our lives, like becoming more savvy and, more aware of the need to be able to like use software and data to tell stories and things like that. but they’re coming to us from like,
a completely different place than I think most geospatial data practitioners were in previously. so the key there, mean, you you, mentioned PyStack, you know, like they’re for whatever reason, you know, a lot of journalists use Python, you know, there are different communities that use different tools. Yeah.
Matt Hanson:
Yeah, right. Sure, yeah. Language of data science, yeah.
Jed Sundwall:
Yeah. Okay. We actually already have a question on, on YouTube from, who I’m just going to, I’m just going to call Sig. I’m not sure if that’s his name, his or her name. can’t tell. But asking stack is built around sharing data easily to anyone. Let’s say you want to use to share more secret data with access control, SSO encryption, et cetera. And different users that have different access to different data sets. have some thoughts on this, but like, as you mentioned, it stack doesn’t have to be explicitly tied to a
a cloud object store or a public bucket. Do you want to take that? I imagine you have some actual examples here. Yeah.
Matt Hanson:
Yeah, so this question comes up a lot, right? Because out of the box, so I will get a little bit more technical here. what we use, so we have an API called EarthSearch that indexes public data sets on AWS. And that’s an implementation of Stack API. And for example, that one, that implementation has no authentication in it, because we were using it originally to index public data. And so.
we didn’t have need for controlling access and all the data was public and so hadn’t added that. And so we get that question a lot. And stack fast API is another implementation that didn’t have like really core built-in authentication at prime that it was first created. So there’s a couple of ways to do this. I’ll jump to the end first, which is that there’s a more
modern solution for this, it’s called Stack Auth Proxy that DevSeed has created. And that can be used to control access to individual items and collections based on attributes in the data. So that works pretty well. But what we’ve generally done is use it as a proxy. So you have your catalog and that’s open. Or it’s behind a firewall, but it’s like available to anyone who can access it.
Jed Sundwall:
Okay.
Jed Sundwall:
Interesting.
Matt Hanson:
And then we have a proxy in front of that. That handles the authentication, queries the catalog, it knows what people can see, and then returns that result. So it’s going through the proxy. But these tend to be all one-off solutions that are created. so I think Stack Auth Proxy, if you haven’t seen that, that’s definitely something to look at that you can combine with Stack
with StackFast API or any potentially any stack API implementation.
Jed Sundwall:
Okay. So yeah, I mean, I think, one thing I’ll like underscore here also is that like stack is a metadata spec. It doesn’t, it itself doesn’t say anything about authentication or anything like that. Like, so it’s, it’s, it’s been built to be very flexible, useful in all sorts of environments and extensible. I want to just stay in the weeds of stack a little bit longer. so the, the specification
Matt Hanson:
That’s right.
Jed Sundwall:
is made up of other specifications. So you have the idea of a, I’m going to go in order of like collection catalog and item. Can you walk through each of those and like what they encompass? Sure.
Matt Hanson:
Yeah, sure thing. well, so we start up at the top. That’s a catalog. A catalog is really just a container. It’s a JSON. It contains really simple fields. Like you have a name, you got a title, you have a description. And then you have, most importantly, all of these entities within stack have links. And links are probably the most important part of stack, right? Because
Jed Sundwall:
Yeah. okay.
Matt Hanson:
we, when we got into this at the beginning, the ability to crawl a catalog was really important because that’s the way the internet works, right? Is by crawling things. And so, we wanted to be able to link, a whole catalog together and link down to items and link back up so that you could really visit any part, of data in this catalog and be able to crawl it one in both ways.
So catalog is the starting point in an API, especially the catalog is, that’s your landing page and it’s going to contain links and it will contain links to the collections underneath it. And each collection is really looks a lot like a catalog, a collection at one point, it even was a catalog, it was derived from a catalog.
Technically, that’s actually not the case anymore. It’s its own entity, but it looks a lot like a catalog. But collections are ways to group together items and data that is similar to each other. And so the most obvious case is when we look at the big public data sets, we see Sentinel-2 or Landsat and like Sentinel-2 level two data, that is a collection.
Right, it contains a bunch of items and that’s your next level down is an item. And an item is this where this is where we move from JSON to geo JSON because an item actually represents a specific location and a specific time or range of times. And that’s really where your data is. So you can think of it as a scene. You can think of it as it’s a footprint containing data. The data is contained.
in what are called assets. So that’s really the fourth entity type, except assets are actually embedded directly in the GeoJSON of items. So you have the catalog, collections, and then items. And so that’s the general hierarchy. And we have links that allow you to go all the way down from catalogs to items. Now, there is some nuances between a static catalog
Jed Sundwall:
Right. Okay.
Matt Hanson:
what we call a static catalog, which is really just a bunch of linked JSON files on disk or on blob in an object store. And that’s an important distinction between that and a dynamic catalog or what we call an API. And so there’s nuances because you can have, for instance, can have sub catalogs within a static catalog.
If that’s confusing or it might be a little confusing or not, but it’s a way to like partition the data basically. So you can, you can use sub catalogs to organize it. So you might have a collection and then underneath that look, we’ll have like a catalog for each continent. then you go onto the continent and then that’s, that’s where your items are. So it’s just a way to partition and organize the data in an API. This question comes up a lot, which is why I have the whole.
narrative around it here, but like in an API, you don’t need those sub catalogs because you don’t need to partition the data because you can search for the data on what continent it’s in or what path row it is if it’s gridded data or you can essentially partition on the fly anything you want. So that’s the important distinction between static catalogs and an API. We get the question a lot.
People have static catalogs and they ask, how can I search this? And you can’t really search it. You have to index it first. Like there’s that missing piece. But Stack originally, Chris Holmes really wanted us to focus on being able to have static catalogs because not everybody wants to stand up a server and incur the cost of that. And they just want to make data available and they want to share it with people.
And so the easiest way to do that is just have the metadata on disk. And it’s all linked to each other so you can crawl it and index it if you wanted to do that.
Jed Sundwall:
That’s right. Yeah. I mean, I can speak to this. mean, I think, like this is a long time ago now, like when all this stuff happened. So when, it’s relevant actually to another, comment or question from SIG on YouTube asking, you know, did, so did the chicken or egg come first, IE the stack or the S three and the cloud optimized formats. I assume stack wouldn’t exist with only old files on disk. So, a lot, there’s a lot to respond to there. first I’ll say,
This is a fundamental issue about sort of the distinction between file storage and object storage that is like just not obvious to most people because they never have to think about it. Is that like, if you’re using a file system, like if you are, if you’re using a computer, like laptop or, you know, normal, normal computer with a GUI and stuff like that, you’re probably are interacting with the file system. You know, your computer needs to have an understanding of like, what are the files on your hard drive and has an index of them.
It also has an index of how the directories are nested and things like that. And you can search your computer for files and stuff like that. Otherwise, a lot of applications would be a huge pain to use if you didn’t have that index. Object storage like S3 has nothing like that. So object storage is just like you have a file, you give it a key name, and you put it in a cloud. And it’s there. If you know that key name, you can get it back out. And so this was…
this is the issue going back to the discussion before about like too much data. You can imagine a scenario where you have so many objects, you have so many files you’re dealing with that even the index of them would be too large for your laptop. Like just listing the names of the files would be too large for like a lot of people’s local storage. Like this is not a crazy idea, let alone like metadata about all those sorts of things. And so.
Matt Hanson:
Hmm.
Jed Sundwall:
Stack and a lot of sort of a cloud optimized approaches are an attempt at standardizing or finding patterns whereby we can break up all of this content into ways that are manageable. that has to do with things like stack catalog, as you described Matt, like with all these JSON files pointing the way. And also things like naming conventions for things and stuff like that, that like all add up to make that stuff work. The only other thing I’ll say is that, you when we brought
Lantz had onto AWS. The metadata that USGS would provide in its tar balls with the imagery was just this like weird text file that was like space delimited or something like that. Do you remember these? Yeah, the MTL files, right? And so we set up, didn’t, you I was just like, you know, I think it’d be better at least if this isn’t JSON. And so we, what we did is we created a process that was happened to the end of every image that we.
Matt Hanson:
MTL. Yeah. Yeah.
Jed Sundwall:
brought in and turned into a cog, as soon as it all landed in the bucket, we would run a Lambda function to take that MTL file and turn it into a JSON version of it. And I think that was kind of the kernel of like the sort of the first notion of doing something like this, where it’s like, you should be able to get to an image and you should have a reliable little machine readable, you know, or like easily parsable bit of metadata that you can find right by it.
Matt Hanson:
Yeah.
Jed Sundwall:
And then I guess then also just to close this off also with the understanding that yeah, there are a lot of people that are never going to run their own API. They can’t stand up a service and there are a lot of data products out there that do just need to land somewhere. And if somebody else wants to index them, they can. And I think the static stack catalogs make that easier, I would say.
Matt Hanson:
Yeah, yeah, yeah, exactly. Yeah.
Jed Sundwall:
Okay, so now let’s talk about the blog posts that you wrote, like the sort of the history of stack. Give us the high level overview. I we’ve, included the link to it in the, as we’ve promoted this and stuff like this, I’ll, I’ll, we’ll, I’ll have to put it back in the, in the chats and stuff like that, but it’s really good. But summarize it quick. It’s a comprehensive post, like, what’s tell, tell the story again.
Matt Hanson:
Okay.
Matt Hanson:
Okay. so yeah, it is, it is a bit lengthy. So yeah, so I did these two blog posts. the first one I wrote a couple of years ago and, I always meant to write a part two and, and two years passed. and then I’m like, you know what? I really, I’ve long been wanting to do it. had draft and various, conditions. So, finally I’m like, this is the time, you know,
Jed Sundwall:
Yeah, you did it.
Matt Hanson:
stack was just as we were publishing it stack was just accepted as a community standard for OGC so it’s like it seemed like a good time to actually publish it so the most recent post is called why stack was successful and it really looks at like like how on earth did this effort that started back in 2017
with with 20 people in a in a small room at the Marriott in Boulder like how did this turn into something that is now being adopted by commercial companies that are launching satellites as well as space agencies so NASA USGS for the Landsat program was definitely an early adopter that helped a lot so So I talk about this idea of guerrilla standards
And I gave a tip to Howard Butler on that, because I love the term guerrilla standards, because it really encapsulates what this process is and how it’s different than traditional standards work. And so that’s big part of it. And we could talk more about that, about the guerrilla standards. But it’s this grassroots approach where you get people that are interested, you get stakeholders that are interested in
doing something better and working within a community rather than making a bespoke thing on their own. And you build something that will serve everybody’s needs. And this is, I think this is critical because I’ll skip to the end a little bit again here and say that.
the conclusion of this is that there’s really, when we talk about standards, there’s really only one metric. Well, I say there’s three metrics that matter as a bit of a joke, which is adoption and adoption. And that’s true. you can have the most elegant standard that could exist. You could spend lots of time and make sure, and this covers every possibility and it’s very elegant and very nice.
Matt Hanson:
But it doesn’t get used and so that’s not a success story at all You can have something that’s maybe a little crappy and If everybody uses it, it’s hard to argue that like the crappiness was a bad thing if everybody’s using it it’s improving everybody’s lives and it’s making interoperability easier and so The central thesis of
Jed Sundwall:
Yeah. Yeah.
Matt Hanson:
of the post was, that adoption is the only thing that matters. And then it exam, I examine like, like, how did we drive that adoption? Like how did we ultimately, like that’s the question, right? It’s like, it was successful because it’s been apparently adopted pretty widely. And so what was it that we did that drove that adoption? And part of that is the guerrilla standards approach of
getting stakeholders and getting champions and getting people excited about it and having buy-in from people. You know, that’s an important piece of this is that when people are part of a process, they’re more likely to use it. Their concerns and their needs are being listened to and they’re more likely to go back and champion it and use it internally for their own projects as well.
Jed Sundwall:
Yeah.
Matt Hanson:
And then another aspect is the simplicity of it. The core, the core spec. You know, this wasn’t about trying to make a standard for everybody and everything. This was about creating a spec that was going to meet 80 % of the needs to, and really focus on what those needs were. Like how do we find data? How do we have consistent metadata across?
different providers. How do we have something really simple and how do we encourage people to use it? We encourage people to use it by creating an ecosystem of tooling so that there’s a low barrier to entry. so the ecosystem is part of the guerrilla standards approach is that you need to start building implementations. And at that first sprint back in Boulder at the end of the day, thanks to Rob Emanuel and Seth Fitzsimmons,
Jed Sundwall:
Yeah.
Matt Hanson:
We had a server working at the end of one day that was serving up NAIP data. I don’t think we went back to it. It doesn’t really resemble much of what stack looks like today, but that wasn’t the point. The point was that we got some ideas together, we stood it up and it worked and then we could continue to iterate on it. So let’s see what other…
aspect of the post that I feel like I should call out. The community collaboration is critical, like having in-person sprints that are open for anybody to join. That is key as well. And I would be remiss if I didn’t mention the timing. The timing, I think this was just serendipity perhaps.
But the timing of stack was critical to its success. We were at a point where the public clouds were maturing to a point where like geospatial, we were starting to see more geospatial data on it. you you were just talking about your effort on bringing the Landsat to AWS. Cogs were really starting to gain traction. There was lots of launches and explosion of
private companies launching satellites. so there was just this real, there was a real need there. Like the previous methods weren’t working. And so there was a real need in this and no one else was really solving that. And so it just filled the missing layer at exactly the right time.
Jed Sundwall:
Yeah.
Jed Sundwall:
Yeah. Yeah, no, it’s it’s great. mean, it’s,
such a fascinating example of like, of really what we’re actually trying to do with this live stream webinar podcast thing, which is like, we know some things have worked. Like we need to understand like, why did they work? Like what made the difference? And like, it’s so easy to look back at, I mean, it’s very easy to look back at failed attempts at foisting standards on the world, you know, so many standards that have not been adopted at all.
Right. despite, despite all the good intentions and the need and things like that. And so it is, it feels mysterious why stack was successful, but I think your post and everything you just said makes, you know, makes it’s not a mystery here. Like I think we can probably look back at things that, made it successful. And, it’s actually kind of interesting timing. got another comment on YouTube from, I don’t know this username is bent quarter.
So bent quarter.
Who knows? But asking, is there a GUI for building a stack? Which is super interesting question because everything you’re talking about, you know, you talk about like, you know, we got all these people together and it was easy for them. And you know, we, we had a server running by the end of the day. It’s like the people that we’re talking about are data practitioners. It’s a pretty esoteric, like cool kids club that, know, these sprints, they’re not huge. It’s a, it’s a, it’s a small group of people who really have practical experience and needs.
Jed Sundwall:
that they understand each other, which has allowed that, I’d say like allowed you to gain traction really, really quickly. But yeah, we are at the point, I think like this question, like is there a GUI for creating a stack? Like that’s an interesting question. Like certainly wasn’t the priority, but where are we now?
Matt Hanson:
yeah, it is an interesting question. So the answer is no. Like there, there is, right. there’s, there’s interfaces for browsing catalogs. there’s stack browser. we have a user interface that we stand up for earth search, called film drop UI that, is, interface for stack API. There’s others out there as well. Microsoft planetary computer, has the user interface, but yeah, these are all kind of focused on.
on being able to search and browse existing APIs, not actually creating your own. And I think that’s just because, those are different user bases. Like the people building the stack metadata, are generally developers. and you have a bunch of data and you need to pro you want, you want to generally programmatically create the
the metadata from it. So like extracting the footprint of it or pulling metadata fields that are important from the original metadata or from the headers from the data file. So that really is done in a programmatic way. I think someone might have created a user interface for creating collections.
It would just be a form field where you can go in and fill things out. But it’s not a bad idea either, like having some sort of user interface to make this easier. But I think it would have to be combined with some back end that is where maybe you’re dragging and dropping a series of files. And then it’s going to try and fill stuff in, but then gives the user an option to be able to add in additional details.
and then extend that to ingesting a bunch of other scenes from it. Like maybe there’s something there that actually could be useful and make it easier for users to make their own. There’s some tooling for the CLI for creating stack. Like there’s Rio Stack, which can be used to create a bare bones stack item from cogs. But yeah, no one’s.
Jed Sundwall:
Yeah.
Matt Hanson:
really brought up a GUI for building a stack.
Jed Sundwall:
Yeah, that’s an interesting question. but it, also gets at, think like sort of the challenge that I think, it’s a huge challenge. It’s a challenge that like a lot of government executives need to understand a lot of people working in policy, people working on workforce development, people educating, future leaders and data scientists is that like the volume of data that we’re working with is so large that like,
the notion of distilling or creating tools that are really designed for like humans to like click and drag and point at things and track with your eyes, like to do stuff. That’s not how it’s going to be done. It’s yeah. Yeah.
Matt Hanson:
Right, right. It has to be programmed. And so that’s why I said, you know, like a GUI that allows you to maybe set that up, right? Like set up the programmatic creation of it. Like that might be useful, but you’re right. Like you’re not gonna, you’re not gonna manually create a stack for every scene, you know, for every, for every item or, you know, image.
Jed Sundwall:
Yeah.
Jed Sundwall:
No. Yeah. Yeah. I mean, and that’s like, this is not to, you know, to dismiss the idea again, like it should there be a GUI or like why, you know, this is, it still remains an interesting question, but it, I think it reveals the fact that like stack emerged because we were dealing with, suddenly found ourselves dealing with so much data that there was, it required a sort of purely programmatic approach at first.
Matt Hanson:
Yeah. And those were the first, like those were the first use cases too, right? Like, was, was lands. It was these big archives. It was the Landsat and Sentinel was NAEP. and it wasn’t, like small amounts of like commercial imagery because we didn’t have access to those. you know, like that was money. So, this was the primary use case was how can we make it easier for users to access public data sets?
Jed Sundwall:
Yeah. Yeah.
Jed Sundwall:
Right. I’m imagining now like a use case where like an entirely local use case where it’s like, okay, as I mentioned to Matt before we started streaming, there’s a mudslide in my neighborhood in Ballard. I don’t know, I don’t know any details about it. I hope no one’s heard or anything like that, but literally right now there’s a mudslide in my neighborhood, but you could imagine somebody going out there like with a laptop, a drone, flying some imagery, producing a relatively small product.
and wanting to package that up in a nice tidy stack catalog that they can then get out somehow. And that’s like kind of like, I could see that as being a very sort of like lay person, not touching the cloud kind of Dropbox scale type thing, you know, that you could do. And that’s a maybe use cases like an emergency response type thing for something like this.
Matt Hanson:
Yeah, for sure. Yeah, that would be. And, you know, some people I think have created stack catalogs for small data sets like that. But then that raises the next question, which is, how do people find the catalogs?
Jed Sundwall:
Well, I I want Source Cooperative to be a place where people find these things. So brought to you by Source Cooperative. This is our podcast, so I get to do stuff like that. Well, thank you, thank you. Well, thank you. Yeah, actually, let me, I’ll do, that’s, that is actually a prompt to do what I said I was going to do. We’re going to do housekeeping really quickly. And just because we know that some people have joined Midstream.
Matt Hanson:
Right, so there’s, yeah.
Thank
Yeah, you get that. Yeah, that was a lead-in for you to plug it.
Jed Sundwall:
So this is Great Data Products. It is a live stream webinar podcast thing brought to you by Source Cooperative, which is a data publishing utility that we manage. You can go to source.coop to learn about it. But this is the time where we talk to data practitioners about their craft. And this month we’re talking to Matt Hanson about the Spatial Temporal Asset Catalogs or STAC metadata specification, which has been wildly successful.
And then to, I’ll, do a little bit more self-promotion on this. we wrote, there’s, there’s, is great data products, the live stream webinar podcast thing. we also wrote a blog post or publish a blog post, a little bit ago called great data products that has, I it’s done pretty well. you can go to radiant earth at radiant.earth slash great and read that. and I’m just, but I’m going to share something.
Let’s see, I don’t know if I can do this. Can I share my screen? Yeah, I’m gonna share a window in response to, again, back to the question about GUIs. And so this is a drum I’ve been beating for a really long time. This is a graph I’ve been talking about for forever, many years, but it’s been enshrined in this blog post. I’m gonna like…
Expound on this in a future post, but like it’s.
it’s useful to understand or sort of to think about how do you maximize the usability of data and like why a programmatically accessible approach is so important. So if you have raw data off of a sensor, it is not going to be that useful to that many people. Like there’s a cost, just like an inherent cost required to like extract any sort of value from it. And satellite imagery is sort of like notoriously difficult here.
Jed Sundwall:
which we can talk about all the reasons why that is. so, but what often gets funded is like, I want a thing that’s going to track mudslide risk, you know, for example, in the Pacific Northwest, right? And so you can spend a lot of money sorting through the data, processing it, creating an interface, you know, doing user testing to create a tool that helps you understand flood, you know, mudslide risk in the Pacific Northwest.
You’ve gone over this huge arc where you, you spend a ton of money, but then the potential value of the data is then diminished again. Right. And so this is always kind of like my warning against focusing on, on, on GUIs or dashboards and stuff like that is that by creating an interface like this, you’re making a ton of decisions about like what the value of the data is. And like, instead what we should be trying to do is like,
How do we maximize the query ability of the data? And then sort of like, it’s this, again, I call this the sweet spot graph. We have to find this place where it’s like, we’re taking out a lot of the annoying, undifferentiated heavy lifting required to like get the data in a way that’s queryable without over determining it. so, um, anyway, I’m preaching to the choir with you, Matt, but I just,
Matt Hanson:
Yeah, no, you know what a great example of that is too, is Landsat. Let’s take look at Landsat. There are two processing streams that Landsat does. They have an ARD process, is in one projection. It’s actually in an alberts projection. There’s five different albers, maybe seven albers projections, depending on the continent and the place of the earth.
Jed Sundwall:
Peace.
Jed Sundwall:
Yeah.
Jed Sundwall:
What’s your favorite? Sorry, I’m just kidding. Yeah.
Matt Hanson:
Favorite favorite continent favorite Alvarez projection I don’t know
Jed Sundwall:
I’m sorry, just go on. I’m being, I’m trolling you. Sorry.
Matt Hanson:
so there’s the, there’s the ARD stream and like, that’s distributed as these, as these ARD tiles. And then there’s the regular stream of data, which, which delivers UTM, tiles. So the question is like, why, why these two different things, right? and the reason why is because people like, and are used to UTM because it’s a nice pretty picture, but it introduces more errors than the
than the Albers projection does. The Albers projection minimizes the distortion errors from the original raw data. And so I have this thing that I like to use, which is as soon as you pick a projection, it’s the wrong one. And so this is in the graph because it’s like, you know, wanna maximize value, right? Then you should try and avoid making assumptions about how people are gonna use that data.
Jed Sundwall:
Right. Yeah. Yeah.
Matt Hanson:
Projection is a perfect example. Rather than picking a projection that you think is going to be useful for everybody, just what’s the one that’s going to minimize the potential errors? Because you know that people are going to reproject it. MODIS does this great. MODIS does a sinusoidal projection, which is the best projection for minimizing distortions due to the orbit of the craft. Everybody hates it because it doesn’t make for very pretty pictures if you open it up and just look directly in QGIS.
Like it looks all wonky, but it really is. It really is the best choice for that space.
Jed Sundwall:
Interesting.
Jed Sundwall:
Fascinating. Oh, wow. Okay. You know your stuff. Yeah. No, no, it’s great though. We have a pretty interesting question about this though. Like on this note of like, what is the right way to present data from the great Max Lenorman, who I’ll just embarrass him a little bit more. Like there’s no way we’d even be having this podcast if it wasn’t for.
Matt Hanson:
I know a couple things and I just keep on reusing the same stuff.
Jed Sundwall:
Minds Behind Maps and the approach that he took with that. So he asked, does this still hold in a world where it’s so much easier to make custom dashboards, GUIs, front ends with AI? I have a response to that, but I’m curious to hear what you think. I mean, especially about, have, element84 does so much great work producing really interesting tools. Yeah, what are your thoughts on this?
Matt Hanson:
Well, I like, I like you guys. They’re pretty, you know, but they’re also pretty impractical. Aren’t they? Like if we look at the data, 99.9, 9 % of the data out there, right? No one’s ever going to look at.
and so I do think we spend an inordinate amount of time focusing on visualizing remote sensing data when that’s actually not really a great use case. I, the demos and outside of pretty pictures, maybe, you know, journalists like that, if you’re telling a story. and so, you know, it’s great that it’s easy to make custom dashboards.
You know, and I, I’ve been working on some UI stuff recently and it’s fun, but, yeah, I think from a practical standpoint, we need to be focusing more on unlocking the value and the data with, you know, with programmatic backends.
I don’t know if that really answers the question now.
Jed Sundwall:
Well, yeah, I I can, I think, a, I agree with you. mean, I think you wise are maybe the wrong thing to be thinking about. I would, so I agree with you on that in the sense that like, I use this example all the time. I may have already mentioned it on this podcast. I probably will again in the future of like so many like attempts at making earth observation data useful for like agriculture, you know, especially in like low and middle income countries where it’s like, no, it’s great. We’re going to give the farmers an app and then they’ll know what to do. And I’m like, no one’s going to use your app. Like.
You’re not a the farmer’s not going to install your app. They’re not going to open it. It’s not going to become a part of their life. Like it’s possible. Like that does. There are sticky technologies that people do, you know, become part of people’s lives. But like it is so expensive to make that happen. And it’s so rare for it to actually happen. My theoretical hypothetical like Earth observation application for the farmer in a poor country is suddenly they have.
Matt Hanson:
you
Jed Sundwall:
they can get insurance for some reason. They don’t know why, but like there’s a flyer for them to get insurance or a salesperson comes and visits them and is like, hey, we can actually sell you affordable insurance now. The basis of that insurance product is Earth observation data that allows for that insurance product to exist. It is a product of data, but the farmer doesn’t have to know anything about that. The value of it gets.
built into the price of the insurance and like that’s how the value is delivered. Is there a GUI or some sort of UI to the data between the receipt of the data and the creation of that insurance product? Like maybe, maybe not, but like I think increasingly, so partially to answer Max’s question, like in the age of AI, and I know Matt, you’ve said stuff about this before, like it’s just gonna be a model doing all the analysis, you know, and
Matt Hanson:
Mm-hmm.
Jed Sundwall:
And what’s derived out of it is going to be like some sort of index or figure that gets put into a spreadsheet or database or, you know, inform some other process. Yeah. Yeah. Yes. Which by the way, you can preview CSVs on source cooperative now, which is amazing. Yeah. Go.
Matt Hanson:
That’s right. It’s tabular data. The future is tabular data. Yeah.
Matt Hanson:
Nice, that’s great. So, all right, so this is a bit of a tangent, but like I feel like it’s maybe a good time to say this. And I used to give a presentation and I talk about this a little bit, but, and I don’t know, this is gonna seem like a tangent, but.
Jed Sundwall:
Go for it. That’s why we’re here.
Matt Hanson:
You talked about the farmer, you know, and, getting the app and, know, there’s another reason why that doesn’t really work. And it’s because remote sensing sucks. All right. RSS remote sensing sucks. And what I mean by that is that like, you know, I mean, I’ve been in this space for a while, right. And, and if you look at old research papers, new research papers, and like, take a look at land cover products, for instance.
You can get land cover products from different producers for the same year using the same data. And like, they might be 20, 30 % off, 20, 30 % disagreement with each other. Because there’s a lot of stuff that goes into the image that’s formed. The entire big equation, the radiative transfer equation for, for
for how that light propagates and gets the image means a lot of variability. And when we talk about level two data, we have atmospheric correction, which also includes a tremendous amount of variability. so I have this issue with the ag community because I feel like, and lots of other, I think industries have done this as well, where they’ve over-promised.
and they’ve under-developed what remote sensing can do. you know, 20, 30 % errors are not uncommon. But if you go to an engineer that is doing space exploration, right, or any other engineering discipline, you’re like, oh, 30 % errors are normal. They’re gonna laugh at you, right? We didn’t send people to the moon with 30 % errors, right? Like, you’re gonna miss the moon. So I think there’s an aspect here of like,
Jed Sundwall:
Yeah.
Jed Sundwall:
Right.
Okay.
Matt Hanson:
having realistic expectations around what remote sensing is capable of. And traditionally, back before Landsat was available on S3, the people doing that work were scientists. And so I don’t think it really, it didn’t really come up that people were like misusing remote sensing data in a bad way.
But once that data became available to the masses, and this kind of ties in some of stuff you were saying before, everybody started using this data. lot, like companies were like startup companies were starting to leverage this to generate NDVI. I remember working with one company back using that Landsat data, Calculate NDVI. And the problem was, and I think I’ve told you this before, Jed, is that that data was not appropriate for doing that, right? Like the original Landsat data that was on
AWS was level one data. It wasn’t even level one, top of the atmosphere data. was like top of the atmosphere prime. So it wasn’t even accounted for angles. so, and so that just like, I think that ended up causing more of the same problem. Like people continually, you know, being over-promised what remote sensing can do. So that’s my, that’s my issue with, with the Ag community is that I, I,
I feel like they’ve over-promised what is capable, what it’s capable of. Remote sensing is very powerful because I might not be able to measure that water quality in a lake very well, you know, within some air, but I can look at every lake in the world, right? Every day. And what it’s really, really good at is looking at relative differences. So time series. So time series.
Jed Sundwall:
Yeah.
Jed Sundwall:
Yeah. Yeah.
Matt Hanson:
is where remote sensing really shines, being able to look at change over time and differences. And then this leads into a whole other segue of this is why most commercial satellite data providers have bad business models.
Jed Sundwall:
Yeah.
Jed Sundwall:
Okay. We should keep going down this path. think.
Matt Hanson:
no, it’s that like they’re, focused on this idea of selling imagery, right? Like scene by scene and, and like, and, and there’s really limited use of that. maybe for photogrammatists, like, you know, looking at it, like that’s how we originally use it. We have a high resolution image and we’re going to look at it and identify things. But the real value in all of these archives of data is that is, is the time dimension.
Jed Sundwall:
Right. That’s right. Yeah.
Jed Sundwall:
That’s right.
Matt Hanson:
And so I don’t know, I hope for a future where those archives are maybe unlocked. Maybe there’s a subscription model where you can access the whole thing, the whole entire archive. But like this whole piece, me and like by image by image just seems, it seems a little ridiculous.
Jed Sundwall:
Yeah.
Jed Sundwall:
Yeah.
Jed Sundwall:
Absolutely. Okay. mean, yeah, this is sorry. Yeah, yeah, yeah. I mean, we’re tipping into the philosophical, which is great. That’s we get to do this is like, I like to say like, imagery is a metaphor for the data. Like, it’s like, yeah, like imagery is like one way to see the data because you want to see it, right? Like, I went through this a bunch at when I was at AWS, you know, building the open data program is that I have had, I’d have executives that are like, where do I see the pictures?
Matt Hanson:
Alright, there’s a bunch of things there.
Jed Sundwall:
Like, what will it look like? And I’m like, well, do you know what an S3 bucket looks like? you know, it’s just like, it’s a bunch of objects, you know, with names, like it doesn’t look like much. We had the same issue when we started hosting Hubble Space Telescope data, where people are like, I want to see pictures of like the of the galaxies and stuff. And I’m like, yeah, that would be cool. Like, that’s not what’s in here. Like, this is like set. This is telescope data in a weird format called fits that has its own, you know,
Matt Hanson:
yeah. Right.
Jed Sundwall:
great wonderful people trying to figure out how to make how to cloud optimize it but it’s like the imagery is a derived product that’s made for a human to look at with human eyes that’s just one tiny sliver like one tiny slice of like how this data can be interpreted or used so yeah i will i feel like i i do feel like i want to defend myself with the lanset stuff i’ll first of all just say like
Matt Hanson:
Yeah.
Jed Sundwall:
I didn’t know what I was doing. Like, I was just like, well, look, we’re going to bring the Landsat data on AWS. I, I had some ideas, uh, bandied that I bandied about with Peter Becker from Esri and, um, Frank Warmerdam at Planet, you know, specifically like I consider them like the two people that were like, you should do this internal tiling and overview thing that ultimately became the, you know, known as the cog. and
Matt Hanson:
Yeah.
Jed Sundwall:
That was it. But I was just like, well, we’ll just see what happens. but I’m, I guess my question though is like, is that a solvable problem? Like is any data fit for, you know, safe for public use and distribution?
Matt Hanson:
Probably not. mean, right? every data can always be misused. So, and don’t get me wrong, right? Like that move of Landsat to the cloud was huge. It was really popularized Landsat. We wouldn’t be where we were today if that data set, that very important data set wasn’t there. But the time was that…
Jed Sundwall:
Yeah, I don’t think so. Yeah.
Jed Sundwall:
Thank you.
Matt Hanson:
like that data wasn’t available really. Well, it was available to people, but like that’s not who was using it. Right. It was, it was scientists and it required that anybody using it probably should have opened up the Landsat data user handbook and read like what the data was and what needed to be done for it in order to do things like compare NDVI over two different days. cause you couldn’t do that.
but people did it anyway. And so, but like, I could point to other data. I’m sure that that’s, you know, that happens all over the place. education, right. It’s a good thing relying on experts. Like, you know, these are things that, that companies need to do is value that expertise in the geospatial and remote sensing domains. and not just assume that because data is easily accessible.
Jed Sundwall:
Yeah. Right.
Matt Hanson:
and you can just easily find it that like you can do things without really knowing what you’re doing.
Jed Sundwall:
Right, right. Well, yeah, I I think I would just, I advocate for sort of permanent constant vigilance and skepticism around everything. mean, the history of the internet so far, you know, which was designed explicitly to like improve the sharing of like research data, you know, I mean, that was Tim Berners-Lee’s like goal was like, I want to be able to share stuff with my colleagues more easily. We’re sort of epistemologically like,
It’s very hard to say whether or not we’re better off because yes, there’s a lot more information out there. I would assume a lot of it is accurate and great and pristine in a lot of ways, but like there’s really never anything, never anything stopping anybody from twisting it, interpreting it, turning it into a narrative that, you know, fits whatever their, their agenda is. Let me go to the comments again. Sig asked about WMS and how it made easy to get
Matt Hanson:
Yeah.
Jed Sundwall:
Many large raster image, well, I’ll just put it on the stream here. To get imagery into legacy desktop and web apps might stack be implemented in a similar fashion. It has been. mean, Esri supported stack for a super long time. Do you have comments on that?
Matt Hanson:
Yeah, Yeah, stack. There’s a new QGIS feature. There’s a stack plugin that actually works really fantastic. So yeah, think that that’s already happening.
Jed Sundwall:
It’s yeah, it is happening. And then from CJ Levinson, I’m curious to hear how this conversation extends to model data sets as opposed to remote sense data and how this relates to my, to my point, Jed’s point of good data products being about making less decisions. So yeah, thinking about climate models, weather models, mostly modeling outputs, which would be the main geospatial artifacts. So yeah, I mean, element84 has done some great thinking on
embeddings data products and things like that. think that’s relevant here. What’s your thought on this, man?
Matt Hanson:
Yeah, well, there’s a couple of aspects here, Well, there’s the aspect of how these model data sets, like these generally large homogeneous model data sets, fit in the stack. But I’m not sure that’s the question. Is that the question?
Jed Sundwall:
No, yeah, less about stack, just more, think about how we’re talking about like, you know, data that’s fit to be shared and fit to be used. And, you know, now we’re dealing with like data products that are, are just model outputs. So like a model’s done a bunch of magic on them.
Matt Hanson:
Right.
Matt Hanson:
Yeah. So I think that gets into your curve, right? Which is like, you know, we’re, we’re in the curve is that modeled output, but like, generally speaking, I think that, like that’s generally what we want, right? Like, this is what users want is they want the modeled output. They, don’t want level two Landsat data. they don’t even want level three. want, you know, what they want is they want planet variables, like planet lab variables. Dataset is exactly the type of thing.
that we need to see more of, think, where this isn’t imagery, this isn’t time series, this is like, I’m looking for a particular type of data variable, and I can get that, and it’s been derived from imagery, but it’s gone through a process that weeds out all those edge cases and everything. I think planetary variables are great.
That’s a great data product right there.
Jed Sundwall:
Yeah. I would also say, so this is a shout out for, um, the time to shout out dynamical, um, another, so the, the dynamical podcast, which is called weathering, which is just a, an absolute delight. Um, this is from the people who build upstream tech. Um, but anyway, they’ve, they’ve, have this great podcast where they have, they’ll actually read papers on, you know, weather forecasting and, and, um, advances in weather forecasting. And in a recent episode, if I, uh,
Let me see if I can remember which one it was, but it was, I think it’s the one.
on, yeah, that’s the most recent one, a taxonomy of bias, sense-making, heretical physics and the Tom Hanks, Bill Murray multiverse. It’s a good episode. But where I think they discuss how like, you know, we already interact with a lot of models and develop opinions of them over time based on their usefulness. Right? So like you were saying before, like a lot of satellite imagery has these like insane, you know, error rates or whatever. They just have like,
substantial error rates, right? They still might be useful. know, there’s this, you know, the adage that like all models are wrong, but some are useful. And so, yeah, I mean, I would say, I guess I would just, I’m just going to agree with you to say like, this is what we want, are to have models that are able to distill data into things like planetary variables or like basically things that can support decision-making. And I think people aren’t idiots.
You know, like they’ll figure out like, is this useful to me or not? And, and it’s possible that sometimes the model gives you something that’s like catastrophically bad and like you lose money on it. And you’ll, you’ll be able to make a decision whether or not you want to trust that model again. You know, it’s, it’s the way the world works. Like, I think it’s so easy to think about, like, or just it’s so easy to like, like over, overthink this sort of stuff. you know,
Matt Hanson:
Right.
Matt Hanson:
Mm-hmm. Mm-hmm.
Thank
Jed Sundwall:
man, I’ve missed out on LinkedIn. People have been saying stuff.
Matt Hanson:
Uh-huh. So while you, okay, before you do that, have I told you about my, my Star Trek theory of, of remote sensing? Have I ever, okay. Well, we’re on a podcast, so I’ll have to now explain it anyway. Even if I, even if you had said, yes, I’ve heard this before. so if we look at Star Trek, right, like my whole vision of, of the future, I hope is, is more, is way, is way more Star Trek.
Jed Sundwall:
Yeah. Go. No, no, no. Go for it.
Jed Sundwall:
I love this. I remind me.
Jed Sundwall:
Yeah, yeah.
Matt Hanson:
then a more dystopian version, but, in Star Trek, you have tri quarters, right? And you have sensors and, and, and what are those sensors not do they’re not sending back images that are then analyzed, right? You’re, scanning for life. You’re scanning for a particular element. You’re scanning for specific variables. And I think that maybe there’s an aspect here. Like we, we creating general purpose satellites.
historically Landsat, right? It’s like, well, we don’t really know this could be used for a bunch of different things, but we’re increasingly, I think, seeing companies that are coming up and, creating satellites for particular specific verticals. selling the satellites and satellite as a service. And I think ultimately maybe that’s where remote sensing goes, where there isn’t a satellite that’s like taking an image and then we’re down linking that and then like.
figuring out a bunch of different use cases and using it for a bunch of different use cases, but rather it’s like, no, this is like, see this with GHG set, right? It’s like, no, this is a satellite for detecting methane. Like it’s a single purpose thing. It’s the Star Trek. It’s like scan for life. It’s like it might, that might actually be an optical satellite or it’s a SAR or something like that, but it’s doing something and doing something on board and then just sending back just the thing.
Jed Sundwall:
Yeah.
Jed Sundwall:
Right.
Jed Sundwall:
Yeah, okay, sorry, now I’ve got it, this is great.
Matt Hanson:
Thank
Jed Sundwall:
Go Star Trek. It’s funny, I’m not a Trekkie by any means. I did watch the Next Generation a bit when I was a kid, really liked it. But I brought Star Trek up at some recent open data event that I was at. just being like, because people are like, are there any examples of like literature or stories about like the future of like technology where like things are good? And I’m like, I think Star Trek is like one of those, you know? Yeah.
Matt Hanson:
Yeah.
Matt Hanson:
yeah, yeah, it’s, yeah.
Jed Sundwall:
Cause we are, we’re so like, we’re just so steeped and we have been for many years into kind of like dystopian technological stories and stuff like that. And I think we should keep Star Trek in mind as like a vision of where we could take things. you reminded me though. I was, so last week, a bunch of our friends were at a national academies of science workshop on earth observation and the future of data stewardship. And I pitched basically what you just said in a way. I mean,
Matt Hanson:
Yeah, absolutely.
Jed Sundwall:
We worked within groups to come up with a 20 year strategy. And I had some license to kind of steer the Ouija board, as I would say. We’re all hacking on these ideas, but this really wasn’t my idea because it really did come out of the group. was just sort of this realization that I think we know a few things that we want to accomplish in terms of governance and let’s say environmental management or something like that.
Matt Hanson:
Right.
Jed Sundwall:
And rather than looking at the next 20 years of Earth observations and thinking like, well, what sensors do we need? You know, and what file format should they be in? You know, what should the standards be and like, who should pay for it? And it’s like, but what I, I led with when I was sort of reading out from the group, like, I think if we’re thinking 20 years ahead, we should assume there will be more sensors. There are going be more data products. There are going be more models producing all sorts of stuff, more users doing weird things that we could have never anticipated. And what we should probably do.
Matt Hanson:
.
Jed Sundwall:
And I cannot emphasize how hard this was for me to say out loud. We should maybe look at something like the sustainable development goals. I like to make fun of the sustainable development goals because it’s just like kind of a bunch of hot air in terms of like, it’s like, that’s nice that you created these goals, but like, really? Like, is anybody going to do anything about this? And, but the truth is like, well, we, but we should, you know, so like one is like, we should like, it’s, it’s just like,
Matt Hanson:
No.
Matt Hanson:
Yeah, we should. Yeah.
Jed Sundwall:
I make fun of them. Sorry, everybody. But like, it’s just like the UN doesn’t really have the ability to herd the cats that are nation states to get them to do stuff, right? I think we’ve proved this has been demonstrated. so, but the sustainable development goals are like really good goals. So it’s like, hey, you know, we really want to ensure that every one of the world has access to clean drinking water. And going back to your point, what do we need to do that? And it’s like, it could be,
Matt Hanson:
Yeah. Yeah.
Jed Sundwall:
any number of different types of sensors and we should have some sort of entity that is actually held accountable to like making the end result happen. And who knows what kind of sensors they’re going to use. You know, we don’t need to say that like, I mean, it might come into the case like we need something like GHDSAT, you know, and the community that’s like driving at that specific goal can determine that.
Matt Hanson:
I know.
Matt Hanson:
Yeah. And they’ll need dedicated satellites to do that. Right? Like this whole shared, the whole shared satellites for all these different use cases. Like there’s just not enough tasking capacity. and power is in time series and you’re like maybe lucky to get an image like every other month. Like you really need, you really need a dedicated satellite for, for, for the purpose, I think.
Jed Sundwall:
Hmm.
Jed Sundwall:
Interesting.
Okay, I don’t have strong opinions about this. I’ve kind of always like thought, you know, there’s likely latent capacity in the satellites that we do have up that people aren’t, you know, just people can’t get access to, right? So like huge fan of common space, for example, you know, like could, well, it’s an example worth debating. I mean, you know, we were fiscal sponsors of common space, know,
Matt Hanson:
Yeah, I mean there might be, but yeah, commonplace, right? This is a great example.
Jed Sundwall:
Bill was listening in here, a glorious initiative. But there’s still, think there’s still like plenty of debate to be had, which is like, does common space need its own satellite? Or like, is there actually just like a legal financial policy hack that could make existing sensors, you know, actually useful for the humanitarian realm? It might be easier just to launch your own satellite at this point.
which is why I’m glad they’re trying to do it. But it’s, I think it’s a worthwhile debate.
Matt Hanson:
Yeah, think, yeah, my sense is that it is. And especially if you want, if you want full control over it and you want to, if you want to revisit the same areas over and over again, like even for a disaster, right? Like, we focus on, on
imagery after there’s some disaster, like ideally you’d want to continue to look at that same area for some months afterwards to see about the recovery efforts or like there’s flooding, like how long does that take for the flood waters to recede? And so I just don’t see how you could get that much data unless you’re actually controlling the satellite and the ability to look at the same areas over and over again. Same thing with like infrastructure, right?
like companies that own and operate global infrastructure. Like, yeah, it totally makes sense for them to just own their own satellites. And like these things are pointing at the exact same areas day after day.
Jed Sundwall:
Yeah. Huh. I wonder this Munich re is there is Munich re going to fly its own satellites soon? You know, it seems like, yeah.
Matt Hanson:
I mean, it’s getting more and more cost effective, right? I mean, we’re seeing companies like pivot towards, you know what? We’re not actually going to sell pixels anymore. We’re going to build satellites. And, I think big companies, there’s lots of countries in the world too. Like this is, this seems like this is where the business is heading is smaller, cheaper purpose built satellites.
Jed Sundwall:
Yeah. Yeah. This is all right. We’re in agreement. mean, this is again, what I was saying at this national academy of science thing last week. was like, I mean, I’ll say like, there were plenty of people that are like, oh no, only the government can do this. You know, everyone knows that. And I’m like, I don’t think that’s true. I think we’re going to see more satellites being flown by more actors. Linda’s chiming in on LinkedIn saying she agrees with the need for dedicated satellites, you know, purpose built. then yeah, bill’s open for the debate, but yeah, I think there’s a.
Matt Hanson:
Yeah.
Matt Hanson:
Nice.
Jed Sundwall:
I’m, I find this compelling. want to get to, so Tim Bailey asked earlier about, the, the issue about human. He said there’s an issue about human inspection to validate interpretation. he says, I work in the forest wildfire resilience field where there’s a stampede of new data products that are not great data products. so yeah, I mean, we’re going back to the error rate issue and like, kind of like the issue of
This is, he posted this a while ago when we were talking about models and accuracy and, you know, making it, you know, actually like informing decision support systems. I’ll also bring up relevant to this is that I think Bloomberg published a story that’s been going around on, LinkedIn this week about Zillow removing climate risk information from its listings. I think.
Zillow and Redfin, they used to show like flood risk and fire risk. This is data that comes from first street foundation and they took it out. They took it off. And, the, the, the issue being that like, increasingly are encountering decision support information that could be fire risk for your house, you know, or for the house that you’re thinking about deciding to buy. but it’s coming from entities that people aren’t sure whether or not they can trust them.
And I think first street kudos to them sort of demonstrably have produced models that are better than FEMA’s models or like anything that the government’s been able to produce. but, still like validating that sort of information is, difficult. And I think we need, I’m perceiving a need for, I always say new, new data institutions. but like arbiters that can actually like help validate this stuff anyway.
Matt Hanson:
Yeah.
Jed Sundwall:
Over to you in case you have, I want go back there.
Matt Hanson:
Yeah. Yeah. So while I think Tim has a great idea for your name, for you can start another podcast called not great data products and you can like, what do you think you can evaluate? Like really crappy data sets. Like this is the worst, you know, it’s like,
Jed Sundwall:
That should be like, we should do special episodes every now and then just like, just talk trash about.
Matt Hanson:
Yeah. Yeah. Not great. Yeah. Yeah. This is the worst. so yeah, I mean, so I feel like I just keep on ranting, on this podcast. like, you know, I think we have a real problem with startup companies, especially, don’t know. Maybe this is a worldwide problem. mean, I see it.
Jed Sundwall:
Do it! That’s the whole thing.
Matt Hanson:
you know, really prevalent in the US here. Startup companies doing really questionable science. And there’s, there’s, because they’re at odds, right? Like the business model that they have is completely at odds with the scientists. I mean, I guess we’ve seen this in, we’ve seen this in very high profile cases outside of the geospatial industry. But like we see it in the geospatial industry as well.
And people making promises for things that really just aren’t practical and over promising and under delivering. and so, yeah, I’m not surprised that like, that Tim has come across a lot of really not great data products. don’t know about the source of those, but like, I’ve seen that. I’ve seen that quite a bit.
Jed Sundwall:
Yeah.
Jed Sundwall:
Yeah. Well, I mean, look, it’s constant. mean, and I’ll say, I mean, this is why, this is why, I’m at radiant earth, right. And like why I left, it’s not why I left Amazon. Like I was, Amazon is great. Like I had a very good eight years there. but what I realized was like, no, we do need to have institutions that understand how to provide data, but that aren’t owned by investors. Right. So they don’t have.
Matt Hanson:
It’s just constant.
Matt Hanson:
Yes.
Jed Sundwall:
the same sort of like forever growth incentive, which is not to say I should say like, some of my best friends are investors, you know, like, that might not be true, but like, I have plenty of friends who are investors. are funded by investors. don’t think investors have inherently malicious intent. What I would say is that like investor owned or governed companies that are united by, you know, just the need to grow constantly.
are not always going to be the best stewards of data. And I would say in almost all cases, they almost can’t be. The pressure to inshitify is unavoidable. And then also the sort of the competitive need precludes them from being like truly open about their models and how they operate, right? It has to be secret sauce, which I think if you’re saying like,
Matt Hanson:
Mm-hmm.
Jed Sundwall:
If you’re going out there and saying like, Hey, we have the data that is going to be used to regulate the environment and the real estate market and like risks to human risks to like life on earth. Um, you need to be held to a higher standard than just be like, and it’s good. Trust us. It’s a, it’s our proprietary secret sauce. Um, so.
Matt Hanson:
Yeah. Yeah, yeah. And we can bring that back to STAC actually now because a years ago, well, there’s been an effort with STAC coordinating with SEOS. So Matthias Moore has done a bunch of this work. was involved. so SEOS, which is an international committee of space agencies,
Jed Sundwall:
yes. Yeah. Bring it home.
Jed Sundwall:
Well, yeah, we’ve been, he does that under the umbrella of Radiant Earth,
Matt Hanson:
has a thing called ARD, CS ARD, analysis ready data. And so Matias has been doing work on like mapping their requirements for ARD back to stack. And so when I was involved with this a bit some years ago, we were in the early days here, we were identifying like, you know, what fields really need to be included in
in this for them to get the ARD certification or whatever from CIS. And I think the immediate problem that I saw was that you want to really, you really need and require radiometric and geometric accuracy to be published in that metadata. And I don’t think that there’s a ton, I could be wrong, but I don’t think there’s a ton of commercial
satellite companies that are really willing to do that.
Jed Sundwall:
Interesting because of the proprietary nature of what they do.
Matt Hanson:
because their satellites suck for the most part because like they’re, you know, they’re, they’re, they’re CubeSats. They’re like, they’re, they’re low cost, cheap, you know, things. Now, maybe I get a whole bunch of people mad at me, which somebody told me that means that you’re doing something right recently. But, you know, I don’t want to make a blanket statement about all of it. I love satellite companies, right? Like, you know, they’re, they’re, they’re, got some of my best friends are, are satellite companies.
Jed Sundwall:
Okay.
Jed Sundwall:
Amazing.
Jed Sundwall:
Some of my best friends are satellite. Yeah.
Matt Hanson:
But like, but the real, the, but the realistic assessment is that these are lower cost, cheaper satellites. And, and the radiometric accuracy is not going to be up to snuff compared to giant school bus size satellites like Landsat is.
Jed Sundwall:
Yeah. Well, interesting. mean, but this does, so this whatever, this is a solvable problem. think you’re just highlighting the need that it needs to be solved is that if we are talking about a future in which more people are deploying sensors, you know, we’re having more low cost sensors going up. A lot of those are to be CubeSats, you know. But again, again, I guess the requirements are going to be bespoke, you know, in the case of every sensor.
Matt Hanson:
Mm-hmm.
Jed Sundwall:
to determine like, okay, does this meet our needs? You know, I’m a reinsurer. need to have control of my own satellite that I can task, but this is what I need. Interesting.
Okay. I put in links in the, in the chat to a blog post that Matthias wrote about the sort of the cloud native approach to, to doing this stuff. So, or to ARD and, and yeah, and shout out and thanks to, to NASA for funding us to be able to do that work with Matthias because it’s, it’s been great. Okay. Well,
We’ve covered a lot of ground here. mean, you’re, I love talking to you. This has been secretly like, is, this is one of the great things about doing this is that I don’t know the last time I had like an hour and a half or so to just talk to you about stuff. So it’s been a real treat for me. there anything else you want to mention? we didn’t talk about my white paper. Did you read my white paper? Okay.
Matt Hanson:
I Yeah, I did yesterday. like, yeah, mean, a lot of great alignment with a lot of things. So yeah, there’s a there’s one thing I perhaps we can talk about this credibility issue because as I told you yesterday, you know, I wrote this blog post and afterwards a colleague of mine
Jed Sundwall:
Yeah, yeah.
Matt Hanson:
was like, well, there’s something missing from this post in that it seems like there was some, there’s something else that’s required here that you didn’t mention. and I think that thing is it’s this credibility issue. And what I mean by that is it’s not like, like if, if some random person, this happens a lot, right? Like they create a really cool thing and then they go out there and they’re like, Hey, help me with this thing. I want to create a standard. Like they just might not get a whole.
Attraction from that and with stack we had some credibility because of Chris Holmes Chris started it and he had a good reputation and he’d been involved with us Geo He knew a lot of people like he had he brought that credibility to it and we see companies like you mentioned the New York Times right with RSS or you know Google and Metta like they they
Jed Sundwall:
Yeah.
Jed Sundwall:
Yep, yeah, that’s right.
Matt Hanson:
come out with standards, right? Like all the time, because they have this credibility. It’s not, they’re not gorilla standards, right? They don’t actually build them in a community, but they have enough credibility and weight behind them that they can accomplish a similar thing, which is this is a standard, use it, and people start using it.
Jed Sundwall:
Yeah. for some reason I feel very compelled to share, I’ll also put in the chat, a link to you just haven’t earned it yet, baby, by the Smiths. Morrissey at his finest, you know, just be like, look, like there’s a, that it is a harsh truth that you have, that you will confront throughout your life. You know, when you’re trying to do anything is like, you do, you do need to earn that credibility. Right. And so like,
So my white paper is called Emergent Standards and basically it’s an exploration of how do standards emerge without an authority coming and saying like, thou shalt do this, right? Linda just commented on LinkedIn, H3 and Uber is another great example where it’s like, Uber clearly knows what they’re doing and H3 was obviously good.
Matt Hanson:
Yes. Yep.
Matt Hanson:
Mm-hmm.
Jed Sundwall:
you know, for, for what it does. And they opened that up and it’s great. And now, you know, we talk about H3 a lot. the, so, so it’s, it’s this interesting, it’s all sweet spots. Like you can’t, we have many examples of institutions that are powerful and have sway in a lot of ways, trying to decree standards that just don’t work because
They are not actually aligned with what practitioners want. then so practitioners can come up with their own thing, but you still have to have a Chris Holmes in the group. have to have somebody who has the convening power or the credibility or something to like actually get, get you to pay attention. it would, which is a drag because it’s like, well, how do you do that? I’m like, I actually don’t know.
Matt Hanson:
Yeah, exactly.
Jed Sundwall:
Like it’s, it, feels like a historical accident in cases like when, when stuff like that works out. And that’s actually probably true. Most of history is a series of accidents. Yeah. Yeah.
Matt Hanson:
I think that’s true. Yep. I think that’s true. Right. There’s been a lot of research into this. you know, if you look at like, you’ve probably familiar with this more than I am. Like if you look at like the path of Bill Gates and like other folks, it’s like, that have become billionaires or founded big companies. Like it’s, it’s, it’s a lot of being in the right place at the right time. It’s a lot of happenstance. It’s a lot of luck. It’s not just because he was brilliant and he just like did stuff and it was like, that
Jed Sundwall:
yeah.
Matt Hanson:
Like if he lived in another time, if Bill Gates lived in another time or Elon lived in another time, right? Like they wouldn’t be the billionaires they were today. There’s our whole life is pretty much dictated by luck.
Jed Sundwall:
yeah. yeah. Actually. So I’ll one final, like bit of self-promotion that I’m allowed to do here is, our last, the last latest episode of texts on texts, my other podcast about literature, we talk about, a short story called anxiety is the dizziness of freedom, by Ted Chang. and which is basically like, it’s awesome. It’s very, it’s totally relevant to what you, what you just said, which is
Matt Hanson:
Okay, cool.
Jed Sundwall:
It describes a device where you can turn, you flip a switch and it creates a parallel universe that you can communicate with. So you can communicate with what’s called like a para self, like a parallel version of yourself. And it drives people crazy. Like it just causes all sorts of issues for people. Cause like there’s a guy who’s like, he’s like, my parallel self has a girlfriend and like, and I don’t. And I’m like, what’s wrong with me? Like, you know, it just cause yeah, it basically reveals to people that how much of their lives are
Matt Hanson:
that.
Jed Sundwall:
pretty much out of their control. Anyway, that’s, we’ve gone very far afield, but to bring it back to the point of stack and everything like that and how the stuff is created is like, you do just have to try to do this sort of stuff. I think it’s, you got to try. I think, I think what stack has demonstrated is that it is possible. And I do think that there are,
Matt Hanson:
You gotta try.
Jed Sundwall:
parts of this playbook that can be documented and repeated. But part of that includes like having, you said it before, building with the community and finding champions. And you have to do that on purpose. So.
Matt Hanson:
Yeah, you do. Yeah. Yeah. And yeah, even just being engaged with the community, even if you are building stuff internally, I do feel like the more that you are engaged with the community, the better that thing is going to be. So if you’re working with stack, even if it’s like for internal use, come to the stack community meetings, you know, and let people know what, what you, what you’re up to and like, maybe you’ll get some good feedback. like,
Jed Sundwall:
Yes.
Matt Hanson:
It’s definitely, you’re gonna be better off, I think. You’re gonna be in a better position the more you work with a larger diverse group of people.
Jed Sundwall:
Absolutely. Well, yeah, where should we point people? can send people to stackspec.org where you can learn everything you need to know. As far as getting involved in the community meetings, where do we point people to?
Matt Hanson:
There’s a Google group that should be, is it on the webpage?
Jed Sundwall:
I’m like looking around, I’m noticing the Stackspec site is directing people to our discourse, which we don’t support anymore. So.
Matt Hanson:
Okay, yeah, so there’s some more things that we need to do. So yeah, we’re trying to clean, so the Stack Steering Committee, we actually, I think, have a meeting in the next week. Maybe it’s tomorrow. And we’re trying to clean up some of these things. So, yeah.
Jed Sundwall:
All right, well, stay tuned then. Stackspec.org. Matt, you’re easy to connect with on LinkedIn and stuff like that. Maybe, I don’t know. You can join the Cloud Data Geospatial Forum. There’s plenty of people in our Slack, but you have to, we do ask people to pay to join that. It’s not a lot of money. But yeah, there lots of places to get involved, stay tuned and look at stackspec.org and see what you can find there.
Matt Hanson:
yeah, yeah.
Jed Sundwall:
All right, this has been awesome. Thanks, Matt, for coming on. I predict that we’ll have you on again, because we’ll be doing this forever. And thanks for everything you’ve done for the community.
Matt Hanson:
Bye.
Matt Hanson:
Yeah, no, thanks for doing this, Jed. Yeah, no, it’s been, this has been fun. I love the chat, so, you know, anytime.
Jed Sundwall:
Any time. All right. Well, happy holidays. All right. Bye.
Matt Hanson:
All right, you too. Bye bye.
Jed Sundwall:
Okay, stay in
Video also available
on LinkedInShow notes
Jed talks with Jack Cushman, director of the Harvard Law School Library Innovation Lab, about how libraries are adapting to technological change while preserving their mission to collect, preserve, and share knowledge. From the printing press to the internet to artificial intelligence, libraries have continuously evolved their methods. The Lab focuses on bridging traditional library principles with cutting-edge technology to empower individuals with better access to information.
The conversation explores the Data.gov Archive project, which aims to preserve approximately 17 terabytes of federal datasets - not just the metadata from Data.gov, but the actual underlying datasets that are at risk of being lost. Jack explains the challenges of collecting these datasets, particularly the limitations of web crawling technology that often fails to retrieve underlying data. The team successfully collected more than 311,000 datasets, with particular attention to smaller datasets that might otherwise disappear, demonstrating their commitment to knowledge stability in an era where governmental data can be fragile.
Jack discusses how they use BagIt - a Library of Congress standard for packaging digital content - to ensure long-term preservation through comprehensive metadata, checksums for verification, and cryptographic signatures for authenticity. This approach addresses data provenance and integrity, creating complete packages that can be cited and verified decades from now. The discussion also covers their innovative client-side viewer that runs entirely in the browser without server-side software, making 17.9 TB of datasets searchable while reducing infrastructure dependencies. They explore the importance of user-centric design, the role of well-supported tools like DuckDB, the “one copy problem” that highlights data fragility in the digital age, and collaboration with institutions like the Smithsonian. The episode also touches on PermaCC, another Lab project that addresses link rot in legal documents by creating permanent links to online resources.
Links and Resources
Key takeaways
- Libraries evolve while preserving their mission - From the printing press to AI, libraries continuously adapt their methods for collecting and sharing knowledge while staying true to their core purpose of preserving information for future generations.
- Small datasets matter as much as big ones - The Data.gov Archive project prioritizes preserving smaller governmental datasets that might otherwise disappear, recognizing that knowledge stability depends on capturing everything, not just the high-profile datasets.
- Web crawling alone isn’t enough - Traditional web crawling technology often fails to retrieve the actual data files linked from catalog pages, requiring more sophisticated approaches to truly preserve datasets rather than just their metadata.
- Client-side viewers reduce infrastructure dependencies - Running search and visualization entirely in the browser without server-side software makes 17.9 TB of datasets accessible while eliminating the fragility and cost of maintaining server infrastructure.
- The one copy problem threatens data persistence - Data in the digital age is more fragile than physical artifacts; without robust systems and collaboration across institutions, valuable datasets can disappear when a single server or organization goes away.
- BagIt enables verifiable long-term preservation - Using Library of Congress standards for packaging data with checksums, metadata, and cryptographic signatures creates complete packages that can be cited, verified, and trusted decades from now.
Transcript
(this is an auto-generated transcript and may contain errors)
Jed Sundwall:
Okay. All right. Well, thanks, Jack. Thanks for coming. Joining us here, the third episode of, yeah. yeah. Yeah. Yeah. Lucky number three. And I want to point out, this is kind of an exciting moment because historically, radiant earth has really dabbled in geospatial data. Like that’s our wheelhouse. That’s our
Jack Cushman:
Good. Thank you so much for having me. I really appreciate it.
Jed Sundwall:
Our origin story of Radiant Earth was an effort to make satellite and drone imagery easier to work with. And one of the things that I did about three years ago when I came in as executive director was realize that a lot of the work that we had figured out with the geospatial community was really broadly useful in terms of adopting object storage and things like that. so we were, anyway, this is all to say, I’m excited to have you on because you’re not a geospatial person.
You know, first two guests have been geospatial. Yeah. Okay. Good. And anyway, so this is going to be a great conversation to learn a little bit more about like how we’ve been working together on source cooperative and, and your background as a librarian and your perspective on these things. Before we get into it though, I do want to point out to everybody and I’ll figure out how to put this in the chat, but that, you know, you’re, you are currently tuned into
Jack Cushman:
Absolutely, would never pretend.
Jed Sundwall:
great data products, the live stream webinar podcast thing, as we call it. there’s also great data products, the blog post now. so I gave a talk about a month ago at the Chan Zuckerberg Institute’s open science meeting and, the, the name of the talk was great data products. And then we published a blog post called great data products. So this is an exercise in brand confusion. perhaps radiant earth could, or this podcast could sue radiant earth for taking the title.
for the blog post, but it’s a little bit confusing. But in any event, the name of the game these days is Great Data Products, and we’ve got a great blog post about it. I’m very happy with it. I’ll put that in the chat. So in case people haven’t seen that, you should see it. And then, yeah, with that, let’s over to you. How do you introduce yourself to people?
Jack Cushman:
Hi, everyone. I’m Jack Cushman. I direct the Library Innovation Lab. I’m really happy to be on the livestream webinar podcast thing. I love working with you, Jed, on Source Co-op. The lab I direct, the Library Innovation Lab, is a research and development lab, a software lab that’s built into one of the world’s largest law libraries. So we’re doing novel things in a very traditional place and drawing on the best of both of those worlds.
Personally, I’m a lawyer. I’ve worked as an appellate lawyer. And I’m a computer programmer. I’ve been programming computers since I was 12 years old, so very many years. And more of a newcomer to libraries, but I’ve been here for about 10 years. really coming, you asked how do you introduce yourself, which is always a challenge for me on the tax form. What are you supposed to write in for what your job is? And I’ve come to say information scientist, that really I’m a person who thinks about how do we consume information? How do we turn it into knowledge?
And how do we help our society over time have better and better access to knowledge? And that’s why the Library Innovation Lab has become such a great fit. Because our mission is to bring library principles to technological frontiers, which means to understand where people are actually getting their knowledge. How is that really happening, which often is outside of the walls of a library? And how can we take the things that we’ve learned in libraries over many centuries and help new technologies to go better? So really core things like libraries are here to…
collect information, preserve it, and share it to empower people. And we’ve been doing that since before the printing press. But when you invent the printing press, you have to change how you collect and share information. Now you need like a written list of the books you have, because there’s enough that you can’t remember them all. When we invented databases, we needed new ways of thinking about libraries. When you invent the internet and data that is digital first, government’s publishing data that is only online and never on paper, you need new ways again to think about information.
Jed Sundwall:
Right.
Jack Cushman:
And now in this AI era, we need yet again new ways to think about what it means to collect and preserve and share knowledge.
Jed Sundwall:
Amazing. So this is interesting. I didn’t realize you were a lawyer. I mean, I guess it makes sense. You’re at the law school.
Jack Cushman:
Clearly I hide it. I’m a recovering lawyer. You know, I have not practiced law since probably 2014, 2015. And happy to leave that to the experts.
Jed Sundwall:
Okay.
Yeah, it’s interesting. mean, I, we, we have a kinship here because I studied foreign policy and thought I was going to be a diplomat or something like that. And then, but was also, I was never a, would never call myself a programmer, but I was making websites in like 1994, like on like mosaic, know, got very, was enamored with the web from the very beginning. And, that was just always kind of like a hobby.
for me and then, but anyway, so I think we’ve ended up in similar places interested in sharing data and stuff like that. So it’s cool to hear your story here. So, can you say a little bit more about like the library innovation lab and like what you all think about these days? Cause everything you just hinted at was great, know, pointing out that we had libraries before books. What are you thinking about in 2025, you know, as we go into 2026?
Jack Cushman:
Absolutely. And I’ll say, you know, we need 100 library innovation labs. Anything that we pick to focus on is one of many things that we could have. And I hope that all of those flowers will bloom. But for the direction that we go in, the core organizing principle is your society needs knowledge to plan and to direct itself. If we have poor short-term memory or long-term memory as an individual, it’s very hard to navigate your life. If we have poor short-term and long-term memory,
as a culture, as a community, as a government, whatever layer you want to look at, it becomes very hard to navigate. And all the projects that we look at address that in different ways. We build PermaCC, for example, that fixes link rot in law decisions in published cases and in law journal articles that’s used by law firms. And it makes documents reliable in long term instead of short term. When you cite to a URL in document,
You include a permalink, and that permalink is on file as a copy of the web page with the Harvard Law Library. And that means that link is going to work in perpetuity. And it goes from kind of having this etch-a-sketch memory, where you can have a case, and a month later the domain doesn’t resolve and you don’t know what they mean, to having kind of permanent memory again. So what that means for LIL is we’re looking at how do you preserve knowledge for the long term and how do you interpret it. On the preservation side, we’re working on projects like PERMA.
We’re working on projects like we’re going to talk about our public data project, which is how do we make sure we don’t lose the public data we all create together? And then we’re also looking at the access and interpretation side. We have a research program looking at law and artificial intelligence, because law is such a wonderful playground for understanding how AI changes our ways of knowing. The law is kind of done by words. I think of how I want to say it. You think of how you want to say it. The judge picks something, and those words become meaning in the real world.
Jed Sundwall:
Yeah.
Jack Cushman:
which means that systems that can interpret and juggle and shuffle words to make meaning all of a sudden have this real practical impact in our field. And it us study things like, how are we going to help law students actually learn in a world where the tools can do much of the reading for them? How are we going to evaluate how good tools are at the fine-grained thing that you’re trying to do? How do we do benchmarking of the thing you actually care about instead of abstract benchmarks of other things? And how are we going to navigate a field where that just employment is rapidly changing?
or like law employment used to be this very pyramid shaped. You hire a bunch of people down at the bottom to read through piles of paper in a box. And now the need for reading through piles of paper in a box is really changing. We have to reinterpret what it means to be a junior lawyer who works their way up. So doing a bunch of things that are about how to make sense of the data once we have it. And that might inform sort of you’re seeing both sides of that in the work we’re doing with you. There’s how do we responsibly collect things and then how do we responsibly share them so that people can really find what they need.
Jed Sundwall:
Yeah. Well, yeah. So let’s, let’s talk about the data.gov archive. and how that came about. Cause I mean, I, you know, I think the conversation started, about a year ago, when we thought maybe this would be a good idea to start backing up data.gov, but I will confess to not, I don’t have the clean answer to what’s, what is in this collection. How do you describe it to people?
Jack Cushman:
Yeah, yeah, great question. So what’s the point of the data.gov archive? It did start because we wanted to do some broad reaching collection of federal data sets. And you mentioned, like, you know, there’s a geopolitical context where you might say, it’s important right now to save data. And at the same time, our law library has been saving data for the federal government since the early 1800s. I don’t know quite when Harvard’s relationship started, but.
The first act where Congress started asking organizations like ours to preserve documents was in like 1813 in the federal library depository act. I’m going to get the name wrong, but it’s been over 200 years that Congress was saying, please help us collectively preserve the stuff that matters. And with data.gov, we were saying, well, what does that mean for 2024, 2025? And
We already knew that the End of Term Archive, which we’re part of, was doing a wonderful job of collecting the web pages of the federal web, including anything under .gov, but also including their Twitter pages and their YouTube and anywhere that the federal government had a footprint, getting a snapshot before and after the transition so you could understand what changed. And End of Term Archive has been doing that since 2008. It’s not a kind of this year or that year thing. As a citizen, you should be able to see what your government was and what it’s become. And you should be able to see that repeatedly as the government evolves.
So we knew that was happening. Then we said, well, what’s not happening? And the real risk that we saw is you can easily end up, if you do a web crawl, getting the manual for the data but not getting the data itself. Because the way web preservation will work is you have a browser, like any of us would use, and it clicks from link to link. And it tries to click all the links on the page, and it clicks all the links on the pages it finds, and then it clicks all the links of the pages it found there. But it can’t do things like interact with a form. It can’t do things like if you need to send an email to get data or
If you need to script an API, it’s only going to get the stuff that you can get by clicking, which is wonderful, but might mean that you end up with a submerged layer of, wish we had the actual data that this report was based on, and that is just gone if it disappears. There was a data rescue community that emerged around that time, a bunch of different groups working on wonderful projects. The part that we worked on was, see if we can save the underlying data behind the data.gov website.
Jack Cushman:
Data.gov itself is an index. It lists datasets across the federal government and also some states. But it doesn’t store the data. It just says, you can go here to read this, you can go here to read that, you can go here to read that. They do have an API. So what we did is let’s script this API, get a list of all 300,000 datasets in there, and then find everything they link to and call that the collection. So, you know, dataset number 2,104, which is a dataset of…
you know, traffic congestion in medium-sized cities or whatever part of measuring our society is going to link out to this CSV and this Excel file and this PDF and this zip file. And that list of objects becomes what we want to put in a collection. And then the goal is to have, you know, accurately collect each of those things. So grab the metadata from the API, grab all of the URLs that link out to it and package those up as one of 300,000 objects that we were making in a new
Jed Sundwall:
Got it.
Jack Cushman:
collection of collections.
Jed Sundwall:
Okay, so, but then obviously like, know, so our world again, going back to the geospatial world, we deal with, you know, federally produced data sets that are like petabyte in scale, you know, weather data and model outputs and satellite imagery, things like that. You don’t have that stuff. So this is just what’s linked to, I guess, I guess I’m, my question is like, how many layers deep did you go?
Jack Cushman:
Yeah, great question. So we went one hop deep. So you have the listing on data.gov. It links to a set of files. And it says, these are the files in this data set. And we grabbed those files. And I think what that meant is we ended up collecting the smaller data sets. Because for the smaller ones, it would be linking right to an object, a file that was the data of that collection. And for the larger ones, yeah, it had the problem that those links would go to a landing page that said, yeah, for this Petabait scale collection, here’s the steps you go through to get it that are very individual to that collection.
For those, we would only get the landing page. We wouldn’t get the actual data. And what that meant is we added up to about 17 terabytes of data, which is a bunch of small data sets and then a bunch of landing pages for large data sets. I think the size kind of tells you both what it succeeds at and what it fails at. Because it tells you on the one hand, no, we didn’t get the massive uncompressed image collections or that kind of thing. It also tells you we didn’t just get landing pages. Like 300,000 landing pages is not 17 terabytes by any means.
Jed Sundwall:
Right. Right.
Jack Cushman:
We got a ton of the smaller data sets. And I kind of liked that as a first pass. We just want to do something to stabilize what exists now, not be losing things. And I think it gets you a very broad reaching, small significant data sets are going to be in there and are going to be preserved. And then it sets up for this question of, well, what else got missed? And you know what? It was true at every level. So there was one piece we knew, is the things in data.gov, we’re going to get some of them. We’re going to miss some.
that’s necessary at this scale. We’re also told going into it, data.gov itself is a partial listing of the federal government. I talked to technical folks working in the government at that time to get an idea of like, where’s the list? What would I download if I wanted to download the data sets of the federal government? And first I said, do you know where that list is? And then I said, who could you ask? And they said, no, I don’t have a list. Second of all, no, I can’t even think of someone, a group of people I could ask who could collectively know what it is.
that what we have is a sort of sprawling, overlapping set of independent agencies and groups just making data. And if you look at data.gov, it’s like, here’s a cool snapshot, 300,000 out of X, out of we don’t know how many.
Jed Sundwall:
Yeah.
Jed Sundwall:
Yeah. man. It’s so you’re taking me back. know, many years ago I worked for USA.gov. So I was at GSA as a contractor when data.gov was launched. And so I had a front row seat to all of that. And I have a similar story, is we at USA.gov at some point, cause I was, I was leading the social media strategy for USA.gov. And I mean, to give you a sense of like what this meant, I started before Obama was elected. Like I started sort of the end of
W’s second term and Facebook and Twitter were already becoming a thing. And it was like, we need to learn how to use this. How do we do it? And at some point somebody was like, we need to keep track of every federal social media account. And it was like, like, well, that’s what are you gonna do? Like open in Excel, create a spreadsheet and just like add them as you find them. And we’re like, that’s obviously not gonna work. This is too big now. And so we created a thing that I’m pretty sure
Jack Cushman:
Mm-hmm.
Jed Sundwall:
I don’t know, it might still exist in some form. It may have been deprecated, but we called it the USA.gov social media registry. And it was basically, what we did is we let anybody with a .gov email address, submit a social media account that they managed. And then we would send them an email, because we’re like, okay, you’ve got a .gov email address. We also asked them to put in their phone number just to scare them, just to be like, this is serious, like don’t spam this thing. But basically you would get an email with a token in it.
you’d click on that so that we would know that you actually owned the, doc of email address that you put in. And we’d say, okay, like this does look like the Twitter account for the embassy in Myanmar or something, whatever it was. And it works pretty, it works really well. We called it fed sourcing. Like we’re going to kind of crowdsource all this sort of stuff. But one of the things we wanted to do for the form was like, we need the list of the government agencies, which I know that you’ve dealt with.
Jack Cushman:
Not sure. That seems like that list would exist.
Jed Sundwall:
Yeah, well, it’s actually something I was going to ask you about because you guys have built and this is, mean, it’s also like a segue into the viewer that you all produced, but you have this awesome data.gov archive search that you’ve built. I’ll let you talk about this. But one thing I just sort of want to like get out right away is that you have things listed into organizations, publishers and bureaus. And I’m curious to know like what, if you all had the same conversation where you’re like,
what are the government agencies? Because as far as I know, that list still doesn’t really exist anywhere. We had to make one up based on a Wikipedia article. Like that was the best source we could find. So.
Jack Cushman:
I love that story. Well, before we get into our archive, I think that question of what is the denominator, what is the set of data that’s out there that we wish we could save, really helped me appreciate the goals that we have behind this thing. Because I started to picture where’s this data coming from. And rather than like, I don’t know, there’s the DOJ, there’s just these objects out there that are doing things like a giant unit. That what we’re really talking about is federal employees.
Jed Sundwall:
Yeah.
Jack Cushman:
you might know the number better than me, maybe 2 million federal employees who are out there doing things for us, making things for us, like go to work and in some way facilitate the functioning of the country. And in the course of their business, making data, making data sets, whether it’s how are the crops growing or how’s the water in the aquifers or what’s going on in this little section of the economy or what’s going on in this little section of education or whatever it is, people going about their day and along the way recording things that help us understand what’s happening.
Jed Sundwall:
Yeah.
Jack Cushman:
And it helps to understand why there’s not a central list, that of course those two million people would be generating millions of Excel files, things that just like, here’s some stuff you should know, here’s something I learned in the course of my day that is worth writing down. Many of them very deliberate and collective and across a group of people. But in many ways, as people who live here, as people invest in our society, we would want all of that. We would have this kind of relationship that is not kind of a citizen and a government, but a person and a person.
that those people should be able to publish the things they learn that will help us. And we collectively should be able to access those and use them. And at that level, the mission starts to feel much more palpable and meaningful to me. That’s like, how do we help those people who are learning things or trying to help us record the things that they’re learning so that they are permanent? And so they’re findable. And if we can have the right taxonomies, let’s do it. If we can have processes, let’s do it. But at the end of the day,
Jed Sundwall:
Yeah.
Jack Cushman:
let’s just have the stuff that we paid for, the people who we employed to help us be able to share the things that they learn and be able to preserve those. And then let’s back into how would we get that list? How would we index it? How would we organize it? One thing that’s made me really curious about, I think there’s a project out there. I don’t know if this is a you and me project or who should do this. But I would love to use the common crawl in the end of term archive to try to just make the list.
Like what if you went through every web page we know about and maybe ask an LLM, you know, do some like automation in there and ask what clues does this give you about a data set that exists? And then see if we can like find all of that and, you know, aggregate it and combine it, know, deduplicate and come out with like the world’s first denominator of like what’s the data the federal government has published and how many data sets would that be on top of the 300,000 we know about? It would be like, it would be so wrong. Like the number you got would be like.
Jed Sundwall:
Yeah.
Jack Cushman:
barely related to reality, but it’d the first time someone has planted a stake for like, I think this might be the list. I think this might be just our inheritance as people who live here and people are trying to share data with us. Like this could be it, what ought to exist. Cause I’d love to able to see that. I’d love to be able to see that constellation and look up and say, yes, that is like the thing that we have built.
Jed Sundwall:
Yeah, so you’re reminding me of two things. One is, are you familiar with this story? It’s, don’t know even what you would call it. It’s essay written by Jorge Luis Borges called the analytical language of John Wilkins. I imagine this has to be something, this is like right up your alley. I’m putting it in the chat, like, this should be like, librarians should love this. Like a lot of computer scientists love this story. Cause it’s a story about an effort at creating a
Jack Cushman:
don’t know that one.
Jed Sundwall:
an actual language, it’s similarly like an attempt at sort of like taxonomizing the universe and it doesn’t really work out very well. And Borges points out, he’s like, the reason we can’t do this is because we don’t know what thing the universe is. We don’t have a handle on it. And to your point about like, you know, the government as being perceived as a monolith, as being perceived as, you know, something that is in DC or something like that is just obviously not true. And that’s the other…
Jack Cushman:
Yes.
Jed Sundwall:
The other thing you remind me of is another essay. And I actually don’t know who wrote this off the top of my head. It’s just like some guy wrote on the internet, but like, I’ll find out now as I Google it, but it’s the title of the essay says it all. It’s reality contains a surprising amount of detail or reality has a surprising amount of detail. got a guy named John Salvatier, I’m not sure how to say his name, but both fantastic little essays here. yeah.
We’ve both lived through this where you can see in something like, what we see in open data policies that are like, the government produces data. The government should make the data open. And those of us who then start looking hard at were like, man, this is not an easy task.
Jack Cushman:
I absolutely, I love this duality, this like, well, there’s an abstraction that we wish we could have that is like the perfect data that exists in the abstract. And then there’s this reality that what we’re talking about is the subjective views of a bunch of human beings. And this comes up very practically in the kind of work that we do, both you and I, when you’re trying to do archiving work. reality kind of doesn’t wanna fit your taxonomy and you have to make a lot of choices. When we were doing the case law access project where we scanned
like the collective case law of the United States from historical times up to 2018, we found cases that came from imaginary dates. Courts would just publish a case, and there in the book it would say, oh yeah, like February 29, 1911, like just a date that doesn’t exist. And we were trying to put in a database, and Postgres was like, that’s not a real date. I can’t save that in my database. And we’re like, OK, but it’s a real case. It really has that date on it. It is presidential. It’s part of the law that you and I are supposed to know and follow.
Jed Sundwall:
Sure. Yeah.
Jed Sundwall:
Wow. Yeah.
Jack Cushman:
We just have to now infer, well, from what date did it become part of the law? I guess maybe midnight on February 28th, it existed in this magic hour. And I love that example because there’s this thing that we’re trying to do. Why do all of this? That is, we’re owed ground truth. And the ground truth is both subjective and objective. We all live on a planet made of atoms. And it’s important to just know how much water is in the aquifer is how much there is. You can’t change that by describing it differently.
But we’re all kind of observing and touching reality with different means and levers. And what we’ve come away with, our measurements are all different and subjective. They add a layer of subjectivity. And if you’re the collector of collections of collections at the end of it, which is kind of where we’re trying to be, you end up with both of those at once. We have an objective reality that we’re measuring and we have a subjective attempt to measure it that we’re trying to make sense of. And I just love that game. I love that work that we get to do of like,
help to see the world for what it is and also help to see people for what they are, which is, you know, very imperfect observers of everything we see.
Jed Sundwall:
Yeah, I love it. mean, this is a reminder, like this is all, everything we do is part of Radiant Earth, this is nonprofit, right? But our mission is to increase shared understanding of our world by making data easier to access and use for that reason, which is basically, are all, I always refer to the blind men and the elephant, I always use this framing that we’re feeling our way in the dark.
we’re increasingly adding new capabilities of measuring reality and trying to understand it. I’m like, well, we should just, let’s make sure we do that together. And I’ll say what I love about my job and the approach we’re taking is just that like, that gives us so much freedom to be happy anytime anybody takes a swing at bat. We’re like, yeah, go for it. I’m not, yeah, exactly. And people are like, you know, like I wanna try some weird new file format. And people are like, well, that’s not.
Jack Cushman:
Yes, get that up there too.
Jed Sundwall:
that’s not the one that we use. And I’m like, it doesn’t matter, let them try. So that’s a segue. We should talk about the search, the archive search, but I want to talk about Baggit first. How do you describe Baggit to people and why do you use it?
Jack Cushman:
Sure. Bagot is a sort of collective product from the library community writ large, but it was strongly endorsed by the Library of Congress. So it really got some traction there. And I think that was around the 2010s. I don’t remember the exact date. The notion was to have a data transfer format that is as simple as it can possibly be, where every moving part has been stripped away.
so that you can do it reliably and make readers that can reliably pass around things regardless of what’s inside. Because part of the issue is you end up with like a, well, here’s how you encode a web archive and here’s how you encode an image or an image collection and here’s how you encode a novel. And you have the proliferation of formats and you get things that fall in between them and have this kind of taxonomy question we were having. So what if you had something that can just like correctly encode anything in a very loose way? So a bag is a folder.
Jed Sundwall:
Yeah.
Jack Cushman:
And the folder has inside it another folder, which is the data folder. And whatever is in there is the thing that you bagged. And then it has a little bit of metadata. It has an index that says, here’s the hash of everything that is in me as data that I’m recording. And here’s the date I was made and some things like that. And beyond that, it’s up to the implementer to decide what substantive metadata to record. So it becomes a lowest common denominator way to pass around data in the library and archives community. And certainly,
Jed Sundwall:
Okay.
Jack Cushman:
you want to specialize from there. You want to have image collections and have a bunch of image specific things that they standardize on. But you don’t want to be stuck with that. You want to also be able to step down to a lowest common denominator to do interchange. We reached for Baggett with Data.gov because it looked exactly like that kind of problem, a very heterogeneous collection. 300,000 data sets, you don’t know what’s in them. You want to just get them all and get them correctly, regardless of whether there’s new file formats you don’t know about. So something that was like, take the files you care about, put them in this folder.
Jed Sundwall:
Okay.
Jed Sundwall:
Yeah.
Jack Cushman:
was a really nice place to start. And then we had to build a bunch of stuff on top of it.
Jed Sundwall:
Yeah. Okay. Okay. But then, but the idea though is the folder is an object. It’s a binary that gets uploaded to S3 and it’s a .bagot file.
Jack Cushman:
Yeah, so if you’re passing it around, I think we zip them. We put them in a format where they’re compressed, but also with an index, you can pull out individual files from the compressed thing. And this is kind of an elaboration on top of Baggett itself. So Baggett doesn’t specify a single file expression of itself. The Baggett is actually the unzipped zip. So it’s like a folder. It has this file. It has this file. It has this file. And if you have a folder that complies with that, then it’s a Baggett object. It’s like a folder.
Jed Sundwall:
Okay.
Jed Sundwall:
okay.
Jed Sundwall:
Interesting.
Jack Cushman:
But we don’t actually share folders on the internet. You always have to turn it into a single file one way or another. So when we share them, the way that we did it is to zip them and index the zips. And if you do that right, then you can get a set of ranges where like, do you want this CSV out of the bag? Just fetch this range directly from the file, and it’ll give you that CSV. And that’s kind of the best of both worlds for serving in terms of it’s small, but it’s also accessible.
Jed Sundwall:
Right.
Jed Sundwall:
Yeah, it’s interesting. We have to think through this on source. we’re just, as far as like features go, is that the way source works is you’re just navigating an object store. for those who know, you’re not clicking through folders. You’re navigating prefixes and then enlisting what’s in there. But then when you get to an individual object, we want to tell you and show you everything we can about that object. And something we need to do for baguettes and zips and tars is
show you that index. so it’ll be a kind of like, it’s just a new view that we have to think through a little bit where it’s like, yes, you’ve landed on an individual object, but also you should think about it as still part of this kind of directory structure. Yeah.
Jack Cushman:
That’s right. And your podcast listeners may know this, but I think I should plug the mission that you’re describing, which is I like to say we collect collections of collections of collections. And I think you then collect collections of collections of collections of collections. So you end up with this very meta, like here is a thing. Harvard made this collection of data at GovObjects. But you don’t want it to just be bits that people have to download and have a local viewer for.
Jed Sundwall:
Yeah.
Jack Cushman:
What I’ve heard from you is we really should help people understand what it is they’re getting. Just a little like try before you buy of like, what would be in there if I pulled it down? And that becomes easy for like a few standard things to show the beginning of a CSV or an Excel file. It’s very straightforward. And you’ve done things with mapping, which I think is also wonderful. But yeah, what do you do when you have a zip file? Are there ways that we can start to show that? I love this vision that our community can do that together. We can start to say, I’d love to be able to try before you buy this kind of object too. There’s a bunch of these and I’m curious what they are.
And then just contribute that viewer and have that happen too. I think that vision is so key to this. One thing that you and I’ve talked about a bit is, some of it I think is really very specific to one collection. Like we have a custom viewer for data.gov. I actually think you probably want a custom viewer. but, cause you don’t, you don’t want to bag it viewer in general. Bag it is a very general format. So it’s hard to expose much detail there. You want a, you know, Jack Krishman flavor to bag it viewer. Like, you know, the.
Jed Sundwall:
Yeah.
That’s right.
Jack Cushman:
a viewer that will tell you what’s specifically in these ones, that with a little bit of elbow grease on our side, you can have it actually be able to see what’s in there very specifically. And I think this game is like, how much can we use standard formats and how much do you end up with a bunch of viewers?
Jed Sundwall:
Yeah, it is, you know, mean, when I, so first of all, it’s very nice to hear you repeat back like what we’re trying to do and you nailed it. that’s, yeah. Yeah. Well, it’s better to have you toot the horn for me. So that’s great. I love it. Fancy Harvard guy agrees that what we’re doing is a good idea. Well, I just put in the chat, the archive search viewer, because absolutely, I mean, our,
Jack Cushman:
You skipped past tooting your own horn, but I think it’s such a good strategy.
Jed Sundwall:
So this is a callback to the Great Data Products blog post where I finally posted, I finally published again what I call the sweet spot graph, and which is something that I’d come up with when I was working at AWS, which was this notion that I still have more work to do on this idea. We’re gonna write another paper about it, but like that you don’t wanna over determine how data is interpreted. It’s everything you were saying before.
but you do still want to give people some assistance in seeing the data, right? And so you have to find the sweet spot between like, here’s the raw data. We refuse to interpret it in any way. Like let the universe decide what it’s good for. But also like, let’s be honest. If you download a hundred thousand row CSV, you can’t open it in Excel. You have to, know, and then if you’re
properly nerdy, you’re gonna do like a head in the terminal and just sort of look at the first few rows. Like we can do that in the browser now, like trivially, you know, so we should. And so that’s something we want to build in. But then also to your other point with like the viewer that you built is that like, if you have a handle on your collection of collections, you know, that you’ve put together, you should also in the browser be able to show people around. Yeah, give them the like Jack’s tour, which is great.
Jack Cushman:
Yeah, very much. There’s this semi opinionated, because I’m not opinionated about the details, but I’m opinionated about like, what’s this most sensible way to explore this? You know, one thing where I think that’s getting more urgent is as a data rescue community, as an archival community, we have a real challenge with preserving the interfaces to things. So one thing that you’ll get those 2 million employees doing is like, well, here’s some data. And think you actually might want to see it on a map combined with this other data so you can understand how like
Jed Sundwall:
Yeah.
Jed Sundwall:
Yes.
Jack Cushman:
your housing choice relates to your school choice, relates to your hospital choice, whatever the things are. There are all these semi-opinionated viewers that just combine two sources that are helpful to see in a shared visualization. And those we mostly lose because when you move from saving the underlying data to saving the software, you’re moving from the business of data preservation to the business of software preservation, which is its own field that is just much more complicated. You have to understand.
Is the source open? Is there a way to host it? Is there a way that it will be patched in the future? How does it need to evolve? And software preservation is just a much more challenging and one at a time kind of business. So we’re losing the point is we’re losing a ton of our viewers if we disinvest in publishing data. And that means we need to ask, because I think the archival community cannot replace that. We’re not two million people who can come build things. We need to ask, can we?
make more general purpose viewers that help people actually see the part of it they need. And so the undertaking of what would be the sweet spot of general purpose of viewer that helps any given person understand what they’re looking at, I think becomes so important.
Jed Sundwall:
Yeah, yeah, well, I guess I’ll say to everybody, stay tuned. I mean, this is something that we’ll definitely be doing a lot more of. I mean, and what’s actually kind of funny, mean, people tend to think this is funny, at least a lot of the people I hang out with because they’re climate model nerds, but I’m like, we really need to make it easier for people to see a CSVs on the web. They’re like, what? I’m like, trust me.
Jack Cushman:
Sure.
Jack Cushman:
Yeah. Yeah. I think that user feedback is so important. One feedback we got for our KSLA Access project is we were publishing JSON lines files, one line of JSON per. And that was really useful for Python programmers. There’s great tools for reading that. It was very confusing for our programmers, if I’m remembering right. And in R, it was a lot easier to read a CSV than a JSON lines file. And I just got this feedback, like, can you make it CSVs? Like, that works better in my environment.
Jed Sundwall:
Yeah.
Jed Sundwall:
Yeah.
Jack Cushman:
And was like these little things that like, if you can get past that like friction, then people were able to use the thing.
Jed Sundwall:
That’s right. Well, and also, mean, I think the story you just told also highlights, think, something that we feel really strongly about, which is that you really have to focus on the practitioner community. This goes back to the sweet spot concept of over determining how data gets presented. If you go too far the other way and you’re like, well, yeah, people just want a dashboard, or you just want a visualization for an executive, and you’re cutting out a whole user community that…
could really surprise you and do interesting things with the data. well, could you say a little bit more about your viewer though, like how it was built and yeah.
Jack Cushman:
Absolutely. Yeah, very practically, if you go to this link, you can go browse our collection. And the way that we’re structuring this is sort of some tasteful use of the metadata that came with data.gov. So this owes a lot of DNA and credit to data.gov for structuring the data, offering metadata for how to just shuffle these 300,000 data sets. And we’re really just replicating that. Going back to your question of do we have a separate list of US agencies?
Jed Sundwall:
Okay.
Jack Cushman:
We really just have the list that came with the data of what metadata entries that they have. And we let you search by data set title, organization, and so on. And then we let you narrow down by categories what we saw as the most useful chunks, metadata fields that were in our raw data to let you browse. The really important thing about this, what makes it little more interesting than a million other pages you’ve seen that let you browse a large data set and narrow it down.
is that it’s running entirely in your browser. There’s no server-side component to that. And for folks who might be on the less techie side of things, we’re talking about in typical website, you have your own browser that runs on your computer, and it fetches HTML and JavaScript and so on from a server. The server is also running custom software. And when you send in your request for just give me the ones that came from the US Geological Survey, on the server, it filters out all the others, narrows it down to that, bundles up exactly what you need, and sends it down to you, which means the person who’s providing this to you
is doing sort of ongoing work for you. They’re keeping this software up to date and running and paid for. And so you’re dependent on them still existing. If you want to come back tomorrow or next year and still be able to narrow things down to just US Geological Survey, you’re depending on the person who’s really providing a service for you, still being there to narrow it down for you and hand it to you live when you need it. And that creates a lot of precarity in the digital humanities space. And there’s a…
We now have enough decades of experience making digital humanities projects and putting them online and then running out of money for them and having them crash again. You can study this. You can look at 100 projects and what made them live or what didn’t. And that server-side software load really becomes an issue because it’s the first thing that’s going to kill your project. It’s a huge difference between print books and libraries and digital books. And I love this contrast. Given some climate control, given a roof that doesn’t leak,
Books are pretty happy to be left alone for a year. If you’re like, you know what, we just don’t have staff to open up this part of the library for the next year. We’re going to close the door, set the thermostat to the right level, and you’re probably just going to find them in better condition in a year than they would be if people had been looking at them. With digital, it’s not like that. If you’re like, we just don’t have the people to match this for the next year, there’s a good chance it’s gone and unrecoverable when you come back for it. You didn’t pay some server bill and something got deleted and no one’s around who knows how to put it back together and it’s just gone.
Jack Cushman:
So this viewer, the really exciting thing about it is that it’s really not subject to that kind of rot because it’s client-side only. Because when we give you the data, we hand the entire software to view it to you right alongside. And the idea is if you’re making a copy of this, you get the original, you get the software too. Your copy becomes just as good as the original. And you can see right now, it’s kind of clunky. When you click around it, it’s slower to load than it would be if we had a powerful server running.
we’re kind of pushing the edges of what’s possible to do in the client. I think we can push those edges a lot further. I think a lot of the clunkiness can be fixed by more indexes and more optimization here and there. But what you’re really having to do is think through if all we could do is write static files, what static files would we need to make the experience I want very efficient? And just like you have seen,
geo data that is structured very carefully so that you can fetch the parts you need from the server without needing server-side software. We can use DuckDB and write custom parquet files out that have these are the indexes that you need to serve this experience with the data you most need right at the top. And the better we have that structure, the faster the thing can run. A cool thing about that is it ends up being the same skill that you need to make a fast server-side software. So like,
If your data is poorly indexed and you’re sending a bunch of queries to the server that require it to do a bunch of work, the server is going to crash if a bunch of people use it. So you try to use indexes where the server has to do very little work. If you get those really right and really pristine, you don’t even need the server. You can just fetch the index data directly. That’s the plan. We should talk about cryptography too, because I think that’s a necessary piece of this vision. But let me know if we should jump to that now or stick to the client side.
Jed Sundwall:
No, let’s just, I mean, let me just linger on that a little bit. So yeah, when you open the search, you have a little spinner there. And I assume what’s happening there is that, is it WebAssembly loading? Do you know?
Jack Cushman:
I think is DuckDB is loading. There’s about five megabytes in the current client that have to run just for raw DuckDB. And this was a technical choice we had to make early on. Like do we use a well-supported off-the-shelf library that does make you load a few megabytes? Or the core work that we’re doing could be done in a lot less software to send down, but you’d have to do a lot more custom.
Jed Sundwall:
Okay.
Jack Cushman:
We ended up deciding to go with the off-the-shelf thing with .DB because it makes us part of a larger community and we’ll kind of, we think it’ll feed back and forth in the open source community better that way. But it was a tough decision. I think the state of this technique right now is that it’s still pretty bleeding edge. You find a bunch of libraries that are like, someone made it and thought it was cool, but stopped supporting the GitHub repo or like this was a one maintainer and now they’re gone. Or this is a large project that’s planning to implement it, but they haven’t got around to it yet. And you have to find a branch where it kind of works.
So like working this way ends up kind of pushing you into some creative coding. And so part of what you’re seeing is loading that, that DuckDB software for now. And what I’m hoping, I think DuckDB itself could be a lot smaller and that’s one direction I’d love to see that grow. The other thing you’re seeing is loading the data. So in addition to fetching DuckDB, at some points, as you click around through here, it’s going to say, to answer that query, I would need to have loaded this index that I know exists.
Jed Sundwall:
Yeah.
Jed Sundwall:
Yeah.
Jack Cushman:
And so it’ll go back to the server and say, can you please send me 500k or megabyte of this index? It’ll help me show the answer to this. And as you click around, you’ll see less of that because you’ll be loading into your browser the parts that you need to see the experience that you want to have.
Jed Sundwall:
Okay. Yeah. just want, just put in the chat also that, you know, hacker news picked it up. They thought it was pretty interesting what you’d done. And, the only other thing I’ll say is in the last episode, Brandon Lou who created Proto maps, which is amazing. vector tile, you know, file format and, and serving tool. I feel terrible. I don’t know exactly like how to characterize how awesome Proto maps is this project, but he’s like, look,
I want to, it’s very, very simpatico with what you’re saying. He’s like, you should also be able to put ProtoMaps data onto an SD card and walk into a forest, you know, and like give it to somebody on a laptop and like visualize it there. I, but now you still need to run a browser. so everything you just said hints at these decisions that you have to think about when you’re trying to find that sweet spot, which is like, okay, we’re to use a very commonly, you know, widely adopted.
platform or tool, DuckDB, because there’s a community there for it. And obviously we’re using object store and browsers because they’re very distributed technology that people have access to. these are the kind of decisions and thinking that I think, well, whatever. I’m preaching to the choir here. Yeah.
Jack Cushman:
It’s exactly right. I think David Rosenthal, who founded Lox, the way he likes to say this is no one’s ever going to make hardware specifically for the archiving community. We are too small. So when you’re designing a system, you figure out what you can do with off-the-shelf parts that are designed for other communities. That was how it led him in the early 2000s to say, we need to figure out how to this work on commodity hard drives. Because we can’t be buying special custom media for us. We’re
way too small for that to ever be as good. We need to figure out what’s the media that other people use and use it. And I see that repeat all kinds of ways, know, communities and structures.
Jed Sundwall:
man, yeah. Just, I’m gonna, maybe a gadfly. don’t know, I don’t think anybody’s listening right now to this, but like, I’ve had conversations with big funders that want to do big stuff for climate and they’re like, we need really gnarly hardware. And I’m like, do not do that. Like, please don’t go down this path. I mean, they’re talking about building their own data centers and I’m just like, stop, please stop. you’re not, you.
Jack Cushman:
Mm-hmm.
Jed Sundwall:
What you’re doing is very important, supremely important. I’m glad you want to put money towards this, but like you should be focusing on the commodity layer. Anyway.
Jack Cushman:
That’s right. I feel like there’s a, we build strong, robust community layers and then we identify specific technical weak spots where a real technical breakthrough will make a difference. So most of the work is kind of building the community that’s going to pass things around. And then we recognize something like, if we can make this client side, if we can make this cryptographically signed, we can have a breakthrough here. So let’s put some tech into that, but spend that very carefully.
Jed Sundwall:
Yeah.
Jed Sundwall:
Yeah. All right. Well, yeah. So let’s, now let’s talk about cryptography. yeah.
Jack Cushman:
Absolutely. So here’s the philosophy. Every copy should be as good as the original. If I make an archive at Harvard and you grab a copy of it and put it on your desktop, your copy should be just as good as mine for posterity. And that’s because lots of copies keep stuff safe. Those copies all have to be valid. And we really, philosophically, we don’t want to be planning for any one institution to exist in perpetuity.
whether it’s the US government or Harvard or if you shot into space, it doesn’t matter. You shouldn’t assume that one is still going to be there. And then it becomes really critical to focus on how to make copies because the history that we’ve seen on the internet is copies tend to disappear. If you try to maintain two copies of something, we’re going to have two copies of the census data. Then pretty soon you’re like, well, one of these isn’t being used. The internet’s very reliable, so we’re all going to one of them. And the other one just kind of gets cut off eventually.
It gets deprioritized, defunded, disappears. So we have to make copies robust and easy. So it becomes a two-part strategy. When I ship something with a DataGov archive, it’s going to come with a viewer. And it’s kind of with signatures so that you don’t need me to be around to make sure that it’s real, to understand what its provenance is in library terms. And we just talked about the viewer prong of that that helps make sure that your copy is as good as mine. The signature prong of that is when you get data from me, you should be able to tell
Who says this is authentic? When did they say it? And what do they say is in it? It’s something that I love. Starling Labs compares this to like an evidence bag in court. If you imagine, you know, I don’t know, let’s pick something nice. Like a beach ball is found at a crime scene. Then it’ll be put in a bag. Most of the examples are not great, but it’ll…
Jed Sundwall:
Yeah, yeah, yeah, yeah, yeah, that’s true. Yeah.
Jack Cushman:
It’ll be put in a bag. The person who picks it up will sign it and say, I picked this up and put it in this bag on this date. And then when they hand it to someone, they’ll say, then I handed it to so-and-so. And they’re like, yeah, I picked this up and it was handed to me by them. And I brought it to the evidence locker. And I put it in this locker and locked it. And then the person who takes it out to bring it to court, they’ll say, I took this out. And I held it from here to court. So when you’re admitting that beach ball before a jury of your peers, the e-
you can say these are the people who would have to testify in sequence, every one of them to say hand to hand how it got from that crime scene to you touching it today. And most of the time those people don’t come testify because most of the time that process is reliable and the fact that we have that record means that we can rely on it. Sometimes it’s not. Sometimes we say like, that one person who was working the evidence locker that day turned out to be really sketchy and put things in the wrong places. Let’s figure out which ones they touched and we can revisit that.
So that provenance chain becomes so vital. If you think about it proving things in court, it’s really clear. But actually in libraries, we care about that all the time. Anytime I say, here’s a list of companies, if you could say, well, this is a list of companies that Edward Snowden said were cooperating with the government and I can prove it, then it’s a really important list. If it’s like, this is a list that Jack found on Wikipedia, it’s a meaningless list. The provenance matters. So we need to be able to attach provenance to the things we pass around. And we need it to last longer than we do. That’s the design constraints.
Jed Sundwall:
Yeah. Yeah.
Jack Cushman:
Cryptography is how we do that. And we attach a signature to it. The signature says, I, Jack, say this is what this is. And you can be convinced that it was Jack who said that. And we attach a timestamp to it that says all of this stuff you’re looking at existed as of this date. No later than this date this came into being. And if you put a signature on it and then you put a timestamp on it, then you can later say reliably, Jack swore in 2024 that this was real and this is what he said it was.
And that existed in 2024. It didn’t happen later. And that doesn’t mean it’s real. Like, it still could all be fake. I could have lied about it. I could have been lied to by the web. A bunch of things could happen. But let’s imagine we go out to 2028 and two people are arguing about water rights in Nebraska. And one of them says, look at this government record from 2010 that proves that these are my water rights. And it’s gone now. It’s no longer on the website. All it is is in the Harvard archive.
And the other one says, that’s a lie. You just made that up. That’s not a real document. It’s not on the federal web. Then what you get to argue about is, is it plausible that Jack in 2024 wrote down some lies about water rights that would mean that I win this thing in 2020? You greatly narrow down the ways that the lie could happen. And most of the time you’ll say, OK, no, that doesn’t make sense. Jack wouldn’t have known to do that. This must be real. So that’s what we need to do. We need to attach a signature. We need to attach a timestamp. Getting into the technical weeds, we were moving pretty quickly, and we wanted to
ship something and we wanted it to be reliable for the long term at the same time. So the plan was to use very well understood basic standard off the shelf crypto. I think if you were designing this from scratch, you would use more modern algorithms, but what we reached for is open SSL and some standard ways that use open SSL design and timestamp things. And so we added a little extension to the bag it format, which you can find in my tool, bag nabbit, which I got to name.
Jed Sundwall:
Hahaha
Jack Cushman:
that it’ll put in a signature file that says all of those hashes of the stuff that’s actually in this thing, I’m going to sign that file of hashes, and I’m going to say, Jack swears that this is real. And it actually goes back to control of our email address at the Library Innovation Lab. So someone who was in control of the email address at this time signed this thing saying it was real. And then we timestamp it, just going out to a timestamp server, like Digicert, and say,
someone else out in the world with no reason to lie who timestamps a bunch of stuff says it’s existed at this time. And that signature plus timestamp can give you a lot of confidence. I also built it so that it can support multiple chains. So you could say, Jack swore this was real and timestamped it, and then someone else swore it was real and timestamped it, and then someone else did it. And you can start to make a collection of people who it’s implausible, the thing up. So technically, it’s trying to make a really simple, hard to mess up
convincing proof that this thing is what it says it is. And if you poke around the bag nabbit source that you just linked, you can see how we made those choices. And the goal was to have a cryptographer not say like, how brilliant, you did some really clever things here, but probably to say like, you did what we thought was amazing 10 years ago and is now fine. And you didn’t do it wrong. Because that was really the goal is like, don’t have any kind of like big implementation mistakes in cryptography.
which is kind of the level of cryptographer that I am. Like, I think I understand the tools we’ve been given and how to use them. And I understand that almost all the time what goes wrong is not some break in the cryptography itself, but like a screw up along the way in how you use the tools. So I’m going to try to use them in a very straightforward, obvious way. And that’s what this tool offers is like, here’s just like the most obvious, straightforward way to use a very standard tool to verify where something came from and when it appeared. And…
I don’t know, I like to imagine sometime five years from now, 10 years from now, 50 years from now, people saying like, is this real or is this made up to suit our moment? And being able to say, yes, I can trace it back through Jack’s software and say, very implausible that this was made up because you would have had to do a bunch of things that didn’t happen.
Jed Sundwall:
Yeah. Okay. Well, this is, mean, it’s great to, we’ve talked about this before, but never, I’ve never gotten this full spiel from you. like, this is also, this has to be built into source. Like I’ve said this for a long time. It’s an aspiration for source. Like that, I’m glad you’re excited. Like, I want it, I want people to be able to use source to win court cases. Like, cause people are like, we’re have open data for impact. I’m like, well, how does that impact actually happen? Cause there’s
There’s always kind of like this like, what I call the data delusion. I got this from Jessica Seddon. She’s on our board. great co-conspirator forever, but like, she talks about imaginary decision makers, which is sort of like, there’s this in our circles, especially those of us who work like in environmental data, we’re like, well, once we have the data, then the people who are in power will know what to do and then they’ll do the right thing. we’re like, no, that’s not, that won’t happen. Like,
I don’t, you know, maybe sometimes that happens, but it’s pretty rare. The way that you get people to change their behaviors. I mean, one good way to do it is by suing them. and winning. And so I was like, yeah. And so I’m like, all right, well, like, then what do we need to do to make data actually suitable to be presented as evidence? And so that you just told the story perfectly. The funny thing, the beach ball thing is hilarious because a beach ball is so benign and then
Jack Cushman:
Yeah, you’re trying to offer a theory of change.
Jed Sundwall:
And then I’m like, how would you commit a crime with a beach ball? know, then I’m just…
Jack Cushman:
Let’s not. I worked as a lawyer for a while and I worked some upsetting criminal law cases. My favorite though were torts. So I think if you’re looking for the fun, if you study torts, it’s the law of how you can get paid back after someone accidentally or intentionally hurts you. How do you just go to court and say, well, something bad happened. You should pay me until we’re even. How do you prove what even would be? And when you read a torts book, everyone starts with like, it feels like you’re reading the start of a horror movie.
Jed Sundwall:
Mmm.
Jed Sundwall:
Yeah.
Jed Sundwall:
Yeah.
Jack Cushman:
It’s like, know, two brothers were riding a train. The train had no doors. The train was on a high bridge. The train went around a corner and you’re like, no! they go, on. But I like it when they’re 100 year old cases and you can kind of have some distance from them.
Jed Sundwall:
Yeah.
Jed Sundwall:
Yeah. Well, actually you just also another thing you mentioned water rights. and I love, you know, again, we worked a lot of environmental data. So, so water rights always come in, come up. And I often like to refer to the Sippus perus, perusinus. Um, but this is evident. This is text, um, that has been preserved on a, on a stone like tablet or I don’t know what you would call this thing, but, um, from
you know, two or three BC or something like that, or second or third century BC or something like that. And of course it’s about water rights. It’s basically like, this is our water. So anyway, talk about archival evidence. Okay. Yeah. You go ahead.
Jack Cushman:
Totally.
Jack Cushman:
Yeah, just to plus one the thing you said, you were saying, well, shouldn’t source be signing things? And I think that figuring out the technical details of that is such an important thing to sort out. And we’ll have a lot of fun, little nitty gritty design choices to it. But it goes back to this core thing that whenever we pass something from one hand to another, we should write down what was passed. Because it tells you, here’s the chain of people who would have to explain what this is for you to make sense of it. And that can be in court. It can be in research. It can be just like,
What is this object and where did it come from? But you have such a wonderful leverage point because you’re collecting a bunch of stuff that if you standardize, here’s how we get this into a provenance chain now. And from here on, it’s going to have a clear record of where it came from. It’s just a wonderful way to be a witness to what has happened and to start to make it possible for the community to know things more specifically and reliably.
Jed Sundwall:
Yeah. No, I think we’re in a good position to do this. it’s the kind of thing that when I was building the open data program at AWS, like we, we would have not been able to do. It would be, I think basically impossible to get Amazon to say like, yeah, we’ll validate all this sort of stuff. think the, you know, for good reason, think like the Amazon’s lawyers would be like, that’s not a role that we’re going to play. and then of course my opinion also is quite strong, but like we should have,
differently governed entities to do that kind of thing. You don’t want an investor owned entity to do that because it’s just not core to the business.
Jack Cushman:
You know, for people who want to get involved in that, right now, I think the C2PA coalition is really where that action is. And I was just noticing Amazon is one of the members of that. I think Adobe is really the driver of it. You their vision is if you take a picture with a camera, pass it to an editor, pass it to a newspaper, like at every step of the way, as a photo is handed from one place to another, including through Photoshop, you should get a reliable record of what did that person do to it, which is a perfect example of how we use provenance chains.
Jed Sundwall:
Okay.
Jed Sundwall:
Yeah.
Jack Cushman:
they’re making a standard that is right on the cusp of being useful for everyone else too. It’s working with images as it’s motivating use case. And you can see some parts of it that are really shaped by that. But then you can also see overlapping almost completely. This is just a general standard for having a provenance chain that gets passed around with a piece of data. And whenever someone touches it, they add on what they did to it. And then they pass it on. And I think if we can get there, it’ll just unlock like
a correct answer for how we’re all supposed to be doing this thing. That we’ve made our own standard for how to attach provenance to web pages, the Waxey Auth standard. We have our own way that we did it with Baggett here. But if we can get this thing to be a generally applicable, here’s the right way you pass things around, it’ll be so powerful. And I throw that out here because, as you said, incentives can be weird for large corporations. And if one is driving it especially, it can end up kind of
Jed Sundwall:
Mm-hmm.
Jack Cushman:
overly shaped in the ways that they can see it can help and under theorize and others. This is just such a good time for people to pile in and help it be useful to everyone. I think OpenAI and the AI platforms have gotten interested in this as a way to say like, if you want to prove where this came from, if you’re not trying to hide that it came from AI, but trying to document it, here’s how you would document it using this. And that’s a good sign because it’s such a different use case, but I’d love to see more of that in there.
Jed Sundwall:
Yeah, yeah. Okay, well then we will. Just making plans for 2026. All so there’s two other things I wanted to touch on. We still have time, again, as you said, think before we started streaming, if we really wanna make it to the top of Spotify, these things need to be three to four hours long, but we’re not there yet. you mentioned once this idea that sort of the internet has created this kind of like…
Jack Cushman:
Totally.
Jed Sundwall:
I would call this just sort of like a distortion or creates this illusion that data is safe when it’s not. it kind of directs everybody into having just one copy somewhere. Do you expound on that? Or did I represent that right?
Jack Cushman:
Yeah, absolutely. That’s exactly right. We’re calling it the one copy problem. And the summary of the one copy problem is that all of the data that we rely on is very fragile. That’s the urgent thing. But the why it’s fragile gets really interesting. And it really comes from the economics of having the internet be very reliable, counter-intuitively. When you’re studying the internet, there’s this network diagram that gets passed around a lot. There are layers. There’s an hourglass where you have like
Jed Sundwall:
Yeah.
Jed Sundwall:
Yeah.
Jack Cushman:
IP and TCP and the DNS system and browsers and applications as a bunch of layers that each take care of their business and let the layers above and below them take care of theirs so the whole system works. So if you picture our data preservation system as layers, a layer that works incredibly well is the ability to reach out and contact a website. Cloudflare was down yesterday. everyone is talking about it and makes headlines that there’s some websites you can’t reach within a second right now.
But we’re used to almost all the time, almost all websites anywhere in the world, you can get in under a second. It’s incredibly robust. If you looked at it terms of how often is it online and how reliable is it, the system is designed very well and it works very well and it gets you things immediately with no complaints. And there’s exceptions to that, but it’s a reasonable way to think about what the internet is and how it works. And that reliability ends up creating fragility other places in the stack.
Because when you have two versions of something, it’s equally easy for the entire world to all go to one. There’s no kind of incentive to be like, well, this one’s down sometimes. This one’s down other times. This one’s closer to me. This one’s farther away. No. If you have like CDC data one and CDC data two, a crowd is going to kind of pick one. And then that one is going to gain momentum. And they’ll tell each other about it. And pretty soon, 100 % of people will be going to CDC data one. No one’s going to CDC data two.
And after a year or two, someone’s going to say, why are we still paying for this thing that no one uses? And it’s going come off the budget. And that’s true. It’s true for governments. It’s true for nonprofits and public interest preservation. It’s true for corporations and redundancy around, are we storing our archive as the New York Times or Amazon or whoever. Because of the reliability of the networking, we do this economic process of
putting 100 % of our reliance on one copy, 0 % on the other, and then deleting copy two. And it means that our memory becomes really fragile. There was a story just a few weeks ago of a fire in South Korea that destroyed 800,000 federal workers’ data. And you’re kind of like, oh, what idiots. If I was the system man, I would never have forgotten to do whatever. But no, actually, all of our data is that fragile, where a systemic shock like a fire really could destroy it.
Jack Cushman:
Some is very well backed up, but most of it is subject to one or more correlated failure modes. So you’re not necessarily picturing like they only had one hard drive, they should have had two hard drives. But you have to picture they only had it behind one administrator password, and if someone stole that, it could be deleted, and it should have been behind multiple. Or they only had it in one geographic region. was all in Amazon’s data centers in Virginia, or it was all in California. And when there was a large scale disaster, it got lost. Or they only had it in one brand of hard drive. And when that
Jed Sundwall:
Yeah.
Jack Cushman:
brand failed, it failed. Or it was all paid for by one source. And when that source either changed its priorities, changed its policy, it got deleted. There’s a paper from the early 2000s from Locke’s that lists their threat model. And they list, I think, about 14 of these kind of correlated failures. Only one government. Whatever you can think of that is a failure. You could even go to like, well, it’s only on one planet and start to think about how to fix that. But for now, even on one planet, there’s a lot of correlated failure.
Jed Sundwall:
Right.
Jed Sundwall:
Right.
Jed Sundwall:
Yep, right.
Jack Cushman:
And so the problem becomes like, how do you beat economics? Like, how do you beat market incentives to have only one copy that is subject to correlated failures for stuff that matters mostly to posterity? We have a public data project, and I’ve thought a lot about what that means, public data. And really the way that I think about it is public data is data that is mostly valuable to people outside of the data custodians. Like, if you’re a company and you collect, you know,
Jed Sundwall:
Yeah.
Jed Sundwall:
Mm-hmm.
Jed Sundwall:
Interesting.
Jack Cushman:
Internet visitor statistics so that you can model traffic and make ads better. That’s private data. You’re collecting it. You’re using it. You’re paying for it. If you delete it, like, you’ll be the one who’s sad about it. If you’re a government and you’re collecting, you know, what have been our tariffs over time, what have been our school crowding over time, you’re doing that primarily for the benefit of people besides you, the person making the spreadsheet, or even your department, but for the world to be able to navigate properly. And so there’s a kind of incentive misalignment.
the people who most value it are not in the room or able to advocate for themselves necessarily. And if you start thinking about what are all the kinds of data where people besides the ones holding the checkbook might care, it’s certainly things like government data sets, but it’s also things like the New York Times archive, all of the archives of news that are behind paywalls. Even like, I don’t know, YouTube. YouTube in many ways is the most important record of a bunch of things that have happened in the last 10 or 15 years. And like,
Jed Sundwall:
Yeah.
Jed Sundwall:
Yeah.
Jack Cushman:
There’s Google’s interest in preserving that, and then there’s society’s interest in preserving it, and that’s very hard to theorize. So public data becomes this kind of misalignment problem. We need to invest in something that the people who care are not here to advocate. And that’s what I think of as the one copy problem. Where do you intervene in the economics of this thing so that we can start to have durable memory of the stuff we most care about?
Jed Sundwall:
Right.
Jed Sundwall:
That’s fascinating. mean, you know, I mean, we’ve talked about, I’m very interested in raising an endowment. That’s going to be a huge area of focus for us because I do think going back to this discussion of focusing on the commodity layer, the very good thing about the tech sector that we have right now is that there is competition in it there’s plenty of downward pressure on pricing. And I think we can forecast costs well enough to endow the long-term preservation of data.
And what that could open up is you could say, like, look, we’ve endowed this data set, or I should use my own terminology. Like, you know, we’ve endowed this data product to be available via these URLs for 50 years. Would you like to endow a copy of it? You know, and we are at the point where it’s like, if it’s like a terabyte of data, like that is a, it’s, it’s thousands of dollars. I mean, don’t get me wrong. Like it’s, it’s a, it’s a real thing, but it’s a one-time check that,
Jack Cushman:
Yes.
Jed Sundwall:
a philanthropist can write, you know, it’s not, yeah.
Jack Cushman:
I think it’s such an important provocation or design goal. Why can’t you endow a terabyte? If you’re like, this terabyte should exist for the next 50 or 100 years for humanity, why can’t any of us make that choice and say, yes, I’m going to invest in making that possible? I don’t know what apparatus you would use for that now. And actually, if you’re a Harvard professor, I do know. I would tell you to use the DRS, the Digital Repository System, that was founded about 20 years ago. going through a whole reinvention right now.
I think some big institutions have learned how to think about this for themselves. But how do we make it something that is available, not just at Harvard, but across the world, if you have something you care about, how do you endow it? I love that question. I think it’s such a good approach to it to start to realign those incentives, to say that someone now, today, can make an investment in something to pass it to posterity. And then the other thing I love about it is it makes you start to think about
Jed Sundwall:
Yeah.
Jack Cushman:
What does it mean to last for 50 years? What steps should you take with that money, the money that you’re handed when you endow a terabyte? And how do you defend against all of those correlated failure modes that Locke’s laid out? I think the gnarly thing, the tricky thing that’s at the end of that thought process is you probably actually need multiple mutually independent institutions to be involved. Because
Jed Sundwall:
That’s right.
Jack Cushman:
you, Jed, become a single point of failure that like, well, if I can buy you, can endow this thing. And that can’t be how it works either. So there’s a bunch of strategies, but how do we make it so that there is no one of us who can disappear and have the thing disappear?
Jed Sundwall:
That’s right.
Jed Sundwall:
That’s right. Yeah. man. Okay. Well, one, one last point I want to bring up is let’s talk about the Smithsonian really quickly. Cause again, it’s very relevant to everything we just just said. What’s what’s, what are your plans with the Smithsonian? I mean, what are our plans with the Smithsonian? You can say,
Jack Cushman:
Absolutely. So the Smithsonian is our second major data collection after data.gov. And this is something that came up in the data preservation community, whether the Smithsonian’s public out of copyright data set as a whole could be preserved, which is over 700 terabytes stored on Amazon.
Jed Sundwall:
Okay.
Jack Cushman:
And then over 700 terabytes becomes enough that most projects are kind of, we can’t take that on. That’s too big a goal for us. And our public data project felt like we were able to do that, able to make a first collection. And then we talked with you and very fortunately, you felt like you were able to take it on with us and move it to source. So we start with this kind of giant blob of 700 terabytes that is really quite an undertaking for
our kind of community. It might not be a huge undertaking if we were Google, but for who we are, it’s a big thing. And now we have it. And what we have right now is just a straight copy. Let’s get a copy from here, move it to here. I think the first thing we’ll do is sign it, just like we talked about with the other thing. Just say, I have error that this is the copy I made, and I made it on this date. And from now on, you won’t need me to be around to know that this is exactly what the Smithsonian had. But beyond that, we have to start thinking about access.
And how can people actually benefit from using that thing? One of the things I’m really excited about is whether we can make a kind of access copy that is much smaller and that you could just have for yourself. It’s very common with these kind of preservation data sets that you have a preservation version that is like uncompressed full color images, for example, can be very large. And that’s one of the sources of your 700 terabytes.
But if you accepted a small amount of compression, even visually indistinguishable compression, you could get down to 10 % of the size. So I think exploring that, is there an access copy that is more like 70 terabytes instead of 700? And you could just have on your desk, like 70 terabytes is still a lot, but you could get an enclosure that you could just plug into your laptop and say, the Smithsonian collection is here on my laptop to talk to. So I love that aspect of it. And then the other piece is we have to figure out discovery. What do you do when you just have
a collection that size that lands in front of you and you don’t understand what’s in it. And I think you have the kind of, there’s one approach that is like when you click a file, you should be able to try before you buy and see what’s in there. But the other approach is, what about at a millions of files level, how do you get a view of in general what’s in here? What am I going to find if I start sifting through this? It’s what people call exploratory data analysis, but I think we have to democratize that and not have it sound like something that only data scientists do.
Jack Cushman:
Or law firms do it too. Here’s the hard drives of your client or the opposing client and just figure out on the hard drive. That’s called forensic analysis. And I think both forensic analysis and exploratory data analysis, we have to move past that to what can I click to understand what I’m looking at? How can we make this more something that everyone can get their hands on?
Jed Sundwall:
Yeah.
Jed Sundwall:
Yeah. Well, so actually that was crazy because you just teed up actually next month’s episode of the live stream webinar podcast thing will be with Matt Hansen to talk about the spatial temporal asset catalog. So this is a metadata spec that has been, I mean, very rapidly adopted within the geospatial world that solves that, that collection level problem that you described, which is basically I have a, I have a collection of spatial temporal assets. So, the,
most common example you would think of is a collection of satellite imagery or drone imagery or something like that. you want to give people, what it is is you give people a JSON file at the root that says, here be spatio-temporal assets collected between these times and covering the spatial extent. So immediately you can kind of tell like, is this a timeframe or a area of the planet that I’m interested in or not? And you can move on, right? And those can be indexed. So you can search them.
Jack Cushman:
Yeah, that makes perfect sense.
Jed Sundwall:
That notion of figuring out the way to kind of distill a collection into something like at that high level so that you at least you’ve standardized. Here’s a bag. We can use any kind of metaphoric bag of collection. What are you gonna say? Like this is the universe it contains. Do you care or not? And move on. So this is a perennial issue. Yeah.
Jack Cushman:
Yeah, if I could connect it, trying to wrap it up a bit, I think geodata is out ahead here because geodata has always had this problem. You go to Google Maps, and you can zoom out until you see the whole world. And then you can zoom in until you see just one block. And structuring the data to allow that, to be able to jump in and out and see the right level of detail when it’s all the same data set.
Jed Sundwall:
Yeah.
Jack Cushman:
has meant that geodata has to be very thoughtful about how is the data stored and indexed so that it’s discoverable by the software that needs it efficiently, which is just what we were talking about with how do we index our data.gov viewer so that that can be fetched efficiently. We need to start thinking that way about that very clever structuring of data across the board for making things available and kind of picture the like. We want to enable for everyone that Google Maps experience.
Jed Sundwall:
Right.
Jack Cushman:
that if you want to, can zoom out and see the world of the 700 terabytes. If you want to, you can zoom in and you can see the block. And you should be able to do both of those, and you should be able to do them very cleanly, which for that community, completely obvious, has been true the whole time. Wonderful technology for it. How do we take that technology and make it for any data set, I think, is a great challenge. I’ll also say, I’m always kind of looking for where is the bigger industry headed. And I think AI is kind of like a huge industry that blows us in a direction.
One thing that we’re going to find as data people is that indexing is critical to AI research and AI practice. There’s like, from a library perspective, using an information tool, there’s a question of, the model smart enough? There’s a question of, does the ground truth even exist? Is it possible to fetch it? But in between those two is, do you have an index that can get the correct answer instead of the wrong answer into your model’s context when you need it? And if you can do that, if you have those indexes, then you can make
Jed Sundwall:
Right.
Jack Cushman:
data tools that actually empower individuals, which is what we think about at the library. And if your indexes are bad, then you’re going to get the wrong answer in context, and it’s going to hallucinate or tell you the wrong thing, and it’s going to disempower people. It’s going to hurt them, which means we have this weird position that I think is a surprise for me as a library person and maybe a surprise to other folks that all of a sudden, indexing is cool. How you index your data is going to really matter. And I think it’s such an opportunity for us because we’ve been thinking about indexing forever. And now that it’s cool, let’s figure out
What we know about it that is cool that we can share.
Jed Sundwall:
Yeah, your day has finally come. So we’ll wrap up, I want to, you actually, mentioning of like how the geo community is ahead here is, I’m sure flattering to those of our community who are listening in, but we did get one comment on LinkedIn from Linda Stevens, who we’ve worked with in the past, and she’s worked in the geospatial space for a really long time. But she made the comment that, you you have to certify a map at different layers. You have to track and certify all the layers that make it up.
it underscores the point that you made, is that maps are these confections of data that we’ve been figuring out how to create. I mean, it’s such a rich field. I cartography is just, it’s amazing because we’ve been trying to figure out how do we downscale so many things we know about the world into something that’s legible for humans and then assert that in a way that’s like credible. It’s a huge challenge and.
Yeah, would say our, my theory for why the geo community is out ahead is that most of us gave up on getting super rich a long time ago, which is as opposed to like the life sciences community where I think, you know, there’s, there’s real gold in those Hills. You know, people think they’re going to cure cancer and make a ton of money, which is great. Like I want them to try to do that. But the geo community is just generally much more open. And I think just has such a long history of sharing information. I mean, it’s.
Jack Cushman:
Mm-hmm.
Jed Sundwall:
core to what we do that.
Jack Cushman:
Maybe try checking your maps for any hills that have gold in them. It’s probably worth a shot.
Jed Sundwall:
We already did that, you know, that’s the point. Like, yeah, we ought to find those. Yeah, I mean, don’t get me wrong. mean, there’s, know, recent years has been lithium, you know, like there’s always going to be something else. There’s money in understanding spatial data for sure, but it’s not, it doesn’t have the, the mad rushes are over and there’s a huge community that’s just, I think very generous. And so, yeah.
Jack Cushman:
We found those already.
Jack Cushman:
You know, I love Linda’s point, too, that you do have to certify at every level. I’ve seen some of the work that goes into designing a product like Google or Apple Maps, where things have to appear or disappear as you go in and out. It has to be the right things. That has to be the things I care about at each level. And sometimes it’s better, sometimes it’s worse, as they’re kind of iterating on what should I show you. And it’s such a wonderful little example or crucible for how we do data in general, because you have a bunch of ground truth. People went out and wrote things down.
Jed Sundwall:
Yeah.
Jack Cushman:
It was maybe accurate at the time I saw it. It’s maybe not. You’re integrating a bunch of different views of the world. There’s a bunch of research just going into how do you tell if two data points are one store or two stores, all of that kind of integrating views of the world into one. And then once you’ve integrated into one view of the world, then there’s how do I express this to you so it’s not a lie? I could show you a map of your neighborhood so that I’m showing you the gas pipes and you’re just confused. I could show you one so that I’m showing you the benches and the things that you care about.
And am I meeting you where you are so that what I’m showing you empowers you instead of disempowers you? And am I doing that without oversimplifying it so much that in fact I’m lying to you and I’m disempowering you that way? And so it’s this perfect combination of seeing the world and getting ground truth, integrating it and deciding which things are going to believe and what you’re not, and then debating, well, how are we going to show this to people so that we are empowering them or not? What do we share with them? How do we lead them?
Let them get more expertise when they get it. I just love all of the parts of that design problem. And then it’s kind of like, now welcome to all the rest of it. What if it was a pile of zip files and some PDFs and some instructions and like the mess of the world? And I’m saying let’s is that we haven’t. It’s something that the data community has thought about for ages. How do you make those wonderful interfaces so that people can find the stuff they need outside of Maps too? I think there’s so much more room for us to improve on that. And that’ll be really exciting work to do.
Jed Sundwall:
Yeah, well, let’s do it. Let’s, mean, I think we’re very aligned and we want to create the conditions to let lots of people run those experiments and make that possible. So yeah, let’s, let’s go. Well, thanks so much, Jack. think this has been, it’s been awesome. Hour and 20 minutes, not bad. Yeah. Yeah.
Jack Cushman:
Thank you, Jet. I really appreciate it. Thanks for giving us a chance to talk about this stuff and thanks to folks for listening. I think we’d love to keep debating more. What are we meant to do and what are we meant to save and how do we save it and how do we pass on to humanity what we should? I just really appreciate the chance to talk about it with you.
Jed Sundwall:
Okay. Well, we’ll, keep talking. Thanks Jack. All right. So we’re going to stop and then.
Jack Cushman:
All right. Take care.
Video also available
on LinkedInShow notes
Jed talks with Brandon Liu about building maps for the web with Protomaps and PMTiles. We cover why new formats won’t work without a compelling application, how a single-file base map functions as a reusable data product, designing simple specs for long-term usability, and how object storage-based approaches can replace server-based stacks while staying fast and easy to integrate. Many thanks to our listeners from Norway and Egypt who stayed up very late for the live stream!
Links and Resources
Key takeaways
- Ship a killer app if you want a new format to gain traction — The Protomaps base map is the product that makes the PMTiles format matter.
- Single-file, object storage first — PMTiles runs from a bucket or an SD card, with a browser-based viewer for offline use.
- Design simple, future‑proof specifications — Keep formats small and reimplementable with minimal dependencies; simplicity preserves longevity and portability.
- Prioritize the developer experience — Single-binary installs, easy local preview, and eliminating incidental complexity drive adoption more than raw capability.
- Build the right pipeline for the job — Separate visualization-optimized packaging from analysis-ready data; don’t force one format to do everything.
Transcript
(this is an auto-generated transcript and may contain errors)
Jed Sundwall:
So I’m going to start it. first of all, happy Halloween, Brandon. Welcome to a special edition of Goth Data Products. If in case anyone’s wondering why we’re both red, if they’re watching, on the listening to the audio only don’t get the benefit of seeing us in this kind of like spooky, spooky color scheme. But welcome.
Brandon Liu:
Thanks.
Brandon Liu:
Yeah, so thanks for having me on the podcast. I’m excited to talk about, know, ProtoMaps, data source cooperative. So I’m here to answer questions, I guess. Yeah.
Jed Sundwall:
Yeah, no, likewise. by the, sorry, I did chicken out. So I’ve changed the lighting. So I’m not right anymore. yeah, no, it’s, it’s great to have you. mean, when, when we started this thing, you were sort really top of mind of somebody who’s been very thoughtful about how to, about what I would call the, what we call the ergonomics of data, like figuring out how to make a lot of data accessible for people. so if you, if you don’t mind, let’s just start there. Like, can you just
Brandon Liu:
Right.
Jed Sundwall:
How do you describe yourself and what you do?
Brandon Liu:
so the way I describe myself is,
I started a project called Protomaps six or seven years ago and the impetus for this was making it easy to make a map. And the direction that came from was very much just like, you think about a web developer that is making a website, like, so for example, they’re making like a site to look up different cafes in their neighborhood.
they might use something like Google Maps, but that is like a proprietary SaaS that they buy. And like, so I really wanted a way to sort of have like a home cooked way to make a map because there’s so many things you can publish on the web. You’re able to publish videos, you’re able to publish pictures or markdown or HTML, but being able to publish a interactive map has never been that way. So really the way I approach this is from
the idea of making it accessible for anyone to publish a map.
Jed Sundwall:
Got it. Okay. And so amazing. and you’ve done it. And so you’ve reminded me. So, so one thing that we were going to be doing, I mean, I’m just gonna like say these things out loud, which is kind of funny is like part of the reason for doing this podcast is like, we’re doing so much stuff at radiant earth and we need like more channels to be able to talk about it. and so just last week, we put out a white paper and this will be in the show notes and I’ll put it in the in the chats, but it’s called emergent standards. so what you said is just like very relevant to this, which is that like in the paper, I argue that the web has turned out to be a really like an engine that helps people come up with new data standards. And so if you look at it from through that lens, you have HTML, which is like, let’s share a document and hypertext, you know, like hyperlinked documents with one another.
And then you end up, you’re like, well, what if I don’t want to load up a webpage, but I want a feed of updates. And so RSS emerged out of that. GTFS emerged out of the need for like standardized transit information. And I would say what you’re doing, and I guess specifically with PM tiles is like a way to do this, for vector tiles.
Brandon Liu:
Yeah, I have a lot of, I guess, thoughts about the idea of standards in general, both in the web and also for geo. I think a lot of the web, we think about them as standards, like for example, HTML evolved very early. And maybe on the early web, was a lot of more sort of like, it was in the design phase where people would collaborate on creating some spec and that became a standard.
Nowadays, what you see is it’s more like if one of the big companies that makes browsers like Google or Microsoft, they make everyone adopt a standard because it’s in their incentive to do so. If Google can convince everyone to use, what is it, like JPEG 2000 instead of plain JPEG, then they can reduce the amount of bandwidth on the internet by 20%. And that is all that tech.
around things like serving video, serving audio and images is all like very mature to where you don’t really see a lot of emerging standards being adopted organically. They’re more like, there’s this committee at these huge companies that all collaborate on a standard. There is some examples of of sort of more like small scale solutions that became adopted. And that’s really how I see PMTiles fitting in with them.
is like, I don’t want it to be top down. Like I don’t want people to like make their organizations adopt PMTiles. I want people to use it because it solves the problem for them. There is a really cool format for images that I like. It’s called QOI. I think like it stands for literally like the quite okay image format. Like it’s very modest and it’s like, like it’s its name. But I think it is just like one guy came up with
Jed Sundwall:
Right.
Jed Sundwall:
Okay.
Brandon Liu:
a way to do lossless compression of images that is a lot simpler than PNG and is good enough. It’s not more optimized, but it’s way faster to decode on a CPU thread. And that is one good example of a, not a standard from a standards body, but of something that had a simple design that became popular. And it was not adopted because it’s like,
Jed Sundwall:
How popular? I’ve never heard of it.
Brandon Liu:
I think it’s used, the original motivation was like for games, like if you have game assets and you need to be able to like decompress them and move them around in like, in just like raw RGB formats, then QoI is supported by like some of those engines. But actually like another one you mentioned is GTFS. So GTFS is like more geo adjacent. And that was, it also came out of, think Google’s requirement to like have some
systematic way of storing transit routes. But it wasn’t like some sort of consortium of transit agencies that came together to design like this like CSV format. It was like, it just became a widely adopted solution because it happened to be good enough.
Jed Sundwall:
Right.
Yeah, well,
Brandon Liu:
And that’s really how I see PMTiles. Yeah.
Jed Sundwall:
Yeah. Well, so yeah, I mean, it happened to be good enough. Also Google had this cruise ship that everybody wanted to get on. I think, I don’t know who first said that, described it that way, but like every transit agency in the world was like, we want our data to be in Google maps. And so they had an incentive to do that. And so that’s a concept we explore in the white paper, which is like, you do need this mix of like good enoughness, because that is usually where things land is you have something that’s
good enough for a lot of people to adopt. They’re like, is fine. mean, sorry, what’s the acronym for the image? Like adequate, what is it? Adequate, quite okay. Like I love it. Like that’s usually where things land. Like RSS, the story of RSS is like a bunch of people fighting and a bunch of attempts that like top-down approaches to syndication until people kind of threw their hands up. And, but tellingly then the New York Times adopted it.
Brandon Liu:
quite okay. Yeah.
Jed Sundwall:
and started publishing RSS feeds and everyone’s like, okay, this is what we’re doing now. So it’s fascinating to see, do you have any sense for the traction of PM tiles as being like this? Like who’s using it?
Brandon Liu:
So I have a couple of proxy ways. So I don’t actually know how many people are using it because by nature, I can’t track. I can’t add a tracking pixel each time someone looks at a map. The one thing I can track is the number of NPM downloads. So NPM is the package manager for JavaScript. And that is, I think, the most popular client for reading NPM tiles. And it’s something that I’ve
Jed Sundwall:
Yeah.
Brandon Liu:
it’s something that I maintain and that crossed like 100,000 downloads per month or it’s either per month or per week. I can’t remember like this year. So you can see like a growth curve of people using this library. Now I don’t actually know if that means anything because it could just be like an automated CI script, like on GitHub actions that is downloading it like a thousand times. But it has some correlation with usage. So the only way that I can kind of
see if people are using PMTiles or if it’s being adopted is through this like proxy metric like NPM downloads. Or people show me like a site that is built using it. So actually like probably the biggest one is like I think the New York Times had a visualization on their homepage that was about like a space debris falling to earth. And that used a map data set that was served from PMTiles.
Jed Sundwall:
Okay.
Brandon Liu:
So probably like a dataset that’s being served on the New York Times like front page in PMTiles format is like probably like the most high traffic use of it.
Jed Sundwall:
Okay.
Jed Sundwall:
Here we are again with the New York Times. really, it’s kind of interesting. mean, you think about the legacy of the New York Times as being like it, the story about them sort of crowning RSS, the standard for syndication, like that’s true. Like they did that. And like, they do have the imprimatur to do that kind of thing, which is, that’s awesome. That’s great. Like that’s, that is a sign that you’ve, you’ve made it. Shout out to Tim Wallace, who probably had something to do with.
Brandon Liu:
Okay.
Jed Sundwall:
with the New York Times using PMTiles. That’s awesome. Okay. Well, so one thing I can say though is like on source, you know, we host a lot of PMTiles files and you can correct me on all of this. Like there are some kind of like base map objects I think that are in there or something like that. But if I search GitHub, it’s one of my favorite things to do is to search GitHub for…
references to the source data proxy, which is data.source.coop. as of today, earlier today, it’s like 612 results pop up when I search for it. But a lot of them are two PM files. Do you know anything about, do you have any insight into that?
Brandon Liu:
right. so the project I run is sort of an umbrella project called Protomaps and PM Piles is just one part. And that was by design because I never thought it would be good enough to just design a format. Because it’s like, if you design a format, then you also have to have like some killer app that makes people actually care. Because just having like a spec with some implementations is like, people are like, that’s cool. But like,
Jed Sundwall:
Or aware of that? Yeah.
Jed Sundwall:
Yes, yeah. Okay.
Jed Sundwall:
All
Brandon Liu:
I can’t immediately take advantage of it. So the way I approached it was to have like a killer app, which is a base map or like what people think of when they think of a map, which is like you look at it and there’s like city names and there’s like water and like roads and stuff. That’s based on OSM. So the actual data product that is like open source and free by default in the PM tiles format is this base map.
that’s from OSM. And I think a lot of the links to source are to that because going back to what I started with, it’s like if people just want some solution for showing a map on their site, know, like as an open source replacement to Google that they can run themselves, that they can copy and they can move around, they can download like, so as if it was a video or an image. But I imagine a lot of the links are to that just because it’s designed to be something that’s like immediately useful.
Jed Sundwall:
Yeah, that’s what I was guessing.
Brandon Liu:
Now I think with source, I think the the cores policy is like quite open. So if there’s other data sets, like a scientific data set that is in PM tiles format, people could link to that. And hopefully people do that more or they download from source and mirror to their own buckets and use that.
Jed Sundwall:
Yeah. Yeah. Yeah. So I mean, this is something that we have to, we’re going to have to do our own analysis on this at some point. which is like, what is the cost of us hosting those, those objects? Cause yeah, our core’s policy is wide open so people can do that. and we can do the math on this, but I mean, you know, shout out to AWS. Thank you to the AWS Open Data Program that, still exists.
after yesterday. anyway, was a tough day for a lot of people at Amazon yesterday. There were a lot of layoffs, but the Open Data program is alive and kicking. And so they subsidize all of our storage and bandwidth for source. But we do want to get serious about this at some point and have an understanding of like, how much should it really cost to do something like this at what scale? We have…
Brandon Liu:
Yeah.
Jed Sundwall:
All the analytics we need, just haven’t sifted through the data yet to figure out like which of those objects are being hit the most and how much and what’s the throughput that’s going out. Cause I know you’ve done analysis on the costs of doing these things. I imagine you have some data on how much it costs to deploy PM tiles, but we also have a lot of this data, but we just haven’t shared it yet. So.
Brandon Liu:
Right. So going back to that for a moment though, like, so I wonder if you think about like, like that idea of like being able to search for GitHub for all the links to source for people that are like hotlinking to it. Like in some sense, like I think it’s, it’s not directly correlated to success. Just, just a number of people that are consuming source. If people are making a copy of the data, if people are copying the data they get from source to their own bucket and then using that.
Jed Sundwall:
Yeah. Yeah.
Jed Sundwall:
Of course not, yeah.
Brandon Liu:
That is still like using the platform as intended. Like there isn’t really like by design, I don’t know if source is designed to be like an intermediary platform. Like for example, like Airbnb. So for Airbnb is like you go to the site and you look up like bookings, like listings, but they will stop you from trying to go off the platform to like make an arrangement with like your host because that’s like, that’s, that’s exactly against their business model. Right. That’s like.
Jed Sundwall:
Yes.
Jed Sundwall:
Yeah.
Brandon Liu:
So it’s for Airbnb, the entire point is like, they’re an intermediary between you, like your desire for a room and the host. Now, so I don’t think source is by design as like a data platform to be an intermediary for all data. There is a lot of like open data platforms in the past that have worked that way, where they make it very difficult for you to be able to consume the data outside of the platform. But it feels like with the sort of cloud native focus, part of the idea is that you’re able to
Jed Sundwall:
Right, right.
Brandon Liu:
you know, just like package up data and take it to go or access it just in chunks instead of having to be locked in to just using source. So if there was some way to maybe promote that as like a first-class way to consume source instead of just linking to assets, then maybe that would help alleviate some of these ideas around like cost sharing for bandwidth.
Jed Sundwall:
Yeah.
Jed Sundwall:
Yeah, well, no, mean, let me address this and then I want to acknowledge we have a viewer, Sig Till, I’m not exactly sure who they are, but they’re Sig Till on YouTube, who is joining us from Norway. So we were like, let’s do this at 4 p.m. Pacific. Sorry, everybody in Europe, but we’re doing it Asia Pacific, or at least, I mean, it’s what, it’s 7 a.m. where you are. So we’re kind of in a…
weird time zone right now, but we had somebody from Norway tuning in to ask what’s in the future for PM tiles and which changes would you like to see in the format itself or new tools that use the format? But anyway, Sigtil, just don’t go to sleep just yet. We’ll answer your question. The vision of source is not so much to be an intermediary. Sources by design, it doesn’t really do much other than provide reliable access to objects.
So we call it, it’s a data publishing utility. It’s not an analytic tool. I’m happy to have, I want people to build stuff on top of source. So yes, I do want people to link to it. However, this is math that we, this is kind of my point in saying we have to do this analysis on our usage is to say, well, how much is that really gonna cost us if we do that? And are there ways for us to…
get a handle on bandwidth and usage so that we don’t, we’re not abused, you know, or rather, abuse isn’t the right term, but just so that we can afford to do that in a way that’s reasonable. And so, and to say like, look, if you don’t want to host your own object somewhere, which tons of people don’t, I mean, sort of a core tenant of the product design is that like, we just know that a lot of people don’t want to host their own stuff. Like they don’t want to their own servers. They don’t want to think about infrastructure at all. If we can,
let them just link to reliable assets that are available. That’s great. But we have to figure out a way to do that in a way that doesn’t, you know, could scale to the usage of something like Google Maps without bankrupting us, you know? Then that means we have to figure out, for example, with like the open course policies, do we have to have some sort of way to say like, no, no, you have to be put onto an allow list?
Jed Sundwall:
to be able to link to this or something like that. We’re gonna have to figure that out. So you’re right that I don’t want to be an intermediary. We’re not really trying to log people into source, but we do wanna provide a service that allows people to access data without having to download and re-serve their own copies if they don’t wanna do that.
Brandon Liu:
Right. I mean, on the other hand, feel like, so part of the messaging is that just having object storage is a commodity. And in my experience, talking to developers that use PMTiles or that use other cloud data formats, a lot of people find using S3 very accessible, and it’s not a huge lift to ask them to be like, hey, go put this thing in your bucket. And it’s even among non-
Like I would say you could just be a front end developer. could be someone that spends all their time doing TypeScript programming and know nothing about like servers and you can figure out like object storage. So I think part of the solving point I’m trying to make is like exactly. Yeah. Like that audience I think is extremely large of people that of people that like it’s too much of a lift to host something like a server.
Jed Sundwall:
That’s my story. Yeah.
Brandon Liu:
But just putting a thing in a bucket is actually like a very good experience. It’s very simple, it has a nice abstraction. And if you can sort of encourage the world to be more object storage-y, that’s the way I think about it. And that’s a big part of why I think PMTiles as a format has succeeded is because that audience is so large.
Jed Sundwall:
Yeah, totally. mean, so yes, agree. I’ll just tell a bit of history. I’ve told this story, tell the story a million times. I’ll probably tell it a lot as we keep doing this podcast, but like the story of the origin of the cloud optimized GeoTIFF and all this was when I found myself at AWS building this open data program and I figured out this one weird trick that I could just get the company to give out free S3.
but I had no engineers. had no, like I was embedded within a sales organization. So like, due to like HR practices, like the idea of hiring engineers to build software or tools or anything was out of the question. And so I’m like, what can we get away with if we can only use S3? And I also being kind of, guess I would say a front end guy, although I’ve never been like ever officially hired as an engineer, loved S3.
It’s like very intuitive product, super powerful, very capable. I wasn’t afraid of it. And so I’ll say this, like you, very talented, smart person, knows how to use S3, isn’t afraid of it, and neither are your friends. There’s tons of people out there that are afraid of S3. Like Source, and actually I got to shout this out. We’ve been working with Development Seed on Source. Anthony Lukash, shout out to Anthony at Development Seed has been.
just cranking out new features on source. Today, we pushed out like you can upload stuff into S3 through the browser through source. for source users now, which you still have to be invited to be a source user, you don’t even have to use the CLI. You don’t have to, you don’t have to look at the AWS console. Like I’m just here to tell you there’s a whole universe of people out there that they’re like.
No, I am scared of S3. I’m scared of AWS. I don’t want to look at that console. And I saw somewhere some tweet that was like, it was in reference to Vercell or something like that, but it was just sort of like, it’s amazing how big of a business you can build just by building an abstraction layer on top of the AWS console. And so that’s really what we’re trying to do. And in fact, I do hope there will be people in the future. mean, we already have a…
Jed Sundwall:
a bunch of other organizations that are hosting their own PMTiles on source, they would rather put it on source than host their own S3 server. So, or rather like manage your own AWS account. So, I’ll leave it at that. Let me make sure, I’m hoping Sig is still awake in Norway. Do you want to take this question? What’s in the future for PMTiles?
Brandon Liu:
What’s in the future? I would say the current version of the spec version three is done. There aren’t any plans for a version four right now. And I think I kind of got lucky in that sense that there was nothing like someone at a conference last month in Japan, they asked me is like, do you have any regrets about like the format design right now? And I’m like, I thought about it. I’m like, not really. It’s not perfect. Like the design overall has very specific trade-offs, you know?
Jed Sundwall:
Okay.
Jed Sundwall:
Okay.
Brandon Liu:
Like it’s, almost stupidly simple in some sense. And like, didn’t want it to like get too carried away. didn’t want to like embed CRS information and that kind of thing. I would say the lowest hanging fruit for PM tiles is better compression methods, but that’s blocked on browser implementations. it, like, so browsers only support GZIP for decompression stream APIs. If that supported something like Z standard.
That would be great, but that is blocked on Apple, Microsoft, Google implementing Z standard support. What changes would I like to see in the formats itself? The format itself is, right now it’s good enough for static data. I would really like to see another format emerge that is for dynamic data that is still like S3 optimized.
that handles rapidly changing data. Because right now, if you edited some geodata and created a PMTiles, you’d have to replace the whole file on object storage. And that is a huge trade-off. Thankfully, a lot of the data out there is you can generate this building data set once. And maybe once a month, you run a new job and it generates a new one. Each time you are replacing it.
Jed Sundwall:
Yeah.
Jed Sundwall:
Yeah. Yes.
Brandon Liu:
What I really want to see is a cloud native storage engine for real-time data. That would be a totally different design than PMTiles, but I think it’s still possible to do a cloud native thing on S3, for example, where maybe you have data in chunks, and then those chunks are addressed by a hash. And then you have a header that is just a reference to hashes. And then as you upload new data or data changes, you create new chunks and reference those.
and then garbage collect them. So I would like to see some other new formats separate from PM tiles that addresses real-time data. In terms of new tools for the format, sort of along this line, one experimental tool I have for PM tiles is a way to do deltas. So you have to replace a PM tiles on S3 each time. But I was thinking about a way to rsync data.
Like if you have like a 200 gigabyte PM tiles on the cloud, and then you have 200 on your desktop and they’re mostly the same, but one part is changed. You can use an algorithm like R sync basically to just fetch the parts that have changed. So that’s like one way from like the cloud to your computer, not the other way around. But I would like to see some use cases for that because I sort of built it as an idea.
But there’s not really a strong compelling use case right now. So that’s, those are a lot of my ideas for the PM tiles ecosystem right now.
Jed Sundwall:
Okay, I love that.
you’re unearthing some feelings about source and like, you so we’re trying to, want source to be kind of this like one-to-one proxy between like for S3, but the idea being that we can create durable URLs that are undergirded by.
as many object stores as we want. So like if you have an object, you should be able to mirror it in lots of different regions and across clouds. And if you have your own S3 compatible object store, like we should be able to point to it and stuff like that. But a really interesting thing happened. If you go to, you’ll have to look around on this, but like the data.source.coop repo on GitHub, which is the repo for our data proxy, this guy Sylvain Lassage, who we’ve been working with on viewers,
You’ve encountered him on GitHub. He’s like, it’s weird. Hugging face can stream CSVs, but S3 can’t. And he looked into it and it had something to do with some header stuff that I don’t remember the details of. But it was like an easy add to the proxy that was basically just like it would pass some more information in the header when you’re calling the CSV and you can stream the CSV. And so like.
We have, we’ve crossed that line. It’s like, it’s sort of like, we’re going to do something. S three API doesn’t do. And I can see us going down a path where we are.
Jed Sundwall:
more than just like a very simple abstraction on top of S3, but we’re extending what object stores can do. So we should keep talking about that.
Brandon Liu:
Right. And also like, so going back to the idea of like a top down versus a bottom up standard. So S3 has become like a de facto standard, like a totally undocumented standard where every other vendor like sort of only implements the features they need to be S3 compatible. And if something is like wrong or like broken, they’re like, well, that’s how S3 works, you know? So it’s sort of become this, this odd thing where this quirky design that Amazon came up with.
Jed Sundwall:
Yeah. That’s right. Yep.
Jed Sundwall:
Right. Right.
Brandon Liu:
is now like what everyone has to do de facto because all the tooling is built on is built with those assumptions that like this API, this XML API exists. They’re trying to do new things though with like there’s that like S3 express one zone that works differently. There is I think a new way to do like partial uploads. Like you can define an upload as being copied from a different object and that’s like accelerated.
Jed Sundwall:
Yeah.
Jed Sundwall:
Yeah.
Brandon Liu:
But yeah, like it would be cool if some other company came up with like an actual, like maybe a more, like a more featureful spec for S3. But again, probably why it succeeded to the point it has is because it’s so simple. It’s like dumb, you know, there’s no really fancy, there’s no fancy semantics around like content hashes and stuff. Like if you look at how Google storage works, you know, it does seem like they had some, you know, some…
Jed Sundwall:
Yeah?
Jed Sundwall:
Well, right.
Brandon Liu:
whatever like level seven engineers sit in a basement for months and like come up with some cooler design that is like more correct or that is more scalable. So there is platforms like Google storage that seem to have more sophistication than S3, but they don’t have the adoption of S3 in terms of the API, not the specific Amazon platform, but like the API, the interface. And I think that is like a fundamental thing, which is there’s always gonna be this trade-off between like,
Jed Sundwall:
Yeah. Yeah.
Brandon Liu:
the simpler and dumber you make it, the more likely it is to thrive, you know, like thrive organically. In terms of people being able to write their own implementation, people writing tools. That I think is also like the trade off between something like PM tiles, which is, you know, like I keep saying, it’s, it’s, simple and dumb versus something that is more full fledged, like a server application that serves WMS tiles, for example.
Jed Sundwall:
Right, yeah, I mean, so we just have to be very careful with how we go about this. I imagine you’re familiar with the concept of pace layering or pace layers. You heard of this? Yeah, so I’m putting another, I’m just gonna be putting stuff in the chat. this is, it’s an idea I think Stuart Brand came up with, which is basically the notion that like you,
Brandon Liu:
I don’t think so.
Jed Sundwall:
that society, like the world, like society is our experience as humans moving through the world. It’s based on all these things that are moving at different rates. like nature undergirds everything. And on top of that, we have all kinds of different life forms and then humans have developed culture and governance and law, language itself. But these are all layers that like they evolve at faster and faster rates.
The funny thing is like sort of the top layer of the pace layer diagram is always like fashion, which is like all over the place. like fashion is this kind of like unpredictable crazy thing that humans do, but that’s based on these other more sort of like foundational things like markets and law and language and blah, blah. And so that’s how I, so, mean, I was at Amazon for eight years. like, and I totally bought into
the philosophy of AWS, which is to provide primitives, to provide primitive services that are reliable and are effectively, extremely durable. We had an AWS crash quite recently. things go wrong, but it’s pretty remarkably stable service in terms of like how complex it is and how much stuff it supports. But the way they do that is by being very primitive.
I would say there’s, to your point, there’s obviously room to extend that. And I think the right way to go about it or to think about it is to extend on top of the primitive. But to go slowly, you wanna add layers very carefully on top.
Jed Sundwall:
All right, let’s see here. I make sure that we’re… I’m figuring out this chat stream thing. I can see it here in Riverside. Sorry, everybody out there, but we’re still figuring out how to do this thing. So I’m curious to get your… I mean, when did you realize you could just do a really huge file? Just like one gigantic file.
Brandon Liu:
so I started ProtoMaps the project before I created PMTiles. and the original plan was to have a server, like a server process, that like serve tiles out of a database. So the original design was like not like, it was not like cloud native or cloud optimized at all. It did not use range requests. It was like a, it was still one file.
Jed Sundwall:
Yeah.
Brandon Liu:
that you like stored on a server and you had to like run this program to be able to like serve it over HTTP. And then I like, I eventually figured out that I could sort of cut out that entire part just by making it something that you could put onto like on S3 as a static file. So that actually came in probably like one or two years into the project.
For me, it’s so in a lot of cases, like that idea of being locked into using the server process to like serve the tiles, that is sort of like a feature. Like for most businesses, like, like if you have to run it on server, that creates like lock-in, you know, and you can monetize that. You can add like, you can add a paywall. You can say, Hey, like, so if you want to like be able to access this thing, it goes to the server. Just like get this API key, you know, once you go, once you go over like 10,000.
Jed Sundwall:
Exactly, yes.
Brandon Liu:
request, then you can pay like a subscription, like pay as you go. So that’s like a feature is to be able to like have it be a, a file like on a server versus just a single static object. but then like, once my, my thinking around like, okay, well like, you know, what is the long-term way this project succeeds? I’m like, you know, isn’t it more interesting to have it just be this like single object?
that you can copy around, like as if it was a video. So right, the original like motivation for the project was coming from like being able to create custom maps and host them yourself. Just the nature of how that was hosted evolved from being a traditional like sort of sassy server thing to being this like object storage focused thing later on.
Jed Sundwall:
Okay, fascinating. Yeah, I mean, the…
this notion that if you control the server, if you have to be this intermediary, you get to control the data flows and also the users. I was thinking like studying Netflix is a really interesting thing to do if you think about like a data business. Netflix is a data business. They sell subscriptions to data. And the way they’re able to do that is by controlling the entire interface, like the entire chain. And so you have to go through them and pay their subscription and…
experience, know, have the Netflix experience, which is good. You know, the fact is like they provide, there’s a huge audience for that kind of data, which is videos that people like to watch. And they’ve just nailed the experience and people are happy to pay for that. know, whereas like, there are certainly people out there that are like, nope, like you have to have your own DVDs or I’m going to run my own local NAS with a bunch of my own video files because I want to have control. But most people are like, whatever, I don’t want to have to think about this. And so.
So all I’m saying is like, I’m underscoring the point that like, there is a business in providing that kind of service to people, but the market for maps is way too small to justify that kind of thing. That’s why I think so many geospatial like SaaS companies have had such a hard time because they might be able to provide a great, great experience to get some vectors and rasters and stuff delivered over their interface, but like,
the market for it’s just way too small to justify it. anyway, I’m a fan of your approach for obvious reasons. And I’m sorry, let me just keep going because Rachel Googler on LinkedIn asked, this is relevant to this. She asked, she said, were the AWS outers last week in Azure issues today? Which I didn’t know about that. We’ve seen how reliant we are as a society on centralized cloud infrastructure. How can cloud native formats be used in temporary local area or
Jed Sundwall:
or peer-to-peer networks when that centralized connectivity is gone, such as during natural disasters. I think you kind of answered her question right away, but do want to address that kind of idea directly? Like how you think about this?
Brandon Liu:
So I think of the Protomaps project as something that works on a server or works on S3, but also as something that works on an SD card. It’s like, if you can put a map or you can put a dataset from source, like a scientific dataset onto an SD card and carry it into the forest, then that is like…
That’s good enough, right? That’s how most technology should work. That’s how videos work. That’s how Word documents work. So I think once you’ve built the primitives, it addresses a lot of these questions about like portability and being able to be resilient against like certain failures of networks, for example. There is some interesting things around peer-to-peer. I know one of the contributors to PMTiles was like,
playing around with IPFS, which is like this distributed storage system, like where everything is like addressed by hash. think it’s cool. don’t know a lot about it, but I’m happy to hear that just designing like a simple single file format can be directly applied or like it just works with these things like IPFS. And…
Jed Sundwall:
Yeah. Yeah.
Brandon Liu:
I haven’t seen a lot of adoption for that specific peer-to-peer system outside of some more niche use cases. But in theory, so you could build a really resilient network of storage for any kind of data as long as what you’re trying to serve is just these simple files.
Jed Sundwall:
Yeah, yeah, well, I mean, and again, I mean, I think the sort of the Netflix example is a good one to explain this, like to highlight also the sort of Rachel’s point of like these single points of failure that can occur where like if you are relying on one system to be able to deliver content like in a very specific way, if that system is brittle, it goes down for any reason, like you’re hosed, but this is the…
This is core to the file-based approach to data architectures, or what I would say specifically the object-based approach, because I like object storage, is that resilience in the face of a system going down to your point, like you can put on an SD card and take it into a forest, that’s perfect, that’s a great way to think about it. There’s kind of no way of getting around the power and effectiveness of sneaker net. However, this opens up the…
the door to a question that I’ve had about PMTiles is that you’ve created PMTiles as this format. If you give, so if I show up with a PMTiles file on an SD card and give it to a random person, they will not be able to open it. They’re gonna double click on it and be like, what is this? How do you get away with that? I mean, yeah.
Brandon Liu:
Yeah.
Brandon Liu:
Yeah. I think it’s tough because like, it sort of depends on the observer, right? Or the person opening it, are they opening it on Android? Are they opening it on Windows? Can I go talk to Apple and ask them to put a PMTiles viewer into Mac OS or something? And I think like my solution is this web viewer. There’s a website called PMTiles.io that I maintain where you can just like drag and drop.
Jed Sundwall:
Right.
Brandon Liu:
a local PMTiles file or a URL of a PMTiles on the cloud. So the sort of intention was that viewer emerged at the very beginning. There has to be essentially a file preview for these things that works locally too. You shouldn’t have to spin out the web server to be able to look at something. So the thing about data is people want to look at it. People don’t believe that it exists until they can see it.
It’s just like this inherent bias. So we know the machine can read it. People don’t trust it until they can look at it. And that is a lot of why people care about PMTiles overall is because they might have geo data in some format, but if they want to visualize it, have to turn it into some more visualizable format. And that’s really what PMTiles is, is making visualization easy. So the answer for the web viewer is as long as they have a copy of that.
web viewer, is open source on the USB stick, then they should be able to open that offline in a browser and just like open up that PMTiles file. That viewer is built using all like pretty standard web stuff. It uses map Libre and some like browser APIs.
Jed Sundwall:
Right. But is that all built? Can that viewer be… that all be… This is a very naive question. Could you just have like an HTML file on that stick that contains the entire viewer?
Brandon Liu:
And a JavaScript bundle. Yeah. The, there is a static build of it, cause it’s hosted on GitHub pages actually. And GitHub pages is just static files. So you could just like clone down a copy of that HTML JavaScript CSS bundle and have it offline and that should work. there is this like interesting question though of like, okay, like, there’s certain like formats like for archiving that are like, I think it’s like the library of Congress. They have like standards about like.
Jed Sundwall:
Yeah, okay.
Jed Sundwall:
Okay. right. Yeah.
Jed Sundwall:
Yeah. Yeah.
Brandon Liu:
they recommend JPEG as a format because it’s like based on the likelihood of like in like 50 years, there’s like some like library science people that are like, like we have these like historical like scans of like restaurant menus, but how do we open them? Because there’s like this, there’s this image format that like was popular, you know, back in the, in the two thousands and now nobody can read it. So there’s like this open question of like, you know, is,
Jed Sundwall:
Right.
Jed Sundwall:
Right.
Jed Sundwall:
Yeah.
Brandon Liu:
is PMTiles like a resilient format in, but like by that standard of measure. And I think that the way the format is designed, it could fit on one page. You know, it’s like, like I know people that have written like a implementation in a different language, like Rust or Swift or something, and they can do it in like a day because the format is intentionally like, like as simple as possible, like going back to
Jed Sundwall:
Yeah.
Right.
Brandon Liu:
that QOI format, just like, it needs to fit on like one PDF page. It can’t be like a white paper, like 200 page book to be able to write a reader. So like my hope is that even if all of, know, if GitHub, you know, like it’s blessed into the sun and we lose all the code, but you have to like write a reader for PM tiles, like from scratch. And all you have is the spec. I don’t think it’s that hard. It should be doable.
So even if you didn’t have like that web viewer or a thing on a USB stick, you could figure it out.
Jed Sundwall:
Yeah. Amazing. This is, I mean, this is great. We’ll, we’ll be announcing this right away. but the, the next episode of great data products is with, we were pretty sure it’s going to be with the Harvard, library innovation lab. It’s the Harvard law school library innovation lab. So where I found like my kind of librarians, you know, that are thinking a lot about, you know, understand the benefits of object storage and these, you know, primitive commoditized layers of storage, but they have a lot of thoughts about this and.
we’re talking about many different types of content, but I think, I hope I want to make sure they, they hear this because your thoughtfulness on this, think is like really, really great. I mean, thinking, you know, the tagline of this podcast is the ergonomic ergonomics and craft of data. And you’re thinking so far ahead, like, what are the ergonomics of like finding a PM tiles file in the like rubble left after the nuclear like winter and people be like, actually I can figure this out.
What, yeah, great experience you’re thinking of for the future archaeologists. Have, yeah.
Brandon Liu:
Right. So just as a comparison point, like it’s probably fine to like sort of bash on, as Ray stuff here, like I saw, I don’t think I’m like, or it’s not a bashing on it, but even like file, like a file geo database, which is like an F F GDB format. There are city governments that publish F GDBs and they expect you to open them. And like most developers that are not into as re ecosystem cannot open these files.
Jed Sundwall:
Yeah.
Brandon Liu:
Like I think like it might’ve been like in New York city, like they distribute their like road network as an FGDB. And you know, that format was maybe designed like 15 years ago. And even then most people I talked to are like, what do do with this file? I have no idea what to do with it. So that’s like an extreme example of like, well, you know, it’s not even a question of like 50 years of like…
of being able to open the file like in 50 years, it’s a question of like even five years later after you publish it, can anyone deal with this thing? And it’s like, well, not really. I think it’s like kind of proprietary or maybe there is some spec, but even things like shapefile, like shapefile like was proprietary from the very beginning, right? And then people sort of like kind of made some, made some like reverse engineered like readers for shapefile.
Jed Sundwall:
Right?
Brandon Liu:
And even then there’s like undocumented extensions for like doing indexing and stuff on top of shapefile. But it’s like all those things are, I think they sort of like fail this question of like that library tests. Like are people going to adopt this if they are thinking about things, like if they’re trying to preserve things like for the future.
Jed Sundwall:
Yeah, absolutely. I mean, this is, you’re thinking the right way, you know, and what’s interesting is that like, Jackson says, geo package. That’s, yeah, there’s an answer there. Yeah. mean, what’s remarkable about,
Brandon Liu:
Geo package, yeah.
Jed Sundwall:
about the, I mean, just thinking about this, like just how short the history of the internet and computing really is, you know? And so it’s fun to think about what things will be like a hundred years from now or whatever. But like we went through a blip, I would say, where people were like, oh yeah, the way to control the market is by controlling the standards. know, Microsoft did that very effectively and developed incredible network effects through the dock and know, XLS formats.
that have since been effectively opened, but who cares? By this time, the damage is already done. Everybody uses Word and Excel, which I should also say, I’m not mad about. I think they’re great, obviously powerful tools that everyone uses. It’s technology that’s well distributed, so I’m not mad about that. But in the future, we have to think more about exactly what you’re saying, which is just sort of like, how durable is this going to be, really?
And that means being very thoughtful about how you design the spec. And it’s usually gonna be something simple. The only other thing I’ll say here is that like, I don’t wanna seem like I’m picking on PM tiles, cause like if I double click on a PM tiles file, nothing will happen. The same is true for Parquet, right? And so Parquet is like all the rage. So much data on hugging face right now is in Parquet. We love having tons of Parquet data on source.
And I was showing a guy earlier today who’s not really familiar with it, but I opened up on source and these are my favorite demos. My PMTiles demo is the best demo source, because we’ve got a great viewer built in and you can just look at it and it’s easy for people. Thank you for that viewer, the viewer that you created. And then Sylvain also built this Parquet viewer and it’s like, great, like now, you know, I mean, as of today, somebody can drag and drop a Parquet file into source.
and they can look at it in the browser right away. And I showed this guy, I’m like, yeah, here’s a parquet file, it’s 800,000 rows. And it’s just like streaming right through really easily. And we’re already at a point where there’s so much data out there and so many files are being adopted that like, no one’s even bothering developing a desktop viewer for them. It’s all being done in the browser. Like it’s all the expectations that’s gonna be done over the internet, which is amazing.
Jed Sundwall:
we got some comments coming through. Yousef from Egypt. Hello, I don’t know who knows what time it is over there. He says new versions of GDALC can open up FGDB now.
Cheetal for the win.
Brandon Liu:
I think I saw that. Yeah. I think like my standard workflow now is like I downloaded like the FGDB of like, it’s like New York city road center lines. And then I do like an OGR to OGR and just get it into like a geo JSON or something. but yeah, I believe there is a solution now. I remember, I think there was one time, like a decade ago where I like downloaded like the ArcGIS Pro trial and like activated the trial just to be able to like open.
the FGDB and then like save it out as something else. But I think that like the status quo is better now. Yeah, for sure.
Jed Sundwall:
Yeah. Yeah.
Yeah, mean, GDAL, it just…
Shout out to Evan. A few more comments on YouTube. Jackson, hello Jackson. He says he’s in the midst of writing an implementation of GeoPackage in Julia. Good luck. Let us know. If you want to write about that on the CNG blog, we have a process for submitting stuff to the blogs. That’d be cool. It’s 2 52 in the morning where Yusuf is. Brandon, you are very popular. People are like, this is incredible.
Sun never sets on the brand and Lou Proto Maps empire. And then we’ve got Sigtil again from Norway, staying awake. I love this, this late night energy we’re getting. Asking, how do you see the new kid in town Geo Park versus PM titles? They have some of the same properties and some differences also. As you said, there’s a lot of new clay. Yeah, so yeah, I have Zarak, Hog, Flat Geo Buffs.
Brandon Liu:
Cheers.
Jed Sundwall:
You’ve explained this to me before, sort of the nuance between like what PMTiles does as opposed to what GeoParK does. I mean, I have my own guesses about this because it’s, GeoParK is like more about like data than PMTiles, which is more about viewing. Is that how you would describe it? Or what’s your response there?
Brandon Liu:
That’s how I see it. Yeah. Like, so I make the distinction between like an, and like a format for, that is for analysis versus a format that’s like for visualization. And I think that’s like, maybe not intuitive because in some cases, those are the same. Like for a cog, viewing it and analyzing it are sort of the same because analyzing it means like, what is the value at this pixel? And viewing it is like, show me the raster, you know, colored in some way.
Jed Sundwall:
Yeah.
Jed Sundwall:
Yeah.
Brandon Liu:
For PMTiles, a lot of the use cases right now for PMTiles are vector-based. And for vector, you sort of need to split out the analysis and visualization into separate things. Because if you wanted an overview for a vector dataset, you can’t really show everything. It would be too noisy. So PMTiles is inherently generalized. Like it has like an overview pyramid.
Jed Sundwall:
Yeah.
Brandon Liu:
So you can load it at any scale and it looks correct. But what you actually see at that level is like not, is not everything. You have to do some filtering down of the data. Sort of like for, for cogs, you have to build overviews that are like smaller and smaller down sampled resolution, like images of the full thing. So GeoPARK is, is, does not have a lot of use case overlap with PMTiles because GeoPARK is like,
and analytical format that is all, it’s just like all the raw data and then only one version of each, and only one version of each data point. While PM tiles will have copies of a single data point because it has to build those overviews. Now there are like approaches to using GeoParkade and visualizing it directly. Like for example, so there’s a project called Lawn Board that lets you like just show,
Jed Sundwall:
Right, right, right.
Brandon Liu:
GeoParkay on a map, whether or not that’s practical to use on the web really depends because if you want to be able to download an entire GeoParkay data set to visualize of a city, that might be 200 megabytes, which is more than people usually expect for a single web page. I mean, it’s possible that in 10 years, bandwidth will be so fast and cheap.
that downloading 200 megs for a single webpage might not matter. And maybe we like at that point, we don’t actually need like a visualization format. We can just be downloading raw data like everywhere. But I expect like some sort of strategy around being able to visualize data with overviews is always going to be necessary just because like some datasets are just really big. Like there’s building datasets on source that are like, like maybe half a terabyte, like they’re like open buildings datasets.
Jed Sundwall:
Yeah.
Jed Sundwall:
Yeah, the VITA datasets, those are my favorite demos. They’re like 300 gigs or 230 gigs or something like that. like, yeah, it’s like, it’s only going to be streamed.
Brandon Liu:
Yeah.
Jed Sundwall:
My assumption is that storage will keep getting cheaper. There’s still plenty of room to progress in terms of the cost of storage itself, but bandwidth, networking has actual physical limits in terms of the speed of light that I think are really compressing space like that. The movement of bytes over space or across space is really hard.
One, actually Qsheng Wu, awesome to have Qsheng on here, says that DuckDB supports serving vector tiles through Parquet, so they’re on LinkedIn. So, cool. It’s great. And then we have another, I wanna talk to you about the Hilbert curve. We’re getting at about an hour, so we can maybe start wrapping it up. But then Alex Kovac asks, and I’m gonna test this out, I’m still figuring out how to do this. You can see it, okay, so.
Brandon Liu:
Nice.
Brandon Liu:
I see it.
Jed Sundwall:
How did, I think the people on LinkedIn can’t see this. So this is tooling on PM tiles. And also for the purposes of the people listening after the fact. Alex says tooling around PM tiles such as the viewer CLI, typical new base maps package, et cetera, is super convenient. How did that evolve? And do you think there’s anything big missing? Yeah.
Brandon Liu:
Yes. I think part of the part I put the most thought into was the overall developer experience of using pm tiles. And from the beginning, had to be like a single binary you could just download. I did not want you to have to homebrew install or npm install or Python package install a package, just because that’s going to fail for a lot of people.
Jed Sundwall:
Yeah.
Jed Sundwall:
Yeah.
Brandon Liu:
If you’ve ever been to a workshop where people like use Python, like a scientific workshop where people are like, we’re providing the material as like a Jupiter notebook. And then someone’s like, I’m on windows. And then you’re like, just use Conda. And then you’re like, trying to like fiddle with this, this conda setup. And I’m like, I don’t, I just don’t like, like, I feel like it, like it pushes people away. Like I understand that like that tooling is mature, but for me, it’s like, I think the best developer experience for any sort of data tooling is like.
Jed Sundwall:
You’re right.
Jed Sundwall:
Yeah.
Brandon Liu:
just download a single binary. Like those are the tools I see having the most adoption and least problems in terms of like the installation. So the installation has to be super simple, like a single download. The viewer we talked about is like the web viewer is for like the viewer for PMTiles files to just browse them. I would say if there’s something big missing, I think tpknew is great.
and PMTile support for that is built in, thanks to felt. But I would say it’s still too hard to install. Like a lot of people that want to build PMTiles, they get stuck on like, do you install the vector tile generator? I would say that is the biggest missing piece, which is to have a single binary download vector tile engine.
Jed Sundwall:
Okay.
Brandon Liu:
Like a lot of the limitation for that is because the libraries you need to do geometry, like geometry processing, are generally only in a couple of languages, like C++, Java. And right now the CLI is like a GIL program and there’s no good libraries for that to go. Even Rust doesn’t have that great of support. You probably need to bring in a Geos via C++ bindings. So the biggest missing part is still like some…
Jed Sundwall:
Yeah.
Brandon Liu:
easy to install and large amount of data generator for vector tiles. It’s something I do want to work on, but right now I think the tpknew solution is good enough. But it’s the major pain point for using PMTiles.
Jed Sundwall:
Yeah. I mean, talk about ergonomics of data. The way you think about this is so great. Everyone learned from Brandon. You’re so thoughtful. this is also just kind of like, see this is you’re helping level up the species just by thinking through things this way. Because yeah, it’s so goofy. I mean, I’ve been in all these hackathons in these rooms where people are like, yeah, like.
you end up spending half the time debugging people’s Python installations. it’s just like, no, there’s got to be a better way. Yeah.
Brandon Liu:
Right. There’s also this idea of different kinds of complexity. There’s like inherent complexity versus incidental complexity. And I think a lot of solving these pain points is around solving incidental complexity, which is just complexity that happens to be there as an artifact that is not related to the actual problem we’re solving. Like maybe you’re trying to solve some route optimization problem. And that is it’s…
like is inherently a interesting computer science problem. But then the, the incidental part is like, I need to like install these packages with Conda and Conda is the like, doesn’t like this wrong version of my machine or something. And it’s just like, all that stuff is just like the part that is like, we can really like, we have to eliminate that in order to actually get to working on the hard problems.
Jed Sundwall:
Right.
Jed Sundwall:
Yeah, exactly. There’s what’s the line? It’s sort of you make the hard stuff easy and the like impossible stuff possible or something. There’s some axiom around like, know, guiding software development along these lines, which is like, we should be continually progressing in that direction. But you’re asking all these great questions or like framing it in the right way, which is just sort of like you imagine somebody who’s coming to a hackathon.
how quickly can you get them up and running? If you’re gonna take an SD card into the forest, what can you actually do with that, realistically? And I often think in terms of, this is what I was saying before about Excel and Word being very successful, is that they are sufficiently distributed technologies. The whole idea that the future’s already here, it’s just not evenly distributed. There are some that are evenly distributed, like spreadsheet software.
Like everyone can open a CSV. Like that’s awesome. CSVs are great, like because of that. But you know, as we’re getting better at producing more complex forms of data, we need to think about the ergonomics in that way. Like what are the experiences of people being introduced to this? So, Yusuf says that to pick a new in Windows is a nightmare by the way. So FYI.
Brandon Liu:
heard that as well. Yeah. Yeah, I’m aware.
Jed Sundwall:
So I remember years ago I asked you if you’d ever seen the movie Tar.
Brandon Liu:
which I still haven’t, but I need to now that you’ve mentioned it twice.
Jed Sundwall:
Okay, well, I’m just like, it’s a, TAR is a weird, TAR fans come out and tell me if you’ve watched the movie TAR. It’s TAR with an accent on the A, it’s a Todd Field movie, in which David Hilbert is a character of sorts. Like he just shows up in the background and I think there are references in the movie to the Hilbert curve.
Tell me about the Hilbert curve. Let’s close on this. Why the Hilbert curve and how did you get into space filling curves? I love this stuff.
Brandon Liu:
I kind of ripped it off of S2. So S2 is Google’s geospatial indexing library, and they use the Hilbert curve there. It has some nice properties that make it work well for geodata. And the motivation behind this is even in Cloud-Optimized GeoTIFF,
Jed Sundwall:
Okay. Yeah.
Jed Sundwall:
Yeah.
Jed Sundwall:
Okay.
Brandon Liu:
People argue about like, like, so we’re making like a cloud, like a cloud optimized format, but like how big should the blocks be? You know, you’re like fetching blocks. If you have small blocks, those are good for certain use cases. If you have big blocks, those are good for like, for more like bulk downloading use cases, it’s more efficient. And there’s some trade-off between small blocks and large blocks. But the Hilbert curve is like a way to like, it’s like a lazy way to get around that argument.
which is because like it’s both small blocks and big blocks in the same, like in the same format. You can actually have any size block as long as the power of two. And the reason this is good for PM tiles is because one of the operations on PM tiles is for extracting one part of the world from a larger file. And the imagined use case for this is, so I host my OpenStreetMap data set on the cloud.
Jed Sundwall:
Yeah.
Brandon Liu:
But maybe you only care about Seattle. You don’t want to have a copy of 100 gigs of the whole world. You only want Seattle. Or maybe you only want Capitol Hill. So the block size in the archive should be small if you only care about a neighborhood. But if somebody else wants all of Canada instead, then they want to be able to have a format that has big blocks so they can download Canada in one chunk.
So the Hilbert curve is useful because it encompasses both of those use cases without having to make a trade off. Because if you did small blocks, it would be good for Capitol Hill, it would be bad for Canada. If you did big blocks, it’d be good for Canada, it’d be bad for Capitol Hill. So because the Hilbert curve is sort of scale-free, it has the same self-similar structure at every power of two.
you sort of get the best of both worlds in one thing. And that’s really the motivation for why the Hilbert curve was useful for this design. I would say it’s not fundamentally essential. You could build a pretty good format just using like other space filling curves or like a Z-order curve. There is some drawbacks in terms of it’s more computationally expensive to decode the Hilbert curve versus other ones.
Jed Sundwall:
Yeah.
Jed Sundwall:
Okay.
Brandon Liu:
For example, there is these Bing, Quan key tile indexes that are much faster to compute than the Hilbert curve. For most use cases though, the cost of decoding and encoding the Hilbert curve is trivial compared to the network. If it spends two milliseconds doing a bunch of tile coordinates on Hilbert, then you’re spending 50 milliseconds fetching something over the network.
Jed Sundwall:
interesting.
Jed Sundwall:
Okay.
Brandon Liu:
So like overall, like holistically, the price you pay for using the Hilbert curve is not that much relative to other things going on in like in some actual use case. But that’s like kind of the whole story as to why we use this like weird thing that is apparently in a movie as well.
Jed Sundwall:
Yeah, I mean, just the movie. I turned the light red again, just because it’s kind of a spooky movie. Let me, there’s BV on YouTube asked a question if H3 grids are similar to the useful, but one thing I want to clarify about the Hilbert curve and like to make sure I understand it, which I’m pretty sure I don’t, which is that like the idea is that you can map two dimensions along one dimension.
Brandon Liu:
Yeah.
Jed Sundwall:
Right? Like with, you you just have like one string that can be extended into two dimensions, like effectively anywhere at any resolution you want. If I’m doing, if I’m loading up the Canada tile, am I just loading up one band? Like, how does it, how does it work? Like, or is it making multiple requests to do that? That’s, can you explain that even? Like, it sounds like the kind of thing you would need a whiteboard to describe, but.
Brandon Liu:
Yeah, you’re opening up multiple like, so if the entire world is on one length of string, then Canada is multiple segments of that range of string. Now, where you can adjust is how finely traced the borders of Canada are because
Jed Sundwall:
Yeah.
Jed Sundwall:
Yeah.
Jed Sundwall:
Yeah.
Brandon Liu:
If you’re working in a networked environment, you can do some optimizations. can say, I’m going to grab a little bit more data than I need, but have less ranges. I can represent Canada using fewer segments of string, even though I get a little bit of America on the side.
Jed Sundwall:
Yeah.
Jed Sundwall:
Yeah.
Jed Sundwall:
Right.
Brandon Liu:
Pretty much that, like there isn’t really one Canada tile, but you can sort of trace out a contiguous segment of the file that is all next to each other, that is all inside of Canada. And then maybe grab a little bit on the sides for like different outline areas. But the interior of Canada, as long as it’s like an area, you know, like most countries in the world or most regions are not like Chile where it’s just like one long thing.
most of them are like kind of rectangular-ish, you know, they have like an interior and then like a border. So this sort of space filling curve is well suited to how people usually think about areas as having like an internal volume and then being able to slice that into just parts of this space filling curve without having to, you know, like use an excess of
Jed Sundwall:
Yeah.
Jed Sundwall:
Okay.
Got it. And then one follow up question on that from the chat is that, is there a benefit here that also these requests are close to each other? Meaning like, you want to look at the full Canada tile and then like the Vancouver tile, should they be near each other? My intuition though is that that shouldn’t matter with object storage and range requests, because it’s not like you’re.
He’s saying like, it’s similar to like how you defragment an old spinning hard drive, but like, that’s not how object storage works. I mean, we’re not assuming that we’re using spinning disk. We might be, but do you have any insight there? Yeah.
Brandon Liu:
Right, so it matters a lot on HDDs because it’s like on those old spinning hard drives, it’s like you have to move the needle more if they’re not by each other. But I think most storage now is solid state and there’s not a huge difference in the seek time for like a far away chunk versus a near chunk. But yeah, there is also benefits to certain operations. Just having parts that are close in space also be close in the file.
Jed Sundwall:
You have a head. That’s right. That’s right.
Brandon Liu:
that is taken advantage of in some parts of the tool.
Jed Sundwall:
Okay. And then let’s, do you have opinions about H3? mean, so BV is asking, are H3 grids similarly useful? I see it as probably not, but I don’t know how H3 content is. H3 is more of like an indexing concept. know.
Brandon Liu:
H3 is really useful for visualization. Yeah, I think it’s like, so H3 is like, you’re usually storing like a value in each cell. And I think it’s like, it’s really great for making like really good looking visualizations of data with hexagons. There is some trade-offs like in H3, one hexagon does not perfectly nest.
Jed Sundwall:
Right.
Jed Sundwall:
Yeah.
Brandon Liu:
it’s child hexagons while in tiles there is a perfect nesting. But for certain use cases like showing like aggregate statistics, it doesn’t matter. So I would say H3 grids are the perfect use or are the perfect match for certain use cases around visualization that are separate from doing tiling.
Jed Sundwall:
Right, right.
Jed Sundwall:
Right.
Yeah, exactly. Yeah, that’s sort of my understanding. And it is especially good for like visualization, but then also like statistics. Like, so if you’re doing like analysis on, I mean, you just think about the origins of it with Uber wanting to measure demand and activity in very, like very certain areas of different grains. It’s like perfect for that. So, okay. Well, look, we’ve been going for an hour and 15 minutes. This is incredible. We’ve got…
people stand up to all sorts of crazy, guys go to bed. Again, there’s a podcast. Like this audio will go out so you can listen to it whenever. But I really, we have been honored. People are honoring us with their time. I hope this has been interesting for them. Brandon, I love talking to you. I love, I obviously love what you’re doing. We’re very proud to have you as a Radiant Earth Fellow and have had you as a fellow for a long time.
man, are you serious? It’s this, Sigtil in Norway won’t let up. He’s got to go to bed, but he’s asking more questions. Are there some geometries that are not supported more difficult? For instance, polygons with, boy, with holes and holes made of curves, et cetera. What was the most difficult geometry to work with across tiles? This is too hard of a question. Are you seeing this comment? Go for it.
Brandon Liu:
No, it’s like I’m able to address this. Yeah, it’s I mean, this is a good like deep question, but it goes back to what I was saying is that there is certain geometries that are hard to deal with. And a lot of it is you have to have a geometry library that is very robust against certain like numerical precision errors. And the only libraries right now that get it totally right are basically like Geos, which is part of
Jed Sundwall:
All right, do it and then we’ll wrap it up. Okay.
Jed Sundwall:
Yeah.
Jed Sundwall:
Yeah.
Brandon Liu:
part of PostGIS and JTS, which is a Java library that is related to Geos. And then a couple other ones, like there’s one that Mapbox made. But yeah, like that difficult geometry is the limitation in being able to write like an easy to install vector tile generator. So I would, I’m happy to follow up over email or something if you wanna like know more about like geometry processing, cause it’s like a really deep.
Jed Sundwall:
Yeah.
Brandon Liu:
subject that sort of is a stealth hard problem. People don’t realize how hard that problem is until they find some weird geometry that’s broken. But yeah, that is a good question. And again, I’m happy to talk about it more.
Jed Sundwall:
Okay, and then, so to contact you, I’m gonna just put in the chat, protomaps.com, go to protomaps.com, there’s info down at the bottom with how to reach you. So, you’re easy to reach. Obviously, everyone listening to this knows how thoughtful you are. So, anyway, I mean, thanks so much for what you’ve given to our community.
Can’t thank you enough. Anything else you want to talk about? we missed?
Brandon Liu:
I just wanted to say thanks for having me on the podcast. I am also on the, CNG Slack, the source cooperative Slack, which one do you want people to use? if people are CNG members, then they can join.
Jed Sundwall:
That’s right, yeah.
Well, yeah, so CNG members, you got to be a member. For both, you kind of have to be a member. So membership to CNG is pretty cheap. We say it’s a symbolic fee. these memberships don’t really add up to pay many bills, but we ask people to pay to join CNG just to make sure that we know that people are there on purpose. They really want to be there. So join CNG if you’re not, and Brandon’s in the Slack there. Sores is still invite-only.
But source, so yeah, the best point of entry right now is the CNG, the Cloud Native Geo Slack. You can go to cloudnativegeo.org slash join and learn how to learn about it there. I’ll put that in the chat as well. But yeah, thank you. Yeah, it would be great to see people interacting with Brandon on any of our slacks, but he’s easy to find otherwise.
All right. And then it’s what is it? 817 in the morning there now.
Brandon Liu:
It is, yeah. It’s red and early.
Jed Sundwall:
You got a whole day ahead of you. All right, well, happy Thursday. Thanks again for doing this. I bet we’ll do it again.
Brandon Liu:
Awesome, yeah, I’m looking forward to the next episodes.
Show notes
Jed Sundwall and Drew Breunig explore why LLM progress is getting harder by examining the foundational data products that powered AI breakthroughs. They discuss how we’ve consumed the “low-hanging fruit” of internet data and graphics innovations, and what this means for the future of AI development.
The conversation traces three datasets that shaped AI: MNIST (1994), the handwritten digits dataset that became machine learning’s “Hello World”; ImageNet (2008), Fei-Fei Li’s image dataset that launched deep learning through AlexNet’s 2012 breakthrough; and Common Crawl (2007), Gil Elbaz’s web crawling project that fueled 60% of GPT-3’s training data. Drew argues that great data products create ecosystems around themselves, using the Enron email dataset as an example of how a single data release can generate thousands of research papers and enable countless startups. The episode concludes with a discussion of benchmarks as modern data products and the challenge of creating sustainable data infrastructure for the next generation of AI systems.
Links and Resources
Key Takeaways
- Great data products create ecosystems - They don’t just provide data, they enable entire communities and industries to flourish
- Benchmarks are data products with intent - They encode values and shape the direction of AI development
- We’ve consumed the easy wins - The internet and graphics innovations that powered early AI breakthroughs are largely exhausted
- The future is specialized - Progress will come from domain-specific datasets, benchmarks, and applications rather than general models
- Data markets need new models - Traditional approaches to data sharing may not work in the AI era
Transcript
(this is an auto-generated transcript and may contain errors)
Jed Sundwall:
All right, well, Drew, welcome to Great Data Products, episode one. Thanks for doing this with us.
Drew Breunig:
Not a problem.
Jed Sundwall:
yeah, as I said, I’m going to ask you to introduce yourself in a second, but before I go, just want to, explain a little bit like why we started this podcast, which is, that we believe that.
Understanding what makes a good data product is just very understudied. We’ve been doing it as a species for a while now, every now and then sharing data. There have been laws on the books saying, you know, thou shalt open your data or policies from research funders saying that, researchers need to open their data. sometimes it goes well and sometimes nothing really happens with it. And we’re, think we have enough experience under our belt now that like we can see there are a handful of data products that have come out.
that have had a huge impact on research. And we’re at the point where we’ve got to figure out like, why, like why those, what made them good? There’s Eleanor Ostrom said this very somewhat famously, at least for me, I’m a big fan of hers, but you know, she was, she’d spent all of her life working on trying to understand how people share common resources that are, that are limited. like a fishery or a forest or, you know, grazing fields and stuff like that.
And she’s like, look, we know this happens. Like humans have figured out how to do this. We know it works in practice. Now we have to figure out how it works in theory. And I love that. So that’s, that’s what we’re doing is trying to figure out. We know that some data products are really great. We want to tease out some theories as to explain why. so, for reasons that are obvious to me, but may, might not be obvious to everybody tuning in or listening. You were one of the first people I’d ever want to talk to you about this. So why did you explain a little bit about your.
background of what you do.
Drew Breunig:
yeah, first I want to put a pin in that quote you said, cause I think one of the things that’s crazy about that is like a fishery is like, it’s a zero sum game. Like that is a exhaustible resource source. data products have entirely different dynamics. like you can go full like old school, boing, boing, Cory doctor out data wants to be free. It’s not theft if you can reproduce it, but at the same time, it grants you this immense advantage.
that then allows you to create more data in a way that isn’t free. it’s kind of anyway. So yeah, my name is Drew Brunig. I write a bunch on AI and data. I’ve been working in data. ran helped run or ran data science and products at a company called Place IQ for about a decade. And then led strategy at precisely when it came to the data and
Jed Sundwall:
Right. Yeah.
Drew Breunig:
intelligence business. I see data as a really interesting space because it’s an intersection between humans and compute, essentially. Because you’re essentially converting humans or work of humans or observations made by humans into something that is programmatically readable, you can build products upon it and that. that, I also think the other thing that’s interesting about that is that’s not a one way street.
It’s a two-way street. So you are converting humans into data, but at the same time you’re preparing data and figuring out how it can be leveraged to inform those humans. So kind of making data human, making humans data. And that is an active negotiation of borderlands as it were, rather than just one way that comes in and goes out.
Jed Sundwall:
Oh man, fantastic. All right. See you. This is a rich well to draw on. Um, yeah. And, what you, um, what’d you just said about like Corey, Dr. O and like sort of the economics of this. think this is my, I’ll just keep saying it out loud over and over again. This is that like, this is the Nobel prize challenge is like, can you, can we figure out how data functions as a market? Good. Because, because it’s weird, right? Like to your point about.
Drew Breunig:
Yeah, we can go, but.
Jed Sundwall:
What Ostrom was studying was, limited resources, which she called the common pool resources, but with the assumption that they were, they were limited and you needed governance to manage access to them. And just to, yeah, just quick primer on Ostrom for a lot of people. And I’m not like a full Ostrom scholar, but like a lot of what made that work was the fact that like, you had to live with the people that you shared the resource with. And so if you were a jerk about it, like you would get punched. Like that’s just part of it. And yeah.
Drew Breunig:
Yeah.
Drew Breunig:
Yeah, mean, guess you can kind of say that exists when it comes to licenses, which is a whole different messy world, which is like, god, please don’t. So much of my beef with licenses is that it’s the will of people when the data wants to be free. And the real way that you can kind of put your fingerprint on the market is you actually put the data out there in the shape and the form that is
Jed Sundwall:
yeah, we’re going to talk about licenses.
Drew Breunig:
what you want that makes what you want in the world to happen. But the idea of releasing it and then gating it is just insane to me. It doesn’t make any sense. It’s backwards. You kind of want the option, but you want to control how people use it, which is just like, why are you even bothering in the first place? But yeah, and I think that’s like, now you’re getting into the familiar terrain of like the data is the new oil claims and other things like that. And I feel like that’s a quote even that we debated and wandered around.
and talked about for decades. And part of the reason we talked about it is because it made people who work in data feel important. It made them feel like this justifies my paycheck, my job title, my power within the organization. But I don’t really feel we got to the point where data is the new oil became somewhat true until LLMs and post-Chat GPT.
Jed Sundwall:
yeah. yeah!
Drew Breunig:
specifically, those were the engines needed. It’s like you can create oil, but if no one owns an engine, no one has anything that they can do with it. Like that’s kind of the era we were in where we figuring it out. We could drill it, but we weren’t sure what there was potential energy there, but what do we actually turn it into? And prior to that, there was one thing you turned it into, which was ad products. That was the one thing you turned it into. That was the way to monetize. And now we’re turning it into large language models and other things like that.
Jed Sundwall:
Right.
Drew Breunig:
figuring out the economics of it, I believe is hard. Because the other, like, I don’t know, I think one of the things is like, you can find so many different metaphors for this, because it’s a complex thing and this complex bucket that like kind of reigns it in. But I do think like one of the king metaphors for data is it’s the platypus. Like it has, because, well, what is a platypus, Chad?
Jed Sundwall:
Go? Go on.
It’s all sorts of crazy stuff.
Drew Breunig:
Yeah, it’s got a bill. It’s poisonous, lays eggs, mammal, it’s got fur. Yeah, it’s like that’s that’s data. Like it sometimes you can, you can make it like oil. Other times you can make it like a lighthouse, which is like a public good that makes it so ships don’t crash. And you can put it at the right
Jed Sundwall:
Yeah. Lack dates in a really weird way. yeah. Sure.
Jed Sundwall:
Mm-hmm.
Drew Breunig:
point that encourages very specific trade routes to occur and economic activity to incur. And so you influence the world by putting it out there. And because it’s a public good that can’t be gated, became that was something governments did. And you could make the same argument. And then you can also find metaphors for like data being countless other metaphors as you can kind of run into. But I do think when you put a data product in the world,
getting towards the definition of what this podcast is, a great data product creates an ecosystem around itself, I think is the way I would say it. And I would say like, perhaps, and this can happen intentionally, it can also happen accidentally. And so by way of kicking this off, like I almost wanna pose to you,
Jed Sundwall:
Yes, yeah.
Drew Breunig:
what I think is the best data product ever created, or one of them, the Enron email data set. Are you familiar with this one?
Jed Sundwall:
Let’s go.
Jed Sundwall:
Ah, I am. Uh, because, uh, so just flashback here when I, when I joined AWS in 2014 to, to build the open data program there, AWS already had this thing called the public data sets program. Um, which was that sort of preceded me. Um, that was not, know, there, it was, had already been dabbling in sharing open data, but there was no kind of like program around it. And this program was somewhat abandoned. And, um, but.
Drew Breunig:
Yes.
Jed Sundwall:
how it had been set up was using elastic block storage volumes. So this is data that to access it, you had to turn on EC2. You had to turn on a server and then attach one of these volumes to that server. Then you could access it. But we had all these EBS snapshots, these volumes of data that you could load up. it was like, one of them was the Enron email database, but some other funny ones, there was like a cannabis genome, like,
maybe the Marvel Cinematic Universe, there’s something to do with like, it was like a graph database of like Marvel characters or something like that. And some Japanese census data that someone found. And it was just, it was kind of this fascinating snapshot. I’m sure there’s plenty of like internet archive screenshots of the site. It was just sort of like, here’s some random data that engineers at AWS found circa 2012. But yeah, the Enron database was in there. So go on, let’s talk about it.
Drew Breunig:
Yeah.
Drew Breunig:
Well, I just think Enron email database, so for those of you who aren’t familiar with the Enron email database, so Enron was a company that blew up in spectacular fashion. When did it blow up? Like 2001, 2002?
Jed Sundwall:
And you’re talking blow up pejoratively, like it was catastrophic.
Drew Breunig:
Yes. Yes, it was not a physical literal blow up. It was just a mountain of fraud. when the case kind of, there was a ton of public anger. A lot of people lost their pensions, a lot of people lost their stock, and effectively it went to zero and gets taken over. And it was a big company. In 2003, as part of the court proceedings,
Jed Sundwall:
Yeah. Yeah.
Drew Breunig:
the I think it was the Energy Regulation Commission released the emails from about 150 senior Enron executives. So this is about 1.6 million emails that get released. And this is 2003, keep in mind. like, that is an amount of emails that would be out of reach for most people because
You would, it’s just incredibly hard to download. Um, though putting it in AWS was, I’m sure it was very popular. you search Enron email dataset, uh, MapReduce, you will find hundreds and hundreds and hundreds of tutorials. And so it became this incredibly popular data set that people wrote papers about, about internal dynamics of workplace culture and language. Um, I think at one point there were like 30,000 papers a year.
that were citing this. And when I checked Google Scholar, maxed out. It was over 100k. Then you start to look at the companies that were booted up around it. I know multiple startups who started building email software or enterprise SaaS software that would start with the Enron email data set. You would start with it to kind of build your products around it. Because there was no email data set. Like even today, you see it used in AI evals and pipelines.
Jed Sundwall:
Interesting.
Drew Breunig:
is just this, it’s the only large email data set that is friendly license free to use. And it is generated an immense, I think it would be a very fun study for someone to do would be to calculate what the economic benefits of this email release from this absolutely failed company and how much it generated from this. so like, to me, that has the qualities of a great data product, which is it provides data that wasn’t existed anywhere else. It doesn’t, so you
There was no competing offering and any competing offering was just a minuscule, minuscule amount. Two, it has legs. We are, I want to say 22 years since the release and it remains as relevant as ever. It was freely available and accessible and easy to work with despite its size. It was a very common MapReduce demo, as I said, which would be the first step you would do in kind of dealing with it.
And it created an ecosystem around it, which I think is the biggest test case for a good data is do things grow out of it? And so it’s kind of like, I equated, I was at the Monterey Bay aquarium this weekend and they had an exhibit on whale falls when a whale dies and it goes to the bottom and it starts to decompose and all of the like critters and everything come to eat it. And it’s this like feasting moment. And that is the Enron email data set was the equivalent of a whale fall.
Jed Sundwall:
Yeah.
Very juicy. mean, yeah, so much, so much material in there. No, I love this. I you’re making me think about, I have this white paper that will come out eventually. I’ve been working on it for way too long. I may have mentioned this to you, but it’s called emergent standards where I make the case that the web is an engine for people to come up with new standards. and so basically like the way, like the server client dynamic of the web is that like,
If you have a server and a client that can talk to each other in a way that makes sense to one another, it works. like it’s worked with HTML and then, you know, we’ve figured out other ways to send more and more complex things over it. Um, including like, and what I talk about in the, in the paper is like, RSS, like we want to figure out how do we syndicate stuff to one another. Stack catalogs. Um, what’s the other one GTFS, which is the general transit feed specification. And, basically
Drew Breunig:
Yeah.
Jed Sundwall:
like what people don’t understand or like what a lot of people in policy don’t understand is that this is an emergent thing that happens as communities grow around types of data. So I’m agreeing with you, but like one conclusion I try to sort of land on with that, that this white paper is that this is effectively like language. If it’s useful to people, it will be adopted. Right. And so to your point about the, the, and the, this collection of emails,
Drew Breunig:
Yeah.
Jed Sundwall:
It’s practically useful in a way to a lot of people. so people have adopted it and it’s become a thing.
Drew Breunig:
Yeah, and I think the other thing too is like the it’s so much easier to create that standard or have a successful data set if you’re operating in the white space where it doesn’t exist. Like when we’re, so I work on the Overture Maps Foundation as you know, and like that’s a little bit of hard mode because you’re competing with a lot, trying to establish a with where standards exist to some degree.
Like open street map is really built more to be a map rather than a data set. So it doesn’t have great data standards for like easy data usage. It’s starting to adopt a lot of the moves that we’ve made at Overture, but at the same time it exists, it provides an alternative. And so it means we have to be that much better. Whereas with the Enron dataset, like there’s still no replacement for it. I was just looking, at the pile. The pile is a big data set. That’s about.
What is it? It’s about 900 gigabytes was used to train llama. It’s used to train lots of open agents. We can assume it’s being used to train closed agents, closed models as well. Again, it’s what? 900 gigabytes and the Enron emails are still in there. They’re still like one of like 25 sources. There is no better.
Jed Sundwall:
Amazing.
Drew Breunig:
email dataset. like operating in the white space means you get more rain to create those standards you go through.
Jed Sundwall:
Right. Interesting. Give me one interlude here. We have to, we got some technical difficulties. We’ve got to make sure the YouTube live stream is working or the chat is working. It’s apparently disabled, I’m going to, I’m going to do a thing. Hey everybody. mean, there are people on YouTube. I’m going to click on something and I don’t know what’s going to happen.
Jed Sundwall:
Now I’m like delayed. I’m watching myself on YouTube with the delay.
Jed Sundwall:
Okay, I think it works.
Drew Breunig:
You got it?
Jed Sundwall:
I think so. All right. Now how do I get out of here?
Drew Breunig:
I mean, look, you got your first episode here.
Jed Sundwall:
All right, we did it. No, we’re good, we’re good. We got, can see, it’s like all my friends. It’s like, this is so great. This is like romper room. Like, I don’t know if you ever watched that. It like a show and I was like really little and it’s like, yeah, it’s like, I see Alex and Camilla and Linda. This is good. Okay, so we’re good. So hold on, I do wanna talk more about the white space though.
Drew Breunig:
Nice.
Drew Breunig:
Ha ha ha
Drew Breunig:
Yeah, you can wave goodbye to them and you can’t hear us back.
Jed Sundwall:
define it more. You’re just saying like creating an entirely new kind of data product or working in entirely new domain.
Drew Breunig:
Yeah, well, I mean, I just think there are some things where it’s like, it’s, I think you see this a lot in culture and technology too, which is like, if you’re the first to come out, you have a longer shelf life than your if, if then the best which may come out later, technically. And so you have more ability to shape the standard, which is hard and a lot of pressure, because you can sit there and think about it forever, or you can just release it.
and then evolve it quickly as they come. But it’s hard when it’s a dataset because you release it and then it ceases. It’s the whale fall moment. You don’t get to go back and rebuild the whale and then drop it again.
Jed Sundwall:
Yeah. No, well, and this goes back to like what I was saying about like the Nobel Prize challenge of like, what are the economics of data? And I think you know this under working at Overture. It is expensive to produce good data. I cut my teeth.
Drew Breunig:
very expensive. It’s expensive to maintain good data too. I think like one of the things that like allow for longevity of these data sets are things where you don’t need that timeliness. Like it’s okay that the people in the Enron email data set are not still emailing and we aren’t still capturing those emails for the last 23 years. Because that’s not the function of that data set. It is a demonstration of how people use email rather than
And there’s been no competitor. Whereas if someone came out and said, I’m to make a business of selling select emails so people can see it, like say, but there’s so where we aren’t going to see that, but we do see it in other spaces.
Jed Sundwall:
Yeah. Well, yeah, let’s, I mean, we need to, talk about this for a little bit, like with a shape of a data product. Um, they, they, they can’t take on many shapes. Right. So my first job out of grad school. like, you know, my life story is I studied foreign policy. I got a master’s in foreign policy. Thought I was going to like work for the state department. I wanted to be a diplomat and I was like, I, I grew up in DC.
It had no appeal to me, like it had no luster to it. So I was like, actually, I just want to work on the internet. Like I had what I’ve called like a coming out process in 2006 where I was like, I care about the internet, like, and I don’t care who knows, like, this is just who I am and worked. So I took a job as a marketing enthusiast at eventful.com, which was like a web 2.0 company.
Drew Breunig:
Well, I mean, look, that’s a title that comes in the Web 2.0 era. Marketing enthusiasts. That was a special time for titles.
Jed Sundwall:
True. Yeah. Yeah. It’s like not a ninja. Like, like definitely like a amateur. Yeah. just an enthusiast, but it was my foot in the door and it ultimately I think was a very good decision. But what eventful did, there’s a site called eventful that was like, they gathered all the world’s events data that they could find by scraping websites and getting access to feeds and then standardizing it and making available via an API.
Drew Breunig:
Not a rock star, not a ninja. Yeah. Just an enthusiast.
Jed Sundwall:
And what we learned very painfully was like, this is very expensive and the bulk of our database becomes useless every day. yeah. Yeah.
Drew Breunig:
Yeah, no, exactly. That’s like the opposite, which is like it’s event data. It’s just gone. It’s done. And I think you see other people who have to struggle with this as well and try to figure it out, which was like, I think satellite imagery providers, you and I know many cases where like there’s several satellite imagery companies who are like, trying to figure out how to build a product that makes their old data valuable.
Jed Sundwall:
Mm-hmm. Yeah.
Drew Breunig:
because right now most satellite imagery providers are, their stuff is valuable because it gives you that snapshot of what’s going on right now. But they want to figure out everything else. And like, you’re not gonna crack that at Eventful. You’re not gonna crack that at, you know, anything that is, you know, temporal in nature.
Jed Sundwall:
Yeah. Yeah. Well, it actually, so this is, this is actually very timely. This, Antoine on the, on the chat, I love this is asking like, what about, you know, what about Freebase? This is the issue. It’s like, what about Eventful? Like Eventful never pretended to be an open data resource. was doing the hard work of taking a lot of open data or data that was like, you know, small enough that it didn’t feel like we were just ripping people off because also we were like,
Drew Breunig:
Yeah.
Drew Breunig:
Yeah.
Jed Sundwall:
highlighting events that people wanted to highlight, but then assembling it into a white pages like product where it’s a huge compendium where the product is like, we have everything in one place and then we sell access to it. Long story short, I don’t think Eventful exists anymore. That the problem that’s solved has been solved in other ways and whatever events are still kind of a difficult space to aggregate in. But so Freebase, awesome example of around the same era. I think it was started in
Drew Breunig:
No, it doesn’t.
Jed Sundwall:
hang on, I just looked up the Wikipedia page. was launched in 2007. 2007, for what it’s worth, the year that AWS announces its first service and the year that the iPhone is announced. It’s a very consequential year. Very heady days of Web 2.0, like seeing what the internet can become. And so…
Drew Breunig:
Yeah.
succeeded according to the Wikipedia by Wikidata.
Jed Sundwall:
Yeah. So, so the, mean, there’s, there’s room for this WikiData. think people like it. It seems good in some ways. I’ve never really relied on it very much, but.
Drew Breunig:
Well, Wikidata is a good example of like the importance of data UX, which is, you know, one of the things that was so nice about Freebase was it was this, it’s kind of what Overture tries to do with like its GURS identifiers, which is for everything, there would be an entity that you could then walk. like, you know, there’s here’s an entity for Jed. Now we can find everything Jed has open to it. And yeah, I, I think Wikidata is kind of sneakily one of the best
crosswalks on the web. think they track over 800 different crosswalk kind of identifiers like Apple Maps ID, Google Maps ID, lot of federal IDs and everything else. And it is fairly successful. It’s API, like think there is a little learning curve for that. I think also when trying to build products off that, it’s incredibly good for crosswalking data, though oftentimes you have to do a little bit of
hurdles to get the data down for that crosswalk. But again, that’s like a whale fall. Again, it’s the same thing, which is once Google walked away, it’s nice because it allowed for Wikidata to exist in a way and utilize the free base data as its code. But then it had to kind of supply the revenue or at least the donation model to keep it going.
Jed Sundwall:
Right. And it’s all goes back to the fact like this is expensive and hard. what I, you know, uh, these days, uh, the year 2025, there’s a lot of concern about sources of data that we had long thought were kind of like unimpeachable and we’re going to be a reliable provided by governments. And, um, that’s just no longer self, you know, a safe assumption to make.
Drew Breunig:
Yes.
Jed Sundwall:
And I’ve actually been a voice, you know, shouting into the void for years. Like this was never a safe assumption to make that we need to think a lot harder about this kind of infrastructure. Because it’s hard. It’s expensive to produce. And if we could figure out the economics of it and get better, have better markets for data, I think we would have more data. the, one of the hard things to, to grapple with here though, is that like nothing is free and
What you were saying before about the difference between like a fishery and a dataset is that like, there’s this phenomenon that I chalk this up to what’s called nano economics, which is like the economics of like individual, like very small transactions. so if you examine like voting behavior, people are like, my vote, like, how could it possibly count? It doesn’t matter, but like votes do matter, right? And like,
Drew Breunig:
Yeah.
Jed Sundwall:
We don’t perceive the emissions that we create by living our lives, but like they obviously add up. And so same thing, like Wikipedia, it feels free to open up an article on Wikipedia, kind of to all involved. Like Wikipedia itself doesn’t really register one page load. And it’s, certainly seems free to you, but Jimmy Wales is going to ask you, he’s going to nag you to donate because they need money. Like, yeah.
Drew Breunig:
Yeah. And, and, and I think there’s also the flip side to that as well, which is something that we see. So during the, when the advertising ecosystem was the way you monetize data, I’m sure many people talk to you about like the dream that everybody wanted to figure out is, how can we, we’ve, I’ve solved the privacy problem in advertising. I’m going to create a system where people can opt in to share their data and they get paid for
Countless, I know countless companies or people who dreamed of trying to figure this out because they’re like, look, people get real value with they sell their data. The advertising ecosystem is incredibly huge. The problem is, is that your data on its own is worth nothing, absolutely nothing. And it’s worth something in aggregate, but
Jed Sundwall:
Nothing. Yeah, exactly.
Drew Breunig:
nothing in in in by itself. And so people would make runs at this, which is like, we’re a co-op, we get to brand together, you like try to get some maybe economic innovation of like, okay, you’re, you know, have a longer timeline, take advantage of compounded interest, all these other things. But it’s it’s kind of the same thing, which is your usage of Wikipedia is a rounding error, but it’s expensive. But the value of the data you create
is a rounding error. And we saw this during the ad era and we’re seeing it again. There was, what’s the mobile phone network that launched a couple of days ago where it’s like, we get training data on all your calls. And so you get cheaper voicemail or cheaper phone service.
Jed Sundwall:
Whoa.
How about this one? fascinating. Tons of people are going to sign up.
Drew Breunig:
Yes. I don’t think, but again, like I can’t, like I haven’t looked at the cost. It can’t be high. Like, like how much of a discount can it actually apply? I’m looking it up because I want to see, I just saw it. Cause it’s, it’s way easier for someone like Meta or Google to just give you the service and the service is predicated on sharing data. But we will just never see that go away.
Jed Sundwall:
Yeah, right.
Jed Sundwall:
No, no, because in aggregate it’s just too, too powerful, too seductive and they provide really good services. Yeah.
Drew Breunig:
And now we’re seeing it, but now we’re seeing it like the flip side of this is the the anthropic case right now, which is how much per book was that settlement was like $3,000 per book, which is like fairly, if you’re an author $3,000 for a book, like for a lot of authors, it’s gonna be a lot for a lot of authors, it is not going to be. But it’s it is more than you would expect. And they’re going back to the well, because the judge took away the settlement.
Jed Sundwall:
Yeah, yeah.
Drew Breunig:
And so we’ll see where that does net out. do think like trying to figure out the cost and training is hard. I don’t know if that’s something like the idea of opting into training, think is, like, you’re going to get applications that rise up too quickly that are just going to take your training data. So chat, you’d be T anthropic just asked everybody to re opt in, change their privacy, because they’re going to be training on that, meta always has always will. and, and so.
Jed Sundwall:
interesting.
Drew Breunig:
how are you going to create an ecosystem to pay people within that? They’re just going to go use these services and kind of knock it out. So, I don’t know.
Jed Sundwall:
Amazing. Okay. Well, let’s, let’s, let’s shift to your blog post now then, cause let’s talk about large language models, talking about Anthropic, and the basis of these things. So you, in your blog post, which I highly recommend, it’s, it’s in the, whatever we linked to it when people registered for the thing, you can put it in the chat. but great overview of, these three data products. And, and again, this is another sort of chance for us to talk about.
what is a data product. So let’s start with the beginning and talk about MNIST. Yeah.
Drew Breunig:
Yeah, so one of the reasons I think large language models and AI in general are the fulfillment of data is the new oil is because previously, if you wanted to write a computer program, you had to worry or make a computer program, we really had to worry about three things. You always worry about your software and your hardware. Actually, two things really. That’s it. Just write my software, run it on hardware. I’m done.
With machine learning, deep learning, and now what we call AI and all those subsets of it, you have to have software, hardware, and then data. The data bit is non-negotiable. You need the data because the way machine learning and deep learning works is rather than having the programmer write the rules for what the program does,
You give a sufficient volume of data and present it to a computer and a computer program for making machine learning or deep learning models. And you give it instructions and you ask it to interpret the patterns in the data without and figure it out for itself. And within deep learning, this is even another layer on top of that, which is you figure it out without even telling what to pay attention to. You aren’t labeling it. You aren’t telling it. It’s just, here’s a pile of data. Go find the patterns.
Now in the early days, there wasn’t a lot of data because think about it this way, which is if you were an early adopter of computers, let’s say to 1994 in this case, you would go to the computer store, you buy your computer, you bring it home, you plug it in. And that was that. If you got any data into your computer, it was because you typed it out or you inserted a floppy disk.
that you got in the mail or picked up at the store, maybe a CD-ROM if you were real fancy. That’s it. The bottom, there was no internet connection. There was no downloading. So to acquire data was an incredible exercise. And so as a result, could you build machine learning systems? Not really. You had to have this access to data that you just weren’t going to get. So people didn’t do that. And so
Drew Breunig:
It wasn’t a field. wasn’t a thing. people are going to say, neural networks were back in the seventies. And it’s true, but there weren’t many who could play with it because the access to the data was so limited. And then what we found though, is that, and this gets back to the white space, which is really any data that was delivered to your door was brand new data. There was no competition for it.
Like, I don’t know about you, like, mean, like you’d get maybe a CD-ROM in your magazine or like, like what would you get for data? Like, what was the consistency of data? I think the only thing you would have is like maybe some project Gutenberg floppy disks you would pass around, maybe some like encyclopedia Britannica CD-ROMs you would pull out. There wasn’t kind of a world of data. And in this environment is the first data set we’re going to talk about. Cause we’re going to explain kind of the history of AI.
in three data sets. And the first data set is the MNIST data set, the M-N-I-S-T data set. Now, this data set, now it’s on a hugging face, as you can see. You can install the hugging face data sets pip library and download it. And it’s also bundled with almost every machine learning library. So if you install TensorFlow.
or Keras or whatever the backend and then you say like install MNIST. It’s almost certainly there because it is the data set that is the effective hello world of machine learning because back in 2004 or 1994 even longer. So what is MNIST? MNIST is a collection of 28 by 28 pixel images squared and they are handwritten letters.
actually is no, it’s digits. It’s not even letters, just digits, just numbers and digits. They collect these from two sources. One of them from, I think, census employees. And then the other one was from a high school class. So like, this is like a classic case of just like someone had access to two people, they were getting values out of them, they’re writing down numbers either as
Jed Sundwall:
It’s just digits. Yeah. It’s just numbers. Yeah.
Drew Breunig:
doing, filling out forms, filling out tests, and just someone in the right position is like, this could be useful or we’re scanning these anyway. And so they took some time. We don’t really know how this happened. They basically realized, hey, let’s make a data set of handwritten digits. They didn’t put a lot of thought into it or how it might be used for machine learning. Like one of the issues is, when you’re building machine learning systems, you have a test and a train.
subset and you should never mix the data. So your train is what you build your model on. You train, you learn from this, and then you test the quality of the model in your test data set. In the initial distribution, one of those data sets was like the high schoolers and then one of them was the census people, which is a terrible way. You should have it all mixed up and scrambled because you can make some assumptions that the census people may have different handwriting than a bunch of teenagers.
that have had no training. so later they improve this. But again, they put no thought into this. This is the equivalent. And they decided to distribute it. distributing was literally like CD-ROMs, burn CD-ROMs. You would get it in the mail. You would get, you you’d have to order it. And this was the NIST data set, the first one.
Jed Sundwall:
Yeah. So, and again, I think we need to maybe tell people like what NIST is. It’s the national something, something national is. Yeah. Yeah. So the government agency. Yeah.
Drew Breunig:
Institute of standards in technology. So the type of people who would be looking at pictures of numbers, and they’re the type of people who thinks there’s something here. Do you ever watch the movie Ed Wood, one of my favorite movies, great movie. You should you should watch Ed Wood. But there’s a scene in the beginning where he’s like on the studio lot. So Ed Wood is famous as like the worst movie director of all time. And he’s walking the studio lot and he’s he’s
Jed Sundwall:
No, I really should.
Drew Breunig:
walks into someone’s office and they’re reviewing the new stock photo, stock video they just shot or stock film they just shot, which they just keep in the studio library to like insert into movies later. And he’s just watching like disconnected random scenes. And he’s like, man, you could make a whole movie out of this. Just like highlighting how bad his taste is. But at the same time, looking at pictures of numbers and saying, we have something here is something you expect from the Bureau of
standards in technology. So they put it on CD-ROMs and mailed them out. And one of the people they mailed them out was a computer programmer at Bell Labs, back when Bell Labs was still like the institutional research standard. And the guy who got it there was Jan Lacoon, who is one of the godfathers of neural networking, one of the AI leaders at Metta.
Jed Sundwall:
Amazing.
Drew Breunig:
led kind of llama and other things. just released the world model last week. He’s just kind of a godfather of this stuff. And he had been working on the problem of trying to recognize numbers because he worked at Bell Labs. This is something they would want to do is they had to automate and kind of figure out and look at mail, look at zip codes. That was all it was trying to do is like, can we look at a camera and look at zip codes and automate the entire thing? And so using MNIST, he
trained a neural network, one of the first neural networks, and could basically delivered a watershed moment in accuracy. Like the error rate now was down to 0.8. He modified NNIST, mixed up the sample sets so it wasn’t just high schoolers and census. It became the Hello World. And at its peak, AT &T was using this original neural network software to read more than 10 % of all the checks deposited in the US.
which was then software that gets sold by Bell Labs. You will find this in almost every machine learning textbook, every deep learning textbook. And part of it was just, it was staged because once Jan got it, he reformatted the data, and this is touching on a question someone just asked, specifically for his task of training neural networks.
which is why this data set is so valuable and why it’s become this hello world is that you can do a one line install for MNIST data and it’s ready for you to use. It’s segmented into the different data sets. It’s all standardized. The levels of contrast and anti-aliasing, the flipping reversals, all of those things are all ready for it to be used. And it has kind of survived this test of time and enabled the foundation of the very first neural networks. Again,
This is a data set that was distributed on CD-ROM. It was sneaker net. It was mail. And it, would argue birthed what would later become our deep learning ecosystem that would lead to AI.
Jed Sundwall:
Yeah. think, no, I mean, and this guy, they’re to pull it up because they think his name, right. Donahoe, this guy at Harvard or sorry, at Stanford, David Donahoe. so wrote this paper that, I still have not finished. It’s very long. I’m putting it in the chat, but look, Donahoe is a smart guy, but the title is a little clickbaity for my tastes. It’s data science at the singularity. Not a terrible title though. I mean, I think he makes the case that there’s something going on here.
Drew Breunig:
on how.
Jed Sundwall:
but he credits Lacoon as the godfather. You would agree completely with what you just said. And the gist of what Donahoe says in this paper is that the machine learning has made the enormous strides it has because its community has adopted a practice of frictionless reproducibility. So one of these fantastic phrases,
similar to undifferentiated heavy lifting. It’s like impossible to say, but very useful. But this idea of frictionless reproducibility within the machine learning space where people have been able to share these great data products, compete around them to go going back to your point about a great data product has a community around it, have leaderboards and it’s just been like to the moon. And this it’s a great, this sort of tees up Alex’s question in the chat, you know, like
How would we get, for example, environmental data to be seen by A models? How do we do that? My answer would be, and this is defending everything that we do with this podcast and also with Source Cooperative is we would improve access to great data products. Like we would then work hard at that. Yeah. Sure. Yeah.
Drew Breunig:
Well, I think there’s two steps, which is cheating ahead. But there’s a couple things that come in, which is this idea of reproducibility, though. That was great in machine learning and deep learning. It’s really hard now. mean, Mira Murati, she left OpenAI and founded Thinking Machines Lab, her own lab, one of the many OpenAI people who have left to find it. And right now, they’re focused on reproducibility, because it’s near impossible because of the probabilistic software and
the way inference works at test time. And so it’s almost impossible now, and they’re innovating on that sense. But the other thing I would say, is we’ll get to this. But I think the other interesting thing is benchmarks, which is you don’t just need to put the data out there. You need to define the problem and provide the means for testing against it. And so if you want to say, it’s not enough to get seen by AI model,
Jed Sundwall:
Yes.
Drew Breunig:
because guess what? They don’t care. They’re just gonna go suck everything else. What you need to worry about is that the people building them have a benchmark to build against that now you’re, it’s the, what’s the, a metric or something becomes a metric, it becomes the, exactly. And that’s what it is, which is like, and this gets back to, I would even say, my funding. It’s not enough to just be there. You have to.
Jed Sundwall:
Metrics become targets. mean, yeah.
Drew Breunig:
you challenge these things and provide a mechanism for measuring success. If you don’t do that, no one’s going to care about it. But yeah, so that’s Yann LeCun. He’s doing his thing with CD-ROMs, sending it out. And it’s crazy to think part of what the internet has done and broadband is it speeds everything up because it makes exchange so much easier. And yes, the test benchmarks need to be actually relevant to the use cases. Yes.
The thing about benchmarks is that they are like they are shipped by people who care about specific things. If you don’t if you’re shipping a benchmark and you don’t have an understanding for why it’s important and why you care about it and you have some stake in what that is, you’re wasting your time like why are you shipping a benchmark in the first place, the point of putting the benchmark out there is to challenge people to perform against the thing that you care about. And there’s lots of
great examples of that.
Jed Sundwall:
Actually, can you help educate me on something I’m like very naive about and this is embarrassing, but I’m just going to be vulnerable on this podcast. is so to Tyler’s point, there’s been my understanding, there’s a lot of discussion about benchmarking with like earth observation, AI, AI models and stuff like that. And, and a gripe is that you can benchmark these things based on some sort of like, you can create like a technical benchmark or something like that, but it is divorced from reality, like from like what’s actually happening on the ground.
And it’s basically, like you can test, you can run a model and then test it to see if it’s performed in a certain way that like indicates that it’s a good model, but that does not indicate if it’s actually useful. Can you explain this to me a little bit more?
Drew Breunig:
Yes.
Yes.
Well, I disagree with that. I think there’s lots of ways you can game benchmarks, but here’s the best way to think about benchmarks in my opinion, is that they are an encapsulation of knowledge with an opinion that allows you to test your performance against that encapsulation of knowledge. Yeah, we’ll talk about overfitting in a second, Joey. That’s very much a thing.
But the thing that I have that’s a problem is like a lot of people in earth sciences or sciences in general is they go to like big private companies and say, my thing is really important. You need to build against it. And that is first off, you have to get them to believe your thing is important. And then B,
They have to get up and running and understand that space really, really, really, really well. And then they have to see, build against it and follow it to create this own benchmark and have this thing is that, so when you create a benchmark, you are doing that work for them. And when you do that work for them, you get to encode the things you care about.
Drew Breunig:
It comes back to the like, there’s a I think it’s Louis Pasteur quote, which is, give me a laboratory and I will move the world. And he was talking about it in the case of like being able to freeze benchmarks and maintain science or freeze a variable and maintain science. And so if you can create a benchmark, you are creating the eval reality that you are asking that model to be held against. And this happens for lots of things. And so I think right now,
The two most successful benchmarks are the Arc AGI benchmark, which Francois Chollet built, which is, again, he basically said, everybody’s talking about AGI, but it’s not in reasoning. It’s really just fact memorization and repetition. He has a different thing, which is like all about pattern recognition. It should be incredibly easy for a human to do, but incredibly hard for a model to do. And so that has been…
kind of the thing he is a, he is a, has been in the deep learning space for over a decade. He is a leading voice. He created this and all of a sudden it became the thing that everyone starts to brag about when they get this because it’s really hard. When O1, OpenAI’s O1 was the first one to do it even somewhat passively, it was a really big deal. And ever since like we’re still kind of chasing it. So like both
his design, his leadership, his brand helped set that as this big thing. The other more tangible example of like, you don’t have to be a leader in the space, but you just found the white space is a benchmark called terminal bench. So terminal bench is how do you testing a model’s ability to use the terminal, use tools in the terminal. So with coding agents, this is so important.
Jed Sundwall:
Mmm. Yeah.
Drew Breunig:
Why do I care about having MCPs? Why do I care about having all these crazy tool sets? Just teach the model how to use the terminal and all the problems are solved. And this was put out by a really great team and they designed it in a specific way to, cause to basically get the agents they want. They spent a lot of time on this. This is out of Stanford and funded by LOD. And this has now become
the thing that people get against. like Anthropic, if you look at when their models come out, they will always put the terminal bench benchmark as like their top thing. When they bumped Claude Opus from four to 4.1, the main thing they cited was their terminal bench improvement. So that’s a good example of like, I’m creating the package of the reality I want from this. So someone in the chat replied to your…
Jed Sundwall:
Yeah.
Drew Breunig:
Earth observation benchmark, is like, all right, benchmarks are great, but my gripe is that most Earth observation benchmarks, so it’s looking at satellite imagery, they’re focused on object detection. Very few are focused on temporal signatures of change. Well, what that says to me, Tyler, is that’s an opportunity for you to create a benchmark or for someone to create a benchmark to measure this capability that you want to build into this model. A benchmark is a data product.
It is honed and I think it’s kind of the current way that data products are released or one of the main form factors they can take in this model. Yes, you have a worry about overfitting. The SWE bench is the software engineering bench. is a, again, it was one of the biggest one first to market, which is can a model take Google issues and submit changes?
Jed Sundwall:
Hmm.
Drew Breunig:
and submit PRs that pass. And it was adopted quickly as the main thing people were building against. I talked to AI researchers at foundation model companies and they’re like, I’m just trying to get another point in Swibench. That is what keeps me up every single day. But again, it has its own shortcomings. Like 50 % of Swibench is just the Python Django library. Like, so it’s really good at building the Django library, but maybe not very good at some…
rust, or maybe not very good at, you know, some data pipeline you’re building. So again, these things shape the outcomes and communities grow up and private companies grow up. And so that’s kind of why I think benchmarks are kind of a modern data product.
Jed Sundwall:
Interesting. Okay. There’s a lot to think about. I’m looking at the clock. want to, boy, where are we going? I want to talk about common crawl, but I also, like, we did not specify an end time for this because like good podcasts just go off the rails. you have a hard stop at the top of the hour? then okay.
Drew Breunig:
Yeah.
Drew Breunig:
I have a hard stop, but yes. Yes, not our top of the hour at one.
Jed Sundwall:
Okay.
Drew Breunig:
Yeah, we got an hour and 10.
Jed Sundwall:
yeah. So we booked our own time. for those listening, we’ve blocked our calendar for two hours so we can go this long. We’re going to go for as long as we want, but, but no further than 1 PM Pacific.
Drew Breunig:
There you go.
Yes.
But I think this, the benchmark thing, that’s how we talk about it today. But transitioning into, from MNIST, we went to ImageNet. And that is something that Fei-Fei Li created when starting at Princeton, because she built it as that challenge. She saw that there was a WordNet,
Jed Sundwall:
Yes.
Jed Sundwall:
Yes.
Drew Breunig:
which was out there, which was essentially a natural language processing training data set. And she said, well, I want this for ImageNet because I want people to build better image recognition models. And so to do that, she realized they needed a way to test it and train it. And it became not only a thing you could train models on to improve the software, but it also became like the foundational improvements of kind of deep learning in general was again, you put out your challenge.
and you make people go to it. It was a benchmark as much of a data set.
Jed Sundwall:
Well, right. like built in somehow, you know, I don’t know how she did this in terms of like funding and her stature at Stanford or whatever, like challenges. was just sort of like, this is a data product, you know, that we’re putting out there and we’re going to run challenges. And it was not, this is one of these overnight successes that took something like six years or something like that. don’t know like when, it was a long time before AlexNet came out.
Drew Breunig:
Mm-hmm.
Drew Breunig:
It was a long time. they, also, yeah. And I think the other thing too is like they had to create it like the only way they were able. So MNIST came out on a CD-ROM pre-internet. ImageNet could only have been created after the internet existed because they leveraged mechanical Turk. They leveraged Google image search. They basically were just paying to label images and the price they just would not have.
Jed Sundwall:
Mm-hmm.
Drew Breunig:
So I think there was a couple years that, because ImageNet was pretty, it was after Common Crawl first launched, but its breakout moment came before Common Crawl’s breakout moment occurred. And so AlexNet was 2012, whereas ImageNet was like, I think 2008, 2007. But yeah, and so go ahead.
Jed Sundwall:
Okay. Yeah.
Jed Sundwall:
Well, ImageNet’s another interesting example though of, well, this is when I said I wanted to talk about licenses because when I was at AWS, people were like, hey, you should host ImageNet in the open data program. And I’m like, I mean, sure. Like I think that would be cool if we did. Also like people can get it. Like you don’t need, you didn’t need S3 necessarily to get ImageNet. Like people, you could download it. Like it wasn’t like so huge.
Drew Breunig:
Yeah.
Jed Sundwall:
that it mattered so much. But I was also like, look, my lawyers aren’t going to like this. Like if we’re going to host these images, we don’t know. Yeah, they’re just like random licenses all over the place. And, but it just reveals just how like, how brittle this sort of like licensing regime is for this sort of stuff where it’s like, look, who’s, who’s going to sue you honestly, because you’re using some like, like 120 by 120 pixels square picture of like a dog.
Drew Breunig:
peeled off of Google search, like…
Jed Sundwall:
Like, you know, like.
Drew Breunig:
Yeah, I mean, do like it is that weird thing where it’s like, it’s fine to bootstrap it. But if like, you’re really successful, someone comes knocking, it’s kind of like, like, you know, Google looks the other way on people using Street View images, even though they know, they know that they are being crawled in some way or another.
Jed Sundwall:
Yeah.
Jed Sundwall:
Yeah. Yeah. No, I mean, or, you know, come for Anthropic once they’ve raised enormous amounts of money and they’ll be like, sure. Great. Actually, like we’re, it’s an honor to pay this because we know that no one can come up behind us now. It’s like, you know, cause we have got the cash.
Drew Breunig:
There you go.
Drew Breunig:
Yeah. And that’s what you’re paying for. mean, some would argue that’s why Google bought YouTube was purely to buy the court case or one of the main reasons. yeah, so that was so ImageNet was basically a database of I think about 1.4 million images that were labeled. Thousand categories. 1.4 million. And then just said, hey, every year we’re going to hold a contest.
Jed Sundwall:
Interesting. Yeah. Okay.
Jed Sundwall:
Something like that.
Drew Breunig:
to see who can get the best one. Now the idea of waiting every year is a positively quaint notion. People just download and run the benchmarks every single day. You have to upload it, the whole thing. But I do think ImageNet was every year. And so that went around for a while. Side note. go ahead.
Jed Sundwall:
Yeah.
Jed Sundwall:
There you go. Do this.
Drew Breunig:
Side note, I was thinking about this last night. So in a former life, I was a media strategist at a large media buying company. And in 2009, I was writing media strategy for Nvidia. And I was thinking about this last night because Nvidia had a new technology that they were very excited about called CUDA. And they…
I remember going down to the briefing and they’re like, here’s what we’re going to show at the floor at our next big conference. Here’s all the demos for CUDA. CUDA is this idea of we can use GPUs for generic computing and we can use it for immense parallel processing. We think this is going to be really big. And we would ask, all right, well, what are people going to use it for?
And they had like eight demos. None of them were machine or deep learning. There was a couple of biotech ones about like protein folding or what have you. There was a lot of cloth simulators. like, hey, we can sell this to fashion designers to simulate how a cloth is going to drape over someone. They just had like tons of different things. And they had no idea what they were going to use CUDA for. They just knew it was going to be this big thing.
Jed Sundwall:
Yeah.
Drew Breunig:
but they had no idea. so like, and you could make a very strong argument that like, CUDA is the reason Nvidia is in the position it is today being, you one of the most valuable companies in the world. And they had no idea what it was for. And so that was CUDA came out 2008. I worked on it in 2009. And it was still this thing, like no one knew. You would go around and like, they just would look at you and they’re like, well, we can do these things. And you’re like, that’s kind of interesting. And it wasn’t until 2012.
Jed Sundwall:
yeah.
Drew Breunig:
where they kind of had the first glimpse. So we’re hitting the big names here. Jeffrey Hinton, who did win a Nobel Prize for deep learning and machine learning, Ilya Sutskever, who later would found OpenAI, Alex Krzywinski. I can never pronounce his last name, Krzywinski. They built AlexNet, which basically performed against ImageNet.
with a score of 84.7. And you have to understand this was a 10 point plus difference compared to any previous competitor that year or before. And it was the first time that they had used deep learning accelerated by a GPU. And they were just using two consumer GPU cards. They were using like basically what you would buy to game at that time. And that basically started deep learning.
Deep learning was like how we talked about AI before AI. And so that was kind of what set it off. And I think the big step change here is that, again, this comes back to the benchmark thing, which is Fei-Fei Li created this space essentially out of a benchmark.
which is deep learning became a thing because its value was proven because someone built a data set and then people gamed to see how well they could perform against it. A benchmark is essentially a data set with intent. And when you ship that out into the world, you get people to do things against it if you make it exciting, if you make it collaborative, and if you’re operating in the white space.
Jed Sundwall:
Yeah. Yeah. This is, mean, this is, I’m going back. Linda said something like, wait, yeah. She said data is typically purpose built understanding. This will force us to examine our data more rigorously. Creating a significant demand for data repurposing, especially with AI. What I’m, what I’m hearing is like, or where this is coming together for me is that like you can produce a data product and we’ve been, we’re going to talk about common crawl next. I we should. And, then.
And then you have to produce benchmarks attendant to that data, that data set or that data product, which are basically just any number of arbitrary goal posts that you want to set. Maybe like, because common crawl is so rich, obviously it can be used for so many things. So you just needed to benchmark for each of those things, you know, and just say like, well, can you do this? Can you do that? Yeah.
Drew Breunig:
Yeah.
Drew Breunig:
Yeah, and I think some of the most interesting things out there are benchmarks. So we talked about Terminal Bench is one of my favorite examples. The other is the Berkeley Function Calling Leaderboard, which is just testing how well LLMs can use tools that are given to them for agentic purposes. And it’s really, really interesting. And then
What’s the other one that I really like? It’s not empathy bench. What is it?
There’s another one. Sorry, John, here it’s Sam Paish has a great benchmark that is one of my EQ bench. And he maintains this himself. Just some dude, love his stuff. he’s like, I’m interested in having LLMs become a better writer. And again, it’s like one of those things that’s really hard to quantify, how to make it a better writer. So again, he just he.
He’s like, here’s one metric. Here’s another metric. One metric is like, how often do you reuse the same phrases? OK, great. That’s great. We can’t do this. But two, long form writing. It’s like all of these. And it’s a really interesting thing. And he admits. He’s like, this isn’t perfect. But again, you start to see people building against it. And it does start to influence and shape the arc of development.
Jed Sundwall:
Yeah. well, let’s, let’s talk about common crawl, a bit more in depth, but I got a shout out, Sam ready at common crawl. She’s like, Hey, we have an event coming up. so people who are in the Bay area at Stanford on October 22nd, there’s an event, called preserving humanity’s knowledge and making it accessible.
Drew Breunig:
yes.
Jed Sundwall:
addressing challenges of public web data. This is the kind of thing I would love to go to. I’m unfortunately booked at another event. Um, Shane Zuckerberg initiative, think whatever CZI stands for. I’m going to be at one of their open science event. Uh, but man, if I weren’t with CZI, I would definitely be trying to go to this thing. Um, so, and you can watch online. So I put the link in the chat and we’ll, we’ll share this. I think we should share this podcast before October 22nd for sure. So.
Drew Breunig:
ooo
Jed Sundwall:
Shout out to Common Crawl. Drew, tell me all of your deepest thoughts and feelings about Common Crawl. It’s a great story.
Drew Breunig:
mean, Common Crawl is novel for how early it started and that it wasn’t really built with machine learning or AI in mind. It was, so to give you some perspective, the Common Crawl project, is essentially, it’s like the idea is that, hey, we’re going to scrape the internet and put it in one data file ready for people to use. So you don’t have to go scrape it.
Jed Sundwall:
Yeah.
Drew Breunig:
because again, we believe that again, lots of people can build things if all of this is accessible. And so the, the net value out of it would be tremendous. it began in 2007, the same year that Feifei Li launched, ImageNet. and so Gil Elbaz, his, yeah, it Gil, good old Gil. is, he started it.
Jed Sundwall:
Yeah.
Drew Breunig:
and he formed the common crawl foundation to basically, it’s funny, he founded it as he left Google. so it kind of tells you, you know, what his motivations were, which is like, want to build essentially. I don’t want Google to get a lock on the internet. I want to kind of expose the thing that’s really expensive to bootstrap and start up, especially in 2007, which is crawling and preparing all of the files. And, now they, it’s a single data set essentially with.
250 billion web pages collected over nearly 18 years. And about three to five billion pages are added a month, though, sadly, Common Crawl is getting shaped a little differently because its crawlers are getting blocked. And the reason its crawlers are getting blocked is because of AI-driven crawling. in a weird twist of fate, Common Crawl became one of the foundational things that early language models would train on.
It would become a critical ingredient in the pile Google C4 data set Basically subsequent data sets kind of child data sets, which is like hey We’re not going to include every single forum or we’re not going to include, you know Duplicative data where we’re gonna filter all this stuff down to the high quality stuff But then once you start building it and this is where it gets into the data like oil I use that let’s say I use that to build my model that later becomes chap GPT. I have so much
Jed Sundwall:
Right.
Drew Breunig:
I’m not going to rely on common crawl anymore. I’m going to start building my own crawlers and go out to the things that I care about and do it with a much greater frequency so that I can improve my model. You get enough of these, which you do. There are a lot of people out there hosting websites right now that are having to think about how to gate their content to prevent legitimate and gray market crawlers that are just hammering their sites. And so now, like,
Common Crawl created this thing, but now we’re kind of having a tragedy of the commons, which is everyone who grew up around it now sees running their own crawler as a competitive differentiation. And they’re going out there and kind of doing that itself. All the while, Common Crawl is still going, but it’s kind of surface area is starting to shrink a little bit because different web pages are shutting off access to crawlers because of this mess that it has created. So I do think it’s this closest thing that we have in data to a tragedy of the commons.
but yeah, I’ll pause right there before I talk about why the text is so important.
Jed Sundwall:
Yeah, no, I mean, I think it’s, um, that’s an amazing story. Gil, I mean, has told me he’s like, he’s like, I’m pretty sure common crawl is like the most impactful nonprofit ever. Um, there’s definitely a case to be made there. don’t know exactly how you’d quantify that, but holy cow. um, yeah, yeah. Yeah.
Drew Breunig:
Yeah, mean, because everything grew up around it. even you’ll look and people will say, so-and-so didn’t use common crawl. But then you look at the data sets they did use and they were derived from common crawl. So it basically fueled the entire first wave of large language models, which is what percentage of our GDP at this moment?
Jed Sundwall:
think it’s 140 % of our GDP. Yeah. Yeah. Yeah, none of the math makes sense when you’re hearing what people are talking about large language models now.
Drew Breunig:
Yeah, 100 % it is. We don’t know how that’s possible, but it is. That’s what we’re
Yeah, and this is like one of those weird things that like when he built this, like, one of the weird things about large language models is that everyone was kind of surprised when the first large language model like worked, like, like attention is all you need. Because like, it’s this thing where like previously, you would have to put structured data into these deep learning models, and then they would have to figure out the relationships. No one at the time like when when people thought of structured data,
Jed Sundwall:
Right.
Drew Breunig:
they thought of the work that like Fei-Fei Li put together with ImageNet, which is here’s an image and here’s some labels. And so the big gate for deep learning is like anyone who wants to build on deep learning, they’d say, all right, well, where am going to get that labeled data? Where am I going to get that structured data? With large language models, the thing that was shocking to everybody is like, wait, language is structured.
because we can see the order of the word. Some words come before the each other who come before after in all of these assemblages. And we don’t need to label language because it’s already organized and structured. We just have to have enough of it. That was the thing. The magic thing was that you built something big enough that it would display spooky, intelligent qualities. And that was what Common Crawl enabled. Because if you didn’t have that, you you couldn’t test that.
Jed Sundwall:
It’s wild. Yeah.
Drew Breunig:
randomly, because you would have had to stand up your own crawlers before that. So like the fact that it just existed allowed for that discovery to be made, which is why I think I wouldn’t argue with Gill’s claim.
Jed Sundwall:
Yeah. No, it’s, it’s, it’s incredible. I have other apocryphal story. mean, we hosted common crawl at my, my, I mean, my program at AWS is the home of common crawl. I have stories that I probably shouldn’t tell. So I won’t, like it’s, it’s, it’s phenomenal. and kind of insane. And I was joking about this last week, at an event at climate week.
because I was in a room with a bunch of organizations, I’ll just say like very large corporations, not a government insight, talking about sustainability data for global supply chains. I won’t go into much more detail than that. But I said, you got to understand, there’s this story about this guy, this one dude, granted a billionaire, who’s just like, here’s a thing I’m gonna do and does it. And it has this huge impact.
And I’m like, this heartwarming story of the impact that one billionaire can have on the world. But the point also being that like, it is possible to create a data product that has a very consequential impact. And if you feel like there’s something there, there might be something there. In Gil’s case, I mean, my story, at least from what I recall, him explaining this to me is that he creates AdSense.
Drew Breunig:
Yeah.
Jed Sundwall:
it’s acquired by Google, he spends his time at Google and he’s like, there’s gotta be some kind of fail safe for this kind of thing. And where we can’t have one company that is like, know, owns all of the world’s information. There’s some irony in the fact like of like what Anthropic and OpenAI are becoming is just sort of like the next version of that sort of thing. But you know, I’m not mad about it. Like, yeah.
Drew Breunig:
But I mean, I think about that a lot. I think it’s interesting now we’ve gone from the crawl being the thing that’s valuable to the interaction data. So like when they were talking about breaking up Google, one of the things that they were talking about was making the ranking data, like making the index open, which isn’t just the data. It’s also the relationships that exist in the data. But again, one of the things that I’m shocked about with LLMs, which I
Jed Sundwall:
Right.
Jed Sundwall:
Yeah.
Drew Breunig:
fine to be really interesting is that no one’s running away with it. Sonnet 4.5 came out and said, hey, this is the best model this week, the best coding model. But the thing is, the difference between Sonnet 4.5, GPT-5, even the open models, the larger Quen coding models, they might not be perfect, but they’re a lot closer than you’d think.
And it’s to the point where like everybody jumps on whatever the newest thing, but you could just be like sitting on like, you could have been sitting on GPT-4.0 for a year and you would have been fine. Like, and I do think what’s wild is that the floor is coming up faster than the ceiling. The ability of 7 billion parameter models to effectively, you know, double in quality every year is just absolutely insane. And so like,
You will get some things from like throughput and other things like that. But like, I think the weird thing is that even if these guys win, you may end up having like free access to something running on your device. That will be, it’s bizarre and it’s really weird to think about.
Jed Sundwall:
That’s incredible. Yeah. Yeah. It is. Well, let me, let me, um, I want to go back to the data is oil thing and how this like LLMs change this sort of stuff. And Alex left another comment about, know, people trying to use robots, text, or there’s like LLMs texts to try to influence how the bots can, navigate the web. I, so I have this theory, I’ll just bounce it off of you. don’t know if it’s a theory, but this idea that like,
Drew Breunig:
Yeah.
Jed Sundwall:
So the internet has been full of like really amazing data for a very long time. And what a lot of us who’ve worked in open data have just been sort of like scratching our heads about it’s just like, well, why doesn’t it all, why doesn’t it get used? You know, there’s all these open data portals that don’t get used. And my, one of my answers to that is that humans don’t know how to use data by and large. Like you did, you know, you just take a sample of like, a million humans, you’re going to get a very small percentage that actually like know how to do stuff with data. And.
And also like have time. I mean, this was always kind of the funny thing is an early realization for me when I was working in civic tech stuff was that there’s people that are like, yeah, like we’ll just open up our city’s data. And then some, these people will just like do cool stuff with it. And I’m like, hey, if someone knows how to do anything with your data, which is not that good, it’s good. It’s kind of a pain to work with. They have a job. You have a narrow window of like college kids and like civic sort of like tech activists people.
who before they like enter, exactly have kids, I was just gonna say like get a wife or a husband and have a job, like they’re willing to do that sort of stuff. And that’s it, and they just kind of go away after a while. But LLMs 24 seven can do stuff with data. And so we are at the point where I think that we might have created a market for data if we can get, and here’s my crazy idea.
Drew Breunig:
have children and full-time jobs. Yeah.
Drew Breunig:
Yeah.
Jed Sundwall:
Tell me if I’m crazy. Also, I think this is already happening, like, OpenAI and Anthropic should pay for data. Like they should just like, hey, they come to some data portal thing where it’s like, hey, we maintain this data. If you’re a bot, we’re gonna charge you a 10,000th of a penny per request here so that you can, know, it’s basically your research budget. Yeah, I think it’s a good idea. I don’t think Cloudflare, I think Source Cooperative should do it.
Drew Breunig:
Well, I mean, that’s what Cloudflare is trying to do.
Jed Sundwall:
because we’re not owned by anyone, but anyway.
Drew Breunig:
Yeah, no, I think it’s a it’s an interesting one. And the incentives are absolutely crazy to think about.
Drew Breunig:
I mean…
Jed Sundwall:
Don’t loan your mind.
Drew Breunig:
I’m thinking about what angle to approach that from. What do you optimize for? Also, do you mind if I a quick break, a one minute break and be back while I think about this? We’ll handle it in the edit. One second. Someone’s knocking.
Jed Sundwall:
Sure.
Okay. Okay. All right. For those of you who are watching the live stream, someone knocked on Drew’s door and he had to get it. I’m going to use this chance because I don’t know when this is going to end, but we still have some people on here. We are doing, for those of you who don’t know about the cloud data of geospatial forum, we did an event in Utah this year at the end of April, early May at Snowbird. was fantastic. Everyone loved it. We pulled everybody at the end of it and
We got like five stars, I don’t know, 97 % of people said they would come back and they loved it. So we’re doing it again. So you’re hearing it here first. We just lost a follower, but anyway, we’re gonna be doing the Cloud8 of GeoForum conference again, October 6th to 9th, not next week, but October 6th to 9th, 2026. So we’re gonna do it in the fall next year, but we’re gonna do it again. We’ll have a landing page up before too long.
and, you know, we’ll, we’ll have links to share out, but anyhow, it’s, very exciting. Alex left another comment. yeah, so exactly. Like, so there’s Alex leaves less comments saying, you know, a lot of journalism Reddit and orgs like Wikimedia are doing with their enterprise APIs is locking them down. I think this is fine. You know, I think people were coming out of like web, web 2.0 era. And I think a lot of the excitement around having open APIs like
Drew Breunig:
So.
Jed Sundwall:
is understandable, now we’re realizing again, we’re just realizing now we have about a decade of knowledge understand this has a cost. Yeah.
Drew Breunig:
Well, I mean, the other thing that’s crazy about it too is like a lot of the Web 2.0 dream is being enabled by LLMs, but like now you go to the meme, like not like that. Like, like we dreamed and we loved the idea of a semantic web that you could ask questions and just access things. And it has been delivered to us and it has been delivered not as an open force, but as an intermediating force. And now we’re having lots of second.
Jed Sundwall:
Yeah. Yeah.
Drew Breunig:
questions about that.
Jed Sundwall:
Yeah. So, I mean, yeah, we’re going to have to figure it out. But I think what I would want to say is that like, we should be, it’s fine. Like I think we should just be sort of sober about this and say, if we want to have reliable access to data in these ways, someone should pay for it. And what’s interesting about chat GBT is that people pay for chat GBT. Like I pay for chat GBT. It should have a research budget. Like
Some fractions of those pennies could go towards maintaining accurate up-to-date data about school enrollment in America or whatever it is, like whatever kind of research I wanna do. There’s actually money flowing because that kind of stuff was never gonna be supported by an ad model. Yeah.
Drew Breunig:
Yeah.
Drew Breunig:
Yeah. I mean, I don’t know. It’s going to be supported by an ad model eventually. Don’t worry. It’ll come. I don’t know if you’ve seen the announcements OpenAI has made over the last couple of days. They’re very much ad model friendly. They’re selling stuff. They want to give you a morning report where they browse the web for you and go find all the things you should be looking at. And that’s going to have an ad in there. I mean.
Jed Sundwall:
Yeah.
Well, either they’re selling stuff through ChetupiTea.
Jed Sundwall:
man.
Drew Breunig:
I mean, well, like, and this is like, but also I’m gonna play devil’s advocate to you because like, if there’s one thing I get frustrated about in the open space is people saying, well, we should be paid. This idea of like, you’re making money off of my library, you should be paying us, you are free loading. And it bothers me because,
I agree with it, like in a perfect world, I want every open project funded, like that gets usage. But your argument cannot be, we should be paid, you’re making money off of it. It needs to be a realistic, practical, pragmatic exchange for how you deliver that. And so I do think there is like a mechanism, like there could be a mechanism for the way information gets distributed and access to it.
And I think it’s going to get really fraught right now because like…
Like the whole ad model is going to go crazy. Not just because of, because it’s going to get intermediated. Like the ad model in the, like is based on attention. And if we have these agents out there making decisions for us as proxies, that attention is now theoretically infinite. How do we kind of govern that relationship and how does it get re-monetized? So.
Jed Sundwall:
Yeah. Yeah. So I, so I’m, I’m with you. mean, my, I’m putting in the chat, I’m just a flog my own blog, but like the gazelles blog post from, you know, near and half where I’m just like, we, one thing I haven’t been explicit about, and there’s going to be like a follow-up blog post at some point, but like, we’re like, the idea of a gazelle is that we should have entities that are, I would say non-owned, not owned by investors exclusively.
Drew Breunig:
Sure.
Jed Sundwall:
that provide some sort of, usually are providing data, but that they are accountable to the market. And so I’m with you in that the conversation needs to go way beyond like we should be paid, which is there’s so much entitlement in the open community. drives me insane. You like you should give me data for free too. Like you should, it’s a public good. I’m like.
Drew Breunig:
Yeah.
And everybody should be giving data away for free. Like, I want people to think about their monetization policies because it gives them the control over their own future, which is, that’s, and that, and that is like me clarifying when I get frustrated, when I hear open people begging for money, because that’s what it is. They don’t have the leverage. They’ve never thought about it before. And now we’re finally have to coming back to it. And as I encourage everyone to think about money before you think you need to, because it’s going to help control.
Jed Sundwall:
Yeah.
Jed Sundwall:
That’s right.
Jed Sundwall:
That’s right.
Drew Breunig:
your future and your destiny and not end up being beholden to something else.
Jed Sundwall:
That’s exactly right. And it’s hard. I say like, I, you know, I’m fighting two fronts with my sort of notions of gazelles and new public sector organizations. One is the easy one where it’s like, these billionaires have too much power and some of these, you know, tech companies are like out of control with too much power. People are like, yeah, blah, blah, blah. We all kind of agree. The harder battle to fight though, is for me to go to my colleagues in the open world and say like, hey, we should maybe put a price on what we do.
and think about the value of what we’re doing and see if the market supports that. And they’re like, what? Like, there’s, I believe that there’s a huge part of this or the cultural legacy of the philanthropic world comes from like European aristocracy where they’re like, we do not touch money. Like I don’t work for a living. Like, you know, like it’s like a leisure class stuff.
Drew Breunig:
Yes.
Drew Breunig:
Or the money, like it goes back to like, well, I think it goes back to like Stallman and others where it’s like, you know, the cathedral and the bizarre type thing, which is like, we should have this free exchange, everything is better with exchange, everything is better with open, but then we get issues often.
Jed Sundwall:
Well, you get steamrolled by people who actually have market power.
Drew Breunig:
I see the Ruby community right now. I don’t know if you’ve been following that, but that’s a good example.
Jed Sundwall:
A little bit. I saw you created a foundation. Tell me more.
Drew Breunig:
No, it’s there’s just a there’s a governance argument right now about who has control over what and what org has control over what and and how much power does Shopify have as the big person bankrolling everything in this example. And so you have all of these things that like stack together until you get into these uncomfortable scenarios when the incentives are not aligned or not aligned way you expect them to be aligned.
Jed Sundwall:
Interesting.
Drew Breunig:
which is almost just as dangerous. And so like, I do think there is a market for data, but like you have to provide the utility of it. And I do think like it comes back to data discovery and data democratization, but like, we’re not going to create these things just because we want them. We have to create them and build the structures around them.
Jed Sundwall:
That’s right. That’s right. And so that’s what I need to figure out is like, can we create some sort of mechanism whereby, look, I mean, I’ll just talk about source, like the vision for source. This is the source cooperative podcast is like, we have this notion of you have data products, which in our opinion is a data product. It is a collection of files of objects that have been shared by an organization or a person that you know who they are. Like that’s fundamental to source, which is like, this is a data product that came from planet.
Drew Breunig:
Yeah.
Jed Sundwall:
you know, the satellite company, for example, that it is up to the user, the beholder to determine whether or not that data is worth their time. And what is interesting to figure out is how could we communicate that to an LLM? Like, could somebody say like, hey, chat GBT, I wanna know this information, but I only wanna get data directly from Planet or from NASA or the census department or whatever it is.
And then it’s up to OpenAI to determine it’s like, yeah, sure, we’re willing to throw a few shekels over to Planet to get access to this data and return it. because they, you my assumption is that OpenAI is just gonna hoover up whatever they can get. Is the credibility and provenance of data actually important to consumers? Maybe sometimes, but who…
Drew Breunig:
Yeah.
Jed Sundwall:
It’s weird because who’s making that determination? It’s many times not going to be the user. It’s just, they’re going to be asking an idle question.
Drew Breunig:
Yeah, I also think it matters in the domain. And that’s where you’re seeing a lot of random startups. Like I was just talking to someone who’s using, who’s starting a company that’s based on medical spending records. So looking at like Medicare receipts and Medicaid receipts and like it’s a highly regulated industry. can’t have hallucinations. You have to have
provenance, figure in when you start to build, different products with this. And like they’ve had to build their own custom pipeline. And this is getting into the question Alex just to ask, like, wonder how rag changes the play look, even if you build your own custom pipeline, they were doing texts to SQL, which is kind of the predating of rag, the first use case of texts to SQL, but then they’re having to figure out, right, well, how do we then go validate and subsequently confirm? And so
getting back to what you’re saying with RAG is like self-subscribing confirmation. And that’s kind of where the messiness comes in. The challenge is here is that like they’re working on one specific domain. Their surface area is a lot tighter, both in terms of the questions being asked and the data that can ask it. So their needle in a haystack exercise is different. And you’re gonna see the same types of companies come up with law. So like, how do I cite legal cases and actually they exist so I don’t get chewed out by a judge and told to like,
you know, go f off. And then the and you’re to see that in each little domain where there’s regularization, where there’s penalties and where you can sell that higher quality. I think the challenge that Anthropic and OpenAI and all these guys have is like, there’s really two markets right now, which is like their chatbot market and their coding market. And so like they’ll care about citation and coding stuff.
The rest, they’re just like, all right, how do I drive down hallucination and citation? They do have citation benchmarks. There benchmarks and evals for people to get to go judge their ability to correctly name things without hallucinating. But coming back to what you’re saying, I think the challenge here too is that you’re not, with LLMs, you also have to worry about multiple stages in the pipeline.
Drew Breunig:
So what I mean by that is like, there’s different stages when you build the, the, pipeline, have the, the pre-training, which is like when you train on the super messy common crawl type data that builds up your kind of base English capabilities or be a base language capabilities and establishes your knowledge base. Then you have post-training post-training is like when you teach the model, how to talk with an interface, that’s when you train it to reason. That’s when you train it to chat and go back and forth.
That’s when you train it to use tools. And then after that, people might fine tune it or they might put further tools on top of that, like data, rag, other similar things. And so what you’re talking about is providing function from basically post-training all the way through to fine tuning, to tool deployment, to framework around it, to the actual application. It’s this wide spectrum of applicability.
that also has different pricing terms as you start to come in. the problem I have with paying it is just like, it’s just, I worry about, it’s one thing if you’re Reddit and you cover everything. It’s another thing if you’re a really, really, really narrow niche, because again, you’re selling into a model that does everything. So how do they value that use case to justify your acquisition?
Jed Sundwall:
Yeah, well, mean, so this is where we’re.
Drew Breunig:
as I drink with an anthropic sticker on my water bottle.
Jed Sundwall:
Cool, man. I wish I had an anthropic sticker. I should put, I’ve got a cloud native geo stickers here still. Nice, nice. Okay. No, I mean, this, so Alex brings up another really interesting point here though. That’s very important. It’s like, you you’re mentioning there’s, if you’re working in very, very narrow space, the applicability of, know, whatever you’re putting out there is very broad.
Drew Breunig:
I have one of those, I just read it.
Jed Sundwall:
I am a hundred percent like my perspective on like the best gazelles create far more value than the capture, right? They should be the kind of thing that’s like only putting something out there, you know, that’s, quite small and simple and you can vouch for it. And then what people can do with it. Go nuts. If people can become billionaires off of it, that’s great. With climate stuff. This is just what we have to acknowledge. Like head on is the fact that like we actually talked about this right before we started rolling is that like we are at this point.
Drew Breunig:
Yes.
Jed Sundwall:
where we are actually talking about making interventions to perturb the environment in order to protect the world as it suits humans, the roughly however many of us there are right now, basically to cool it off, where we’re like, we’re gonna make this decision now. Like we’re gonna gather up a bunch of climate data, a bunch of information about the planet, and it will be used for us to manipulate the environment in a way that is much more deliberate than we have done in the past.
as we discussed, we’ve been messing with the environment quite a bit, not deliberately, but now we’re like, we’re gonna do this sort of stuff on purpose. This has huge, huge repercussions on like global governance. And we do have to figure out models that can allow us to make huge volumes of data available reliably. And I would say like, they absolutely should be available to AIs, but just how do we, who pays for that?
Somebody’s gotta pay for it. And I’m with you. The answer should not be, well, we should get paid to do it.
Drew Breunig:
Well, I thought you were going the answer is not communism or something similar.
Jed Sundwall:
No, but I mean, do think, I mean, that’s the other thing is that like, we don’t have the luxury of being too idealistic now, which is like, ideally, it wouldn’t be shaking down billionaires, but there are enough billionaires around that like, we should be shaking them down. think philanthropy has a role to play here. I’m very interested in endowments for, you know, guaranteeing access to data over time. So there’s something to be done here, but it will be.
Drew Breunig:
Mm-hmm.
Jed Sundwall:
It’s this is a huge challenge. It’s an exciting challenge though. Yeah.
Drew Breunig:
I mean, I think that comes down to discovery. And I think that’s one of the big challenges, which is like, I mean.
Like that’s the, so I shared a paper with Jed yesterday, which is going brand new. Just a PhD student came out with it yesterday. I’m gonna link it. And wait, let me find the link I sent you. And it’s about searching for, teaching LLMs to search for data and assess data.
Jed Sundwall:
yeah.
Drew Breunig:
And I think of it as a natural extension of, you know, one of the first things that happened when chat GPT came out in that first year is there was a lot of text SQL applications is I think it’s a further extension of layers upon that, which is, I’m going to understand a data source, build a representation artifact that is queryable.
so that then we can kind of query it on top of that. And so I think we’re starting to see these systems. the good news is, here’s the thing that I do think is incredibly valuable, is you look at…
this application and you can see why a company would fund it because you can say, all right, would Databricks fund this? Would AWS fund this? Would Microsoft fund this? Like would Tableau fund this? 100 % they would because they want people to find more data and the right data because if you find more data and the right data and it’s valuable to you, you have to generate the compute to actually utilize that. so I do think that we’re going to see
things that are aligned with these functionalities when it comes to data discovery, because there is a huge market opportunity for it. And I do think like maybe that’s the value that gets put on, which is not the access to the data, but the discovery of the data and the service of finding that. And I would be, that to me is like, that would be a huge problem to be solved for tons of enterprises that I’ve talked to.
Jed Sundwall:
Yeah.
Jed Sundwall:
Hmm. Okay. Well, very relevant to what I want to do. So I’m going read it. this, so Camilla asks a question, an important question. does the inability of, for some sort of royalty model disappear with the complexity and lack of explainability of how inputs are ultimately used, um, in these models at the end of the day? uh, so yeah, basically, so it’s like, yeah, open AI, ChatGPT says like, yeah, we just got the coolest data from
the Gates Foundation, here’s our answer. You know, and it’s like.
Drew Breunig:
Yeah, I mean…
Jed Sundwall:
A lot of people are gonna be like, okay, I trust your interpretation of this.
Drew Breunig:
Yeah, let me tell you a story based off that. So I am one of the best ways to learn about new companies, especially new models. And this is something. So at Place IQ, we cared about privacy a lot. And we embraced new privacy mechanisms, regulations. We designed our systems with privacy in mind. And so I learned a lot about privacy during those eras.
Jed Sundwall:
Okay.
Drew Breunig:
OpenAI came out with ChatGBT, and they launched ChatGBT, and they launched the model. I knew something how the model was made. And so the first thing I’m like is like, there’s a lot of privacy issues that are inherent in this, especially because once you train the model, coming out and selecting the data from the model that it learned from your private data is basically impossible. You can only kind of add it. You can’t go in and surgically remove it.
So as just for fun, because I’m weird, I filed a CCPA request with to open AI. CCPA request is a California privacy regulation that allows you to contact any company that has your data and you have to say, hey, do you have my PII, my private personally identifiable data? What is it? And I also have the right to correct it or delete it if I require. So
Jed Sundwall:
Hmm.
Drew Breunig:
you read their privacy policy and it was all about the accounts you create when you create an account. It wasn’t about the model or the training data they used for the model. They seem to have kind of deliberately skirted that question because it would be a really big question. But at the same time, it’s still PII and it still have it. And I know for a fact that they have my website because I know my website’s in common curl. And so I filed the request and
This was like in the first year after a chat GPT and like the person who was on the other line had no idea what to do with it. And they’re like, well, here’s your email. And I’m like, no, no, no, I want to know about the training data. And they’re like, I don’t know. So it kept, I would go through periods of very quiet and then it would get elevated and then it would get very quiet and then it was elevated. And finally, they’re just like, well, your email’s not in our training data. We have processes for removing your email.
Jed Sundwall:
Hmm.
Drew Breunig:
So I used the prompt exploit to get my email out of ChatGPT. So I did, you can use all sorts of tricks to get around its alignment and safety protocols. And I did that. And I got it to say, Drew Brunig’s email is what Drew Brunig’s email is. I’m not going to say it here. So I emailed them back and said, here is proof that you have my email. Somewhere in your data banks, it exists.
Jed Sundwall:
Yeah.
Drew Breunig:
And they’re like, can you share the prompt? they got really, like, it got elevated, elevated, elevated. And finally, they closed the issue because they said, well, your email is actually something that could be really easily guessed. And we could have learned it from other things and then inferred the naming pattern. And so that’s how it came out. And so, but this is the crazy thing, but it’s still my email.
Jed Sundwall:
Hmm.
Jed Sundwall:
man.
Drew Breunig:
So from a privacy perspective, like it’s still happened. The email existed, whether it guessed it or not, it’s kind of immaterial, especially if it guessed it, then it falls a foul off the CAN spam act, which is using a software to automate the guessing of like brute forcing emails. If it didn’t guess it, then it has a PII in my training data, which it almost certainly does. And like, I’m not gonna lawyer up and go fight this fight, but it’s like a good example of like even they,
Jed Sundwall:
Yeah.
Drew Breunig:
can’t tell what was the model was trained on. And so to Camilla’s question, like the royalty model does kind of disappear because there’s kind of different scenarios you can plan for, which is did the model just hallucinate it? Did the model figure it out based on the fact that it has seen previous patterns that are similar to that? And your question combined with the weights would manage to evoke your email, or is it recalling it from the way it’s buried deep in its kind of the depths of the weight?
And so like, there will be ways where they try to do this, like, like Anthropic just a couple months ago had a big thing. It’s like, Hey, we can explain what’s actually happening inside these models. And they could, but they had to train a special model just for explainability. And then they had to train a different model. And it was like a model that was the equivalent of like Claude two, like, and then they had another one that looked at it.
Jed Sundwall:
Right.
Jed Sundwall:
Yeah.
Drew Breunig:
And then it would have to go through the output and it was like two expert researchers would have to spend two months of their time just unbundling all the traces to figure out what actually happened for one query. So like, it’s not a scalable mechanism, like, and it doesn’t even work on the largest models. So yes, no one knows where the data is coming from. In fact, a lot of people say like, that’s why reasoning models are a net good, because you can kind of see the logic of how they arrive at their conclusion. But.
Jed Sundwall:
pray.
Drew Breunig:
I think it is. Yeah, it’s a challenge.
Jed Sundwall:
Yeah. Yeah. I mean, well, this is kind of, again, sort of going to the philosophy of source is that like, you should be able to view source. the model can’t be explained, like whenever possible, there should be some sort of like auditable layer of data. That’s not always going to happen, but like there’s, are things, I mean, I’m going back to Alex’s point about like climate data. Like if we’re talking about environmental data where we’re, this is,
Drew Breunig:
Yeah.
Jed Sundwall:
deliberately being shared so that we can impact the environment. It’s an impact on everyone’s There are layers of the internet that have to be auditable. And yes, the large companies, they’re gonna wanna have plenty of secret sauce in their models, but there’s some stuff that can’t be secret. We should fight for being auditable.
Drew Breunig:
But like, so then like.
I don’t know. I don’t think you can make the model auditable. I think we’re past that term.
Jed Sundwall:
Yeah, I agree. I don’t think we’re making the models auditable, but at least you should be able to say, we know where some of this data came from and you can do your own research if you want to.
Drew Breunig:
Yeah, it’s funny because I do think that like a lot of labs would like that. The problem is, that like they see their competition as like actively stealing their stuff. And so like, how do you enforce that internationally is the big question that comes in. And then also the desire to not fall behind, you know, other countries, I think is the other issue that you start to get into the politics of the thing.
Jed Sundwall:
Yeah. Yeah. Yeah.
Drew Breunig:
So like, I don’t know. Like I do think like, like getting into the goal of like training data products so that LLMs can understand them. Like, is that what you’re like angling at? And if that’s the case, like, I think it’s like, you need to make them like, Brian Bischoff talks about the map versus the terrain when it comes to creating data systems that LLMs can query is like, you do have to create that thing that, that fits within the context well and allows them to kind of.
navigate, negotiate with it.
Jed Sundwall:
Right. And that’s what I’m saying. That’s what I was trying to say before is like, we could create a great catalog at Source Cooperative that, and talk to, you know, talk to our friends, or I need to make friends at Anthropic at home in AI, I’ve got a few, but like, and be like, do you want to use this catalog? And if you use this catalog, are you willing to pay to access stuff from it? Like, how would, how do you train a model to know that data is worth paying for versus not paying for?
Drew Breunig:
Mm.
Jed Sundwall:
I don’t know. mean, and I don’t know if it could just be sort of a brute force thing, which is to say, open AI agrees that I’m going to use the Gates Foundation again for, you know, it’s just like the Gates Foundation that maintains a lot of useful data. Actually better examples is FactsUSA, another Microsoft guy, bomber who created it’s USA Facts, which is his nonprofit that’s that shares statistical data. Yeah. Fantastic. And they say, like,
Drew Breunig:
Great outlet.
Jed Sundwall:
OpenAI is like, Balmer can’t afford to keep this thing running himself. So we’re going to pay. This is where the argument falls apart with both Gates and Balmer is like, these are these are groups that do not need to be charging for access to this data. but it still we want to have a market for data to make sure that it’s continually being produced. Yeah.
Drew Breunig:
I do think one of the things there is getting in the provenance stack, which I do think is like, if you’re merging datasets, you’re going to have a ranking stack order for which ones you trust more than others. so I think that’s the service that may be a thing, which is validating and normalizing the data so that it can be referenced confidently.
That to me is like, that’s the service to provide. that like, because I love the question of like, when does an LLM know when to pay the data? Or like, when does it present that option? And like.
Drew Breunig:
What do you think goes into that question? Like, what do you think are the inputs that you can think about in that one?
Jed Sundwall:
Well, right. So, I mean, again, like what I’ve said about source is that like source, we, what you find at source are files that have been put there by people or organizations who you may or may not trust, but you at least know who put them there. And so then the question is how, we’re building a UI for source that we want people to be able to tell at first sight whether or not the data they find there is worth their time or not. We then have to answer the same question for a model.
Drew Breunig:
Yeah.
Jed Sundwall:
It’s like, this worth my time? Is this worth me spending some of my research budget on? And I think part of that just has to be like brute force through like partnerships to say that like OpenAI recognizes this as useful data source. Does it make sense to charge for the data at that point based on some metering thing? know, at fractions of a penny, or is it just like a, it’s a partnership where they pay for to just go in and out, you you get as much as they want. I don’t know.
Drew Breunig:
Yeah, I do think it’s like, and that’s the thing which is like, you’re, are the thing that is vouching for the data, I think is the service that is provided. But then now you, now you’d need to be a quality clearinghouse.
Jed Sundwall:
That’s right. Well, right. have to have all, that, okay, so here we go. And then we got to start wrapping this up. But like, one thing that we, there’s a bunch of stuff that we can do once we have these, so we have these files, we know who produced them. We can also have DOIs, right? So bear with me if you’re, if you cringe at the motion of, the notion of DOIs that I sometimes do, but we can say this data actually gets cited a lot.
We could track how many citations the data has gotten. We also have metrics that we want to share about like how much the data gets used. Hugging face is great at this, like on their data sets product, which I love and it’s kind of like my envy. There’s so much signal when you get to a hugging face data set landing page, like there’s a lot of signal for you to be able to tell like, is this being used or not? And that’s one way of motivating it. I mean, it’s the way
you you shop on Amazon and it’s like, this is a best seller. So you’re like, okay, if the whole market agrees that this is a good thing to buy, then it’s probably good enough for me to buy too. And so, but it’s a matter of communicating that both to humans and to Asians.
Drew Breunig:
I think that’s, mean, maybe you need to build a benchmark. Maybe you need to build a benchmark on like quality retrieval from source datasets, which is like, can you correctly augment? And so I think that to me is like an interesting thing, which is can you correctly augment site without hallucination? Cause like that’s the challenge, which is like, you may get the right pull, but then you don’t adhere to the prompt and you rely on something in your weights.
Jed Sundwall:
Yeah.
Drew Breunig:
So it’s just like, it’s kind of like a recall on a moving target data set. which I think is a really interesting idea.
Jed Sundwall:
Hmm. Okay. Well, I’m going to have to talk to you about that another time.
Drew Breunig:
Because I mean, that’s the challenge. You have a bunch of data. You want to basically check against it and then validate that it actually is repeating what it repeats back. Because I think that’s the thing, is like having high quality data isn’t enough. You need to have high quality data. And then you need to ship the yardstick for measuring that high quality data when used in violence.
Jed Sundwall:
Yeah, right.
Jed Sundwall:
All right. Well, the lesson from this conversation is, is benchmarks. Like we, we got to talk about benchmark design, not just designing great data products, but
Drew Breunig:
Well, I think benchmarks are, because this comes back to Common Crawl, which is like Common Crawl didn’t do anything to its data, just made it easier to access, didn’t make any choices or anything like that. But I do think it’s a really good exercise for like, all right, if Gill launched Common Crawl to build a better Google or build a better information recall or to not have Google monopolize it,
The benchmark he should ship is like, you’re building a search engine, like here’s the queries and here’s the ID records that you should be finding within your thing to like do this query and like you can start to rank against it. I think, I do think even if you don’t ship a benchmark, doing the performing the exercise of what benchmark you would ship for the data product you’re looking to ship is a good exercise because it forces you to say, well, what do I want people to be able to do with this? And
And then it focuses your kind of the way you package it up.
Jed Sundwall:
Yeah, lot of, mean, but such a, so, many consequential decisions, they come from that. So, okay. Well.
Drew Breunig:
So who’s building the temporal benchmark? Who do we assign that to?
Jed Sundwall:
That’s Tyler Erickson will be built. No, it’s useful feedback from we actually have a bit of funding right now to work on on some GOAI benchmarking work. So. Yeah. Anyway, this has been awesome. I couldn’t be happier with our first episode. We’ll get this out there. I we’ve got we got a link. Michelle cooked up a website for CNG Conference 2026. So mark your calendars. It’s official. It’s on there’s a URL.
Drew Breunig:
that it was my voice Tyler. Yeah.
Jed Sundwall:
So, yeah, Drew, I announced this while you were answering the door. We’re doing CNG 2026. Same location, but in the fall, six to ninth of October. No snow, which some people…
Drew Breunig:
Ooh, so no snow. That’s a big plus for me.
Jed Sundwall:
See, thank you. I’m glad you said that because people are like, no, like I like skiing. It’s like a few people have the time and energy to ski. I think most people wanted to get out on the mountain and just couldn’t because there was too much snow. anyway, thanks so much, man. We are going to do this again. You’re not gonna be, I predict you’ll be a many times repeat guest. And thank you everyone for tuning in. This has been a lot of fun to do it with a live chat.
Drew Breunig:
Yeah, I know.
Jed Sundwall:
Really appreciate everybody who chimed in. Anything else? Do you have anything to plug? Okay.
Drew Breunig:
Awesome. Well, no, that’s no. I’ll be at the Spatial Data Science Conference talking about GERS, talking about standards. And yeah, now I’m just thinking about evals, man.
Jed Sundwall:
good.
Jed Sundwall:
All right, well, stay tuned. We’ve got some work going on evals too. So, all right. Bye everybody. Thanks. Bye. Bye.
Drew Breunig:
Talk to you later, Jed. Always pleasure. Bye.