Open Data

→ Episode 6: The Storm Events Database Explorer

Video also available on LinkedIn

Show notes

Jed talks with Kwin Keuter and Brad Andrick, geospatial software engineers at Earth Genome, about the Storm Events Database Explorer. This collaborative project between Earth Genome, The Commons, and the Internet of Water Coalition provides access to over 1.9 million U.S. severe weather events spanning 70+ years of NOAA’s National Center for Environmental Information (NCEI) storm records, including tornadoes, floods, hail, and hurricanes.

The conversation explores how Earth Genome approached transforming decades of federal storm data into an exploration-ready dataset with multiple access modes. Kwin and Brad discuss the data quality challenges they encountered, from changing event types over time (at the outset, the dataset only recorded tornadoes) to inconsistent location data across different event types. They explain their design process, which started with user surveys targeting meteorologists, emergency managers, insurance professionals, and local government planners to understand pain points and workflows.

Throughout the discussion, they emphasize Earth Genome’s philosophy of creating as many modes of access as possible: a visual Explorer interface for non-technical users, downloadable CSVs for traditional workflows, a programmatic API for developers, and cloud-optimized Parquet files on Source Cooperative for data scientists. The conversation touches on broader themes about marketing data products, the economics of sustaining open data tools, and the role of government in producing core datasets while enabling external innovation.

Links and Resources

Storm Events Database Explorer — Interactive map and search interface
Storm Events Database on Source Cooperative — Cloud-optimized Parquet files
Earth Genome blog post on the project — Technical process and discovery work
The Commons case study — Project background and case study
NOAA Storm Events Database — Original NOAA dataset and beta interface
GeoParquet.io — Chris Holmes’s project for working with Parquet files

Key takeaways

Multiple access modes serve different users — Earth Genome built a visual Explorer for planners, CSV downloads for traditional workflows, an API for developers, and Parquet files for data scientists, recognizing that different users need different interfaces to the same underlying data.
Historical datasets require careful handling — The Storm Events Database only recorded tornadoes from 1950 to 1996 before expanding to 55 event types. Working with evolving data structures across 75 years requires thoughtful design to present both historical and modern data in semi-standardized ways.
Data quality issues are inevitable — From FIPS codes that don’t match known counties to varying location representations (points for tornadoes, polygons for heat waves), real-world datasets contain inconsistencies that must be addressed through ingestion pipelines and documented for users.
Marketing data products requires ongoing effort — Building the tool is one thing; driving awareness and usage requires conference presentations, blog posts, webinars, and community engagement. The team emphasized that making data easy to use means more than just posting it—you have to actively get it in front of people.
Government should focus on core data collection — Brad and Kwin discussed the value of federal agencies prioritizing primary data collection and publishing over building every possible user interface, allowing external organizations to innovate on top of stable, open datasets.
Feedback loops remain missing — Despite building a valuable tool on top of NOAA data, Earth Genome has limited direct engagement with NCEI. Creating channels for data users to communicate with data stewards would improve data quality and help agencies understand the value their datasets provide.

Transcript

(this is an auto-generated transcript and may contain errors)

Jed Sundwall: It’s 10 AM and we’re live. And thank you so much. All right, Brad and Kwin and everyone out there. Welcome to Great Data Products. This is our live stream webinar podcast thing about the craft and ergonomics of data. It’s brought to you by Source Cooperative. We’ll be talking about Source Cooperative a bit today, which is exciting. One quick bit of housekeeping before we start. Another thing that we do at Radiant Earth is the CNG forum. And just yesterday, we published a blog post about the upcoming event that’s happening in October. Early bird tickets are still available and there are opportunities to sponsor.

submit a talk, do a workshop. Brad and Kwin, believe both spoke at the first one last year, so they can attest it’s a great event. But anyway, I’ll put a link in the chat and in the show notes. It’s a great event. And I, you know, you don’t want to miss it. I think it is becoming very quickly the the conference of choice for data professionals. So today we’re joined by Brad, Andric and Kwin Coyter? I should have figured this out beforehand. Cuter?

Kwin Keuter: It’s [pronounced] Kiter.

Jed Sundwall: Wow, two strikes. We’re going to talk about the Storm Events Database Explorer. Before we get into it, do you mind just introducing yourselves? I don’t care which order you go in.

Kwin Keuter: There’s no other way.

Brad Andrick: Kwin, you want me to? All right. We talked about this ahead of time, that we were just going to shift back and forth, but I’ll run with it. Thanks, Jed, for having us. My name is Brad Andrick. I am a software engineer at Earth Genome, where Kwin and I both work. I’ve been in the game for a little over 12 years. My focus is in digital cartography, geospatial software engineering,

Jed Sundwall: Yeah.

Kwin Keuter: You feel that?

Brad Andrick: been public sector, private sector, now in the nonprofit world. And love it. I can give more of an Earth genome thing in a minute, but I’ll let Kwin give his response.

Kwin Keuter: Yeah. Yeah. Hey, I’m Kwin Keiter, also an engineer at Earth Genome. My title there is Geospatial Software and Data Engineer, which is most verbose title I’ve had, but it is accurate. Basically, I build data pipelines and APIs. And I also dabble in DevOps a bit. So, yeah.

Jed Sundwall: All right. Well, thanks for coming on. This is an exciting one for me because we’d worked a bit. I mean, we actually didn’t work that closely on this project or anything like that, but I remember talking to Mikel about it months ago and sharing some of my thoughts on this project of an explorer, but also exposing the underlying data in source cooperative.

so that people can do other stuff with it. That’s what we’re doing. That’s always sort of the dream for me is to have an interface that’s accessible by a lot of people and really beautiful and elegant, but also making sure that people can get under the hood and get the data. So can you talk a little bit more about the project and the genesis of it and who the audience is and things like that?

Brad Andrick: Sure. So I’ll give the genesis and Kwin can tell me what I get wrong and what I get right. So the Storm Events Database exists as a product of NOAA. And specifically, NCEI over in Asheville, the National Center for Environmental Information, I it is, puts out this Storm Events Database.

And in short, the Storm Events Database captures storm events that occurred going back to 1950 to today. And it includes some unique things inside of it. So those will be an event type. It could be a hurricane. It could be a tornado. Wildfires are in there. There’s also a data narrative, which is pretty interesting. So it’s about the particular event. And then…

that exists and is put out as a CSV sort of product as a lot of federal things kind of end up being at some point. And there is a user interface you can search through the database there for. In addition to that, those fields about the storm event itself, there’s also the impact that it had. So that can be property damage as well as deaths and injuries are also recorded inside of that data set. So that’s what that data set really is.

today if you go to Noah’s site itself. Now that’s a little bit different than the Storm Events Database Explorer. So the Explorer is kind of our designation on top of that and the project itself. So what that ended up being is there were some challenges that come up when you have CSVs that you need to download or those user interfaces where you’ve got a couple of filters and then you’re getting pieces of the data here and there out of it.

But a lot of this data is spatial. so there’s wasn’t really a great visualization to capture that. And in general, we thought that there were some value adds and I should mention the genesis of this project is not Earth genome. like, we didn’t decide, Hey, we should make this. It’s great. it actually, it was a coalition to start with the internet of water coalition and they, had an idea for this project working with, dukes.

Brad Andrick: Nicholas School for the Environment, and then also The Commons, which is another nonprofit that all kind of started working on this project idea and needed some of the technical guns to come in and kind of facilitate actually creating it. And that’s where Earth Genome came into the mix to actually build out the tool. I kind of stopped short of saying what the Explorer part of it got into more, but Kwin, I’ll let you go a bit.

Kwin Keuter: Yeah, yeah, we’ll get to that. But yeah, think that this NOAA data set is really unique in that it brings together all these different types of, speaking, weather events. Brad mentioned a few of them. I think there’s 55 different event types that are tracked. And it’s really more about the sort of narrative elements of these events rather than like the

Jed Sundwall: Yeah, sure.

Kwin Keuter: hyper detailed gridded data that you might see, you know, in like NetCDF format, for example, it’s not that it’s more about what happened when this event, you know, you know, this tornado or hurricane, like either in a very specific place or across a broad region, you know, what happened? So the, the narrative texts like Brad mentioned is really useful, but also like, how do you take

everything from a tropical storm to a wildfire, a hailstorm, and how do you make them all sort of have a similar format? And that’s what NOAA and NCI have been able to do. And I really think that the best parts of that are, yeah, what were the economic impacts and how were people affected, you know, in terms of safety with fatalities and injuries?

I’m not really sure where else you would go for all of that. so, yeah, like Brad said, you know, clearly this data set has a huge amount of potential or it’s just really valuable. But if it’s in a CSV or, oh, you can query it, but you can only see like a thousand events at a time or whatever, like you’re not really going to get all the potential out of it. don’t think, so that’s what we tried to do here.

Jed Sundwall: Yeah, that’s great. No, I mean, it’s a perfect example of what, what’s possible now. You know, it’s just over the past, you know, not that many years, it’s become so easy to see lots of things in the browser and interrogate data that way. And I don’t blame anybody for using CSVs. I love, CSVs are great. but, it’s, it’s really interesting. I’m sure you have this very lived experience now having worked with this data to see.

If this data goes back to, you said the 1950 or amazing. So you could just, you have to imagine sort of the constraints that people were under in terms of how they thought about like what was even possible to share.

Brad Andrick: 1950.

Brad Andrick: And we might get into this. will get into this a bit more, I’m sure, but that’s one of the interesting things about this data set and, like the data discovery portion of this project is in 1950, it only recorded tornadoes. And so, yes. Um, what Kwin was it 1995 Kwin, um, was the transition 96, uh, was the transition over to a bunch more event types. And there’s even more now of 55, like.

Jed Sundwall: really? Okay. Yeah.

Kwin Keuter: Yeah, no you still say, I think.

Brad Andrick: Kwin mentioned. So working with a data set that for so many years was just tornadoes. And then you’ve got all of these other event types that come in, making sure that you can visualize in a semi-standardized way both of those different viewpoints over time. was a really interesting challenge to go through.

Kwin Keuter: Yeah, and I just, oh yeah, to echo the 75 years worth of data, like, it’s just, there’s actually some great sort of storytelling on Noah’s websites about the just incredible journey that the data went through, you know, going from, like, basically typing out documents and, you know, filing them and like, whatever, you know, paper filing archives they had to

you know, whatever was, uh, the database of choice, you know, in 1980, like, then every, and then just the rapid acceleration of different developments in our space. And now, now here we are, like, that’s, I think nothing short of heroic to, to shepherd a dataset through all that time. So, but so we’re, we’re just, yeah, we’re lucky that we get to.

Jed Sundwall: Absolutely.

Kwin Keuter: to be on the tail at this point and to get to explore this data and try to make it come to life. Yeah, that was really fun and that’s what we’re here to talk about.

Jed Sundwall: Yeah, no, it’s really cool. mean, it’s, you know, for me, just cause you know, I have some biases here, you know, I’m very excited to see the data in source and in particular that you all created a parquet file. And I’m going to pull up, I have the link here. I’ll also put it in the chat. But that, you know, you can now.

I mean, that’s kind of a janky URL, I guess, to share in the chat. don’t know if people could go to that. you know, that loads up really quickly over 2 million rows of data that you can sort. mean, for people who don’t know how Parquet works, the sorting experience might be a little bit strange because it streams in data. So the different columns will come in. But one of the first things I did when you put up this

this parquet file was going to look like just sort by number of deaths. It’s like, is the most kind of like morbid, know, intriguing statistic there? And it’s, you know, it’s Katrina, you know, Katrina shows up and there’s a narrative about it. And it’s, it’s fascinating to think that we can just, you know, I mean, I don’t remember anymore the limit that Excel has on something, 60 something thousand rows where when I was in grad school, I’d always be frustrated. Like, well, you you get a,

CSV or some sort of file and you just can’t open it. And now we’re at this point where it’s like, you just send somebody a URL and they can see all this data. it’s, if you think about what, what would it imagine 2 million records in 1950, how they would have thought about that. A 2 million page book, you know, 2 million card, card catalog or something like that.

Kwin Keuter: Yeah, and the fact that the queries that you can do on a Parquet file like this are basically only limited by what are the columns there and then how much SQL do you want to learn. Now you probably don’t have to really learn hardly any SQL at all. You could just have an LLM write it for you. But yeah, and then the queries are really fast.

Yeah, I’m glad that you’re excited about that, Jen. It is just one sort of mode of access that we’ve built into this product. I think that this product is sort of exemplary of a lot of what we do at Earth Genome, where we try to create as many modes of access as possible because…

Naturally, someone’s first impression of the data is probably going to be going to the first URL that you shared, the Storm Events, Internet of Water app. They want to see what it looks like, but then they might really love CSVs. have those. They might want to access an API programmatically. They can do that. Or I’m hoping that they’ll be like you and they’ll be excited to try out these parts.

files.

Jed Sundwall: Yeah. Well, so yeah, let’s, let me ask though, like who are these sort of personas? you, I’m curious to know like what sort of thought goes into creating a product like the Explorer and who it’s for and how it’s going to be put in front of people and who those, yeah, who those people might be.

Brad Andrick: Yeah, so part of the process that we go through for most of our projects and in some ways this process I think will be changing, maybe a bigger discussion on like what that looks like for the industry. we always engage with our users as much as we can. So we started with a user survey. There was some initial work before we actually got handed the project.

to, to think about people who are in the, space already to think about, okay, what would be, what would make this better? And after reviewing that, we also went out to the wider team at the end of the water and said, Hey, we want to talk. I’m going to send out a survey basically to more people. And so we, did, we created a survey with 20 different questions, user backgrounds, their current workflow through the existing storm events database, their pain points and challenges.

preferences for new features, and then any final overarching thoughts. And these coverage, with the people that it got sent out to, had meteorology, climatology wrapped up in there by a few different people, emergency management, kind of disaster preparedness folks. There was insurance in there as well, real estate.

And that’s where some energy policy, think, might have been in the mix as well. So that’s when we kind of think about those users, that’s roughly the mix of people. So it spans scientists to someone in the insurance industry thinking about risk or a local government that’s planning for, this area had this much impact of disaster and this was the cost for it, right? We need to understand that risk plan for finances and all those sorts of things.

Kwin Keuter: Yeah. Brad, did you mention local government planners? I think that was one on our list of like target users. And I think that that represents like a segment that is maybe, this would be especially useful as a product for people who need this type of data, but maybe they don’t have like an army of

Brad Andrick: But… Yeah.

Kwin Keuter: You know, data scientists like an insurance company might have to like to build some detailed model, you know, but if you’re, you know, you, you, a mid-sized city or a small city and you just need to quickly access weather event narrative data, like, I’m hoping that people will hear about this so that they could be like, yeah, now I just, now I just go here and it’s just that easy. so I think that like, especially the people who don’t have.

resources or expertise, you know, I’m hoping it gets to them.

Jed Sundwall: Right. Yeah, I that’s, mean, that kind of persona of somebody who doesn’t, they don’t know how to write a SQL query. They don’t know what SQL is like they, and this is also, it’s a fascinating dataset because it’s the kind of thing where like, it’s the sort of thing that I think a lot of people imagine does exist or should exist, you know? Like, like, yeah, so certainly somebody has a dataset of all the storms, but like to find that and then to be able to like, as you all were saying before,

query in any sort of useful way by geography or interact with it was just too much of a lift for a lot of people. So this explorer, hopefully people will find it. I am curious to know though, this might not be your part of your job description, but do you know of like, there plans to like get it into classrooms or train people on it or make sure people know about it?

Brad Andrick: I don’t know of anything like that. I would love for that to happen. So we did the last step of the project cycle, which right now we’re kind of moving to maintenance mode. we run updates. runs updates on the project to keep the data up to date. But that’s kind of where things live right now. The last step, though, before we kind of moved into that mode was communication in a way, which was mostly

Jed Sundwall: Okay. All right.

Jed Sundwall: Okay.

Brad Andrick: blog post, so the Commons case study. We did a blog post outlining our process. Then there was a webinar run by the Internet of Water Coalition that Kwin and I were on and gave the walk through the tool. 20 or 30 people, I think, were on that at the high point. then finally, there also was a talk at Phosphor G North America this year up in Reston.

that kind of outlined a little bit more of the technical details, but still brought up the project. But that’s kind of where that project term ended right now.

Jed Sundwall: Yeah. Well, that’s, you know, this is interesting. mean, I’m not like, you I don’t want to like put anybody on the spot, but like, that’s the tricky thing with a lot of these things is that we’re at this point where it’s like, it’s kind of cheap to create something like this.

then it’s a living thing, presumably, that needs to be stored in for a long time. So anyway, this is a plea to the funders listening in, like, you should continue to fund these kinds of things to make sure that we can drive usage of them. I’m curious to know what’s NCEIs involved in here other than just kind of being where the data comes from. Are they at the table?

Brad Andrick: Not really, which is interesting. I’ve reflected on that myself. I’m like, that’s interesting. So I know with the Duke School, they had some people that they worked with directly at NOAA get some opinions and consult on like there’s categories. So one of the things that’s different between the original dataset and what you’ll find in this product is a categorization. And so that there was a lot of time.

Jed Sundwall: Yeah. Yeah.

Brad Andrick: that went into figuring out where those breaks should be, how we should group things together. And so that had some NOAA consult back and forth. But as far as the NCEI Storm Events Database and anybody that’s currently staffed on that project, we didn’t really have much of a relationship with them. That said, we’ve been following along what they’re working on. So that includes the data releases. But more than that, there has been an initiative to

redo the user interface that currently exists. And it’s interesting to follow because the last update that I had heard was from a presentation last summer, midsummer, that talked about a September release of their beta. So if you go to their website, you have to dig a little bit. You can find a beta version. Interestingly, it does not have the map.

visualization. It has a filter by map feature now, which is very helpful. But it doesn’t have a visualization and not putting all of the points on a map or anything like that. But that beta version hasn’t been released yet. And not who knows why. It could be funding, right? We don’t know. That’s an assumption. it’s a weird time. But there was a parallel project that after that started, at least our awareness of it started after the

Jed Sundwall: Yeah, well, it’s a weird time to be Noah. Yeah.

Brad Andrick: our project is already underway. And we’ve finished our development work last September. And yeah, have you had to see that beta go? did before this call. was like, let me follow up on that on LinkedIn with the product owner that I saw from the webinar and followed up on LinkedIn just to kind of see. And they were just somewhere giving a presentation about something related to the Storm Events beta that’s still underway.

The deadline has passed, but it’s still moving somewhere. So that’s at least a good sign.

Jed Sundwall: Yeah. Okay. Well, that’s good to know. Yeah. It’s a, no, it’s such a, it’s such an interesting data product that, that Noah has here. And it’s as I was looking at it, you know, thanks to your Explorer and also the parquet file and just, know, it’s fun to be able to go back and look at this, this old data. And then also to understand that like there’s certainly,

there are events, like there’s clearly like bad data in there, I think, you in the sense that like, I’m pretty sure Katrina didn’t, there’s no like sort of like, there’s like direct, sorry, I’m like so morbid. I’m like, I only care about death. It’s like, there’s like the direct deaths and then there’s like indirect, you know, fatalities or whatever. And I think they’re like zero for Katrina, which is like almost certainly not true. But I guess, did you have to deal with this as you were going through?

Certainly fields were added over time, you know, for which there’s just no data for historical events. How much of your work was caught up in that kind of stuff?

Kwin Keuter: a few months of work for me to, mean, alongside other projects that I was working on separate from this, but yeah, I spent quite a bit of time digging into the data, you know, as it’s presented by Noan NCI and trying to make sense of where some of those gremlins might be hiding. Yeah, so I don’t have an answer for the…

you know, why indirect deaths might be undercounted in certain events on the, the, we have also, the dataset has property damage and crop damage. spent a while looking at that because one, the way that, that damage, those damage numbers are presented. It’s not just numerical. It’s not just like a million dollars. It’s like one capital

You know, it’s like you have to parse these text fields and, you know, so I got that working and that’s all fine. But then in about April or May, you know, Mikkel Marin, our boss here at Earth Genome, he, he mentioned the Noah’s billion dollar data set. cause we were wondering like, okay, well that’s got, you know,

Jed Sundwall: wow.

Jed Sundwall: yeah? Yeah.

Kwin Keuter: property damage information, like the sort of its database has, how similar are they? Are the numbers exactly the same? So I spent a little bit of time comparing those numbers and the billion dollar data set, like because it’s, I’m assuming that because it’s been focused on figuring out the overall economic impact of a large scale event, they probably dialed in the methodology for counting that damage.

you know, more precisely than perhaps this Storm Events data set has. So typically the billion dollar data set numbers are higher, probably because they’ve done a more complete sort of evaluation. But then it was interesting because then in May, you know, announced that they were going to retire that billion dollar data set. So that caused some…

Brad Andrick: you

Kwin Keuter: I wouldn’t say alarm on our part because everything looked fine for the continuity of the sort of immense database from now on, but we just had to wonder like, is this going to be next? But so far it’s been kept up to date every month except for during the shutdown.

Jed Sundwall: Right. Yeah, gosh. mean, it’s so, you know, it’s inevitable that you, if you look at any kind of data set like this, you’re going to get into all these like really interesting questions where you’re like, you know, you see a number there and it’s like, this is how much damage there was. And you’re like, well, how do you know? Like, like what model did you use? And to your point, like you could use the whatever model that the economist who runs the billion dollar disaster thing, who, the way, the guy’s name is Adam Smith. He’s an economist at NOAA or used to be.

Brad Andrick: Thank

Jed Sundwall: because of presumably Doge or something like that. Not there anymore. anyway, it’s just funny. He’s an economist named Adam Smith. I don’t know him, but anyway, so he has a model to figure out what counts as a billion dollar disaster. And the point being that whatever number is in those cells is like, there’s a lot of thinking that has to go behind that, that’s never documented. It’s very, very rarely documented.

Kwin Keuter: Yeah.

Jed Sundwall: Anyway, it’s just I don’t I want to make sure that it’s clear that I’m not picking on you all if there are issues with the data. It’s just one of the sorry, but I want to hear from you. But like that’s kind of why I’m like curious to know if like NCI is involved, because I think one of the great things about this explorer is that it should allow for a feedback loop to NCI. But my guess is that there’s no one on the other no one on the other end of the line.

Kwin Keuter: Yeah, I would love to have that feedback loop too. In a previous job, I spent three years as a contractor at the US Geological Survey, working on the national map and delivering those data products. I have found myself on the other end of the line. It was like, how do we talk to users? I would love to hear.

know, the users of these data products, like what feedback they have for us. And maybe that was just not in my scope of responsibilities, but I regard this would love to see more of that feedback loop, like you mentioned. Yeah, one specific sort of data quality issue that

that I wrestled with and I think where we ended up actually works pretty well is the location data. So in the sort of mindset, again, you have 55 different event types. How do you represent where that event happened? Some of that types like a tornado, the database has an actual point location or even a series of points where that tornado happened.

But how would you do that for a heat wave, which is another event type? You can’t really use a point location for a heat wave. So there, would make the location. The location would be represented by the county or forecast zone that experienced that heat wave or whatever the event type was. So right there, you’ve got two different points and a boundary, a polygon.

way that we sort of merge those together is just by taking the centroid of that boundary. But then, so that’s fine. But I also found that the way that those boundaries are assigned isn’t always, there’s some cases where they would say it happened in this state and here’s the FIPS code for the county where it happened.

Kwin Keuter: And we’ve got, we’ve got a new data set of, you know, county FIPS codes. And sometimes the FIPS code doesn’t line up with any known FIPS code in that state. So, so what do you do then? You know, well, there’s also names of, you know, the name of the county. So I have a little bit of logic in the data ingestion pipeline where it’s like, okay, if the FIPS code didn’t match up, we’ll try the name, but.

Despite that, there’s still about 1 % of the events where we just try as we might, we could not figure out exactly which county or forecast zone was this event. All we really know is what state it was in. So that’s something where I’d be like, oh, I’d love to talk to someone at NCI and say, here are the events that I flagged where I don’t have this, you know, this location. Could you try to fix that?

And maybe they can’t, maybe that event was 30 years ago and there’s just no way to know, but it’d be nice to that conversation.

Brad Andrick: you

Jed Sundwall: Yeah, things are lost to time. don’t know if you saw, mean, this is a plug for a talk from the CMG last year where Drew Brunig, it was the last day, but Drew gave this talk about sort of the origin of railroad time and how like for a really long time, I mean, for yeah, a really long time, most of human history, we didn’t have standardized time, the idea of time zones and being set to

you know, Greenwich Mean Time was not a thing. And so you would have to like, if you showed up in another city, you’d have to like find the town clock and be like, okay, this is how they keep the time here. And you know, just what an unlock it was for us to get a standardized way of referring to time, you know, how that enabled commerce and travel and things like that. And we just don’t, you just, you’re just remaking his point for him. He’s like, we need that for place, you know, there is no way to like,

like a common way to refer to a place. There’s a FIPS code, know, their county boundaries and stuff like that, county names, but it all falls, it can fall apart really easily. And so standardizing that is really important.

Kwin Keuter: Yeah. Well, another thing that I took away from Drew’s talk, he said, you know, it wasn’t just about railroad time. was, it was about, you know, well, one of the points he made was it’s not enough to just post data, you know, make the data open and just stop there. You have to actually make it easy to use. And so that’s where, again, it’s like, we had an open data set.

But here we tried to make it really easy for as many people who would care to use the data, just give them as many options as possible to easily access it. So yeah, again, a good plug for the CNG conference in Snowbird next October.

Jed Sundwall: Yeah. I didn’t pay anyone to do this. So thank you, Kwin. Yeah. Actually, yeah, I’m gonna put, I’ll put a link to Drew. So the funny thing about Drew’s talk is that we did, we failed to record it at the conference. So we had him redo it just like as a webinar and recorded it. And so it’s up on YouTube, but yeah. Well, I mean, yeah, I’m curious to get, you your, your take then on like,

your kind of ideal

user or like how you’d like to see this data being used more like in the future if you’ve given thought into, you know, as you discussed before, like the personas for the explorer, but then also for the data and source. Do you have like any imagination of like who might be doing stuff with the parquet file or all the CSV files that are there in source?

Kwin Keuter: Yeah, that’s, I mean, I would love to just, you know, sort of the LinkedIn crowd that wants to, you know, hey, like, I had an idea. And so I spent, you know, two hours on Saturday, digging into this data set that I heard about, you know, one, I would love to just see like, people in our community in the cloud native geospatial community.

just sort of engage with this data. I have a bit of a hard time telling people who their job is not my job. I’m not a researcher. I’m not a insurance underwriter. don’t know. I want to tell them how to, like, don’t do it that way. Use this data set instead. But.

Yeah, so Brad, who’s your ideal user?

Brad Andrick: It’s a great question to think about because I think that’s one of the problems with honestly this sort of project that we engaged is a great idea, right? And then we have some follow on ideas too. We also have other projects that we need to work on, right? So like, how does that get picked back up? And for us, it generally would be like, oh, someone has this idea.

And they cobble together the funding to be able to support that. And that then turns into one of the many projects that might come back up underneath Kwin or I or somebody else. I think the communication part of that too is something that we need to maybe do more with. And ESIP comes to mind. It’s a great community to get out and talking with. They just had their conference, the virtual one. I think the next one’s out in July.

So that’s probably the direction to engage more directly, because those are little one-off events that we can hit, try to spread the word. But we are, like many organizations, bound by funding, resources, other projects, and priorities, and those sorts of things. So figuring out where that follow-through is and how it happens is maybe a good question to think about.

Kwin Keuter: Yeah, well, so I guess is a little anecdote about for whoever out there is listening, how you could use the data. So I live in Colorado and, you know, in January when much of the US, especially the Eastern States were experiencing severe winter weather in Colorado, were, we had like a few inches of snow here in Denver, hardly anything.

daily highs in the 50s and 60s. So I was wondering, know, what are the historical winter weather trends, you know, in Colorado and in the US? So using the parquet file and some queries against that, you can use the data set to find out very quickly. January is the top calendar month for winter weather events in the US.

And that’s in terms of the number events, the fatalities and injuries, and the economic damage. But in Colorado, again, just a different query, it’s actually March and April that have the highest economic impacts in terms of winter weather. And to me, that was no surprise because typically we were patient about seeing

know, large snow storms, you we know they’re not going to come until April or March. And that, you know, as a consequence, that’s apparently when, you know, they wreak the most havoc in terms of, you know, economic damage. So that was an easy, that took me maybe 15 minutes to figure that out. But yeah, but I think that there is a sort of open question that I’m glad we’re talking to you, Jed.

Like we, think all three of us love, we’ve loved the idea of data products. We sort of inherently see and have a visceral feeling of the value of these, these data. But like, it really depends on, you know, someone also getting that feeling. like, how do you, how do you market a data product? and yeah, what’s your, what have you learned from?

Kwin Keuter: from trying to do that.

Jed Sundwall: Sure. Well, I mean, that’s, that is the whole point of this podcast is to sort of explore these questions in public. You know, one is like having a name, you know, and even like a name, like the like national storm events database or something like that, like is, fine. Like Landsat is a great, is one that I always use because it has like a brand name, you know, and, know, my hilarious joke about Landsat and this is true. it like,

I’d been working with Landsat for like at least five years before I realized it stood for land satellite. I was like, oh, land satellite. This is not obvious to me. like, so, but like just, just even thinking of it in terms of a product is really important. So like, yes, you have to give it a name, you know, that makes sense to people. And then it’s good to, we have a comment from Akis on YouTube, asking, he’s like, how do you market data products?

It’s kind of like any other product. You just have to figure out like how to explain its value proposition. Make sure that people know about it. You get awareness. Talk about it a lot, you know? And it’s the kind of thing that like our world of like nonprofit people tend to tend to ignore entirely, which is like it is sales. know, it’s like it’s like very kind of like it’s like you got to like find channels.

you know, this is what like salespeople like to talk about or like channels, like, which is like, how are you going to get your, your message out to a lot of people at once and make sure that they’re the people that you want. And yeah, I think our, our community is again, is why we started the podcast and like try to elevate this conversation is we don’t talk about this kind of stuff enough. There’s another interesting dimension though, in 2026, you know, and has been for the past few years, which is that

that a lot of users of data aren’t gonna be people, know, they’ll be agents and things like that. And so I’m curious, do you all think about that at Earth genome now? Making data AI ready or whatever?

Kwin Keuter: Well, I have another anecdote. the relevant here, the Parquet, I really relied on a project that Chris Holmes is, you know, it’s a Chris Holmes project. I know that he has collaborators on this tool called GeoParquet.io. So when I was trying to figure out how to

How do I write these Parquet files? What’s the best way to do this in January 2026? And the LLM that I was talking to referred me to Chris Holmes’s Geo Parquet IO project. And when I went to look at it, I looked at the GitHub and it’s like, this has only been really live for like a week. This is in January. Yeah.

Jed Sundwall: nice.

Brad Andrick: you

Jed Sundwall: Yeah, it’s like brand new. Yeah.

Kwin Keuter: So somehow Chris Holmes has figured out how to market his stuff to AI so that it can be marketed back to us. So yeah, I thought that was fun. I think he leaned into using Claude’s skills definition. So they’re just markdown files that explain to an agent

Jed Sundwall: That is crazy.

Kwin Keuter: Here’s how you could use this library. maybe that is what got that to work.

Jed Sundwall: Maybe, yeah, okay, I have an email in my inbox I need to get back to on Chris, on, I’m gonna ask him. But yeah, I I’ve been aware of this project. I’m very surprised to hear you mention it because I’ve been aware of this project. There’s another guy named Nism Libovitz, think, I hope I’m pronouncing his name right, who’s also been leading on this. It is like brand new. that’s awesome, that’s crazy. That’s a really interesting anecdote about like how like,

AI or Claude or I guess you’re using Claude that was aware of or was it?

Kwin Keuter: I think I was using, I don’t know, Gemini maybe or something like that.

Jed Sundwall: Okay, interesting. But regardless, that, like right away, the models have been able to figure out other stuff that people have been working on that would be relevant to you. So anyway, it’s just like a model helping sort of matchmake among humans, which is crazy. That’s super cool. Yeah, but yeah, everybody check out geoparkk.io. Brad, I’m gonna put you on the spot. mean, have you been thinking about the rise of agents as data users?

Brad Andrick: Thank

Jed Sundwall: as well.

Brad Andrick: bit, but if I could, I’d go back to the marketing question for a second. So, I think why it matters is really important. So, and I think about why this data set matters from Noah. And if, if you look at it, we’re talking about who’s our persona and that’s important, but also what is the overall impact potential of this data set? And to me, it’s.

Jed Sundwall: Sure, please.

Brad Andrick: it potentially where billions in resilience funding and infrastructure investment must go. Like it can dictate that level of like direction of things. how many people that have that level of influence are looking at this data set or aware of this data set or using a report that maybe is rolled up five levels and there’s no like actual tie back to the data set. And it also goes to a lot of open data gets lost out there. think of.

I spent a year working for a local government. We had a brilliant open data portal. And it wasn’t just an Esri open data portal, although there’s plenty of those out there in local government. And that’s great for local resources. But then how do you search 15,000 local government open data portals to find what you need? There’s that next step of everything else. And then also, the marketing ties in there on who’s

Who’s responsible for the marketing? Cause a lot of those people building the open data stuff, they’re not marketing. They’re not sales. And there might not be any budget at all for that stuff. either you’re going to kind of what we do. We go to a conference and talk about the work. Great. We got it out there. We go on a webinar. We talk about the work. All right, here it is. There you go. But where is that gap and how to, how does that get filled? I don’t have a great answer to that.

Jed Sundwall: Yeah. And so this is a message that this is our message also kind of like to policy makers or people who are writing open data policy or otherwise like, yeah, funding open data programs is that we’ve got to get ourselves out of this idea that like, you just open up the data and it’s fine and good and you’re done. It’s like, well, no, like what are you doing this for? Who are you doing it for? Are you sure you’re reaching them? Do you have any way of knowing whether or not it’s useful?

to your point, Brad, none of that stuff is funded. It’s all, just kind of just throw some CSVs up on the internet and be like, all right, our job is done here. it’s like, nothing else, no other media production operation behaves that way. Where we’re just like, we’re just gonna put stuff out there to see what happens. anyhow. Yeah, it’s

It is a gap, and that’s why we’re working this stuff out, like I said before, in public. I want to point out that Akas, again, said that he was very interested in knowing more about the user stories. I don’t know if there’s any way, or if you all have shared already, or any sort of information about the background of the project.

I shared the Commons blog post, but I don’t know what else would be out there.

Brad Andrick: Yeah, so the Commons blog, that’s a great one because that’s framed as a case study. And that gives a good amount of the background. There’s maybe a little bit more on the discovery and the process one that has some screenshots that talk about some of those categories that we targeted. That’s on the earthgenome.org slash about or slash blog. And then you’ll find the one.

Jed Sundwall: Okay. Okay.

Jed Sundwall: Okay.

Brad Andrick: Um, for the storm events, I think it’s the first one still cause it’s the last reason maybe, um, that has a few more details in it. As far as that is like a, process, uh, I could speak a little bit to that more generally that we sometimes we target and we, do all the work to build a persona and do like the traditional design thing where we have.

A persona mocked up and these are their skill sets and this is their description and these are their pain points and those sorts of things to get one or two of those generalized personas built out. We did not do that specific exercise for this project. There was some more general work when we got on the project and then there was the user survey. And then from that, we landed on a requirements list that went back and forth between our team, the Internet of Water team and the Duke school.

to basically refine that down, say, hey, this is how much budget. This is all the requirements that we’d love to have in there. These are the most important ones ranked by the user survey that we got back from, did an analysis on that, and then kind of decided, OK, this is where the project needs to live.

Jed Sundwall: Got it. Okay.

Fascinating. then, but then once again, it’s like, there’s a project that needs to live. think there’s a lot that word live is doing a lot of heavy lifting. It’s like for this thing to, sorry, this would be a little cheesy. Like to come to life, you know, it does need, it needs to live in people’s minds. People need to be aware of it. I just left another comment on YouTube. It’s like, if people can’t find it, it doesn’t exist. doesn’t, it’s irrelevant if it’s just sitting out there and no one knows about it. Yeah.

Brad Andrick: 100 % and in the when we joined this project, we did that kind of discovery phase one to talk to users, but also we Googled Noah Storm events database and see what popped up. And there was a ArcGIS dashboard, not a story map, but a dashboard kind of interface that existed at some point. Up to a moment in time and then the data was never updated and.

Yeah, that just went away.

Jed Sundwall: Yeah, yeah, here we are. Well, this is the other, think, this, you know, talking about like, Noah providing a better interface to this is interesting. So, you know, it’s possible that this beta comes out and then it just becomes, you know, moves into production and then what you’ve created is made obsolete by it. And that’s fine. That is probably like a perfectly fine outcome. But it also sort of reveals this interesting

I think it’s an interesting challenge, is that like, especially in the context of data products is that you have the data product and our concept of that on source is it’s a collection of objects. And in fact, the story you’ve told about this data product explains sort of why that’s really important to us is that we take an object or what some people might say a file-based approach, which is that

If you did have a bunch of paper records of storms, you know, on a ledger or cards or something like that, that’s kind of timeless, you know, like it’s not locked up in anything. It’s just that there are paper records. Of course, those are expensive to preserve and maintain and stuff like that. So as we move into digital things, you know, for a long time, we’ve been, I would say beset by database people who I love, you know, some of my best friends are database people, but.

But then they’re like, no, this needs to be put into a database for all these obvious benefits. And we’re like, well, yeah, sort of, except it does start to impinge upon accessibility and portability, or it can, depending on what kind of database you use. And we’re just like, no, I want us to keep a file-based approach just because it feels more timeless, right? And you can live anywhere. So we think of products as like a collection of objects. And so in the case of what you publish on source,

it’s a lot of all the CSVs and this parquet file. I would say if NOAA could only do anything, it should just do that and then get out of the way because it allows people like you to build tools like your Explorer much more easily. But very few government agencies can get away with that. They can’t be like, here’s a parquet file, good luck. There’s an executive somewhere that wants to have a dashboard and it’s fine.

Jed Sundwall: All of those dashboards and visualizations are expensive. go ahead, Kwin, sorry.

Kwin Keuter: They have a need to market themselves and market their data. And yeah, if it’s just a CSV file.

Yeah, they are going to want to go a step beyond. But yeah, your conversation with Denise Ross, I think she made that point about it’s really essential that if a data set can only be produced by a government, they need to make the care and feeding of that data set their priority. need to protect.

whatever it takes to keep collecting it because that continuity is more valuable than, you know, a user interface because, you know, then as long as the data is open, you know, anyone else outside of the government, like us, where those people can come along and build and innovate on that open data. So I was glad that you said that because I hadn’t heard anyone articulate that before, but it’s like, yeah, okay.

Brad Andrick: you

Kwin Keuter: Yeah, I think we need to say that more often, you know, rather than just be passive consumers of public data.

Jed Sundwall: Yeah. I mean, I think we should, everyone knows Denise is very good. conversation is like really great. I need to go back and re-listen to it because there’s some just amazing gems in there. And you know, we’ve, we’ve gotten feedback on, on, on that one, but yeah, I think, the, haven’t said it explicitly. I’ve been thinking this whole conversation, been thinking about what she and I talked about the lack of feedback loops, you know, this is why I was asking like, is NCI at the table? Is there anybody there?

Brad Andrick: you

Jed Sundwall: at the table. you know, and Brad, like you said it, you know, that kind of work that sort of like consumer, you know, customer engagement, front, you know, front facing type work is just very, very, very rarely funded. And, and it’s a, it’s a problem because it does. It’s you almost have no choice, but to be a passive consumer, because there’s just like, no, no way to actually like interact with the provider of the data in many cases.

go ahead, Kwin. Were you gonna say something? Okay. Sorry. Okay. Well, we’re reaching the end of the hour. I’m, so I’m going to say this. want people to submit talks for CNG, 2026, but we’re going to be one bit of feedback we got from last year is we had too many sessions that were very good and people just couldn’t choose. So we’re going to have fewer sessions. We’re just going to like be a bit more picky.

Brad Andrick: you

Jed Sundwall: so I encourage you all, Genome or, you know, either of you to submit stuff. You don’t have to say what you’re going to submit about, but, and also anybody listening as well.

Yeah, I’d be curious. Oh, yeah.

Brad Andrick: Gwyn and I have already talked. Yeah, we’ve got ideas, it’s kind of like, which one, since it is, it seems like a more focused effort this time. Which idea do we pull the gun on and submit for?

Jed Sundwall: Okay.

Jed Sundwall: Okay, here’s a clue is like we want end users. We want like impact, you know, this is sort of a problem our community has is that we are very geeky and we’ll geek out on talking about file formats and stuff like that and technical specs, but we want to talk about actual use like end user impact.

Kwin Keuter: Yeah, yeah, and Jed, I appreciate your encouragement and the comments on YouTube. Also, the implicit encouragement there to keep talking about these valuable data sets and not just post it and forget it, but yeah, continue those conversations because that’s how.

Yeah, I think that’s how their value can actually be fully realized. So yeah, that’s one of the goals for this year. Just keep talking about it.

Jed Sundwall: Yeah, yeah. mean, yeah, it’s, I’ve got another email in my inbox about this and Anak has just left another comment, you how would this work and get funded at the open data product level is like, that’s kind of the point is like, you have to think in terms of products to defend these things is to say like the other email I have in my inbox is about like, hey, how do we prioritize which data sets, you know, from the government are at risk and.

this conversation has been going on for like a solid, you well over a year now, ever since, you know, Trump came back into office and things started going a little haywire. People are like, what are our most important data assets? And people kind of don’t know. There’s just like data here and there, but because the data hasn’t ever been conceptualized as a product that has users and you have an understanding of how much it like costs to maintain that product, these conversations are super hard. so, yeah, I mean, so I can say.

My response to Acasir is like, that’s the whole point of referring to data products as products is because you have to fund them discreetly or at least be, think about defending them at an individual product level. eventually the way it should be and the way it is, like, no one has a portfolio of products that they’re stewards of. And they might have to confront a time where it’s like, know what, this product, no one uses it. Nobody cares about it. It’s time to let it go.

Kwin Keuter: Yeah. Well, just to be clear for anyone who wants to know, our plan for the Stormoments database explorer is to sync the data with NOAA’s data every month through, I think at least October this year, probably longer. It’s actually very easy to do that. Did it last, no big deal. And yeah, so this will be a resource.

Jed Sundwall: Yeah. Yeah.

Kwin Keuter: It’s not like we just did this once and we’re stopping. It’s going to continue to be updated as long as Noah keeps updating their data.

Jed Sundwall: Great. Well, actually I failed to ask this before, but like, is the application itself? What kind of application is this? I imagine this is also pretty modern web app that’s very lightweight.

Brad Andrick: Yeah, it’s the React front end. Mapbox is the MapRendering library in there. It’s fast. It works on a mobile app. We’re responsive. It’s pretty lightweight overall. I don’t know if the phosphor detox from this year are recorded or if they’ll be out at some point, but there was one there that went into more detail on some of the fun behind the scenes database work that

Kwin helped with and using PGTileServe and deciding not to use PGTileServe and going back to DOJSON for the data aggregation component of it. I could talk for way, way long here about the decision of a dot grid as a data aggregation type over H3 or something else. So there was, it was a great, I love, it’s like the place I love to live, digital cartography land, but it’s pretty lightweight and.

Jed Sundwall: Okay.

Brad Andrick: kind of this circle back to the agent question that I avoided in a way, was you made a comment that I think is very true, that we’ve had a bit of a sea change where it is a lot easier to get things done faster, at least to that prototype. We talk about this a lot internally right now on what does that design process look like for us? Are we iterating in code now? Because it used to be code’s expensive. You do that last.

Jed Sundwall: yeah, sure.

Brad Andrick: Well, that’s kind of changing in the landscape, at least for the prototype stage. but as far as Noah putting the data out there and just being like, here’s the end point, scripting down those CSVs. I don’t like Kwin did a lot of work to do that, but these days it’s, it’s not too wild to script those and pull them down right now. Like cleaning up portion. That’s where a lot of the thought work.

went into, but like the, the, the action of pulling down all of the data and then throwing it into a database. Like there’s work to get there to get it to work, to work at scale. we’re at 2 million over that now events in the database. So that starts to be like an annoying, it’s not big data, but it’s a big enough to be annoying that you need to consider aggregation methods and all those things and multiple filters being combined and dynamic querying and all of that.

But anyway, I think there is maybe a world going ahead where that sort of interface is, you can spin it up much more quickly, make these tweaks here and there, and much more open and collaborative and easy to access for people away.

Jed Sundwall: Yeah, absolutely. No, and I think this is the great future that we live in. I’m a broken record on this. Like it gets cheaper to do this stuff every day. Like every day it’s cheaper and easier to build a tool like this. And yeah, we live in a future where you can, I can send somebody a URL and they can interrogate 2 million rows of data. Like in a browser, anywhere in the world, like that’s awesome. What to me, you know, like the lesson from that is like,

then we need to be investing in making sure that the data itself is really high quality and keeps being produced, you know, at the core. So that anybody can build on top of it. All right. Well, so any last words before we wrap it up? you want our audience to know about or look at? We’ll share, of course, links to all of your work in the show notes and stuff like that. But anything else?

Brad Andrick: Um, for anybody that does follow our work, um, on earth index, that’s something we’ve been doing with searching embeddings that we’ve got some interesting things coming out that we went through a Google, Jena at Google.org, Jena accelerator. And so that was an interesting learning experience and we have some even features past that coming out later this year. So there’ll be some interesting things there to watch. And then I just will plug again, C and G everybody should go to the it’s.

Jed Sundwall: Yes.

Brad Andrick: I said to Jed before, this is the conference when I’m at places, I tell people, go to this conference. It’s like, if you took all the technical people and hopefully this won’t offend you, Jed, because you said you’re trying to shift this up, but you took all the technical people out of the false 4G world and put them in a room. And it’s just the geospatial nerds that are also talking about the bigger, wider. were some great sessions in the like business kind of side of things last year.

How do we open data sustainably longer term and funding models? So it was just, it was great. I love it. And I’m going to be back this year. Everybody should.

Jed Sundwall: Awesome, I’m so glad to hear that. Yeah, mean, look, you said it better than I could and nobody needs to hear it from me. I agree that our conference is awesome, but listen to Brad. All right, well thank you both.

Featuring:

Kwin Keuter

Brad Andrick

Jed Sundwall

→ Episode 5: Turning Federal Data Into Action

Show notes

Jed talks with Denice Ross, Senior Fellow at the Federation of American Scientists and former U.S. Chief Data Scientist, about federal data’s role in American life and what happens when government data tools sunset. Denice led efforts to use disaggregated data to drive better outcomes for all Americans during her time as Deputy U.S. Chief Technology Officer, and now works on building a Federal Data Use Case Repository documenting how federal datasets affect everyday decisions.

The conversation explores why open data initiatives have evolved over the years and how administrative priorities shape public data tool availability. Denice emphasizes that federal data underpins economic growth, public health decisions, and governance at every level. She describes how data users can engage with data stewards to create feedback loops that improve data quality, and why nonprofits and civil society organizations play an essential role in both data collection and advocacy.

Throughout the discussion, Denice and Jed examine the balance between official government data products and innovative tools built by external organizations. They discuss creative solutions for filling data gaps, the importance of identifying tools as “powered by federal data” to preserve datasets, and strategies for protecting federal data accessibility for the long term.

Links and Resources

Denice Ross at the Federation of American Scientists — Her bio and current work
The federal data and tools that died this year — Marketplace coverage featuring Denice

Takeaways

Federal data underpins daily life — From public health decisions to economic planning, federal datasets inform choices that affect Americans whether they realize it or not.
Data tools require active protection — When administrative priorities shift, public data tools can disappear. Building awareness of data dependencies helps preserve access.
Feedback loops improve data quality — Data users should engage directly with data stewards. Public participation in the data lifecycle leads to better, more relevant datasets.
Civil society fills critical gaps — Nonprofits and external organizations can collect data and advocate for data resources in ways government cannot.
Disaggregated data drives equity — Breaking down aggregate statistics reveals disparities and enables targeted interventions that benefit underserved communities.
External innovation complements government stability — A healthy ecosystem keeps federal data stable while enabling community-driven tools to evolve and serve specific needs.

Transcript

(this is an auto-generated transcript and may contain errors)

Jed Sundwall: Yes. Hello, Denise. Welcome to the great data products. Thanks for joining us from Virginia. Okay. That’s right. Okay. Want to make sure. no, but very, I mean, happy 2026, really, really interesting time to be talking about these things. just a bit of housekeeping as we get started. this is a, what I like to call a

Denice Ross: Good to be here.

Denice Ross: Northern Virginia.

Jed Sundwall: live stream webinar podcast thing, where we talk about the craft and ergonomics of data and talk to people who, you know, professionals who’ve worked in the production and distribution of data about, you know, what works, what doesn’t work and what we’re working on. you are currently at the Federation of American Scientists as a, how do you describe yourself? Senior advisor, former chief data scientist of the United States. How else do you describe yourself?

Denice Ross: Senior advisor.

Denice Ross: That’s a good question. You know, I really like the title former, the former chief data scientist of the United States is serving me well. Yeah, I always wondered why my predecessor DJ Patel used that, you know, after he left his position. He went as the former and I see it’s a good title.

Jed Sundwall: Hahaha

Jed Sundwall: Good, yeah.

Jed Sundwall: Yeah, it is a good title. well, and I think so we also share, I mean, we share a lot of interests and, but I think one thing we have in common is it’s you created, were a leader in New Orleans in open data back in the day. I also created a thing called open San Diego back in the day. Can you just share a little bit about your experience in New Orleans and how that got started?

Denice Ross: Yep.

Denice Ross: Yeah, absolutely. So I moved to New Orleans in 2001. It was the first time that the internet was a thing and the decennial census data were being released. there was this idea that we could democratize the data instead of decisions being made about communities behind closed doors by people in power and with resources to analyze the data and access it.

Jed Sundwall: wow.

Denice Ross: that neighborhoods and community organizations could have access to that data to advocate on their own behalf. And so, you know, I think when the civic tech movement arrived, you know, in sort of the 2005 to 2010, New Orleans was very primed to be a leader in that space, as was San Diego.

Jed Sundwall: Okay.

Jed Sundwall: Yeah, yeah. Well, I mean, so you were early on. mean, I think I came. San Diego was 10 years after that, like, or open San Diego. So you’re you’re way ahead of the game there. That’s fascinating. Well, OK, so as discussed, I mean, you know, as we plan for this, I’m curious to know what you’re looking forward to this year, both what you’re working on and sort of more broadly where you see things going.

Denice Ross: Yeah, absolutely. So 2025 was tumultuous. think we can all agree on that from the data perspective. And as we head into 2026, what we have, though, is a pretty activated and informed citizenry around the role that the federal government plays in our everyday lives and our economy.

Jed Sundwall: He’s really polite, yeah.

Denice Ross: and just running a modern society and also the role that data play, that federal data play. I think there’s less of a tendency now to take it for granted data like the weather and data on jobs and the economy. So that to me feels like a good foundation to start building out a plan for what we want for the future of federal data. And at the same time also really protect the core of the federal data that

that we depend on that we may not really be paying attention to yet and perhaps have been taking for granted.

Jed Sundwall: Yeah, actually, mean, it’s the, idea of taking things for granted, think is actually really, it’s something worth dwelling on this idea that like, there’s, there is so much that we take for granted that we, don’t notice it until it’s gone or until it’s disrupted. I think, you know, my dad worked in public health, his whole career. And when, you know, when COVID hit and suddenly that, know, the pandemic put

notions of public health and response and interventions and hard decisions like into the, you know, into people’s minds, everyone starts freaking out. They’re like, why is the government telling me what to do? and he realized, I mean, I think this is pretty insightful that like public health sort of had become a victim of its own success in that, like everyone just sort of takes for granted the fact that like you, everyone learns to wash their hands.

Denice Ross: You

Jed Sundwall: growing up, you know, like there’s like sort of the basic like cultural norms around like hygiene and behavior and things like that, that like, it actually was like, it took a ton of work to figure out how to get that out into the world and to train everybody on that. And that was done by public servants for the most part. And you don’t want to do a rug pull on those sorts of things. Cause anyway, we just take them all for granted. And, but I’m curious to get your take on like, what do you consider when you talk about core data?

Denice Ross: Mm-hmm.

Jed Sundwall: Are there specific data products that you have in mind or categories of data or what?

Denice Ross: There are. And you know, actually, though, I wanted to, as you’re talking about this idea of taking for granted, I’m reminded of early in my career, I worked in lunar and planetary sciences. And I talked to this real old school planetary scientist. And his take was that the American space program had suffered because of science fiction, because Americans thought we could do so much more.

in terms of like, you know, exploring space than we actually could. And after Katrina, we used to joke because people just assumed that we would have the information on like, who’s moving back and what do they need and what are their characteristics and how many households have access to a vehicle and how many sexually active teenagers of this particular demographic live in Marrero. know, like people thought that we had this really detailed data.

Jed Sundwall: Right.

Denice Ross: And we used to joke that they thought that maybe the Star Trek enterprise could just scan the planet and get us the data that we need. And so there’s, I think there’s two things. We take it for granted, the data that are flowing. And we also just like assume that we have access to data that are really important. as you know, like it takes…

It takes a lot of effort and resources and coordination to create a data collection, a lot of intentionality. It doesn’t happen accidentally. And so as we think about the future, yeah, as we think about the future, it’s not just like what data do we, what are the core data that we are currently collecting, but also what data should we be collecting moving forward.

Jed Sundwall: yeah, no exactly.

Jed Sundwall: Right. Well, yeah. So I guess I am curious to know if, well, yeah, there’s, there’s a lot of threads to pull on here. I mean, you’ve been outspoken talking about the need for federal data. So maybe we can start there and just kind like, is, what is that category? And before I let you, I’ll just make one point. We grappled with this when I created Open San Diego, because we’re like, well, whose data are we talking about?

What are we advocating for? And what we landed on was data about San Diego, because there’s a San Diego County, there’s a city of San Diego. There’s also like the most heavily trafficked border in the world, I think maybe, is the border in San Diego, San Ysidro. So there’s like Mexico data and trade data. There’s all sorts of data that we realize. like, there’s a lot of data about San Diego that’s independent of the city government or the county government.

Denice Ross: Right. Mm.

Jed Sundwall: So when you talk about federal data as a category, what are you talking

Denice Ross: Yeah, and that’s a really good distinction. So federal data are data that are produced by the federal government or with funding from the federal government. A lot of scientific data, health, climate, and environment are created through relationships with universities and whatnot. But I would call all of that federal data.

There’s two ways to think about what is core to me. And one is thinking about the primary collection of the data. What types of data sets have the need for scale and real comprehensiveness so we’re not leaving any places or people behind that only the federal government can do it. And so that’s sort of the horizontal of the core data. And then there’s the vertical. And that is maybe the federal government

collects the data, but then they also create different ways of accessing the data through lookup tools and maps and various APIs and resources. And that’s always a tension within federal government is how much do you build out those derivative works so that you can meet the needs of specific populations of Americans who need to make decisions or navigate some process.

Jed Sundwall: Yeah. Yeah, that’s actually, mean, yeah, I’m very curious to get your, your take on, this. mean, the, the naming of this podcast comes from this, you know, this one weird trick that we do at radiant earth, where we just really. Yammer on about this. lot of the, we have to talk about data products. Like, I think one of the challenges that people like you and I have faced over the, our years working on this sort of stuff is that it’s very easy and fun and

apparently you can just talk about data in the abstract for as long as you want. But that doesn’t always get you, that might not get you very far. We find it’s really useful to talk about products. So like what you’re describing in this vertical thing, which is like APIs, maps, other tools and things like that, those are products and they have users in mind. And I’m curious to know like who are the, who are the users that throughout your experience, like you’ve engaged with most in that?

Denice Ross: Mm-hmm.

Jed Sundwall: Yeah, like who are these people? Because it’s not like average citizens, I think, in most cases.

Denice Ross: Well, interestingly, so I’ll just mention a few recent examples of federal data that I’ve seen in the wild. I was getting money out of the ATM the other day, and I bank with USAA. They serve the military community. And the screen, when I was going to get my money, talked about firearm safety and suicide prevention.

Jed Sundwall: You can correct me on that.

Denice Ross: The reason that that campaign has been so successful is because it was based on evidence that came from the National Death Index that found that that veteran suicide rates went down when veterans were locking up their firearms. And so that federal data spurred this very successful social media campaign that then made it to my ATM.

Another example is, you know, we go camping with the Scouts a lot. And when you get to a campsite, you know, there’s that old school wooden sign that tells you what the fire danger is. Well, that’s an official federal data set that’s informing which wooden sign gets hung on the hooks. And another example is when you go to the pharmacy and, you know, you might be prescribed a generic equivalent.

there’s an official data set out of the FDA, it’s called the Orange Book, that determines the generic drug equivalency for brand name drugs. And so those are just like a few touch points where, you every day we’re interacting with federal data that has made it into the real world.

Jed Sundwall: wow. Yeah.

Jed Sundwall: Yeah, I love that. I mean, this is this reminder of like that wooden sign at a campsite is a data visualization. Like that’s a user interface, right? You never think of it that way, but that’s actually what it is. So actually, yeah, but this is a good segue into this other thing I wanted to ask you about. Cause so when you’re on Marketplace and when we publish this as like a podcast episode, we’ll put this in the show notes, but that you were on Marketplace last year.

Denice Ross: Great.

Jed Sundwall: which I’m jealous of because I love Marketplace. But you said in that segment how you’ve felt like a lot of those tools and interfaces that the federal government has provided are maybe like almost like demos that should inspire others to build on top of. I think the USAA example is a really interesting one where that’s taking data to this weird endpoint, which is an ATM screen, but it’s actually a good channel to get the data out there.

I’m just curious if you could say more about how you see that playing out or how you’d like to see more of, I don’t want to say private sector, but like other actors taking federal data and building on top.

Denice Ross: Yeah.

Denice Ross: Yeah, this, you know, my thinking on this really solidified in the years after Hurricane Katrina because I was on the outside of federal government working for a data intermediary. we, federal data couldn’t keep up with the rapid changing of, you know, both the exodus when 80 % of the city flooded and then the…

people rapidly coming back and also sort of different types of people as we were rebuilding. And we desperately needed information from local government in order to track those changes and to be able to have some community participation so that the recovery was complete and equitable. And I remember going to City Hall and asking…

I think it was like a parcel layer, a list of childcare centers or something like that. And the contractor who was running the data at that time tried to set up sort of a quid pro quo. Like, well, I’ll give you this data if you give me this data that you have. I’m like, but you’re a job, like you guys are the ones who produce this data. Like you’re the primary data producer and you’re the only ones who can give this data to the citizenry.

And although they were making the data available in maps, they weren’t making the raw data available, which you remember was an issue in the early days of the open data movement. And so at that point, I became pretty fixed in my sense that if a data set can only be produced by government, then that should absolutely be their priority. Like as resources come and go, protect the core of that collection because as long as it’s made open, then

others can build on it and innovate on it. But if the federal government or the local government’s not doing their job with that primary data collection and the publishing of it, then everything sort of falls apart and you have to get creative with inadequate proxies. so just given the limited resources that governments have, I really do focus on that primary role of collection and publishing.

Jed Sundwall: Yeah.

Denice Ross: and maintaining the high data quality and then comparability and the continuity across time and space. And that said, as you start to think about the different uses for any specific data set, there’s so many. Think about the American Community Survey or Landsat data, for example. Both of them have such broad uses across very different domains that it would

It’s unreasonable to expect that the federal government would build tools to meet all of those use cases. And it’s especially, you know, we’ve interacted with government websites, right? Like the government doesn’t generally do a good job at creating websites and tools and maybe they do like a good job once, but then, you know, it starts to age and, you know, isn’t sustained in the way that, you know, more modern life cycle outside of government might.

Jed Sundwall: Alright.

Yeah, yeah.

Jed Sundwall: Right. Yeah. I mean, we don’t need to pick on government people too, too hard, but yeah, we, we, it’s easy to fall into that. We can talk about procurement issues and why the government’s not that great at managing digital services or improving them over time. like, I, I totally agree. I felt this way for a long time that, a lot of this came from our work. When I, when I was at AWS, we worked a lot with Noah on publishing their data. And it was this kind of funny.

Now that I think about it, it’s sort of like a funny relationship in that we all sort of agreed, Noah was like, look, we can produce the data, but we really need you to get it out to more people. And we’re like, okay, that makes sense. But then also like, I can talk about my former employer, AWS doesn’t make great user interfaces either. Like AWS is a really, I mean, as far as like infrastructure as the service goes, hard to beat, know, they’ve done very well.

Denice Ross: Mm-hmm.

Denice Ross: Right.

Jed Sundwall: But like when it comes to producing like consumer facing end user interfaces that can reach a lot of people, it’s just constitutionally the company doesn’t seem like that great at it. It’s, that’s really what AWS is built to do. Other people build those interfaces on top of AWS and that’s how we did it. But I just, I just kind of like, I’m just agreeing with you pretty violently that like, it’s okay to have the government stop at some point and let other actors take over to get things.

Denice Ross: Right.

Jed Sundwall: the last mile.

Denice Ross: Yeah, I think it’s how we build resilience into the system, frankly, like, you know, let the federal government focus on the core. What is missing, though, to make this really work are the feedback loops so that federal data stewards have a really good sense of both how are the data being used, how could the data be improved to better meet the use cases, and then what untapped

possibilities are there for the data better serving the American people if the federal data collection of adjusts to changing conditions or data needs. And those feedback loops, when I was in the Biden administration, we did talk about how we might infuse more public participation and community engagement around federal data.

And because it’s tough, like right now, you the main avenue of giving feedback on a given data set is really only applies to data sets that are collected through forms and surveys and subject to the Paperwork Reduction Act, which triggers this sort of public notice and comment period. And then you have to be like watching the federal register to know that a comment period just opened. For example, just…

Jed Sundwall: Right? Right.

Denice Ross: Tomorrow, we’re, so I’m working on a project, two projects right now, which I should mention. The first is dataindex.us and we, it’s a collective of federal data watchers who have been, who we started with that paperwork reduction act data on changes to forms and surveys and are expanding to scientific and health and environmental and other types of data.

But we’re monitoring changes to the federal data and looking for opportunities for public input because when those policy windows open, those are going to be the times when public input is going to have the biggest difference. And so tomorrow we’ve got a webinar about the Pregnancy Risk Assessment Monitoring System, I believe it’s called. But it’s basically the only way that we understand maternal and infant mortality.

in America. And that collection, interestingly, so if you think about how those data are collected, it has to come from local public health institutions. then it reports up into the CDC or into the states and then the CDC. And recently, the CDC stepped back on

on aggregating the data at the national level. So now researchers, if you want to study maternal and infant health, you have to go to every state individually and ask for the data, which introduces so much friction into the system, right?

Jed Sundwall: Yeah. Oh man. mean, I, we have a deal. This is all the time when I was running the open data program at AWS. Like it was like almost clockwork. It was like at least once a month, like at this like pretty regular cycle. Some people were like, Hey, wouldn’t it be cool if we had all of the X data about cities in the country? And I’m like, like crime. mean, a crime, the crime one came up a lot. It was like, wouldn’t it be cool if we had a data set of like all the, all of the crime in different cities in America. And I’m like, that would be cool. Who does that?

Like who would do it? Like that’s a very, it’s a very expensive process to carry out. And I agree it would be cool, but we have to find somebody like who actually is intended to do that. CDC, very clear, obvious mission here that’s, know, historically been funded to do this sort of thing. Um, what happens when we’re the, so I’ll just, I’ll just go ahead and say it, you know, although it’s already 2026, like

We can talk about core data and these sorts of things, but then what happens when the arbiter of the core data might not be seen as trustworthy?

Denice Ross: Right, or just drops the ball, as is happening with prams now. Or if you think about what happened for the first year of COVID, where civil society, the COVID tracking project and Hopkins and others, filled in that role of harvesting the data from state and local health departments. And then it took about a year until the federal government really was on the ball with that.

Jed Sundwall: Right.

Denice Ross: There’s another example, though, recently speaking of crime. So historically, the FBI has released their crime data once a year. the year closes out at the end of December, and then it takes nine months to process the data. It’s the official statistics. so quality and continuity and all these things are really important. And so it takes nine months, and then they’re published. But that’s not timely enough for really understanding, for example,

you know, is carjacking becoming a problem or, you know, what, you know, what, like, what are the trends that we’re seeing in murder and informing the national dialogue and local policies. last September, Jeff Asher and his colleagues created the Real-Time Crime Index, where they are hoovering up data directly from the nation’s law enforcement agencies and then creating a monthly estimate.

And I was in the White House when the first month that that monthly estimate dropped. And it was amazing. Like immediately, every policymaker who was working on violence, especially gun violence in America, they changed the way that they consume their data about crime in America. And so they go to this real-time crime index for the monthly updates. But then it’s still essential to…

Jed Sundwall: interesting.

Denice Ross: benchmark that to the official data coming out of the FBI. And what I really I love about the resilience that that builds into the system, like we need both. We need the official slower, but really comprehensive and high quality data coming out of federal agencies. Data that, you know, the FBI director can go before Congress and talk about with confidence. So we need that. And then we also need

some of the scrappier sort civil society, best guesses of how things are going. They don’t have to go testify before Congress to talk about the quality of the data, right? They can have their methodology, it might be a little black boxy. And there might be even competitors in the space giving slightly different perspectives on what’s happening. We see that happen with flood risk, for example, where there’s different models that consume a lot of federal data that tell you about how at risk your particular property is.

Jed Sundwall: Right. Yeah.

Denice Ross: And I think that that combination of the official data plus the innovative data that might trade a little bit of quality for timeliness is important given how fast things are changing in America around crime and climate and society.

Jed Sundwall: Yeah. Well, I mean, I also think like it’s super useful to acknowledge the fact that like, it’s always a, I don’t want to say like a negotiation, but like, think, you know, all models are wrong, but some are useful. That, that idea is, is to understand that like authoritative, authoritative data is useful in the sense that like, there’s a methodology you might have, you might

Denice Ross: Mm-hmm.

Jed Sundwall: be more comfortable about how it’s governed and produced. But it doesn’t always mean that it’s like the end all be all absolute truth, you know. It might be data that you’re required for some regulatory reason to rely on. It might be the safest data to use. So if you are hold in front of Congress, you can say where you got your numbers from. But like, I think it’s worthwhile to

Denice Ross: Yep.

Jed Sundwall: engage with that idea that like, okay, it is useful to have authority, authoritative data for some reasons, but we shouldn’t just sort of rest on our laurels and say like, oh, that’s the data from the government. So it must be true, you know? Yeah.

Denice Ross: Right. Yeah, absolutely. And the other nice thing about having authoritative data then plus the innovation happening in civil society is, for example, with the crime data, the FBI sets the standards for that data. And then every software vendor in America serving law enforcement agencies conforms to those standards. So that gives you the comparability on the basics.

But then often law enforcement agencies need more details. so they can, so for example, some innovations were happening over the last few years because jurisdictions realized that they needed data on non-fatal shootings, not just the fatal ones. And the FBI standards didn’t include that. And so cities like Philadelphia and other cities started collecting data on non-fatal shootings to inform

their policing practices and community engagement. And so that innovation started to happen at the local level. And then the slower process of incorporating that into the official government standards was happening at the same time. And then in the last few months, that became an official part of the new standard, which would then be propagated across all of the nation’s law enforcement agencies. So there’s a really nice interplay between

Jed Sundwall: Interesting to see it.

Denice Ross: the slow building of standards and the sort of field expedient data collections that communities need in order to answer the questions that are before them.

Jed Sundwall: That’s a great story. Have I ever shared with you this white paper that we published last year called Emergent Standards, where I’ll send this to you. I’ll put it in the show notes. like I tell the story of RSS, which is what’s used to publish blogs and podcasts and things like that. GTFS, which is the general transit feed specification. That’s how transit authorities share data.

Denice Ross: Mm-mm.

Jed Sundwall: largely with like Google Maps and like Big Map, like Apple Maps. But it’s very, it tells stories similar to like what you’re just saying, which is that like, you do have to have kind of like large institutions that can give the imprimatur or set standards or sort of define requirements in a way. But they should negotiate and engage with the data practitioners and learn from one another. And that’s really like, the web is actually really good, good at that, at like enabling that kind of

negotiation. And then after a while, people are like, okay, yeah, this is the standard. This is how we describe this data. This is what counts as a shooting, like in your case. You know, but that’s, that’s a, that’s a negotiation among a bunch, a bunch of different actors and data users that has to happen. And it’s never as simple as saying like, there’s a standard that some government agency set is the one and everyone agrees. I think you’ve probably lived this, I mean, many times.

Denice Ross: Mm-hmm. Yeah.

Jed Sundwall: why that’s not true. Okay. Well, I’m also curious to get, this is, mean, this was relevant to that. actually, hold on, before I go on, you said you were working on two projects. You mentioned dataindex.us. What’s the other thing you should brag about what you’re doing?

Denice Ross: Mm-hmm.

Yeah.

Denice Ross: When in the first Trump administration, and there were concerns about data, especially around climate and environment disappearing, and also concerns about the decennial census that took place in 2020, it became clear to me that we as data users and stakeholders and advocates had not done a good job of telling the story about why data matter.

And so that’s been some serious unfinished business for me. And as I saw things unfold almost a year ago with the pulldown of so many data sets to remove elements that were not compatible with administration priorities like DEI and gender and climate,

I saw the narrative in the media about how researchers were going to be harmed by the disappearing data. And I was like, no, actually, all Americans are going to be harmed by the degradation of federal data capacity. I realized as I started to look at how we generally think about data use cases, we center the user of the data and what task they need to accomplish.

for some outcome that they’re trying to reach. And I thought, well, what if we flip the script a little bit and focus on the beneficiary of the data rather than the user of the data? So for example, a cancer patient can find a clinical trial that’s a good fit for them because the clinicaltrials.gov data set.

is easily available and they can sort by the condition that they have. Or a football coach knows to move practice inside when it gets too hot so his players don’t get heat stroke because the National Weather Service publishes the heat index. so what we’ve done with a website called essentialdata.us is we’ve been crowdsourcing and building up

Denice Ross: these little one sentence love letters to have specific federal data sets benefit everyday Americans and their livelihoods. And we’re almost at 100 data sets about nine months in. And it’s just been such a delight, but I’ll tell you, it takes about 20 to 30 minutes talking with a data user to shift their perspective from centering the users of the data

Jed Sundwall: Nice.

Denice Ross: to centering those who benefit from the data. And so I thought, sometimes I had these doubts at the beginning. was like, this is just too obvious. But it’s actually, it’s a big mindset shift. it’s something that anyone who cares about data, I think we all need to undergo that shift so that we can talk about how it benefits people in their everyday lives.

Jed Sundwall: Interesting.

Jed Sundwall: Yeah. It’s, oh man, I have so many thoughts about this, this issue. Um, a weird one though, comes from, uh, a book I read years ago called entangled life, which is about, about fungi. Um, it’s an awesome book. It’s actually a great book, but, there’s just one insight in the book that is just sort of like the author points out. He’s like, we are, you know, um, humans that live on the surface of the earth and we see things that, you know, above the soil. so.

We look at a plant or a tree and we’re like, yeah, like that’s a tree. Like there it is, I’m looking at it. And he’s like, well, you don’t see those all of the fungal activity in the soil that’s transferring nutrients. actually, I mean, we’ve learned like information from that tree to other plants and other life forms around it through the soil. So there’s all this stuff going on underneath. We just cannot see, we never consider it at all. And we think of a tree as a tree and it’s like, yeah, sure, it’s a tree, but it’s a part of so much else.

Denice Ross: Right.

Jed Sundwall: And this is going back to the whole taking things for granted thing. We live on this substrate that like, just, no one thinks about it at all. And we’re the beneficiaries of all of it, but it’s yeah, it’s totally invisible to people. Yeah.

Denice Ross: Yeah, I love that metaphor. And it reminds me of digital tools and how they consume federal data, for example, all the real estate apps like Zillow and Redfin and whatnot. They consume data from the Department of Education about school performance. But it takes actually a lot of work to figure out that that data is federal data.

Jed Sundwall: Right. Yeah.

Denice Ross: And that’s one of the tricky things about these digital tools that we build is that we make it look like the data are all there and we sort of hide where it’s coming from and how it might be at risk. I remember…

Jed Sundwall: Right.

Denice Ross: I remember a survey question around attitudes around the decennial census data. And people were asked, the census decennial data, is it unique? Like, is it something that only the federal government can produce? And a common answer was, no, no, you can get that data from Google.

Jed Sundwall: Yeah.

Jed Sundwall: Wow, amazing. Yeah. Yeah.

Denice Ross: Right? like, yeah, you can, but Google wouldn’t have the data if the census didn’t exist. And we’ve had some rough patches, right? Like with the economic data, for example, with the shutdown, where the private sector was able to sort of fill the gaps. But you have to have that federal benchmark to snap to, or the private sector data is going to veer further and further from reality.

Jed Sundwall: Oh yeah. Yeah. Well, I mean, this is going back to this feedback loop thing, um, which, know, we, so we don’t have great feedback loops, right? So for like federal data providers or a lot of government data providers, they just have, really don’t have many ways to know how their data is being used and how valuable it is. And this is where I’m going to start. I’m approaching a third rail here. Um, cause I’m going to talk about like data markets and pricing and things like that, but like,

There’s another, this Google example is kind of funny because Landsat has a sort of similar story. Landsat had been around for a long time and very widely used, very, so for those who don’t know, think most people listening to this podcast are familiar with Landsat satellite data, earth observation data provided by USGS. But Google Earth Engine is created.

I won’t go into the whole history of how it’s created, but Google has suddenly has this thing called Google Earth Engine that is an incredibly powerful tool that makes Landsat so much more accessible to people and just like leads to like an explosion in usage of Landsat. I would also, I should take credit like at AWS, we subsequently did something similar putting Landsat data into AWS. But I do know that there was some consternation at USGS that like Google Earth Engine was getting all this credit for the Landsat.

Denice Ross: Right.

Jed Sundwall: And which is fair, you know, it’s like, well, hang on. Like we’ve been doing this forever. didn’t Google didn’t fly the satellite or take the risk in the seventies of developing this program and keep it going for decades. but this is, so this is where we get into the third rail territory, which is just sort of like Google earth engine was able to do what they did. I was able to do what I did at AWS because the data was free and open. And, because of that.

Denice Ross: Yeah.

Jed Sundwall: There’s some recent study from USGS showing like the value of Landsat is like billions of dollars for the economy. I’m like, well, if that’s true, why can’t you defend yourself? why, how are you not able to capture any of that value to make sure that you continue to exist? And I guess I’ll just leave that there for you to respond to, because I do think this.

Those of us who are open data enthusiasts have divorced ourselves from getting useful signal from markets. And I don’t know if that’s worth re-examining.

Denice Ross: It’s a really good time for the private sector to step up and advocate for the continued flow of the data that they depend on.

Jed Sundwall: Agree.

Denice Ross: we haven’t seen a lot of that, frankly. mean, we, you if you think about the data advocacy, it tends to be more nonprofits, academics. and, and I think Steve Ballmer with USA Facts is one of the, you know, former Microsoft, leader. He, he’s one of the few private sector folks who’s been really advocating for the continued flow of federal data.

One thing to keep in mind, and I know there’s concern about appearing to be anti-administration, but there’s nothing inherently political about wanting data to keep flowing. And in fact, the Evidence Act was signed by President Trump in his first term.

and has a section in there that requires federal data stewards to engage with the public so that they can better understand how the data are used and how the data can be improved. so that type of public engagement is baked into the law that President Trump signed in 2019. The federal government, we just haven’t done a great job of creating those feedback loops.

And that’s why the work that we’re doing at dataindex.us, we’re trying to bridge that gap so that people who care about data don’t need to monitor the federal register on their own or keep an eagle eye on LinkedIn to see if their favorite data set is at risk. We sort of centralize the heavy lifting of that. And then when there’s an opportunity where public input can be really useful, then we mobilize folks.

to submit their public comments.

Jed Sundwall: Yeah. Great. Well, I think what I’ll add to that though is like, there’s also just sort of like basic analytics that you, that we should be better at doing. Um, which is it’s crazy to me how hard it is to count data usage. I, fact, I had a text exchange earlier, like, um, like on source cooperative, we host three petabytes of data now. Um, and you know, we’re logging over 150 million requests a month now. And, and I was saying to,

Denice Ross: my gosh, so true. Yeah.

Denice Ross: Right.

Jed Sundwall: shout out to Avery Cohen earlier today. I’m like, it gets really annoying when you’re counting tens of millions of things, you know, requests and then filtering through those and figuring out which data sets are being accessed. And can we, do we know anything about who’s accessing them? And what does this data even telling us? But in any event, like at a minimum, we should be able to know like, and this is also a hard conversation that’s starting to happen more and more often, which is that.

some data just never gets used and maybe we shouldn’t, you know, we should have, think the term I’ve heard a lot in 2025 is joyous funeral. Where there are probably some data products that were like, okay, we can let these ones go. It’s okay. You know.

Denice Ross: No, I like that. I like the concept of a joyous funeral. I have enough humility now, having been in the field of data for 20 years, to know that I don’t know what all the use cases are. And you just never know. So I’ll mention one of my favorite data sets is the North American BAT Monitoring Database. Yeah, it’s this geospatial data set out of USGS. And there are 400.

Jed Sundwall: Ooh.

Denice Ross: organizations around the country that contribute to it, information on bat species, their locations, what they’re doing. And you might think like, well, why is the federal government collecting data on bats? Well, it turns out that bats provide billions of dollars of free services every year to America’s farmers. And if you want to protect that free service, you have to protect the bats. And if you want to protect bats, you need to know where they are. And if you’re building like a

a wind farm or expanding a mining operation or renovating a highway overpass. That all requires permitting that will require you to make sure you’re not harming bats. And so every one of those developers, if the bat database didn’t exist, they’d have to, what? don’t know, count the bats themselves to figure out what the impact would be.

And so this like streamlines permitting, makes it easier for development to happen in a responsible way. And then there’s also some research that shows that in areas where there have been precipitous declines in bat populations due to, for example, disease in agricultural areas, that infant mortality goes up.

which is strange, right? But the hypothesis here is that if the bats aren’t providing that free service of insect removal, then farmers need to use more pesticides.

Jed Sundwall: Yeah, okay.

Denice Ross: which gets into the bloodstream of pregnant women. So you wouldn’t, know, so an infant’s death, you wouldn’t say like, well, you know, that’s attributable to the fact that the North American bat monitoring database went away. But you just, you you have to be really careful about what data we say are not important anymore. And that’s one of the, frankly, one of the blind spots that we have is like, who’s using this data? And they’re probably like quietly in their basement.

Jed Sundwall: Interesting. Has its own issue. Wow.

Jed Sundwall: Right.

Denice Ross: you know, like, you know, deep in some building using this data, but it could have some real, some super high impact application that just, you know, isn’t, isn’t that public.

Jed Sundwall: Yeah. No, I mean, this is, you know, it’s kind of inevitable that I bring this up at some point. I’ve never talked about this on the podcast, but there’s the famous, there’s a famous XKCD comic about the open source tool. Hang on. I’ll put it in there, but yeah, it’s, mean, I guarantee there are people I know who’ve memorized the URL for this. It’s XKCD two, three, four, seven. I’ll put it, but it’s the,

The, the open source dependencies comic, which is basically it’s like, have this, this huge towering, you know, complex bit of, of digital infrastructure. And it’s just like all running off of like just one random thing that some guy in a basement is maintaining or, know, a bat database that a very dedicated and continually abused public servant has been heroically maintaining for forever.

And this is why I say, you know, I’m always very cautious and get nervous when I talk about market signal to support data is that there are data that are maybe very valuable, but for which the market signal is going to be extremely weak. So it’s not, the market won’t tell us that it’s valuable. And I actually, this is where, you know, I think you’ll agree with me. This is why the government’s role is so important is that there’s

all sorts of stuff that there’s no market signal for, but that we should probably be doing. And it’s the government’s responsibility to make those things happen.

Denice Ross: Yeah, and that’s one thing. So having served in both the Obama administration and the Biden administration, in Obama, the focus was on open government, which was exciting and had shockwaves, really good shockwaves throughout the nation and state and local governments. And then the…

The Trump administration was really, you know, the first Trump administration was so focused on building evidence and data capacity and, you know, they’ve installed a chief data officer in every major agency. And so when I came back in the Biden administration, there was so much more data capacity in federal agencies. And what Biden really leaned into and was my role as the chief data scientist was how can we build the data backbone across agencies so that

we’re delivering better outcomes for all Americans. If you want to do that, you need to disaggregate the data in ways that the market may not be interested in. So you need to understand, you know, veteran status, know, caregivers, survivors, you need to understand rural versus urban, the role of sexual orientation and gender identity in outcomes, race, ethnicity, gender.

primary language spoken at home, whether you have access to a vehicle. There’s just so many ways to slice and dice the data to see which populations might be in areas are being overburdened or left behind. then adjust our policies and our programs so that we’re benefiting all Americans. And if you don’t…

If you don’t disaggregate the data to identify those disparities, it’s really easy to look at a number like, you know, we’re serving 99 % of America and declare a mission accomplished. But if you look at that 1%, it’s almost never evenly distributed. If you look at it geographically, you know, what you see the places left behind are Appalachia, you know, the Southern Black Belt.

Jed Sundwall: yeah.

Denice Ross: tribal communities, the border with Mexico, rural America, you the same places and the same groups of people are left behind repeatedly. market forces aren’t going to raise those data to consciousness.

Jed Sundwall: Absolutely. Yeah. That’s absolutely. Yeah. I’ll agree with you a hundred percent. Well, okay. I’m going to shift gears a little bit because I’m, I’m leading you into talking about a, a dataset and a story that I think is really interesting, which is that they’re

Historically, you know, I mean, we go back far enough, it’s like, for a while there, like it was only the federal government that even like had a computer. So like, we’ve historically had to sort of rely on, we’ve looked to the government to gather and store data just because you needed the most powerful nation state in the world to even be able to do it in the first place. Those days are long gone. There’s all sorts of data that can be produced by non-government actors. You can call them commercial actors or other groups. I mean,

Denice Ross: Hahaha

Jed Sundwall: the environmental defense fund famously launched their own satellite, which was lost, which is sad, but like they did it. Like they launched a satellite that produced data. So there has come a time, we were well past the point where we don’t necessarily need the federal government to do all this sort of stuff. Do you have any thoughts on when it’s okay for other organizations to take over or to step in?

Denice Ross: Hmm.

Jed Sundwall: to support this kind of work and how do we know when that’s appropriate or not?

Denice Ross: Yeah, I a few thoughts. maybe three examples can come to mind. The first is, goes back to that idea of the primary data production and the unique role that the federal government has in producing core primary data. And then there’s the data products that can be built with those data. A recent example is the billion dollar disaster, climate and weather disaster data set.

was terminated in 2025, but it’s a NOAA data product. And Climate Central hired the NOAA researcher behind that data set. And they are using similar methodology as was used when it was inside of government, but improving upon it. They’re talking about making a, like reducing the threshold so that they can track million dollar disasters.

So, you know, like maybe that’s the best place for the billion-dollar disaster data set, as long as the federal data that feed it keep flowing.

Jed Sundwall: Yeah, yeah, yeah, right.

Denice Ross: So that’s the big if there, right? So that’s one thing. But then if you talk about something like the Framingham Heart Study, that’s a federally funded study that completely transformed our understanding of heart disease.

Jed Sundwall: Yes, this is the one I was…

Denice Ross: It was a federal program that was initiated after World War II. Our president had recently died of heart disease. think 40 plus percent of American men had heart disease at the time. so heart disease was very much in the national consciousness. This was a priority. Congress funded the study for 20 years. At the end of that 20 year span, the National Heart Institute announced that it was gonna phase out the study the next year.

So the researchers, similar to what’s happening right now with climate and health and other research that’s been federally funded, that’s been producing essential data, the researchers started looking for other funding sources and they ended up raising money to keep this collection alive from unlikely groups, including the Tobacco Research Council and Oscar Mayer Meat Processing.

So they went to the private sector to fund the collection during the in-between years. But then the really cool part of this story is so, you know, it’s one thing to like, you know, find a way to keep the collection going, like maintain that continuity, right? So because that’s what makes, that’s what turns science into knowledge, into action, is the continuity across time and space. But you also have to have a policy game there because the federal government,

Jed Sundwall: Yeah.

Denice Ross: really belongs at, they should be the steward of the collection of these really critical data. And it turned out that President Nixon’s personal physician was a real stakeholder in this heart study. And he talked Nixon into advocating to get the funding turned back on for the Framingham Heart Study. So it was like this, you know, DC style interaction between the president’s doctor and the president.

Jed Sundwall: Interesting.

Denice Ross: that then got the funding back on track. it came back stronger than ever when it was funded again. They recruited the children of the original volunteers, and now that study is three generations long. And they also, as the demographics of Framingham, Massachusetts changed, they started to widen the sample to go beyond those initial families so that they could be more representative of the demographics of the US.

Jed Sundwall: wow.

Denice Ross: So I think that’s an interest, know, I think there’s some parallels for where we are right now, where we might be seeing some gaps in federal support. And so maybe we think about this as like, let’s create sort of a heart, lung bypass machine for our data, right? To keep it alive, keep the continuity there, but then let’s figure out what the long-term policy plays are to make sure that the data we need as a nation continue to flow and come back stronger.

Jed Sundwall: Fascinating. Yeah.

Jed Sundwall: Right.

Jed Sundwall: Yeah.

Jed Sundwall: Yeah. I mean, this is where I will advocate for my, you know, I, talk about this a lot at Radiant Earth, but it’s our new institutions and new data institutions, which is to say like, I, I won’t say I disagree, but like maybe the federal government isn’t always the right steward, but they’re in a very important stakeholder, right? So I guess, you know, framing up heartstudy.org, I assume, I just found the website. This is

some kind of independent nonprofit or entity that is the federal government is a large stakeholder as is Oscar Mayer. know, like it is this, this I don’t know if Oscar Mayer is still involved or Altria or whatever Philip Morris is now called. Like, but the, but the point is like, it is actually an independent entity that is able to receive resources from

Denice Ross: Hahaha

Denice Ross: Right.

Jed Sundwall: a lot of different stakeholders. yes, I mean, I would agree that yes, the federal government, this should be a national priority to understand these things. Yeah.

Denice Ross: No, and I agree. And I think those types of more creative arrangements that you often see in the sciences can build resilience into the system. Some data sets don’t have that luxury. For example, the Federal Employee Viewpoint Survey that OPM runs every year, during the greatest disruption ever to the federal workforce, there won’t be any data collected on

Jed Sundwall: Yeah, great example.

Denice Ross: employees feel about it. And so Partnership for Public Service stepped in and they’re running a lighter weight version of the survey, but they can’t possibly, they don’t have the Rolodex to reach out to every federal employee. there’s just, you know, it’s, I’m grateful that Partnership for Public Services is running it, but it’s not a replacement for what Office of Personnel Management should be doing.

Jed Sundwall: Yeah. Well, then we can start landing this plane, but with a pretty big question then, is knowing what we know now, how would we protect a data product like that survey? Like, do you have any ideas?

Denice Ross: I do. do. If I could just go back for a second, though. So I talked about the heart study. yeah, and the third example is, so I talked about the billion-dollar disaster data set, the heart study. And then the third one is an example of data that I think really do belong in the private sector but have a really important public use.

Jed Sundwall: please.

Jed Sundwall: Yeah, you three examples. I wasn’t sure if this is all of them.

Denice Ross: And this is when there’s a disaster, one of the important pieces for response and recovery is knowing which gas stations are open.

Jed Sundwall: Okay.

Jed Sundwall: makes sense.

Denice Ross: And so right after Superstorm Sandy, the Energy Information Administration was literally calling gas stations to see if they were open and if they had gas. And I don’t know if you remember the news coverage from that time, but gas was in short supply and tempers were flaring and there were lines of cars at gas stations just trying to get fuel so they could evacuate or go wherever they needed to go.

Jed Sundwall: Amazing.

Denice Ross: And so you can imagine how well received the phone call from the federal government was, that poor gas station owner, trying to get a sense for whether the station was open and closed. And then the data were so volatile that who knows what the actual status was. It turns out that a company like GasBuddy, which is a crowdsourcing tool that’s used by especially like truckers and rideshare drivers, taxi drivers, and

The way it works is that you go get gas and you type in the amount that you paid, and then you get rewards that you can spend in the little shop at the gas station. And so there’s this whole incentive structure built in. And so GasBuddy, it turns out, has actually the best data in the country on gas station status. Yeah. And so I know from my friends in the National Security Council that it causes them much consternation to have to cite GasBuddy.

Jed Sundwall: Okay.

Jed Sundwall: Wow!

Denice Ross: when they’re reporting up to their superiors on the status of our fuel supply in a disaster impacted area, but GasBuddy actually is the best data set for that. So the question there is how might the federal government create some sort of agreement with GasBuddy so that those data can be reliably available to serve the public good when needed?

Jed Sundwall: Yeah. Interesting. Okay. Well, I mean, this is kind of going back to the whole like, wouldn’t it be cool if we had all this crime data and I’m like, well, who’s going to do that? but yeah, it is. So many these, they just ended up being collective action problems. Right? So it’s like, yeah, that gasp. It is, and it’s, you can just imagine like what it like incredibly vast and complex data product that would be to create.

Denice Ross: Right.

Jed Sundwall: And also it’s the perfect sort of thing where a nerd would be like, well, why isn’t there just like an API? Like that, every gas station reports, you know, it’s prizes or something into anyway, it’s like that.

Denice Ross: Right.

That would be nice, but we don’t even have that for power outages. The Department of Energy has to scrape power outage data from the public websites, from the electric service providers.

Jed Sundwall: No, that’s it. Yeah. Yeah.

Jed Sundwall: I, yeah, I’m not surprised. And again, collective action problems, but it’s a bummer because I think people like us who work in this, like we know that like, this is not a hard technological problem anymore. Like the tech required to do it isn’t hard. It’s the coordination that’s hard. Okay. Well then what was my question? my, other question. Yeah. So how would we make things, well, especially these things that are like, I mean, look,

Denice Ross: Right.

Denice Ross: less vulnerable.

Jed Sundwall: I want to be charitable. You’ve said you’ve worked both in the Obama and Biden administrations. I live in Seattle. I run a nonprofit. think people can guess how we feel about things politically. But the truth is that for better or worse, half the country seems to be pretty mad at the president no matter who’s in office.

Anyway, I’m not going to start talking about like popular vote versus electoral college stuff. anyway, but regardless, there’s, we live in a country where people disagree with each other and people, and actually I think this is a great feature of America is that we’re very skeptical of our, of our leaders. Right. So, we’re lucky to have decades behind us of precedent where there’s a pretty, there’s a functional bureaucracy.

that has produced reliable data accurately and reliably for a long time. In the past year or so though, we’ve started to see like, yeah, data is getting taken down. Data really appears to be actively distorted in some ways. we’ve now crossed that threshold. Is there a way back from this or do you have thoughts on how to protect federal data in the future?

Denice Ross: Yeah, I think the most important thing that we can do comes back to the idea of not taking the data for granted, making visible and explicit the role that federal data play in our everyday lives. And there’s probably three levels of intervention for that. And we’re starting with the people who use data, including the private sector entities that are using federal data.

and making it easier for them to mobilize, to share with federal data stewards and policymakers the ways that they use data, the way they depend on the federal data and why it’s really important for the economy, for example, that these data keep flowing. So my contention there is that anyone who’s a data user should also be a data advocate. And that is completely independent of who’s in office.

Jed Sundwall: Yeah. Yeah. Okay.

Denice Ross: And then the second audience for this is policymakers and the federal data stewards themselves because they often aren’t aware of the deep impact that these data sets have. so, for example, we’ve heard stories of federal data stewards who are able to collect

use cases about why their data collections matter to industries that this administration prioritizes. And that can have a real protective effect on the flow of data that can be used by a whole bunch of different domains. And then the broader, and then more broadly, just raising awareness with the general public about things like the no campfires sign.

at a national park and how that also comes from federal data so that we stand behind the investment in these essential data resources.

Jed Sundwall: Yeah. That’s a great answer. mean, and yeah, I think the, again, I mean, a policy guy is like, like nerding out a little bit, but like a government is effectively, it’s job is to just understand what’s going on within its borders for a bunch of reasons. You know, it’s a pretty easy story to tell. Like it’s, it has, as you pointed out, the open data act, the evidence act, this is a bipartisan, you know, legislation.

this shouldn’t be that hard. And I would say it’s, it maybe sounds a little bit cynical, but I’m okay with it. Is it like every administration cares about businesses and economic growth in the country and data is vital to that. so it’s always, you know, this is, this is always the tricky thing though is I think there’s an obvious easy case to be made for a lot of data to be produced. Like weather data is a good one where it’s like the economy would like grind to a halt, without.

Denice Ross: Right.

Jed Sundwall: Maybe not a halt, but it would be really bad if we didn’t have weather data. But then also there’s this other universe of data that there might not be great market signal, but it’s just really important for governance and for public health or wellbeing or scientific research. I don’t know. It doesn’t seem like this should be that hard to advocate for. Anyway. Okay.

Denice Ross: Yep. Well, in interview, you mentioned you’re a policy person. I think I was in this field for 15 years before I realized I did data policy. And if you think about it, there’s not really a pipeline of data policy wonks, right? We’ve got data users who just use the data and assume it will keep flowing. And they often use the data as is. They complain about its shortcomings. But they don’t…

Jed Sundwall: Yeah.

Jed Sundwall: No!

Denice Ross: like go back to the data steward and say, hey, can you improve this? Like there’s because of those feedback loops that haven’t been put in place. And so I think we have a real opportunity to build the field of data policy, you know, so that any anyone who’s a data user, especially using public data also has a little bit of policy understanding so that they recognize that this is their data infrastructure to co-create as members of American society.

Jed Sundwall: Yeah, no, that’s beautiful. And actually, mean, yeah, I you’re helping me realize what I was just trying to say. think we could be much more forceful. Is that like, it’s a core function of government to understand what’s happening through those boundaries. Like that’s done with data, you know? So, yes, there are dozens of us data policy nerds, but we should be more powerful. I think we can all agree. Yeah. Well, this has been awesome.

Denice Ross: Hahaha.

Denice Ross: So true.

Jed Sundwall: I just checked in on the live stream. Apparently we weren’t live streaming on LinkedIn, which we’ll have to look into what’s happening there, but that’s okay, because this will go, this will still go out after, but no comments or questions from YouTube. So we’re in the clear. We don’t have to answer any hard questions. Only softballs from me. Anything else you want to share about your work or what people should be thinking about before we go?

Denice Ross: Hahaha.

Denice Ross: Yeah, would say if, think about your favorite federal data set, the one that you might be taking for granted, the one you wish were a little bit better, but you couldn’t live without, start practice talking to people about why it matters in a way so that you build your skills on that, because it’ll be useful. It will definitely be useful in the coming year. And if you come up with a good story about why these data matter,

Let us know at essentialdata.us because many of the use cases that are up there came from people who have deep expertise in a specific data set and we were able to turn it into a one sentence love story for that data set.

Jed Sundwall: All right. Yeah. We’ll, we’ll point people to essential data.us. thanks for setting it up. mean, thanks for everything you do. Thanks for coming on. This has been, it’s been great. This conversation will continue. yeah. So we’ll do it again sometime too. Thank you. All right. Okay. So.

Denice Ross:

Thank you, Jed.

Featuring:

Denice Ross

Jed Sundwall

→ Episode 3: Inside Harvard's data.gov Archive

Video also available on LinkedIn

Show notes

Jed talks with Jack Cushman, director of the Harvard Law School Library Innovation Lab, about how libraries are adapting to technological change while preserving their mission to collect, preserve, and share knowledge. From the printing press to the internet to artificial intelligence, libraries have continuously evolved their methods. The Lab focuses on bridging traditional library principles with cutting-edge technology to empower individuals with better access to information.

The conversation explores the Data.gov Archive project, which aims to preserve approximately 17 terabytes of federal datasets - not just the metadata from Data.gov, but the actual underlying datasets that are at risk of being lost. Jack explains the challenges of collecting these datasets, particularly the limitations of web crawling technology that often fails to retrieve underlying data. The team successfully collected more than 311,000 datasets, with particular attention to smaller datasets that might otherwise disappear, demonstrating their commitment to knowledge stability in an era where governmental data can be fragile.

Jack discusses how they use BagIt - a Library of Congress standard for packaging digital content - to ensure long-term preservation through comprehensive metadata, checksums for verification, and cryptographic signatures for authenticity. This approach addresses data provenance and integrity, creating complete packages that can be cited and verified decades from now. The discussion also covers their innovative client-side viewer that runs entirely in the browser without server-side software, making 17.9 TB of datasets searchable while reducing infrastructure dependencies. They explore the importance of user-centric design, the role of well-supported tools like DuckDB, the “one copy problem” that highlights data fragility in the digital age, and collaboration with institutions like the Smithsonian. The episode also touches on PermaCC, another Lab project that addresses link rot in legal documents by creating permanent links to online resources.

Links and Resources

Harvard Library Innovation Lab
Harvard Library Innovation Lab Data.gov Archive - the archive on Source Cooperative
Data.gov Archive Search Viewer - search and explore the archive in your browser
Data.gov Archive Search Blog Post
Data.gov Archive Search on Hacker News
Great Data Products (blog post edition) - Jed’s call for the data community to pursue greatness
BagIt Specification - Library of Congress standard for packaging digital content
bag-nabit - a tool for downloading and attaching provenance information to public datasets
LOCKSS (Lots Of Copies Keep Stuff Safe) - Distributed digital preservation architecture
C2PA (Coalition for Content Provenance and Authenticity) - Standards for content authenticity and provenance
Perma.cc - A tool for creating permanent records of web pages
The Analytical Language of John Wilkins by Jorge Luis Borges
Reality Has a Surprising Amount of Detail - John Salvatier on the complexity hidden in seemingly simple tasks
Cippus Perusinus - Ancient Etruscan inscription showing that humans have been concerned about water rights for a very long time and that preservation of written language is foundational to the law

Key takeaways

Libraries evolve while preserving their mission - From the printing press to AI, libraries continuously adapt their methods for collecting and sharing knowledge while staying true to their core purpose of preserving information for future generations.
Small datasets matter as much as big ones - The Data.gov Archive project prioritizes preserving smaller governmental datasets that might otherwise disappear, recognizing that knowledge stability depends on capturing everything, not just the high-profile datasets.
Web crawling alone isn’t enough - Traditional web crawling technology often fails to retrieve the actual data files linked from catalog pages, requiring more sophisticated approaches to truly preserve datasets rather than just their metadata.
Client-side viewers reduce infrastructure dependencies - Running search and visualization entirely in the browser without server-side software makes 17.9 TB of datasets accessible while eliminating the fragility and cost of maintaining server infrastructure.
The one copy problem threatens data persistence - Data in the digital age is more fragile than physical artifacts; without robust systems and collaboration across institutions, valuable datasets can disappear when a single server or organization goes away.
BagIt enables verifiable long-term preservation - Using Library of Congress standards for packaging data with checksums, metadata, and cryptographic signatures creates complete packages that can be cited, verified, and trusted decades from now.

Transcript

(this is an auto-generated transcript and may contain errors)

Jed Sundwall: Okay. All right. Well, thanks, Jack. Thanks for coming. Joining us here, the third episode of, yeah. yeah. Yeah. Yeah. Lucky number three. And I want to point out, this is kind of an exciting moment because historically, radiant earth has really dabbled in geospatial data. Like that’s our wheelhouse. That’s our

Jack Cushman: Good. Thank you so much for having me. I really appreciate it.

Jed Sundwall: Our origin story of Radiant Earth was an effort to make satellite and drone imagery easier to work with. And one of the things that I did about three years ago when I came in as executive director was realize that a lot of the work that we had figured out with the geospatial community was really broadly useful in terms of adopting object storage and things like that. so we were, anyway, this is all to say, I’m excited to have you on because you’re not a geospatial person.

You know, first two guests have been geospatial. Yeah. Okay. Good. And anyway, so this is going to be a great conversation to learn a little bit more about like how we’ve been working together on source cooperative and, and your background as a librarian and your perspective on these things. Before we get into it though, I do want to point out to everybody and I’ll figure out how to put this in the chat, but that, you know, you’re, you are currently tuned into

Jack Cushman: Absolutely, would never pretend.

Jed Sundwall: great data products, the live stream webinar podcast thing, as we call it. there’s also great data products, the blog post now. so I gave a talk about a month ago at the Chan Zuckerberg Institute’s open science meeting and, the, the name of the talk was great data products. And then we published a blog post called great data products. So this is an exercise in brand confusion. perhaps radiant earth could, or this podcast could sue radiant earth for taking the title.

for the blog post, but it’s a little bit confusing. But in any event, the name of the game these days is Great Data Products, and we’ve got a great blog post about it. I’m very happy with it. I’ll put that in the chat. So in case people haven’t seen that, you should see it. And then, yeah, with that, let’s over to you. How do you introduce yourself to people?

Jack Cushman: Hi, everyone. I’m Jack Cushman. I direct the Library Innovation Lab. I’m really happy to be on the livestream webinar podcast thing. I love working with you, Jed, on Source Co-op. The lab I direct, the Library Innovation Lab, is a research and development lab, a software lab that’s built into one of the world’s largest law libraries. So we’re doing novel things in a very traditional place and drawing on the best of both of those worlds.

Personally, I’m a lawyer. I’ve worked as an appellate lawyer. And I’m a computer programmer. I’ve been programming computers since I was 12 years old, so very many years. And more of a newcomer to libraries, but I’ve been here for about 10 years. really coming, you asked how do you introduce yourself, which is always a challenge for me on the tax form. What are you supposed to write in for what your job is? And I’ve come to say information scientist, that really I’m a person who thinks about how do we consume information? How do we turn it into knowledge?

And how do we help our society over time have better and better access to knowledge? And that’s why the Library Innovation Lab has become such a great fit. Because our mission is to bring library principles to technological frontiers, which means to understand where people are actually getting their knowledge. How is that really happening, which often is outside of the walls of a library? And how can we take the things that we’ve learned in libraries over many centuries and help new technologies to go better? So really core things like libraries are here to…

collect information, preserve it, and share it to empower people. And we’ve been doing that since before the printing press. But when you invent the printing press, you have to change how you collect and share information. Now you need like a written list of the books you have, because there’s enough that you can’t remember them all. When we invented databases, we needed new ways of thinking about libraries. When you invent the internet and data that is digital first, government’s publishing data that is only online and never on paper, you need new ways again to think about information.

Jed Sundwall: Right.

Jack Cushman: And now in this AI era, we need yet again new ways to think about what it means to collect and preserve and share knowledge.

Jed Sundwall: Amazing. So this is interesting. I didn’t realize you were a lawyer. I mean, I guess it makes sense. You’re at the law school.

Jack Cushman: Clearly I hide it. I’m a recovering lawyer. You know, I have not practiced law since probably 2014, 2015. And happy to leave that to the experts.

Jed Sundwall: Okay.

Yeah, it’s interesting. mean, I, we, we have a kinship here because I studied foreign policy and thought I was going to be a diplomat or something like that. And then, but was also, I was never a, would never call myself a programmer, but I was making websites in like 1994, like on like mosaic, know, got very, was enamored with the web from the very beginning. And, that was just always kind of like a hobby.

for me and then, but anyway, so I think we’ve ended up in similar places interested in sharing data and stuff like that. So it’s cool to hear your story here. So, can you say a little bit more about like the library innovation lab and like what you all think about these days? Cause everything you just hinted at was great, know, pointing out that we had libraries before books. What are you thinking about in 2025, you know, as we go into 2026?

Jack Cushman: Absolutely. And I’ll say, you know, we need 100 library innovation labs. Anything that we pick to focus on is one of many things that we could have. And I hope that all of those flowers will bloom. But for the direction that we go in, the core organizing principle is your society needs knowledge to plan and to direct itself. If we have poor short-term memory or long-term memory as an individual, it’s very hard to navigate your life. If we have poor short-term and long-term memory,

as a culture, as a community, as a government, whatever layer you want to look at, it becomes very hard to navigate. And all the projects that we look at address that in different ways. We build PermaCC, for example, that fixes link rot in law decisions in published cases and in law journal articles that’s used by law firms. And it makes documents reliable in long term instead of short term. When you cite to a URL in document,

You include a permalink, and that permalink is on file as a copy of the web page with the Harvard Law Library. And that means that link is going to work in perpetuity. And it goes from kind of having this etch-a-sketch memory, where you can have a case, and a month later the domain doesn’t resolve and you don’t know what they mean, to having kind of permanent memory again. So what that means for LIL is we’re looking at how do you preserve knowledge for the long term and how do you interpret it. On the preservation side, we’re working on projects like PERMA.

We’re working on projects like we’re going to talk about our public data project, which is how do we make sure we don’t lose the public data we all create together? And then we’re also looking at the access and interpretation side. We have a research program looking at law and artificial intelligence, because law is such a wonderful playground for understanding how AI changes our ways of knowing. The law is kind of done by words. I think of how I want to say it. You think of how you want to say it. The judge picks something, and those words become meaning in the real world.

Jed Sundwall: Yeah.

Jack Cushman: which means that systems that can interpret and juggle and shuffle words to make meaning all of a sudden have this real practical impact in our field. And it us study things like, how are we going to help law students actually learn in a world where the tools can do much of the reading for them? How are we going to evaluate how good tools are at the fine-grained thing that you’re trying to do? How do we do benchmarking of the thing you actually care about instead of abstract benchmarks of other things? And how are we going to navigate a field where that just employment is rapidly changing?

or like law employment used to be this very pyramid shaped. You hire a bunch of people down at the bottom to read through piles of paper in a box. And now the need for reading through piles of paper in a box is really changing. We have to reinterpret what it means to be a junior lawyer who works their way up. So doing a bunch of things that are about how to make sense of the data once we have it. And that might inform sort of you’re seeing both sides of that in the work we’re doing with you. There’s how do we responsibly collect things and then how do we responsibly share them so that people can really find what they need.

Jed Sundwall: Yeah. Well, yeah. So let’s, let’s talk about the data.gov archive. and how that came about. Cause I mean, I, you know, I think the conversation started, about a year ago, when we thought maybe this would be a good idea to start backing up data.gov, but I will confess to not, I don’t have the clean answer to what’s, what is in this collection. How do you describe it to people?

Jack Cushman: Yeah, yeah, great question. So what’s the point of the data.gov archive? It did start because we wanted to do some broad reaching collection of federal data sets. And you mentioned, like, you know, there’s a geopolitical context where you might say, it’s important right now to save data. And at the same time, our law library has been saving data for the federal government since the early 1800s. I don’t know quite when Harvard’s relationship started, but.

The first act where Congress started asking organizations like ours to preserve documents was in like 1813 in the federal library depository act. I’m going to get the name wrong, but it’s been over 200 years that Congress was saying, please help us collectively preserve the stuff that matters. And with data.gov, we were saying, well, what does that mean for 2024, 2025? And

We already knew that the End of Term Archive, which we’re part of, was doing a wonderful job of collecting the web pages of the federal web, including anything under .gov, but also including their Twitter pages and their YouTube and anywhere that the federal government had a footprint, getting a snapshot before and after the transition so you could understand what changed. And End of Term Archive has been doing that since 2008. It’s not a kind of this year or that year thing. As a citizen, you should be able to see what your government was and what it’s become. And you should be able to see that repeatedly as the government evolves.

So we knew that was happening. Then we said, well, what’s not happening? And the real risk that we saw is you can easily end up, if you do a web crawl, getting the manual for the data but not getting the data itself. Because the way web preservation will work is you have a browser, like any of us would use, and it clicks from link to link. And it tries to click all the links on the page, and it clicks all the links on the pages it finds, and then it clicks all the links of the pages it found there. But it can’t do things like interact with a form. It can’t do things like if you need to send an email to get data or

If you need to script an API, it’s only going to get the stuff that you can get by clicking, which is wonderful, but might mean that you end up with a submerged layer of, wish we had the actual data that this report was based on, and that is just gone if it disappears. There was a data rescue community that emerged around that time, a bunch of different groups working on wonderful projects. The part that we worked on was, see if we can save the underlying data behind the data.gov website.

Jack Cushman: Data.gov itself is an index. It lists datasets across the federal government and also some states. But it doesn’t store the data. It just says, you can go here to read this, you can go here to read that, you can go here to read that. They do have an API. So what we did is let’s script this API, get a list of all 300,000 datasets in there, and then find everything they link to and call that the collection. So, you know, dataset number 2,104, which is a dataset of…

you know, traffic congestion in medium-sized cities or whatever part of measuring our society is going to link out to this CSV and this Excel file and this PDF and this zip file. And that list of objects becomes what we want to put in a collection. And then the goal is to have, you know, accurately collect each of those things. So grab the metadata from the API, grab all of the URLs that link out to it and package those up as one of 300,000 objects that we were making in a new

Jed Sundwall: Got it.

Jack Cushman: collection of collections.

Jed Sundwall: Okay, so, but then obviously like, know, so our world again, going back to the geospatial world, we deal with, you know, federally produced data sets that are like petabyte in scale, you know, weather data and model outputs and satellite imagery, things like that. You don’t have that stuff. So this is just what’s linked to, I guess, I guess I’m, my question is like, how many layers deep did you go?

Jack Cushman: Yeah, great question. So we went one hop deep. So you have the listing on data.gov. It links to a set of files. And it says, these are the files in this data set. And we grabbed those files. And I think what that meant is we ended up collecting the smaller data sets. Because for the smaller ones, it would be linking right to an object, a file that was the data of that collection. And for the larger ones, yeah, it had the problem that those links would go to a landing page that said, yeah, for this Petabait scale collection, here’s the steps you go through to get it that are very individual to that collection.

For those, we would only get the landing page. We wouldn’t get the actual data. And what that meant is we added up to about 17 terabytes of data, which is a bunch of small data sets and then a bunch of landing pages for large data sets. I think the size kind of tells you both what it succeeds at and what it fails at. Because it tells you on the one hand, no, we didn’t get the massive uncompressed image collections or that kind of thing. It also tells you we didn’t just get landing pages. Like 300,000 landing pages is not 17 terabytes by any means.

Jed Sundwall: Right. Right.

Jack Cushman: We got a ton of the smaller data sets. And I kind of liked that as a first pass. We just want to do something to stabilize what exists now, not be losing things. And I think it gets you a very broad reaching, small significant data sets are going to be in there and are going to be preserved. And then it sets up for this question of, well, what else got missed? And you know what? It was true at every level. So there was one piece we knew, is the things in data.gov, we’re going to get some of them. We’re going to miss some.

that’s necessary at this scale. We’re also told going into it, data.gov itself is a partial listing of the federal government. I talked to technical folks working in the government at that time to get an idea of like, where’s the list? What would I download if I wanted to download the data sets of the federal government? And first I said, do you know where that list is? And then I said, who could you ask? And they said, no, I don’t have a list. Second of all, no, I can’t even think of someone, a group of people I could ask who could collectively know what it is.

that what we have is a sort of sprawling, overlapping set of independent agencies and groups just making data. And if you look at data.gov, it’s like, here’s a cool snapshot, 300,000 out of X, out of we don’t know how many.

Jed Sundwall: Yeah.

Jed Sundwall: Yeah. man. It’s so you’re taking me back. know, many years ago I worked for USA.gov. So I was at GSA as a contractor when data.gov was launched. And so I had a front row seat to all of that. And I have a similar story, is we at USA.gov at some point, cause I was, I was leading the social media strategy for USA.gov. And I mean, to give you a sense of like what this meant, I started before Obama was elected. Like I started sort of the end of

W’s second term and Facebook and Twitter were already becoming a thing. And it was like, we need to learn how to use this. How do we do it? And at some point somebody was like, we need to keep track of every federal social media account. And it was like, like, well, that’s what are you gonna do? Like open in Excel, create a spreadsheet and just like add them as you find them. And we’re like, that’s obviously not gonna work. This is too big now. And so we created a thing that I’m pretty sure

Jack Cushman: Mm-hmm.

Jed Sundwall: I don’t know, it might still exist in some form. It may have been deprecated, but we called it the USA.gov social media registry. And it was basically, what we did is we let anybody with a .gov email address, submit a social media account that they managed. And then we would send them an email, because we’re like, okay, you’ve got a .gov email address. We also asked them to put in their phone number just to scare them, just to be like, this is serious, like don’t spam this thing. But basically you would get an email with a token in it.

you’d click on that so that we would know that you actually owned the, doc of email address that you put in. And we’d say, okay, like this does look like the Twitter account for the embassy in Myanmar or something, whatever it was. And it works pretty, it works really well. We called it fed sourcing. Like we’re going to kind of crowdsource all this sort of stuff. But one of the things we wanted to do for the form was like, we need the list of the government agencies, which I know that you’ve dealt with.

Jack Cushman: Not sure. That seems like that list would exist.

Jed Sundwall: Yeah, well, it’s actually something I was going to ask you about because you guys have built and this is, mean, it’s also like a segue into the viewer that you all produced, but you have this awesome data.gov archive search that you’ve built. I’ll let you talk about this. But one thing I just sort of want to like get out right away is that you have things listed into organizations, publishers and bureaus. And I’m curious to know like what, if you all had the same conversation where you’re like,

what are the government agencies? Because as far as I know, that list still doesn’t really exist anywhere. We had to make one up based on a Wikipedia article. Like that was the best source we could find. So.

Jack Cushman: I love that story. Well, before we get into our archive, I think that question of what is the denominator, what is the set of data that’s out there that we wish we could save, really helped me appreciate the goals that we have behind this thing. Because I started to picture where’s this data coming from. And rather than like, I don’t know, there’s the DOJ, there’s just these objects out there that are doing things like a giant unit. That what we’re really talking about is federal employees.

Jed Sundwall: Yeah.

Jack Cushman: you might know the number better than me, maybe 2 million federal employees who are out there doing things for us, making things for us, like go to work and in some way facilitate the functioning of the country. And in the course of their business, making data, making data sets, whether it’s how are the crops growing or how’s the water in the aquifers or what’s going on in this little section of the economy or what’s going on in this little section of education or whatever it is, people going about their day and along the way recording things that help us understand what’s happening.

Jed Sundwall: Yeah.

Jack Cushman: And it helps to understand why there’s not a central list, that of course those two million people would be generating millions of Excel files, things that just like, here’s some stuff you should know, here’s something I learned in the course of my day that is worth writing down. Many of them very deliberate and collective and across a group of people. But in many ways, as people who live here, as people invest in our society, we would want all of that. We would have this kind of relationship that is not kind of a citizen and a government, but a person and a person.

that those people should be able to publish the things they learn that will help us. And we collectively should be able to access those and use them. And at that level, the mission starts to feel much more palpable and meaningful to me. That’s like, how do we help those people who are learning things or trying to help us record the things that they’re learning so that they are permanent? And so they’re findable. And if we can have the right taxonomies, let’s do it. If we can have processes, let’s do it. But at the end of the day,

Jed Sundwall: Yeah.

Jack Cushman: let’s just have the stuff that we paid for, the people who we employed to help us be able to share the things that they learn and be able to preserve those. And then let’s back into how would we get that list? How would we index it? How would we organize it? One thing that’s made me really curious about, I think there’s a project out there. I don’t know if this is a you and me project or who should do this. But I would love to use the common crawl in the end of term archive to try to just make the list.

Like what if you went through every web page we know about and maybe ask an LLM, you know, do some like automation in there and ask what clues does this give you about a data set that exists? And then see if we can like find all of that and, you know, aggregate it and combine it, know, deduplicate and come out with like the world’s first denominator of like what’s the data the federal government has published and how many data sets would that be on top of the 300,000 we know about? It would be like, it would be so wrong. Like the number you got would be like.

Jed Sundwall: Yeah.

Jack Cushman: barely related to reality, but it’d the first time someone has planted a stake for like, I think this might be the list. I think this might be just our inheritance as people who live here and people are trying to share data with us. Like this could be it, what ought to exist. Cause I’d love to able to see that. I’d love to be able to see that constellation and look up and say, yes, that is like the thing that we have built.

Jed Sundwall: Yeah, so you’re reminding me of two things. One is, are you familiar with this story? It’s, don’t know even what you would call it. It’s essay written by Jorge Luis Borges called the analytical language of John Wilkins. I imagine this has to be something, this is like right up your alley. I’m putting it in the chat, like, this should be like, librarians should love this. Like a lot of computer scientists love this story. Cause it’s a story about an effort at creating a

Jack Cushman: don’t know that one.

Jed Sundwall: an actual language, it’s similarly like an attempt at sort of like taxonomizing the universe and it doesn’t really work out very well. And Borges points out, he’s like, the reason we can’t do this is because we don’t know what thing the universe is. We don’t have a handle on it. And to your point about like, you know, the government as being perceived as a monolith, as being perceived as, you know, something that is in DC or something like that is just obviously not true. And that’s the other…

Jack Cushman: Yes.

Jed Sundwall: The other thing you remind me of is another essay. And I actually don’t know who wrote this off the top of my head. It’s just like some guy wrote on the internet, but like, I’ll find out now as I Google it, but it’s the title of the essay says it all. It’s reality contains a surprising amount of detail or reality has a surprising amount of detail. got a guy named John Salvatier, I’m not sure how to say his name, but both fantastic little essays here. yeah.

We’ve both lived through this where you can see in something like, what we see in open data policies that are like, the government produces data. The government should make the data open. And those of us who then start looking hard at were like, man, this is not an easy task.

Jack Cushman: I absolutely, I love this duality, this like, well, there’s an abstraction that we wish we could have that is like the perfect data that exists in the abstract. And then there’s this reality that what we’re talking about is the subjective views of a bunch of human beings. And this comes up very practically in the kind of work that we do, both you and I, when you’re trying to do archiving work. reality kind of doesn’t wanna fit your taxonomy and you have to make a lot of choices. When we were doing the case law access project where we scanned

like the collective case law of the United States from historical times up to 2018, we found cases that came from imaginary dates. Courts would just publish a case, and there in the book it would say, oh yeah, like February 29, 1911, like just a date that doesn’t exist. And we were trying to put in a database, and Postgres was like, that’s not a real date. I can’t save that in my database. And we’re like, OK, but it’s a real case. It really has that date on it. It is presidential. It’s part of the law that you and I are supposed to know and follow.

Jed Sundwall: Sure. Yeah.

Jed Sundwall: Wow. Yeah.

Jack Cushman: We just have to now infer, well, from what date did it become part of the law? I guess maybe midnight on February 28th, it existed in this magic hour. And I love that example because there’s this thing that we’re trying to do. Why do all of this? That is, we’re owed ground truth. And the ground truth is both subjective and objective. We all live on a planet made of atoms. And it’s important to just know how much water is in the aquifer is how much there is. You can’t change that by describing it differently.

But we’re all kind of observing and touching reality with different means and levers. And what we’ve come away with, our measurements are all different and subjective. They add a layer of subjectivity. And if you’re the collector of collections of collections at the end of it, which is kind of where we’re trying to be, you end up with both of those at once. We have an objective reality that we’re measuring and we have a subjective attempt to measure it that we’re trying to make sense of. And I just love that game. I love that work that we get to do of like,

help to see the world for what it is and also help to see people for what they are, which is, you know, very imperfect observers of everything we see.

Jed Sundwall: Yeah, I love it. mean, this is a reminder, like this is all, everything we do is part of Radiant Earth, this is nonprofit, right? But our mission is to increase shared understanding of our world by making data easier to access and use for that reason, which is basically, are all, I always refer to the blind men and the elephant, I always use this framing that we’re feeling our way in the dark.

we’re increasingly adding new capabilities of measuring reality and trying to understand it. I’m like, well, we should just, let’s make sure we do that together. And I’ll say what I love about my job and the approach we’re taking is just that like, that gives us so much freedom to be happy anytime anybody takes a swing at bat. We’re like, yeah, go for it. I’m not, yeah, exactly. And people are like, you know, like I wanna try some weird new file format. And people are like, well, that’s not.

Jack Cushman: Yes, get that up there too.

Jed Sundwall: that’s not the one that we use. And I’m like, it doesn’t matter, let them try. So that’s a segue. We should talk about the search, the archive search, but I want to talk about Baggit first. How do you describe Baggit to people and why do you use it?

Jack Cushman: Sure. Bagot is a sort of collective product from the library community writ large, but it was strongly endorsed by the Library of Congress. So it really got some traction there. And I think that was around the 2010s. I don’t remember the exact date. The notion was to have a data transfer format that is as simple as it can possibly be, where every moving part has been stripped away.

so that you can do it reliably and make readers that can reliably pass around things regardless of what’s inside. Because part of the issue is you end up with like a, well, here’s how you encode a web archive and here’s how you encode an image or an image collection and here’s how you encode a novel. And you have the proliferation of formats and you get things that fall in between them and have this kind of taxonomy question we were having. So what if you had something that can just like correctly encode anything in a very loose way? So a bag is a folder.

Jed Sundwall: Yeah.

Jack Cushman: And the folder has inside it another folder, which is the data folder. And whatever is in there is the thing that you bagged. And then it has a little bit of metadata. It has an index that says, here’s the hash of everything that is in me as data that I’m recording. And here’s the date I was made and some things like that. And beyond that, it’s up to the implementer to decide what substantive metadata to record. So it becomes a lowest common denominator way to pass around data in the library and archives community. And certainly,

Jed Sundwall: Okay.

Jack Cushman: you want to specialize from there. You want to have image collections and have a bunch of image specific things that they standardize on. But you don’t want to be stuck with that. You want to also be able to step down to a lowest common denominator to do interchange. We reached for Baggett with Data.gov because it looked exactly like that kind of problem, a very heterogeneous collection. 300,000 data sets, you don’t know what’s in them. You want to just get them all and get them correctly, regardless of whether there’s new file formats you don’t know about. So something that was like, take the files you care about, put them in this folder.

Jed Sundwall: Okay.

Jed Sundwall: Yeah.

Jack Cushman: was a really nice place to start. And then we had to build a bunch of stuff on top of it.

Jed Sundwall: Yeah. Okay. Okay. But then, but the idea though is the folder is an object. It’s a binary that gets uploaded to S3 and it’s a .bagot file.

Jack Cushman: Yeah, so if you’re passing it around, I think we zip them. We put them in a format where they’re compressed, but also with an index, you can pull out individual files from the compressed thing. And this is kind of an elaboration on top of Baggett itself. So Baggett doesn’t specify a single file expression of itself. The Baggett is actually the unzipped zip. So it’s like a folder. It has this file. It has this file. It has this file. And if you have a folder that complies with that, then it’s a Baggett object. It’s like a folder.

Jed Sundwall: Okay.

Jed Sundwall: okay.

Jed Sundwall: Interesting.

Jack Cushman: But we don’t actually share folders on the internet. You always have to turn it into a single file one way or another. So when we share them, the way that we did it is to zip them and index the zips. And if you do that right, then you can get a set of ranges where like, do you want this CSV out of the bag? Just fetch this range directly from the file, and it’ll give you that CSV. And that’s kind of the best of both worlds for serving in terms of it’s small, but it’s also accessible.

Jed Sundwall: Right.

Jed Sundwall: Yeah, it’s interesting. We have to think through this on source. we’re just, as far as like features go, is that the way source works is you’re just navigating an object store. for those who know, you’re not clicking through folders. You’re navigating prefixes and then enlisting what’s in there. But then when you get to an individual object, we want to tell you and show you everything we can about that object. And something we need to do for baguettes and zips and tars is

show you that index. so it’ll be a kind of like, it’s just a new view that we have to think through a little bit where it’s like, yes, you’ve landed on an individual object, but also you should think about it as still part of this kind of directory structure. Yeah.

Jack Cushman: That’s right. And your podcast listeners may know this, but I think I should plug the mission that you’re describing, which is I like to say we collect collections of collections of collections. And I think you then collect collections of collections of collections of collections. So you end up with this very meta, like here is a thing. Harvard made this collection of data at GovObjects. But you don’t want it to just be bits that people have to download and have a local viewer for.

Jed Sundwall: Yeah.

Jack Cushman: What I’ve heard from you is we really should help people understand what it is they’re getting. Just a little like try before you buy of like, what would be in there if I pulled it down? And that becomes easy for like a few standard things to show the beginning of a CSV or an Excel file. It’s very straightforward. And you’ve done things with mapping, which I think is also wonderful. But yeah, what do you do when you have a zip file? Are there ways that we can start to show that? I love this vision that our community can do that together. We can start to say, I’d love to be able to try before you buy this kind of object too. There’s a bunch of these and I’m curious what they are.

And then just contribute that viewer and have that happen too. I think that vision is so key to this. One thing that you and I’ve talked about a bit is, some of it I think is really very specific to one collection. Like we have a custom viewer for data.gov. I actually think you probably want a custom viewer. but, cause you don’t, you don’t want to bag it viewer in general. Bag it is a very general format. So it’s hard to expose much detail there. You want a, you know, Jack Krishman flavor to bag it viewer. Like, you know, the.

Jed Sundwall: Yeah.

That’s right.

Jack Cushman: a viewer that will tell you what’s specifically in these ones, that with a little bit of elbow grease on our side, you can have it actually be able to see what’s in there very specifically. And I think this game is like, how much can we use standard formats and how much do you end up with a bunch of viewers?

Jed Sundwall: Yeah, it is, you know, mean, when I, so first of all, it’s very nice to hear you repeat back like what we’re trying to do and you nailed it. that’s, yeah. Yeah. Well, it’s better to have you toot the horn for me. So that’s great. I love it. Fancy Harvard guy agrees that what we’re doing is a good idea. Well, I just put in the chat, the archive search viewer, because absolutely, I mean, our,

Jack Cushman: You skipped past tooting your own horn, but I think it’s such a good strategy.

Jed Sundwall: So this is a callback to the Great Data Products blog post where I finally posted, I finally published again what I call the sweet spot graph, and which is something that I’d come up with when I was working at AWS, which was this notion that I still have more work to do on this idea. We’re gonna write another paper about it, but like that you don’t wanna over determine how data is interpreted. It’s everything you were saying before.

but you do still want to give people some assistance in seeing the data, right? And so you have to find the sweet spot between like, here’s the raw data. We refuse to interpret it in any way. Like let the universe decide what it’s good for. But also like, let’s be honest. If you download a hundred thousand row CSV, you can’t open it in Excel. You have to, know, and then if you’re

properly nerdy, you’re gonna do like a head in the terminal and just sort of look at the first few rows. Like we can do that in the browser now, like trivially, you know, so we should. And so that’s something we want to build in. But then also to your other point with like the viewer that you built is that like, if you have a handle on your collection of collections, you know, that you’ve put together, you should also in the browser be able to show people around. Yeah, give them the like Jack’s tour, which is great.

Jack Cushman: Yeah, very much. There’s this semi opinionated, because I’m not opinionated about the details, but I’m opinionated about like, what’s this most sensible way to explore this? You know, one thing where I think that’s getting more urgent is as a data rescue community, as an archival community, we have a real challenge with preserving the interfaces to things. So one thing that you’ll get those 2 million employees doing is like, well, here’s some data. And think you actually might want to see it on a map combined with this other data so you can understand how like

Jed Sundwall: Yeah.

Jed Sundwall: Yes.

Jack Cushman: your housing choice relates to your school choice, relates to your hospital choice, whatever the things are. There are all these semi-opinionated viewers that just combine two sources that are helpful to see in a shared visualization. And those we mostly lose because when you move from saving the underlying data to saving the software, you’re moving from the business of data preservation to the business of software preservation, which is its own field that is just much more complicated. You have to understand.

Is the source open? Is there a way to host it? Is there a way that it will be patched in the future? How does it need to evolve? And software preservation is just a much more challenging and one at a time kind of business. So we’re losing the point is we’re losing a ton of our viewers if we disinvest in publishing data. And that means we need to ask, because I think the archival community cannot replace that. We’re not two million people who can come build things. We need to ask, can we?

make more general purpose viewers that help people actually see the part of it they need. And so the undertaking of what would be the sweet spot of general purpose of viewer that helps any given person understand what they’re looking at, I think becomes so important.

Jed Sundwall: Yeah, yeah, well, I guess I’ll say to everybody, stay tuned. I mean, this is something that we’ll definitely be doing a lot more of. I mean, and what’s actually kind of funny, mean, people tend to think this is funny, at least a lot of the people I hang out with because they’re climate model nerds, but I’m like, we really need to make it easier for people to see a CSVs on the web. They’re like, what? I’m like, trust me.

Jack Cushman: Sure.

Jack Cushman: Yeah. Yeah. I think that user feedback is so important. One feedback we got for our KSLA Access project is we were publishing JSON lines files, one line of JSON per. And that was really useful for Python programmers. There’s great tools for reading that. It was very confusing for our programmers, if I’m remembering right. And in R, it was a lot easier to read a CSV than a JSON lines file. And I just got this feedback, like, can you make it CSVs? Like, that works better in my environment.

Jed Sundwall: Yeah.

Jack Cushman: And was like these little things that like, if you can get past that like friction, then people were able to use the thing.

Jed Sundwall: That’s right. Well, and also, mean, I think the story you just told also highlights, think, something that we feel really strongly about, which is that you really have to focus on the practitioner community. This goes back to the sweet spot concept of over determining how data gets presented. If you go too far the other way and you’re like, well, yeah, people just want a dashboard, or you just want a visualization for an executive, and you’re cutting out a whole user community that…

could really surprise you and do interesting things with the data. well, could you say a little bit more about your viewer though, like how it was built and yeah.

Jack Cushman: Absolutely. Yeah, very practically, if you go to this link, you can go browse our collection. And the way that we’re structuring this is sort of some tasteful use of the metadata that came with data.gov. So this owes a lot of DNA and credit to data.gov for structuring the data, offering metadata for how to just shuffle these 300,000 data sets. And we’re really just replicating that. Going back to your question of do we have a separate list of US agencies?

Jed Sundwall: Okay.

Jack Cushman: We really just have the list that came with the data of what metadata entries that they have. And we let you search by data set title, organization, and so on. And then we let you narrow down by categories what we saw as the most useful chunks, metadata fields that were in our raw data to let you browse. The really important thing about this, what makes it little more interesting than a million other pages you’ve seen that let you browse a large data set and narrow it down.

is that it’s running entirely in your browser. There’s no server-side component to that. And for folks who might be on the less techie side of things, we’re talking about in typical website, you have your own browser that runs on your computer, and it fetches HTML and JavaScript and so on from a server. The server is also running custom software. And when you send in your request for just give me the ones that came from the US Geological Survey, on the server, it filters out all the others, narrows it down to that, bundles up exactly what you need, and sends it down to you, which means the person who’s providing this to you

is doing sort of ongoing work for you. They’re keeping this software up to date and running and paid for. And so you’re dependent on them still existing. If you want to come back tomorrow or next year and still be able to narrow things down to just US Geological Survey, you’re depending on the person who’s really providing a service for you, still being there to narrow it down for you and hand it to you live when you need it. And that creates a lot of precarity in the digital humanities space. And there’s a…

We now have enough decades of experience making digital humanities projects and putting them online and then running out of money for them and having them crash again. You can study this. You can look at 100 projects and what made them live or what didn’t. And that server-side software load really becomes an issue because it’s the first thing that’s going to kill your project. It’s a huge difference between print books and libraries and digital books. And I love this contrast. Given some climate control, given a roof that doesn’t leak,

Books are pretty happy to be left alone for a year. If you’re like, you know what, we just don’t have staff to open up this part of the library for the next year. We’re going to close the door, set the thermostat to the right level, and you’re probably just going to find them in better condition in a year than they would be if people had been looking at them. With digital, it’s not like that. If you’re like, we just don’t have the people to match this for the next year, there’s a good chance it’s gone and unrecoverable when you come back for it. You didn’t pay some server bill and something got deleted and no one’s around who knows how to put it back together and it’s just gone.

Jack Cushman: So this viewer, the really exciting thing about it is that it’s really not subject to that kind of rot because it’s client-side only. Because when we give you the data, we hand the entire software to view it to you right alongside. And the idea is if you’re making a copy of this, you get the original, you get the software too. Your copy becomes just as good as the original. And you can see right now, it’s kind of clunky. When you click around it, it’s slower to load than it would be if we had a powerful server running.

we’re kind of pushing the edges of what’s possible to do in the client. I think we can push those edges a lot further. I think a lot of the clunkiness can be fixed by more indexes and more optimization here and there. But what you’re really having to do is think through if all we could do is write static files, what static files would we need to make the experience I want very efficient? And just like you have seen,

geo data that is structured very carefully so that you can fetch the parts you need from the server without needing server-side software. We can use DuckDB and write custom parquet files out that have these are the indexes that you need to serve this experience with the data you most need right at the top. And the better we have that structure, the faster the thing can run. A cool thing about that is it ends up being the same skill that you need to make a fast server-side software. So like,

If your data is poorly indexed and you’re sending a bunch of queries to the server that require it to do a bunch of work, the server is going to crash if a bunch of people use it. So you try to use indexes where the server has to do very little work. If you get those really right and really pristine, you don’t even need the server. You can just fetch the index data directly. That’s the plan. We should talk about cryptography too, because I think that’s a necessary piece of this vision. But let me know if we should jump to that now or stick to the client side.

Jed Sundwall: No, let’s just, I mean, let me just linger on that a little bit. So yeah, when you open the search, you have a little spinner there. And I assume what’s happening there is that, is it WebAssembly loading? Do you know?

Jack Cushman: I think is DuckDB is loading. There’s about five megabytes in the current client that have to run just for raw DuckDB. And this was a technical choice we had to make early on. Like do we use a well-supported off-the-shelf library that does make you load a few megabytes? Or the core work that we’re doing could be done in a lot less software to send down, but you’d have to do a lot more custom.

Jed Sundwall: Okay.

Jack Cushman: We ended up deciding to go with the off-the-shelf thing with .DB because it makes us part of a larger community and we’ll kind of, we think it’ll feed back and forth in the open source community better that way. But it was a tough decision. I think the state of this technique right now is that it’s still pretty bleeding edge. You find a bunch of libraries that are like, someone made it and thought it was cool, but stopped supporting the GitHub repo or like this was a one maintainer and now they’re gone. Or this is a large project that’s planning to implement it, but they haven’t got around to it yet. And you have to find a branch where it kind of works.

So like working this way ends up kind of pushing you into some creative coding. And so part of what you’re seeing is loading that, that DuckDB software for now. And what I’m hoping, I think DuckDB itself could be a lot smaller and that’s one direction I’d love to see that grow. The other thing you’re seeing is loading the data. So in addition to fetching DuckDB, at some points, as you click around through here, it’s going to say, to answer that query, I would need to have loaded this index that I know exists.

Jed Sundwall: Yeah.

Jack Cushman: And so it’ll go back to the server and say, can you please send me 500k or megabyte of this index? It’ll help me show the answer to this. And as you click around, you’ll see less of that because you’ll be loading into your browser the parts that you need to see the experience that you want to have.

Jed Sundwall: Okay. Yeah. just want, just put in the chat also that, you know, hacker news picked it up. They thought it was pretty interesting what you’d done. And, the only other thing I’ll say is in the last episode, Brandon Lou who created Proto maps, which is amazing. vector tile, you know, file format and, and serving tool. I feel terrible. I don’t know exactly like how to characterize how awesome Proto maps is this project, but he’s like, look,

I want to, it’s very, very simpatico with what you’re saying. He’s like, you should also be able to put ProtoMaps data onto an SD card and walk into a forest, you know, and like give it to somebody on a laptop and like visualize it there. I, but now you still need to run a browser. so everything you just said hints at these decisions that you have to think about when you’re trying to find that sweet spot, which is like, okay, we’re to use a very commonly, you know, widely adopted.

platform or tool, DuckDB, because there’s a community there for it. And obviously we’re using object store and browsers because they’re very distributed technology that people have access to. these are the kind of decisions and thinking that I think, well, whatever. I’m preaching to the choir here. Yeah.

Jack Cushman: It’s exactly right. I think David Rosenthal, who founded Lox, the way he likes to say this is no one’s ever going to make hardware specifically for the archiving community. We are too small. So when you’re designing a system, you figure out what you can do with off-the-shelf parts that are designed for other communities. That was how it led him in the early 2000s to say, we need to figure out how to this work on commodity hard drives. Because we can’t be buying special custom media for us. We’re

way too small for that to ever be as good. We need to figure out what’s the media that other people use and use it. And I see that repeat all kinds of ways, know, communities and structures.

Jed Sundwall: man, yeah. Just, I’m gonna, maybe a gadfly. don’t know, I don’t think anybody’s listening right now to this, but like, I’ve had conversations with big funders that want to do big stuff for climate and they’re like, we need really gnarly hardware. And I’m like, do not do that. Like, please don’t go down this path. I mean, they’re talking about building their own data centers and I’m just like, stop, please stop. you’re not, you.

Jack Cushman: Mm-hmm.

Jed Sundwall: What you’re doing is very important, supremely important. I’m glad you want to put money towards this, but like you should be focusing on the commodity layer. Anyway.

Jack Cushman: That’s right. I feel like there’s a, we build strong, robust community layers and then we identify specific technical weak spots where a real technical breakthrough will make a difference. So most of the work is kind of building the community that’s going to pass things around. And then we recognize something like, if we can make this client side, if we can make this cryptographically signed, we can have a breakthrough here. So let’s put some tech into that, but spend that very carefully.

Jed Sundwall: Yeah.

Jed Sundwall: Yeah. All right. Well, yeah. So let’s, now let’s talk about cryptography. yeah.

Jack Cushman: Absolutely. So here’s the philosophy. Every copy should be as good as the original. If I make an archive at Harvard and you grab a copy of it and put it on your desktop, your copy should be just as good as mine for posterity. And that’s because lots of copies keep stuff safe. Those copies all have to be valid. And we really, philosophically, we don’t want to be planning for any one institution to exist in perpetuity.

whether it’s the US government or Harvard or if you shot into space, it doesn’t matter. You shouldn’t assume that one is still going to be there. And then it becomes really critical to focus on how to make copies because the history that we’ve seen on the internet is copies tend to disappear. If you try to maintain two copies of something, we’re going to have two copies of the census data. Then pretty soon you’re like, well, one of these isn’t being used. The internet’s very reliable, so we’re all going to one of them. And the other one just kind of gets cut off eventually.

It gets deprioritized, defunded, disappears. So we have to make copies robust and easy. So it becomes a two-part strategy. When I ship something with a DataGov archive, it’s going to come with a viewer. And it’s kind of with signatures so that you don’t need me to be around to make sure that it’s real, to understand what its provenance is in library terms. And we just talked about the viewer prong of that that helps make sure that your copy is as good as mine. The signature prong of that is when you get data from me, you should be able to tell

Who says this is authentic? When did they say it? And what do they say is in it? It’s something that I love. Starling Labs compares this to like an evidence bag in court. If you imagine, you know, I don’t know, let’s pick something nice. Like a beach ball is found at a crime scene. Then it’ll be put in a bag. Most of the examples are not great, but it’ll…

Jed Sundwall: Yeah, yeah, yeah, yeah, yeah, that’s true. Yeah.

Jack Cushman: It’ll be put in a bag. The person who picks it up will sign it and say, I picked this up and put it in this bag on this date. And then when they hand it to someone, they’ll say, then I handed it to so-and-so. And they’re like, yeah, I picked this up and it was handed to me by them. And I brought it to the evidence locker. And I put it in this locker and locked it. And then the person who takes it out to bring it to court, they’ll say, I took this out. And I held it from here to court. So when you’re admitting that beach ball before a jury of your peers, the e-

you can say these are the people who would have to testify in sequence, every one of them to say hand to hand how it got from that crime scene to you touching it today. And most of the time those people don’t come testify because most of the time that process is reliable and the fact that we have that record means that we can rely on it. Sometimes it’s not. Sometimes we say like, that one person who was working the evidence locker that day turned out to be really sketchy and put things in the wrong places. Let’s figure out which ones they touched and we can revisit that.

So that provenance chain becomes so vital. If you think about it proving things in court, it’s really clear. But actually in libraries, we care about that all the time. Anytime I say, here’s a list of companies, if you could say, well, this is a list of companies that Edward Snowden said were cooperating with the government and I can prove it, then it’s a really important list. If it’s like, this is a list that Jack found on Wikipedia, it’s a meaningless list. The provenance matters. So we need to be able to attach provenance to the things we pass around. And we need it to last longer than we do. That’s the design constraints.

Jed Sundwall: Yeah. Yeah.

Jack Cushman: Cryptography is how we do that. And we attach a signature to it. The signature says, I, Jack, say this is what this is. And you can be convinced that it was Jack who said that. And we attach a timestamp to it that says all of this stuff you’re looking at existed as of this date. No later than this date this came into being. And if you put a signature on it and then you put a timestamp on it, then you can later say reliably, Jack swore in 2024 that this was real and this is what he said it was.

And that existed in 2024. It didn’t happen later. And that doesn’t mean it’s real. Like, it still could all be fake. I could have lied about it. I could have been lied to by the web. A bunch of things could happen. But let’s imagine we go out to 2028 and two people are arguing about water rights in Nebraska. And one of them says, look at this government record from 2010 that proves that these are my water rights. And it’s gone now. It’s no longer on the website. All it is is in the Harvard archive.

And the other one says, that’s a lie. You just made that up. That’s not a real document. It’s not on the federal web. Then what you get to argue about is, is it plausible that Jack in 2024 wrote down some lies about water rights that would mean that I win this thing in 2020? You greatly narrow down the ways that the lie could happen. And most of the time you’ll say, OK, no, that doesn’t make sense. Jack wouldn’t have known to do that. This must be real. So that’s what we need to do. We need to attach a signature. We need to attach a timestamp. Getting into the technical weeds, we were moving pretty quickly, and we wanted to

ship something and we wanted it to be reliable for the long term at the same time. So the plan was to use very well understood basic standard off the shelf crypto. I think if you were designing this from scratch, you would use more modern algorithms, but what we reached for is open SSL and some standard ways that use open SSL design and timestamp things. And so we added a little extension to the bag it format, which you can find in my tool, bag nabbit, which I got to name.

Jed Sundwall: Hahaha

Jack Cushman: that it’ll put in a signature file that says all of those hashes of the stuff that’s actually in this thing, I’m going to sign that file of hashes, and I’m going to say, Jack swears that this is real. And it actually goes back to control of our email address at the Library Innovation Lab. So someone who was in control of the email address at this time signed this thing saying it was real. And then we timestamp it, just going out to a timestamp server, like Digicert, and say,

someone else out in the world with no reason to lie who timestamps a bunch of stuff says it’s existed at this time. And that signature plus timestamp can give you a lot of confidence. I also built it so that it can support multiple chains. So you could say, Jack swore this was real and timestamped it, and then someone else swore it was real and timestamped it, and then someone else did it. And you can start to make a collection of people who it’s implausible, the thing up. So technically, it’s trying to make a really simple, hard to mess up

convincing proof that this thing is what it says it is. And if you poke around the bag nabbit source that you just linked, you can see how we made those choices. And the goal was to have a cryptographer not say like, how brilliant, you did some really clever things here, but probably to say like, you did what we thought was amazing 10 years ago and is now fine. And you didn’t do it wrong. Because that was really the goal is like, don’t have any kind of like big implementation mistakes in cryptography.

which is kind of the level of cryptographer that I am. Like, I think I understand the tools we’ve been given and how to use them. And I understand that almost all the time what goes wrong is not some break in the cryptography itself, but like a screw up along the way in how you use the tools. So I’m going to try to use them in a very straightforward, obvious way. And that’s what this tool offers is like, here’s just like the most obvious, straightforward way to use a very standard tool to verify where something came from and when it appeared. And…

I don’t know, I like to imagine sometime five years from now, 10 years from now, 50 years from now, people saying like, is this real or is this made up to suit our moment? And being able to say, yes, I can trace it back through Jack’s software and say, very implausible that this was made up because you would have had to do a bunch of things that didn’t happen.

Jed Sundwall: Yeah. Okay. Well, this is, mean, it’s great to, we’ve talked about this before, but never, I’ve never gotten this full spiel from you. like, this is also, this has to be built into source. Like I’ve said this for a long time. It’s an aspiration for source. Like that, I’m glad you’re excited. Like, I want it, I want people to be able to use source to win court cases. Like, cause people are like, we’re have open data for impact. I’m like, well, how does that impact actually happen? Cause there’s

There’s always kind of like this like, what I call the data delusion. I got this from Jessica Seddon. She’s on our board. great co-conspirator forever, but like, she talks about imaginary decision makers, which is sort of like, there’s this in our circles, especially those of us who work like in environmental data, we’re like, well, once we have the data, then the people who are in power will know what to do and then they’ll do the right thing. we’re like, no, that’s not, that won’t happen. Like,

I don’t, you know, maybe sometimes that happens, but it’s pretty rare. The way that you get people to change their behaviors. I mean, one good way to do it is by suing them. and winning. And so I was like, yeah. And so I’m like, all right, well, like, then what do we need to do to make data actually suitable to be presented as evidence? And so that you just told the story perfectly. The funny thing, the beach ball thing is hilarious because a beach ball is so benign and then

Jack Cushman: Yeah, you’re trying to offer a theory of change.

Jed Sundwall: And then I’m like, how would you commit a crime with a beach ball? know, then I’m just…

Jack Cushman: Let’s not. I worked as a lawyer for a while and I worked some upsetting criminal law cases. My favorite though were torts. So I think if you’re looking for the fun, if you study torts, it’s the law of how you can get paid back after someone accidentally or intentionally hurts you. How do you just go to court and say, well, something bad happened. You should pay me until we’re even. How do you prove what even would be? And when you read a torts book, everyone starts with like, it feels like you’re reading the start of a horror movie.

Jed Sundwall: Mmm.

Jed Sundwall: Yeah.

Jack Cushman: It’s like, know, two brothers were riding a train. The train had no doors. The train was on a high bridge. The train went around a corner and you’re like, no! they go, on. But I like it when they’re 100 year old cases and you can kind of have some distance from them.

Jed Sundwall: Yeah.

Jed Sundwall: Yeah. Well, actually you just also another thing you mentioned water rights. and I love, you know, again, we worked a lot of environmental data. So, so water rights always come in, come up. And I often like to refer to the Sippus perus, perusinus. Um, but this is evident. This is text, um, that has been preserved on a, on a stone like tablet or I don’t know what you would call this thing, but, um, from

you know, two or three BC or something like that, or second or third century BC or something like that. And of course it’s about water rights. It’s basically like, this is our water. So anyway, talk about archival evidence. Okay. Yeah. You go ahead.

Jack Cushman: Totally.

Jack Cushman: Yeah, just to plus one the thing you said, you were saying, well, shouldn’t source be signing things? And I think that figuring out the technical details of that is such an important thing to sort out. And we’ll have a lot of fun, little nitty gritty design choices to it. But it goes back to this core thing that whenever we pass something from one hand to another, we should write down what was passed. Because it tells you, here’s the chain of people who would have to explain what this is for you to make sense of it. And that can be in court. It can be in research. It can be just like,

What is this object and where did it come from? But you have such a wonderful leverage point because you’re collecting a bunch of stuff that if you standardize, here’s how we get this into a provenance chain now. And from here on, it’s going to have a clear record of where it came from. It’s just a wonderful way to be a witness to what has happened and to start to make it possible for the community to know things more specifically and reliably.

Jed Sundwall: Yeah. No, I think we’re in a good position to do this. it’s the kind of thing that when I was building the open data program at AWS, like we, we would have not been able to do. It would be, I think basically impossible to get Amazon to say like, yeah, we’ll validate all this sort of stuff. think the, you know, for good reason, think like the Amazon’s lawyers would be like, that’s not a role that we’re going to play. and then of course my opinion also is quite strong, but like we should have,

differently governed entities to do that kind of thing. You don’t want an investor owned entity to do that because it’s just not core to the business.

Jack Cushman: You know, for people who want to get involved in that, right now, I think the C2PA coalition is really where that action is. And I was just noticing Amazon is one of the members of that. I think Adobe is really the driver of it. You their vision is if you take a picture with a camera, pass it to an editor, pass it to a newspaper, like at every step of the way, as a photo is handed from one place to another, including through Photoshop, you should get a reliable record of what did that person do to it, which is a perfect example of how we use provenance chains.

Jed Sundwall: Okay.

Jed Sundwall: Yeah.

Jack Cushman: they’re making a standard that is right on the cusp of being useful for everyone else too. It’s working with images as it’s motivating use case. And you can see some parts of it that are really shaped by that. But then you can also see overlapping almost completely. This is just a general standard for having a provenance chain that gets passed around with a piece of data. And whenever someone touches it, they add on what they did to it. And then they pass it on. And I think if we can get there, it’ll just unlock like

a correct answer for how we’re all supposed to be doing this thing. That we’ve made our own standard for how to attach provenance to web pages, the Waxey Auth standard. We have our own way that we did it with Baggett here. But if we can get this thing to be a generally applicable, here’s the right way you pass things around, it’ll be so powerful. And I throw that out here because, as you said, incentives can be weird for large corporations. And if one is driving it especially, it can end up kind of

Jed Sundwall: Mm-hmm.

Jack Cushman: overly shaped in the ways that they can see it can help and under theorize and others. This is just such a good time for people to pile in and help it be useful to everyone. I think OpenAI and the AI platforms have gotten interested in this as a way to say like, if you want to prove where this came from, if you’re not trying to hide that it came from AI, but trying to document it, here’s how you would document it using this. And that’s a good sign because it’s such a different use case, but I’d love to see more of that in there.

Jed Sundwall: Yeah, yeah. Okay, well then we will. Just making plans for 2026. All so there’s two other things I wanted to touch on. We still have time, again, as you said, think before we started streaming, if we really wanna make it to the top of Spotify, these things need to be three to four hours long, but we’re not there yet. you mentioned once this idea that sort of the internet has created this kind of like…

Jack Cushman: Totally.

Jed Sundwall: I would call this just sort of like a distortion or creates this illusion that data is safe when it’s not. it kind of directs everybody into having just one copy somewhere. Do you expound on that? Or did I represent that right?

Jack Cushman: Yeah, absolutely. That’s exactly right. We’re calling it the one copy problem. And the summary of the one copy problem is that all of the data that we rely on is very fragile. That’s the urgent thing. But the why it’s fragile gets really interesting. And it really comes from the economics of having the internet be very reliable, counter-intuitively. When you’re studying the internet, there’s this network diagram that gets passed around a lot. There are layers. There’s an hourglass where you have like

Jed Sundwall: Yeah.

Jack Cushman: IP and TCP and the DNS system and browsers and applications as a bunch of layers that each take care of their business and let the layers above and below them take care of theirs so the whole system works. So if you picture our data preservation system as layers, a layer that works incredibly well is the ability to reach out and contact a website. Cloudflare was down yesterday. everyone is talking about it and makes headlines that there’s some websites you can’t reach within a second right now.

But we’re used to almost all the time, almost all websites anywhere in the world, you can get in under a second. It’s incredibly robust. If you looked at it terms of how often is it online and how reliable is it, the system is designed very well and it works very well and it gets you things immediately with no complaints. And there’s exceptions to that, but it’s a reasonable way to think about what the internet is and how it works. And that reliability ends up creating fragility other places in the stack.

Because when you have two versions of something, it’s equally easy for the entire world to all go to one. There’s no kind of incentive to be like, well, this one’s down sometimes. This one’s down other times. This one’s closer to me. This one’s farther away. No. If you have like CDC data one and CDC data two, a crowd is going to kind of pick one. And then that one is going to gain momentum. And they’ll tell each other about it. And pretty soon, 100 % of people will be going to CDC data one. No one’s going to CDC data two.

And after a year or two, someone’s going to say, why are we still paying for this thing that no one uses? And it’s going come off the budget. And that’s true. It’s true for governments. It’s true for nonprofits and public interest preservation. It’s true for corporations and redundancy around, are we storing our archive as the New York Times or Amazon or whoever. Because of the reliability of the networking, we do this economic process of

putting 100 % of our reliance on one copy, 0 % on the other, and then deleting copy two. And it means that our memory becomes really fragile. There was a story just a few weeks ago of a fire in South Korea that destroyed 800,000 federal workers’ data. And you’re kind of like, oh, what idiots. If I was the system man, I would never have forgotten to do whatever. But no, actually, all of our data is that fragile, where a systemic shock like a fire really could destroy it.

Jack Cushman: Some is very well backed up, but most of it is subject to one or more correlated failure modes. So you’re not necessarily picturing like they only had one hard drive, they should have had two hard drives. But you have to picture they only had it behind one administrator password, and if someone stole that, it could be deleted, and it should have been behind multiple. Or they only had it in one geographic region. was all in Amazon’s data centers in Virginia, or it was all in California. And when there was a large scale disaster, it got lost. Or they only had it in one brand of hard drive. And when that

Jed Sundwall: Yeah.

Jack Cushman: brand failed, it failed. Or it was all paid for by one source. And when that source either changed its priorities, changed its policy, it got deleted. There’s a paper from the early 2000s from Locke’s that lists their threat model. And they list, I think, about 14 of these kind of correlated failures. Only one government. Whatever you can think of that is a failure. You could even go to like, well, it’s only on one planet and start to think about how to fix that. But for now, even on one planet, there’s a lot of correlated failure.

Jed Sundwall: Right.

Jed Sundwall: Yep, right.

Jack Cushman: And so the problem becomes like, how do you beat economics? Like, how do you beat market incentives to have only one copy that is subject to correlated failures for stuff that matters mostly to posterity? We have a public data project, and I’ve thought a lot about what that means, public data. And really the way that I think about it is public data is data that is mostly valuable to people outside of the data custodians. Like, if you’re a company and you collect, you know,

Jed Sundwall: Yeah.

Jed Sundwall: Mm-hmm.

Jed Sundwall: Interesting.

Jack Cushman: Internet visitor statistics so that you can model traffic and make ads better. That’s private data. You’re collecting it. You’re using it. You’re paying for it. If you delete it, like, you’ll be the one who’s sad about it. If you’re a government and you’re collecting, you know, what have been our tariffs over time, what have been our school crowding over time, you’re doing that primarily for the benefit of people besides you, the person making the spreadsheet, or even your department, but for the world to be able to navigate properly. And so there’s a kind of incentive misalignment.

the people who most value it are not in the room or able to advocate for themselves necessarily. And if you start thinking about what are all the kinds of data where people besides the ones holding the checkbook might care, it’s certainly things like government data sets, but it’s also things like the New York Times archive, all of the archives of news that are behind paywalls. Even like, I don’t know, YouTube. YouTube in many ways is the most important record of a bunch of things that have happened in the last 10 or 15 years. And like,

Jed Sundwall: Yeah.

Jack Cushman: There’s Google’s interest in preserving that, and then there’s society’s interest in preserving it, and that’s very hard to theorize. So public data becomes this kind of misalignment problem. We need to invest in something that the people who care are not here to advocate. And that’s what I think of as the one copy problem. Where do you intervene in the economics of this thing so that we can start to have durable memory of the stuff we most care about?

Jed Sundwall: Right.

Jed Sundwall: That’s fascinating. mean, you know, I mean, we’ve talked about, I’m very interested in raising an endowment. That’s going to be a huge area of focus for us because I do think going back to this discussion of focusing on the commodity layer, the very good thing about the tech sector that we have right now is that there is competition in it there’s plenty of downward pressure on pricing. And I think we can forecast costs well enough to endow the long-term preservation of data.

And what that could open up is you could say, like, look, we’ve endowed this data set, or I should use my own terminology. Like, you know, we’ve endowed this data product to be available via these URLs for 50 years. Would you like to endow a copy of it? You know, and we are at the point where it’s like, if it’s like a terabyte of data, like that is a, it’s, it’s thousands of dollars. I mean, don’t get me wrong. Like it’s, it’s a, it’s a real thing, but it’s a one-time check that,

Jack Cushman: Yes.

Jed Sundwall: a philanthropist can write, you know, it’s not, yeah.

Jack Cushman: I think it’s such an important provocation or design goal. Why can’t you endow a terabyte? If you’re like, this terabyte should exist for the next 50 or 100 years for humanity, why can’t any of us make that choice and say, yes, I’m going to invest in making that possible? I don’t know what apparatus you would use for that now. And actually, if you’re a Harvard professor, I do know. I would tell you to use the DRS, the Digital Repository System, that was founded about 20 years ago. going through a whole reinvention right now.

I think some big institutions have learned how to think about this for themselves. But how do we make it something that is available, not just at Harvard, but across the world, if you have something you care about, how do you endow it? I love that question. I think it’s such a good approach to it to start to realign those incentives, to say that someone now, today, can make an investment in something to pass it to posterity. And then the other thing I love about it is it makes you start to think about

Jed Sundwall: Yeah.

Jack Cushman: What does it mean to last for 50 years? What steps should you take with that money, the money that you’re handed when you endow a terabyte? And how do you defend against all of those correlated failure modes that Locke’s laid out? I think the gnarly thing, the tricky thing that’s at the end of that thought process is you probably actually need multiple mutually independent institutions to be involved. Because

Jed Sundwall: That’s right.

Jack Cushman: you, Jed, become a single point of failure that like, well, if I can buy you, can endow this thing. And that can’t be how it works either. So there’s a bunch of strategies, but how do we make it so that there is no one of us who can disappear and have the thing disappear?

Jed Sundwall: That’s right.

Jed Sundwall: That’s right. Yeah. man. Okay. Well, one, one last point I want to bring up is let’s talk about the Smithsonian really quickly. Cause again, it’s very relevant to everything we just just said. What’s what’s, what are your plans with the Smithsonian? I mean, what are our plans with the Smithsonian? You can say,

Jack Cushman: Absolutely. So the Smithsonian is our second major data collection after data.gov. And this is something that came up in the data preservation community, whether the Smithsonian’s public out of copyright data set as a whole could be preserved, which is over 700 terabytes stored on Amazon.

Jed Sundwall: Okay.

Jack Cushman: And then over 700 terabytes becomes enough that most projects are kind of, we can’t take that on. That’s too big a goal for us. And our public data project felt like we were able to do that, able to make a first collection. And then we talked with you and very fortunately, you felt like you were able to take it on with us and move it to source. So we start with this kind of giant blob of 700 terabytes that is really quite an undertaking for

our kind of community. It might not be a huge undertaking if we were Google, but for who we are, it’s a big thing. And now we have it. And what we have right now is just a straight copy. Let’s get a copy from here, move it to here. I think the first thing we’ll do is sign it, just like we talked about with the other thing. Just say, I have error that this is the copy I made, and I made it on this date. And from now on, you won’t need me to be around to know that this is exactly what the Smithsonian had. But beyond that, we have to start thinking about access.

And how can people actually benefit from using that thing? One of the things I’m really excited about is whether we can make a kind of access copy that is much smaller and that you could just have for yourself. It’s very common with these kind of preservation data sets that you have a preservation version that is like uncompressed full color images, for example, can be very large. And that’s one of the sources of your 700 terabytes.

But if you accepted a small amount of compression, even visually indistinguishable compression, you could get down to 10 % of the size. So I think exploring that, is there an access copy that is more like 70 terabytes instead of 700? And you could just have on your desk, like 70 terabytes is still a lot, but you could get an enclosure that you could just plug into your laptop and say, the Smithsonian collection is here on my laptop to talk to. So I love that aspect of it. And then the other piece is we have to figure out discovery. What do you do when you just have

a collection that size that lands in front of you and you don’t understand what’s in it. And I think you have the kind of, there’s one approach that is like when you click a file, you should be able to try before you buy and see what’s in there. But the other approach is, what about at a millions of files level, how do you get a view of in general what’s in here? What am I going to find if I start sifting through this? It’s what people call exploratory data analysis, but I think we have to democratize that and not have it sound like something that only data scientists do.

Jack Cushman: Or law firms do it too. Here’s the hard drives of your client or the opposing client and just figure out on the hard drive. That’s called forensic analysis. And I think both forensic analysis and exploratory data analysis, we have to move past that to what can I click to understand what I’m looking at? How can we make this more something that everyone can get their hands on?

Jed Sundwall: Yeah.

Jed Sundwall: Yeah. Well, so actually that was crazy because you just teed up actually next month’s episode of the live stream webinar podcast thing will be with Matt Hansen to talk about the spatial temporal asset catalog. So this is a metadata spec that has been, I mean, very rapidly adopted within the geospatial world that solves that, that collection level problem that you described, which is basically I have a, I have a collection of spatial temporal assets. So, the,

most common example you would think of is a collection of satellite imagery or drone imagery or something like that. you want to give people, what it is is you give people a JSON file at the root that says, here be spatio-temporal assets collected between these times and covering the spatial extent. So immediately you can kind of tell like, is this a timeframe or a area of the planet that I’m interested in or not? And you can move on, right? And those can be indexed. So you can search them.

Jack Cushman: Yeah, that makes perfect sense.

Jed Sundwall: That notion of figuring out the way to kind of distill a collection into something like at that high level so that you at least you’ve standardized. Here’s a bag. We can use any kind of metaphoric bag of collection. What are you gonna say? Like this is the universe it contains. Do you care or not? And move on. So this is a perennial issue. Yeah.

Jack Cushman: Yeah, if I could connect it, trying to wrap it up a bit, I think geodata is out ahead here because geodata has always had this problem. You go to Google Maps, and you can zoom out until you see the whole world. And then you can zoom in until you see just one block. And structuring the data to allow that, to be able to jump in and out and see the right level of detail when it’s all the same data set.

Jed Sundwall: Yeah.

Jack Cushman: has meant that geodata has to be very thoughtful about how is the data stored and indexed so that it’s discoverable by the software that needs it efficiently, which is just what we were talking about with how do we index our data.gov viewer so that that can be fetched efficiently. We need to start thinking that way about that very clever structuring of data across the board for making things available and kind of picture the like. We want to enable for everyone that Google Maps experience.

Jed Sundwall: Right.

Jack Cushman: that if you want to, can zoom out and see the world of the 700 terabytes. If you want to, you can zoom in and you can see the block. And you should be able to do both of those, and you should be able to do them very cleanly, which for that community, completely obvious, has been true the whole time. Wonderful technology for it. How do we take that technology and make it for any data set, I think, is a great challenge. I’ll also say, I’m always kind of looking for where is the bigger industry headed. And I think AI is kind of like a huge industry that blows us in a direction.

One thing that we’re going to find as data people is that indexing is critical to AI research and AI practice. There’s like, from a library perspective, using an information tool, there’s a question of, the model smart enough? There’s a question of, does the ground truth even exist? Is it possible to fetch it? But in between those two is, do you have an index that can get the correct answer instead of the wrong answer into your model’s context when you need it? And if you can do that, if you have those indexes, then you can make

Jed Sundwall: Right.

Jack Cushman: data tools that actually empower individuals, which is what we think about at the library. And if your indexes are bad, then you’re going to get the wrong answer in context, and it’s going to hallucinate or tell you the wrong thing, and it’s going to disempower people. It’s going to hurt them, which means we have this weird position that I think is a surprise for me as a library person and maybe a surprise to other folks that all of a sudden, indexing is cool. How you index your data is going to really matter. And I think it’s such an opportunity for us because we’ve been thinking about indexing forever. And now that it’s cool, let’s figure out

What we know about it that is cool that we can share.

Jed Sundwall: Yeah, your day has finally come. So we’ll wrap up, I want to, you actually, mentioning of like how the geo community is ahead here is, I’m sure flattering to those of our community who are listening in, but we did get one comment on LinkedIn from Linda Stevens, who we’ve worked with in the past, and she’s worked in the geospatial space for a really long time. But she made the comment that, you you have to certify a map at different layers. You have to track and certify all the layers that make it up.

it underscores the point that you made, is that maps are these confections of data that we’ve been figuring out how to create. I mean, it’s such a rich field. I cartography is just, it’s amazing because we’ve been trying to figure out how do we downscale so many things we know about the world into something that’s legible for humans and then assert that in a way that’s like credible. It’s a huge challenge and.

Yeah, would say our, my theory for why the geo community is out ahead is that most of us gave up on getting super rich a long time ago, which is as opposed to like the life sciences community where I think, you know, there’s, there’s real gold in those Hills. You know, people think they’re going to cure cancer and make a ton of money, which is great. Like I want them to try to do that. But the geo community is just generally much more open. And I think just has such a long history of sharing information. I mean, it’s.

Jack Cushman: Mm-hmm.

Jed Sundwall: core to what we do that.

Jack Cushman: Maybe try checking your maps for any hills that have gold in them. It’s probably worth a shot.

Jed Sundwall: We already did that, you know, that’s the point. Like, yeah, we ought to find those. Yeah, I mean, don’t get me wrong. mean, there’s, know, recent years has been lithium, you know, like there’s always going to be something else. There’s money in understanding spatial data for sure, but it’s not, it doesn’t have the, the mad rushes are over and there’s a huge community that’s just, I think very generous. And so, yeah.

Jack Cushman: We found those already.

Jack Cushman: You know, I love Linda’s point, too, that you do have to certify at every level. I’ve seen some of the work that goes into designing a product like Google or Apple Maps, where things have to appear or disappear as you go in and out. It has to be the right things. That has to be the things I care about at each level. And sometimes it’s better, sometimes it’s worse, as they’re kind of iterating on what should I show you. And it’s such a wonderful little example or crucible for how we do data in general, because you have a bunch of ground truth. People went out and wrote things down.

Jed Sundwall: Yeah.

Jack Cushman: It was maybe accurate at the time I saw it. It’s maybe not. You’re integrating a bunch of different views of the world. There’s a bunch of research just going into how do you tell if two data points are one store or two stores, all of that kind of integrating views of the world into one. And then once you’ve integrated into one view of the world, then there’s how do I express this to you so it’s not a lie? I could show you a map of your neighborhood so that I’m showing you the gas pipes and you’re just confused. I could show you one so that I’m showing you the benches and the things that you care about.

And am I meeting you where you are so that what I’m showing you empowers you instead of disempowers you? And am I doing that without oversimplifying it so much that in fact I’m lying to you and I’m disempowering you that way? And so it’s this perfect combination of seeing the world and getting ground truth, integrating it and deciding which things are going to believe and what you’re not, and then debating, well, how are we going to show this to people so that we are empowering them or not? What do we share with them? How do we lead them?

Let them get more expertise when they get it. I just love all of the parts of that design problem. And then it’s kind of like, now welcome to all the rest of it. What if it was a pile of zip files and some PDFs and some instructions and like the mess of the world? And I’m saying let’s is that we haven’t. It’s something that the data community has thought about for ages. How do you make those wonderful interfaces so that people can find the stuff they need outside of Maps too? I think there’s so much more room for us to improve on that. And that’ll be really exciting work to do.

Jed Sundwall: Yeah, well, let’s do it. Let’s, mean, I think we’re very aligned and we want to create the conditions to let lots of people run those experiments and make that possible. So yeah, let’s, let’s go. Well, thanks so much, Jack. think this has been, it’s been awesome. Hour and 20 minutes, not bad. Yeah. Yeah.

Jack Cushman: Thank you, Jet. I really appreciate it. Thanks for giving us a chance to talk about this stuff and thanks to folks for listening. I think we’d love to keep debating more. What are we meant to do and what are we meant to save and how do we save it and how do we pass on to humanity what we should? I just really appreciate the chance to talk about it with you.

Jed Sundwall: Okay. Well, we’ll, keep talking. Thanks Jack. All right. So we’re going to stop and then.

Jack Cushman: All right. Take care.

Featuring:

Jack Cushman

Jed Sundwall