→ Episode 3: Inside Harvard's data.gov Archive

November 19, 2025

Video also available on LinkedIn

Show notes

Jed talks with Jack Cushman, director of the Harvard Law School Library Innovation Lab, about how libraries are adapting to technological change while preserving their mission to collect, preserve, and share knowledge. From the printing press to the internet to artificial intelligence, libraries have continuously evolved their methods. The Lab focuses on bridging traditional library principles with cutting-edge technology to empower individuals with better access to information.

The conversation explores the Data.gov Archive project, which aims to preserve approximately 17 terabytes of federal datasets - not just the metadata from Data.gov, but the actual underlying datasets that are at risk of being lost. Jack explains the challenges of collecting these datasets, particularly the limitations of web crawling technology that often fails to retrieve underlying data. The team successfully collected more than 311,000 datasets, with particular attention to smaller datasets that might otherwise disappear, demonstrating their commitment to knowledge stability in an era where governmental data can be fragile.

Jack discusses how they use BagIt - a Library of Congress standard for packaging digital content - to ensure long-term preservation through comprehensive metadata, checksums for verification, and cryptographic signatures for authenticity. This approach addresses data provenance and integrity, creating complete packages that can be cited and verified decades from now. The discussion also covers their innovative client-side viewer that runs entirely in the browser without server-side software, making 17.9 TB of datasets searchable while reducing infrastructure dependencies. They explore the importance of user-centric design, the role of well-supported tools like DuckDB, the “one copy problem” that highlights data fragility in the digital age, and collaboration with institutions like the Smithsonian. The episode also touches on PermaCC, another Lab project that addresses link rot in legal documents by creating permanent links to online resources.

Links and Resources

Harvard Library Innovation Lab
Harvard Library Innovation Lab Data.gov Archive - the archive on Source Cooperative
Data.gov Archive Search Viewer - search and explore the archive in your browser
Data.gov Archive Search Blog Post
Data.gov Archive Search on Hacker News
Great Data Products (blog post edition) - Jed’s call for the data community to pursue greatness
BagIt Specification - Library of Congress standard for packaging digital content
bag-nabit - a tool for downloading and attaching provenance information to public datasets
LOCKSS (Lots Of Copies Keep Stuff Safe) - Distributed digital preservation architecture
C2PA (Coalition for Content Provenance and Authenticity) - Standards for content authenticity and provenance
Perma.cc - A tool for creating permanent records of web pages
The Analytical Language of John Wilkins by Jorge Luis Borges
Reality Has a Surprising Amount of Detail - John Salvatier on the complexity hidden in seemingly simple tasks
Cippus Perusinus - Ancient Etruscan inscription showing that humans have been concerned about water rights for a very long time and that preservation of written language is foundational to the law

Key takeaways

Libraries evolve while preserving their mission - From the printing press to AI, libraries continuously adapt their methods for collecting and sharing knowledge while staying true to their core purpose of preserving information for future generations.
Small datasets matter as much as big ones - The Data.gov Archive project prioritizes preserving smaller governmental datasets that might otherwise disappear, recognizing that knowledge stability depends on capturing everything, not just the high-profile datasets.
Web crawling alone isn’t enough - Traditional web crawling technology often fails to retrieve the actual data files linked from catalog pages, requiring more sophisticated approaches to truly preserve datasets rather than just their metadata.
Client-side viewers reduce infrastructure dependencies - Running search and visualization entirely in the browser without server-side software makes 17.9 TB of datasets accessible while eliminating the fragility and cost of maintaining server infrastructure.
The one copy problem threatens data persistence - Data in the digital age is more fragile than physical artifacts; without robust systems and collaboration across institutions, valuable datasets can disappear when a single server or organization goes away.
BagIt enables verifiable long-term preservation - Using Library of Congress standards for packaging data with checksums, metadata, and cryptographic signatures creates complete packages that can be cited, verified, and trusted decades from now.

Transcript

(this is an auto-generated transcript and may contain errors)

Jed Sundwall: Okay. All right. Well, thanks, Jack. Thanks for coming. Joining us here, the third episode of, yeah. yeah. Yeah. Yeah. Lucky number three. And I want to point out, this is kind of an exciting moment because historically, radiant earth has really dabbled in geospatial data. Like that’s our wheelhouse. That’s our

Jack Cushman: Good. Thank you so much for having me. I really appreciate it.

Jed Sundwall: Our origin story of Radiant Earth was an effort to make satellite and drone imagery easier to work with. And one of the things that I did about three years ago when I came in as executive director was realize that a lot of the work that we had figured out with the geospatial community was really broadly useful in terms of adopting object storage and things like that. so we were, anyway, this is all to say, I’m excited to have you on because you’re not a geospatial person.

You know, first two guests have been geospatial. Yeah. Okay. Good. And anyway, so this is going to be a great conversation to learn a little bit more about like how we’ve been working together on source cooperative and, and your background as a librarian and your perspective on these things. Before we get into it though, I do want to point out to everybody and I’ll figure out how to put this in the chat, but that, you know, you’re, you are currently tuned into

Jack Cushman: Absolutely, would never pretend.

Jed Sundwall: great data products, the live stream webinar podcast thing, as we call it. there’s also great data products, the blog post now. so I gave a talk about a month ago at the Chan Zuckerberg Institute’s open science meeting and, the, the name of the talk was great data products. And then we published a blog post called great data products. So this is an exercise in brand confusion. perhaps radiant earth could, or this podcast could sue radiant earth for taking the title.

for the blog post, but it’s a little bit confusing. But in any event, the name of the game these days is Great Data Products, and we’ve got a great blog post about it. I’m very happy with it. I’ll put that in the chat. So in case people haven’t seen that, you should see it. And then, yeah, with that, let’s over to you. How do you introduce yourself to people?

Jack Cushman: Hi, everyone. I’m Jack Cushman. I direct the Library Innovation Lab. I’m really happy to be on the livestream webinar podcast thing. I love working with you, Jed, on Source Co-op. The lab I direct, the Library Innovation Lab, is a research and development lab, a software lab that’s built into one of the world’s largest law libraries. So we’re doing novel things in a very traditional place and drawing on the best of both of those worlds.

Personally, I’m a lawyer. I’ve worked as an appellate lawyer. And I’m a computer programmer. I’ve been programming computers since I was 12 years old, so very many years. And more of a newcomer to libraries, but I’ve been here for about 10 years. really coming, you asked how do you introduce yourself, which is always a challenge for me on the tax form. What are you supposed to write in for what your job is? And I’ve come to say information scientist, that really I’m a person who thinks about how do we consume information? How do we turn it into knowledge?

And how do we help our society over time have better and better access to knowledge? And that’s why the Library Innovation Lab has become such a great fit. Because our mission is to bring library principles to technological frontiers, which means to understand where people are actually getting their knowledge. How is that really happening, which often is outside of the walls of a library? And how can we take the things that we’ve learned in libraries over many centuries and help new technologies to go better? So really core things like libraries are here to…

collect information, preserve it, and share it to empower people. And we’ve been doing that since before the printing press. But when you invent the printing press, you have to change how you collect and share information. Now you need like a written list of the books you have, because there’s enough that you can’t remember them all. When we invented databases, we needed new ways of thinking about libraries. When you invent the internet and data that is digital first, government’s publishing data that is only online and never on paper, you need new ways again to think about information.

Jed Sundwall: Right.

Jack Cushman: And now in this AI era, we need yet again new ways to think about what it means to collect and preserve and share knowledge.

Jed Sundwall: Amazing. So this is interesting. I didn’t realize you were a lawyer. I mean, I guess it makes sense. You’re at the law school.

Jack Cushman: Clearly I hide it. I’m a recovering lawyer. You know, I have not practiced law since probably 2014, 2015. And happy to leave that to the experts.

Jed Sundwall: Okay.

Yeah, it’s interesting. mean, I, we, we have a kinship here because I studied foreign policy and thought I was going to be a diplomat or something like that. And then, but was also, I was never a, would never call myself a programmer, but I was making websites in like 1994, like on like mosaic, know, got very, was enamored with the web from the very beginning. And, that was just always kind of like a hobby.

for me and then, but anyway, so I think we’ve ended up in similar places interested in sharing data and stuff like that. So it’s cool to hear your story here. So, can you say a little bit more about like the library innovation lab and like what you all think about these days? Cause everything you just hinted at was great, know, pointing out that we had libraries before books. What are you thinking about in 2025, you know, as we go into 2026?

Jack Cushman: Absolutely. And I’ll say, you know, we need 100 library innovation labs. Anything that we pick to focus on is one of many things that we could have. And I hope that all of those flowers will bloom. But for the direction that we go in, the core organizing principle is your society needs knowledge to plan and to direct itself. If we have poor short-term memory or long-term memory as an individual, it’s very hard to navigate your life. If we have poor short-term and long-term memory,

as a culture, as a community, as a government, whatever layer you want to look at, it becomes very hard to navigate. And all the projects that we look at address that in different ways. We build PermaCC, for example, that fixes link rot in law decisions in published cases and in law journal articles that’s used by law firms. And it makes documents reliable in long term instead of short term. When you cite to a URL in document,

You include a permalink, and that permalink is on file as a copy of the web page with the Harvard Law Library. And that means that link is going to work in perpetuity. And it goes from kind of having this etch-a-sketch memory, where you can have a case, and a month later the domain doesn’t resolve and you don’t know what they mean, to having kind of permanent memory again. So what that means for LIL is we’re looking at how do you preserve knowledge for the long term and how do you interpret it. On the preservation side, we’re working on projects like PERMA.

We’re working on projects like we’re going to talk about our public data project, which is how do we make sure we don’t lose the public data we all create together? And then we’re also looking at the access and interpretation side. We have a research program looking at law and artificial intelligence, because law is such a wonderful playground for understanding how AI changes our ways of knowing. The law is kind of done by words. I think of how I want to say it. You think of how you want to say it. The judge picks something, and those words become meaning in the real world.

Jed Sundwall: Yeah.

Jack Cushman: which means that systems that can interpret and juggle and shuffle words to make meaning all of a sudden have this real practical impact in our field. And it us study things like, how are we going to help law students actually learn in a world where the tools can do much of the reading for them? How are we going to evaluate how good tools are at the fine-grained thing that you’re trying to do? How do we do benchmarking of the thing you actually care about instead of abstract benchmarks of other things? And how are we going to navigate a field where that just employment is rapidly changing?

or like law employment used to be this very pyramid shaped. You hire a bunch of people down at the bottom to read through piles of paper in a box. And now the need for reading through piles of paper in a box is really changing. We have to reinterpret what it means to be a junior lawyer who works their way up. So doing a bunch of things that are about how to make sense of the data once we have it. And that might inform sort of you’re seeing both sides of that in the work we’re doing with you. There’s how do we responsibly collect things and then how do we responsibly share them so that people can really find what they need.

Jed Sundwall: Yeah. Well, yeah. So let’s, let’s talk about the data.gov archive. and how that came about. Cause I mean, I, you know, I think the conversation started, about a year ago, when we thought maybe this would be a good idea to start backing up data.gov, but I will confess to not, I don’t have the clean answer to what’s, what is in this collection. How do you describe it to people?

Jack Cushman: Yeah, yeah, great question. So what’s the point of the data.gov archive? It did start because we wanted to do some broad reaching collection of federal data sets. And you mentioned, like, you know, there’s a geopolitical context where you might say, it’s important right now to save data. And at the same time, our law library has been saving data for the federal government since the early 1800s. I don’t know quite when Harvard’s relationship started, but.

The first act where Congress started asking organizations like ours to preserve documents was in like 1813 in the federal library depository act. I’m going to get the name wrong, but it’s been over 200 years that Congress was saying, please help us collectively preserve the stuff that matters. And with data.gov, we were saying, well, what does that mean for 2024, 2025? And

We already knew that the End of Term Archive, which we’re part of, was doing a wonderful job of collecting the web pages of the federal web, including anything under .gov, but also including their Twitter pages and their YouTube and anywhere that the federal government had a footprint, getting a snapshot before and after the transition so you could understand what changed. And End of Term Archive has been doing that since 2008. It’s not a kind of this year or that year thing. As a citizen, you should be able to see what your government was and what it’s become. And you should be able to see that repeatedly as the government evolves.

So we knew that was happening. Then we said, well, what’s not happening? And the real risk that we saw is you can easily end up, if you do a web crawl, getting the manual for the data but not getting the data itself. Because the way web preservation will work is you have a browser, like any of us would use, and it clicks from link to link. And it tries to click all the links on the page, and it clicks all the links on the pages it finds, and then it clicks all the links of the pages it found there. But it can’t do things like interact with a form. It can’t do things like if you need to send an email to get data or

If you need to script an API, it’s only going to get the stuff that you can get by clicking, which is wonderful, but might mean that you end up with a submerged layer of, wish we had the actual data that this report was based on, and that is just gone if it disappears. There was a data rescue community that emerged around that time, a bunch of different groups working on wonderful projects. The part that we worked on was, see if we can save the underlying data behind the data.gov website.

Jack Cushman: Data.gov itself is an index. It lists datasets across the federal government and also some states. But it doesn’t store the data. It just says, you can go here to read this, you can go here to read that, you can go here to read that. They do have an API. So what we did is let’s script this API, get a list of all 300,000 datasets in there, and then find everything they link to and call that the collection. So, you know, dataset number 2,104, which is a dataset of…

you know, traffic congestion in medium-sized cities or whatever part of measuring our society is going to link out to this CSV and this Excel file and this PDF and this zip file. And that list of objects becomes what we want to put in a collection. And then the goal is to have, you know, accurately collect each of those things. So grab the metadata from the API, grab all of the URLs that link out to it and package those up as one of 300,000 objects that we were making in a new

Jed Sundwall: Got it.

Jack Cushman: collection of collections.

Jed Sundwall: Okay, so, but then obviously like, know, so our world again, going back to the geospatial world, we deal with, you know, federally produced data sets that are like petabyte in scale, you know, weather data and model outputs and satellite imagery, things like that. You don’t have that stuff. So this is just what’s linked to, I guess, I guess I’m, my question is like, how many layers deep did you go?

Jack Cushman: Yeah, great question. So we went one hop deep. So you have the listing on data.gov. It links to a set of files. And it says, these are the files in this data set. And we grabbed those files. And I think what that meant is we ended up collecting the smaller data sets. Because for the smaller ones, it would be linking right to an object, a file that was the data of that collection. And for the larger ones, yeah, it had the problem that those links would go to a landing page that said, yeah, for this Petabait scale collection, here’s the steps you go through to get it that are very individual to that collection.

For those, we would only get the landing page. We wouldn’t get the actual data. And what that meant is we added up to about 17 terabytes of data, which is a bunch of small data sets and then a bunch of landing pages for large data sets. I think the size kind of tells you both what it succeeds at and what it fails at. Because it tells you on the one hand, no, we didn’t get the massive uncompressed image collections or that kind of thing. It also tells you we didn’t just get landing pages. Like 300,000 landing pages is not 17 terabytes by any means.

Jed Sundwall: Right. Right.

Jack Cushman: We got a ton of the smaller data sets. And I kind of liked that as a first pass. We just want to do something to stabilize what exists now, not be losing things. And I think it gets you a very broad reaching, small significant data sets are going to be in there and are going to be preserved. And then it sets up for this question of, well, what else got missed? And you know what? It was true at every level. So there was one piece we knew, is the things in data.gov, we’re going to get some of them. We’re going to miss some.

that’s necessary at this scale. We’re also told going into it, data.gov itself is a partial listing of the federal government. I talked to technical folks working in the government at that time to get an idea of like, where’s the list? What would I download if I wanted to download the data sets of the federal government? And first I said, do you know where that list is? And then I said, who could you ask? And they said, no, I don’t have a list. Second of all, no, I can’t even think of someone, a group of people I could ask who could collectively know what it is.

that what we have is a sort of sprawling, overlapping set of independent agencies and groups just making data. And if you look at data.gov, it’s like, here’s a cool snapshot, 300,000 out of X, out of we don’t know how many.

Jed Sundwall: Yeah.

Jed Sundwall: Yeah. man. It’s so you’re taking me back. know, many years ago I worked for USA.gov. So I was at GSA as a contractor when data.gov was launched. And so I had a front row seat to all of that. And I have a similar story, is we at USA.gov at some point, cause I was, I was leading the social media strategy for USA.gov. And I mean, to give you a sense of like what this meant, I started before Obama was elected. Like I started sort of the end of

W’s second term and Facebook and Twitter were already becoming a thing. And it was like, we need to learn how to use this. How do we do it? And at some point somebody was like, we need to keep track of every federal social media account. And it was like, like, well, that’s what are you gonna do? Like open in Excel, create a spreadsheet and just like add them as you find them. And we’re like, that’s obviously not gonna work. This is too big now. And so we created a thing that I’m pretty sure

Jack Cushman: Mm-hmm.

Jed Sundwall: I don’t know, it might still exist in some form. It may have been deprecated, but we called it the USA.gov social media registry. And it was basically, what we did is we let anybody with a .gov email address, submit a social media account that they managed. And then we would send them an email, because we’re like, okay, you’ve got a .gov email address. We also asked them to put in their phone number just to scare them, just to be like, this is serious, like don’t spam this thing. But basically you would get an email with a token in it.

you’d click on that so that we would know that you actually owned the, doc of email address that you put in. And we’d say, okay, like this does look like the Twitter account for the embassy in Myanmar or something, whatever it was. And it works pretty, it works really well. We called it fed sourcing. Like we’re going to kind of crowdsource all this sort of stuff. But one of the things we wanted to do for the form was like, we need the list of the government agencies, which I know that you’ve dealt with.

Jack Cushman: Not sure. That seems like that list would exist.

Jed Sundwall: Yeah, well, it’s actually something I was going to ask you about because you guys have built and this is, mean, it’s also like a segue into the viewer that you all produced, but you have this awesome data.gov archive search that you’ve built. I’ll let you talk about this. But one thing I just sort of want to like get out right away is that you have things listed into organizations, publishers and bureaus. And I’m curious to know like what, if you all had the same conversation where you’re like,

what are the government agencies? Because as far as I know, that list still doesn’t really exist anywhere. We had to make one up based on a Wikipedia article. Like that was the best source we could find. So.

Jack Cushman: I love that story. Well, before we get into our archive, I think that question of what is the denominator, what is the set of data that’s out there that we wish we could save, really helped me appreciate the goals that we have behind this thing. Because I started to picture where’s this data coming from. And rather than like, I don’t know, there’s the DOJ, there’s just these objects out there that are doing things like a giant unit. That what we’re really talking about is federal employees.

Jed Sundwall: Yeah.

Jack Cushman: you might know the number better than me, maybe 2 million federal employees who are out there doing things for us, making things for us, like go to work and in some way facilitate the functioning of the country. And in the course of their business, making data, making data sets, whether it’s how are the crops growing or how’s the water in the aquifers or what’s going on in this little section of the economy or what’s going on in this little section of education or whatever it is, people going about their day and along the way recording things that help us understand what’s happening.

Jed Sundwall: Yeah.

Jack Cushman: And it helps to understand why there’s not a central list, that of course those two million people would be generating millions of Excel files, things that just like, here’s some stuff you should know, here’s something I learned in the course of my day that is worth writing down. Many of them very deliberate and collective and across a group of people. But in many ways, as people who live here, as people invest in our society, we would want all of that. We would have this kind of relationship that is not kind of a citizen and a government, but a person and a person.

that those people should be able to publish the things they learn that will help us. And we collectively should be able to access those and use them. And at that level, the mission starts to feel much more palpable and meaningful to me. That’s like, how do we help those people who are learning things or trying to help us record the things that they’re learning so that they are permanent? And so they’re findable. And if we can have the right taxonomies, let’s do it. If we can have processes, let’s do it. But at the end of the day,

Jed Sundwall: Yeah.

Jack Cushman: let’s just have the stuff that we paid for, the people who we employed to help us be able to share the things that they learn and be able to preserve those. And then let’s back into how would we get that list? How would we index it? How would we organize it? One thing that’s made me really curious about, I think there’s a project out there. I don’t know if this is a you and me project or who should do this. But I would love to use the common crawl in the end of term archive to try to just make the list.

Like what if you went through every web page we know about and maybe ask an LLM, you know, do some like automation in there and ask what clues does this give you about a data set that exists? And then see if we can like find all of that and, you know, aggregate it and combine it, know, deduplicate and come out with like the world’s first denominator of like what’s the data the federal government has published and how many data sets would that be on top of the 300,000 we know about? It would be like, it would be so wrong. Like the number you got would be like.

Jed Sundwall: Yeah.

Jack Cushman: barely related to reality, but it’d the first time someone has planted a stake for like, I think this might be the list. I think this might be just our inheritance as people who live here and people are trying to share data with us. Like this could be it, what ought to exist. Cause I’d love to able to see that. I’d love to be able to see that constellation and look up and say, yes, that is like the thing that we have built.

Jed Sundwall: Yeah, so you’re reminding me of two things. One is, are you familiar with this story? It’s, don’t know even what you would call it. It’s essay written by Jorge Luis Borges called the analytical language of John Wilkins. I imagine this has to be something, this is like right up your alley. I’m putting it in the chat, like, this should be like, librarians should love this. Like a lot of computer scientists love this story. Cause it’s a story about an effort at creating a

Jack Cushman: don’t know that one.

Jed Sundwall: an actual language, it’s similarly like an attempt at sort of like taxonomizing the universe and it doesn’t really work out very well. And Borges points out, he’s like, the reason we can’t do this is because we don’t know what thing the universe is. We don’t have a handle on it. And to your point about like, you know, the government as being perceived as a monolith, as being perceived as, you know, something that is in DC or something like that is just obviously not true. And that’s the other…

Jack Cushman: Yes.

Jed Sundwall: The other thing you remind me of is another essay. And I actually don’t know who wrote this off the top of my head. It’s just like some guy wrote on the internet, but like, I’ll find out now as I Google it, but it’s the title of the essay says it all. It’s reality contains a surprising amount of detail or reality has a surprising amount of detail. got a guy named John Salvatier, I’m not sure how to say his name, but both fantastic little essays here. yeah.

We’ve both lived through this where you can see in something like, what we see in open data policies that are like, the government produces data. The government should make the data open. And those of us who then start looking hard at were like, man, this is not an easy task.

Jack Cushman: I absolutely, I love this duality, this like, well, there’s an abstraction that we wish we could have that is like the perfect data that exists in the abstract. And then there’s this reality that what we’re talking about is the subjective views of a bunch of human beings. And this comes up very practically in the kind of work that we do, both you and I, when you’re trying to do archiving work. reality kind of doesn’t wanna fit your taxonomy and you have to make a lot of choices. When we were doing the case law access project where we scanned

like the collective case law of the United States from historical times up to 2018, we found cases that came from imaginary dates. Courts would just publish a case, and there in the book it would say, oh yeah, like February 29, 1911, like just a date that doesn’t exist. And we were trying to put in a database, and Postgres was like, that’s not a real date. I can’t save that in my database. And we’re like, OK, but it’s a real case. It really has that date on it. It is presidential. It’s part of the law that you and I are supposed to know and follow.

Jed Sundwall: Sure. Yeah.

Jed Sundwall: Wow. Yeah.

Jack Cushman: We just have to now infer, well, from what date did it become part of the law? I guess maybe midnight on February 28th, it existed in this magic hour. And I love that example because there’s this thing that we’re trying to do. Why do all of this? That is, we’re owed ground truth. And the ground truth is both subjective and objective. We all live on a planet made of atoms. And it’s important to just know how much water is in the aquifer is how much there is. You can’t change that by describing it differently.

But we’re all kind of observing and touching reality with different means and levers. And what we’ve come away with, our measurements are all different and subjective. They add a layer of subjectivity. And if you’re the collector of collections of collections at the end of it, which is kind of where we’re trying to be, you end up with both of those at once. We have an objective reality that we’re measuring and we have a subjective attempt to measure it that we’re trying to make sense of. And I just love that game. I love that work that we get to do of like,

help to see the world for what it is and also help to see people for what they are, which is, you know, very imperfect observers of everything we see.

Jed Sundwall: Yeah, I love it. mean, this is a reminder, like this is all, everything we do is part of Radiant Earth, this is nonprofit, right? But our mission is to increase shared understanding of our world by making data easier to access and use for that reason, which is basically, are all, I always refer to the blind men and the elephant, I always use this framing that we’re feeling our way in the dark.

we’re increasingly adding new capabilities of measuring reality and trying to understand it. I’m like, well, we should just, let’s make sure we do that together. And I’ll say what I love about my job and the approach we’re taking is just that like, that gives us so much freedom to be happy anytime anybody takes a swing at bat. We’re like, yeah, go for it. I’m not, yeah, exactly. And people are like, you know, like I wanna try some weird new file format. And people are like, well, that’s not.

Jack Cushman: Yes, get that up there too.

Jed Sundwall: that’s not the one that we use. And I’m like, it doesn’t matter, let them try. So that’s a segue. We should talk about the search, the archive search, but I want to talk about Baggit first. How do you describe Baggit to people and why do you use it?

Jack Cushman: Sure. Bagot is a sort of collective product from the library community writ large, but it was strongly endorsed by the Library of Congress. So it really got some traction there. And I think that was around the 2010s. I don’t remember the exact date. The notion was to have a data transfer format that is as simple as it can possibly be, where every moving part has been stripped away.

so that you can do it reliably and make readers that can reliably pass around things regardless of what’s inside. Because part of the issue is you end up with like a, well, here’s how you encode a web archive and here’s how you encode an image or an image collection and here’s how you encode a novel. And you have the proliferation of formats and you get things that fall in between them and have this kind of taxonomy question we were having. So what if you had something that can just like correctly encode anything in a very loose way? So a bag is a folder.

Jed Sundwall: Yeah.

Jack Cushman: And the folder has inside it another folder, which is the data folder. And whatever is in there is the thing that you bagged. And then it has a little bit of metadata. It has an index that says, here’s the hash of everything that is in me as data that I’m recording. And here’s the date I was made and some things like that. And beyond that, it’s up to the implementer to decide what substantive metadata to record. So it becomes a lowest common denominator way to pass around data in the library and archives community. And certainly,

Jed Sundwall: Okay.

Jack Cushman: you want to specialize from there. You want to have image collections and have a bunch of image specific things that they standardize on. But you don’t want to be stuck with that. You want to also be able to step down to a lowest common denominator to do interchange. We reached for Baggett with Data.gov because it looked exactly like that kind of problem, a very heterogeneous collection. 300,000 data sets, you don’t know what’s in them. You want to just get them all and get them correctly, regardless of whether there’s new file formats you don’t know about. So something that was like, take the files you care about, put them in this folder.

Jed Sundwall: Okay.

Jed Sundwall: Yeah.

Jack Cushman: was a really nice place to start. And then we had to build a bunch of stuff on top of it.

Jed Sundwall: Yeah. Okay. Okay. But then, but the idea though is the folder is an object. It’s a binary that gets uploaded to S3 and it’s a .bagot file.

Jack Cushman: Yeah, so if you’re passing it around, I think we zip them. We put them in a format where they’re compressed, but also with an index, you can pull out individual files from the compressed thing. And this is kind of an elaboration on top of Baggett itself. So Baggett doesn’t specify a single file expression of itself. The Baggett is actually the unzipped zip. So it’s like a folder. It has this file. It has this file. It has this file. And if you have a folder that complies with that, then it’s a Baggett object. It’s like a folder.

Jed Sundwall: Okay.

Jed Sundwall: okay.

Jed Sundwall: Interesting.

Jack Cushman: But we don’t actually share folders on the internet. You always have to turn it into a single file one way or another. So when we share them, the way that we did it is to zip them and index the zips. And if you do that right, then you can get a set of ranges where like, do you want this CSV out of the bag? Just fetch this range directly from the file, and it’ll give you that CSV. And that’s kind of the best of both worlds for serving in terms of it’s small, but it’s also accessible.

Jed Sundwall: Right.

Jed Sundwall: Yeah, it’s interesting. We have to think through this on source. we’re just, as far as like features go, is that the way source works is you’re just navigating an object store. for those who know, you’re not clicking through folders. You’re navigating prefixes and then enlisting what’s in there. But then when you get to an individual object, we want to tell you and show you everything we can about that object. And something we need to do for baguettes and zips and tars is

show you that index. so it’ll be a kind of like, it’s just a new view that we have to think through a little bit where it’s like, yes, you’ve landed on an individual object, but also you should think about it as still part of this kind of directory structure. Yeah.

Jack Cushman: That’s right. And your podcast listeners may know this, but I think I should plug the mission that you’re describing, which is I like to say we collect collections of collections of collections. And I think you then collect collections of collections of collections of collections. So you end up with this very meta, like here is a thing. Harvard made this collection of data at GovObjects. But you don’t want it to just be bits that people have to download and have a local viewer for.

Jed Sundwall: Yeah.

Jack Cushman: What I’ve heard from you is we really should help people understand what it is they’re getting. Just a little like try before you buy of like, what would be in there if I pulled it down? And that becomes easy for like a few standard things to show the beginning of a CSV or an Excel file. It’s very straightforward. And you’ve done things with mapping, which I think is also wonderful. But yeah, what do you do when you have a zip file? Are there ways that we can start to show that? I love this vision that our community can do that together. We can start to say, I’d love to be able to try before you buy this kind of object too. There’s a bunch of these and I’m curious what they are.

And then just contribute that viewer and have that happen too. I think that vision is so key to this. One thing that you and I’ve talked about a bit is, some of it I think is really very specific to one collection. Like we have a custom viewer for data.gov. I actually think you probably want a custom viewer. but, cause you don’t, you don’t want to bag it viewer in general. Bag it is a very general format. So it’s hard to expose much detail there. You want a, you know, Jack Krishman flavor to bag it viewer. Like, you know, the.

Jed Sundwall: Yeah.

That’s right.

Jack Cushman: a viewer that will tell you what’s specifically in these ones, that with a little bit of elbow grease on our side, you can have it actually be able to see what’s in there very specifically. And I think this game is like, how much can we use standard formats and how much do you end up with a bunch of viewers?

Jed Sundwall: Yeah, it is, you know, mean, when I, so first of all, it’s very nice to hear you repeat back like what we’re trying to do and you nailed it. that’s, yeah. Yeah. Well, it’s better to have you toot the horn for me. So that’s great. I love it. Fancy Harvard guy agrees that what we’re doing is a good idea. Well, I just put in the chat, the archive search viewer, because absolutely, I mean, our,

Jack Cushman: You skipped past tooting your own horn, but I think it’s such a good strategy.

Jed Sundwall: So this is a callback to the Great Data Products blog post where I finally posted, I finally published again what I call the sweet spot graph, and which is something that I’d come up with when I was working at AWS, which was this notion that I still have more work to do on this idea. We’re gonna write another paper about it, but like that you don’t wanna over determine how data is interpreted. It’s everything you were saying before.

but you do still want to give people some assistance in seeing the data, right? And so you have to find the sweet spot between like, here’s the raw data. We refuse to interpret it in any way. Like let the universe decide what it’s good for. But also like, let’s be honest. If you download a hundred thousand row CSV, you can’t open it in Excel. You have to, know, and then if you’re

properly nerdy, you’re gonna do like a head in the terminal and just sort of look at the first few rows. Like we can do that in the browser now, like trivially, you know, so we should. And so that’s something we want to build in. But then also to your other point with like the viewer that you built is that like, if you have a handle on your collection of collections, you know, that you’ve put together, you should also in the browser be able to show people around. Yeah, give them the like Jack’s tour, which is great.

Jack Cushman: Yeah, very much. There’s this semi opinionated, because I’m not opinionated about the details, but I’m opinionated about like, what’s this most sensible way to explore this? You know, one thing where I think that’s getting more urgent is as a data rescue community, as an archival community, we have a real challenge with preserving the interfaces to things. So one thing that you’ll get those 2 million employees doing is like, well, here’s some data. And think you actually might want to see it on a map combined with this other data so you can understand how like

Jed Sundwall: Yeah.

Jed Sundwall: Yes.

Jack Cushman: your housing choice relates to your school choice, relates to your hospital choice, whatever the things are. There are all these semi-opinionated viewers that just combine two sources that are helpful to see in a shared visualization. And those we mostly lose because when you move from saving the underlying data to saving the software, you’re moving from the business of data preservation to the business of software preservation, which is its own field that is just much more complicated. You have to understand.

Is the source open? Is there a way to host it? Is there a way that it will be patched in the future? How does it need to evolve? And software preservation is just a much more challenging and one at a time kind of business. So we’re losing the point is we’re losing a ton of our viewers if we disinvest in publishing data. And that means we need to ask, because I think the archival community cannot replace that. We’re not two million people who can come build things. We need to ask, can we?

make more general purpose viewers that help people actually see the part of it they need. And so the undertaking of what would be the sweet spot of general purpose of viewer that helps any given person understand what they’re looking at, I think becomes so important.

Jed Sundwall: Yeah, yeah, well, I guess I’ll say to everybody, stay tuned. I mean, this is something that we’ll definitely be doing a lot more of. I mean, and what’s actually kind of funny, mean, people tend to think this is funny, at least a lot of the people I hang out with because they’re climate model nerds, but I’m like, we really need to make it easier for people to see a CSVs on the web. They’re like, what? I’m like, trust me.

Jack Cushman: Sure.

Jack Cushman: Yeah. Yeah. I think that user feedback is so important. One feedback we got for our KSLA Access project is we were publishing JSON lines files, one line of JSON per. And that was really useful for Python programmers. There’s great tools for reading that. It was very confusing for our programmers, if I’m remembering right. And in R, it was a lot easier to read a CSV than a JSON lines file. And I just got this feedback, like, can you make it CSVs? Like, that works better in my environment.

Jed Sundwall: Yeah.

Jack Cushman: And was like these little things that like, if you can get past that like friction, then people were able to use the thing.

Jed Sundwall: That’s right. Well, and also, mean, I think the story you just told also highlights, think, something that we feel really strongly about, which is that you really have to focus on the practitioner community. This goes back to the sweet spot concept of over determining how data gets presented. If you go too far the other way and you’re like, well, yeah, people just want a dashboard, or you just want a visualization for an executive, and you’re cutting out a whole user community that…

could really surprise you and do interesting things with the data. well, could you say a little bit more about your viewer though, like how it was built and yeah.

Jack Cushman: Absolutely. Yeah, very practically, if you go to this link, you can go browse our collection. And the way that we’re structuring this is sort of some tasteful use of the metadata that came with data.gov. So this owes a lot of DNA and credit to data.gov for structuring the data, offering metadata for how to just shuffle these 300,000 data sets. And we’re really just replicating that. Going back to your question of do we have a separate list of US agencies?

Jed Sundwall: Okay.

Jack Cushman: We really just have the list that came with the data of what metadata entries that they have. And we let you search by data set title, organization, and so on. And then we let you narrow down by categories what we saw as the most useful chunks, metadata fields that were in our raw data to let you browse. The really important thing about this, what makes it little more interesting than a million other pages you’ve seen that let you browse a large data set and narrow it down.

is that it’s running entirely in your browser. There’s no server-side component to that. And for folks who might be on the less techie side of things, we’re talking about in typical website, you have your own browser that runs on your computer, and it fetches HTML and JavaScript and so on from a server. The server is also running custom software. And when you send in your request for just give me the ones that came from the US Geological Survey, on the server, it filters out all the others, narrows it down to that, bundles up exactly what you need, and sends it down to you, which means the person who’s providing this to you

is doing sort of ongoing work for you. They’re keeping this software up to date and running and paid for. And so you’re dependent on them still existing. If you want to come back tomorrow or next year and still be able to narrow things down to just US Geological Survey, you’re depending on the person who’s really providing a service for you, still being there to narrow it down for you and hand it to you live when you need it. And that creates a lot of precarity in the digital humanities space. And there’s a…

We now have enough decades of experience making digital humanities projects and putting them online and then running out of money for them and having them crash again. You can study this. You can look at 100 projects and what made them live or what didn’t. And that server-side software load really becomes an issue because it’s the first thing that’s going to kill your project. It’s a huge difference between print books and libraries and digital books. And I love this contrast. Given some climate control, given a roof that doesn’t leak,

Books are pretty happy to be left alone for a year. If you’re like, you know what, we just don’t have staff to open up this part of the library for the next year. We’re going to close the door, set the thermostat to the right level, and you’re probably just going to find them in better condition in a year than they would be if people had been looking at them. With digital, it’s not like that. If you’re like, we just don’t have the people to match this for the next year, there’s a good chance it’s gone and unrecoverable when you come back for it. You didn’t pay some server bill and something got deleted and no one’s around who knows how to put it back together and it’s just gone.

Jack Cushman: So this viewer, the really exciting thing about it is that it’s really not subject to that kind of rot because it’s client-side only. Because when we give you the data, we hand the entire software to view it to you right alongside. And the idea is if you’re making a copy of this, you get the original, you get the software too. Your copy becomes just as good as the original. And you can see right now, it’s kind of clunky. When you click around it, it’s slower to load than it would be if we had a powerful server running.

we’re kind of pushing the edges of what’s possible to do in the client. I think we can push those edges a lot further. I think a lot of the clunkiness can be fixed by more indexes and more optimization here and there. But what you’re really having to do is think through if all we could do is write static files, what static files would we need to make the experience I want very efficient? And just like you have seen,

geo data that is structured very carefully so that you can fetch the parts you need from the server without needing server-side software. We can use DuckDB and write custom parquet files out that have these are the indexes that you need to serve this experience with the data you most need right at the top. And the better we have that structure, the faster the thing can run. A cool thing about that is it ends up being the same skill that you need to make a fast server-side software. So like,

If your data is poorly indexed and you’re sending a bunch of queries to the server that require it to do a bunch of work, the server is going to crash if a bunch of people use it. So you try to use indexes where the server has to do very little work. If you get those really right and really pristine, you don’t even need the server. You can just fetch the index data directly. That’s the plan. We should talk about cryptography too, because I think that’s a necessary piece of this vision. But let me know if we should jump to that now or stick to the client side.

Jed Sundwall: No, let’s just, I mean, let me just linger on that a little bit. So yeah, when you open the search, you have a little spinner there. And I assume what’s happening there is that, is it WebAssembly loading? Do you know?

Jack Cushman: I think is DuckDB is loading. There’s about five megabytes in the current client that have to run just for raw DuckDB. And this was a technical choice we had to make early on. Like do we use a well-supported off-the-shelf library that does make you load a few megabytes? Or the core work that we’re doing could be done in a lot less software to send down, but you’d have to do a lot more custom.

Jed Sundwall: Okay.

Jack Cushman: We ended up deciding to go with the off-the-shelf thing with .DB because it makes us part of a larger community and we’ll kind of, we think it’ll feed back and forth in the open source community better that way. But it was a tough decision. I think the state of this technique right now is that it’s still pretty bleeding edge. You find a bunch of libraries that are like, someone made it and thought it was cool, but stopped supporting the GitHub repo or like this was a one maintainer and now they’re gone. Or this is a large project that’s planning to implement it, but they haven’t got around to it yet. And you have to find a branch where it kind of works.

So like working this way ends up kind of pushing you into some creative coding. And so part of what you’re seeing is loading that, that DuckDB software for now. And what I’m hoping, I think DuckDB itself could be a lot smaller and that’s one direction I’d love to see that grow. The other thing you’re seeing is loading the data. So in addition to fetching DuckDB, at some points, as you click around through here, it’s going to say, to answer that query, I would need to have loaded this index that I know exists.

Jed Sundwall: Yeah.

Jack Cushman: And so it’ll go back to the server and say, can you please send me 500k or megabyte of this index? It’ll help me show the answer to this. And as you click around, you’ll see less of that because you’ll be loading into your browser the parts that you need to see the experience that you want to have.

Jed Sundwall: Okay. Yeah. just want, just put in the chat also that, you know, hacker news picked it up. They thought it was pretty interesting what you’d done. And, the only other thing I’ll say is in the last episode, Brandon Lou who created Proto maps, which is amazing. vector tile, you know, file format and, and serving tool. I feel terrible. I don’t know exactly like how to characterize how awesome Proto maps is this project, but he’s like, look,

I want to, it’s very, very simpatico with what you’re saying. He’s like, you should also be able to put ProtoMaps data onto an SD card and walk into a forest, you know, and like give it to somebody on a laptop and like visualize it there. I, but now you still need to run a browser. so everything you just said hints at these decisions that you have to think about when you’re trying to find that sweet spot, which is like, okay, we’re to use a very commonly, you know, widely adopted.

platform or tool, DuckDB, because there’s a community there for it. And obviously we’re using object store and browsers because they’re very distributed technology that people have access to. these are the kind of decisions and thinking that I think, well, whatever. I’m preaching to the choir here. Yeah.

Jack Cushman: It’s exactly right. I think David Rosenthal, who founded Lox, the way he likes to say this is no one’s ever going to make hardware specifically for the archiving community. We are too small. So when you’re designing a system, you figure out what you can do with off-the-shelf parts that are designed for other communities. That was how it led him in the early 2000s to say, we need to figure out how to this work on commodity hard drives. Because we can’t be buying special custom media for us. We’re

way too small for that to ever be as good. We need to figure out what’s the media that other people use and use it. And I see that repeat all kinds of ways, know, communities and structures.

Jed Sundwall: man, yeah. Just, I’m gonna, maybe a gadfly. don’t know, I don’t think anybody’s listening right now to this, but like, I’ve had conversations with big funders that want to do big stuff for climate and they’re like, we need really gnarly hardware. And I’m like, do not do that. Like, please don’t go down this path. I mean, they’re talking about building their own data centers and I’m just like, stop, please stop. you’re not, you.

Jack Cushman: Mm-hmm.

Jed Sundwall: What you’re doing is very important, supremely important. I’m glad you want to put money towards this, but like you should be focusing on the commodity layer. Anyway.

Jack Cushman: That’s right. I feel like there’s a, we build strong, robust community layers and then we identify specific technical weak spots where a real technical breakthrough will make a difference. So most of the work is kind of building the community that’s going to pass things around. And then we recognize something like, if we can make this client side, if we can make this cryptographically signed, we can have a breakthrough here. So let’s put some tech into that, but spend that very carefully.

Jed Sundwall: Yeah.

Jed Sundwall: Yeah. All right. Well, yeah. So let’s, now let’s talk about cryptography. yeah.

Jack Cushman: Absolutely. So here’s the philosophy. Every copy should be as good as the original. If I make an archive at Harvard and you grab a copy of it and put it on your desktop, your copy should be just as good as mine for posterity. And that’s because lots of copies keep stuff safe. Those copies all have to be valid. And we really, philosophically, we don’t want to be planning for any one institution to exist in perpetuity.

whether it’s the US government or Harvard or if you shot into space, it doesn’t matter. You shouldn’t assume that one is still going to be there. And then it becomes really critical to focus on how to make copies because the history that we’ve seen on the internet is copies tend to disappear. If you try to maintain two copies of something, we’re going to have two copies of the census data. Then pretty soon you’re like, well, one of these isn’t being used. The internet’s very reliable, so we’re all going to one of them. And the other one just kind of gets cut off eventually.

It gets deprioritized, defunded, disappears. So we have to make copies robust and easy. So it becomes a two-part strategy. When I ship something with a DataGov archive, it’s going to come with a viewer. And it’s kind of with signatures so that you don’t need me to be around to make sure that it’s real, to understand what its provenance is in library terms. And we just talked about the viewer prong of that that helps make sure that your copy is as good as mine. The signature prong of that is when you get data from me, you should be able to tell

Who says this is authentic? When did they say it? And what do they say is in it? It’s something that I love. Starling Labs compares this to like an evidence bag in court. If you imagine, you know, I don’t know, let’s pick something nice. Like a beach ball is found at a crime scene. Then it’ll be put in a bag. Most of the examples are not great, but it’ll…

Jed Sundwall: Yeah, yeah, yeah, yeah, yeah, that’s true. Yeah.

Jack Cushman: It’ll be put in a bag. The person who picks it up will sign it and say, I picked this up and put it in this bag on this date. And then when they hand it to someone, they’ll say, then I handed it to so-and-so. And they’re like, yeah, I picked this up and it was handed to me by them. And I brought it to the evidence locker. And I put it in this locker and locked it. And then the person who takes it out to bring it to court, they’ll say, I took this out. And I held it from here to court. So when you’re admitting that beach ball before a jury of your peers, the e-

you can say these are the people who would have to testify in sequence, every one of them to say hand to hand how it got from that crime scene to you touching it today. And most of the time those people don’t come testify because most of the time that process is reliable and the fact that we have that record means that we can rely on it. Sometimes it’s not. Sometimes we say like, that one person who was working the evidence locker that day turned out to be really sketchy and put things in the wrong places. Let’s figure out which ones they touched and we can revisit that.

So that provenance chain becomes so vital. If you think about it proving things in court, it’s really clear. But actually in libraries, we care about that all the time. Anytime I say, here’s a list of companies, if you could say, well, this is a list of companies that Edward Snowden said were cooperating with the government and I can prove it, then it’s a really important list. If it’s like, this is a list that Jack found on Wikipedia, it’s a meaningless list. The provenance matters. So we need to be able to attach provenance to the things we pass around. And we need it to last longer than we do. That’s the design constraints.

Jed Sundwall: Yeah. Yeah.

Jack Cushman: Cryptography is how we do that. And we attach a signature to it. The signature says, I, Jack, say this is what this is. And you can be convinced that it was Jack who said that. And we attach a timestamp to it that says all of this stuff you’re looking at existed as of this date. No later than this date this came into being. And if you put a signature on it and then you put a timestamp on it, then you can later say reliably, Jack swore in 2024 that this was real and this is what he said it was.

And that existed in 2024. It didn’t happen later. And that doesn’t mean it’s real. Like, it still could all be fake. I could have lied about it. I could have been lied to by the web. A bunch of things could happen. But let’s imagine we go out to 2028 and two people are arguing about water rights in Nebraska. And one of them says, look at this government record from 2010 that proves that these are my water rights. And it’s gone now. It’s no longer on the website. All it is is in the Harvard archive.

And the other one says, that’s a lie. You just made that up. That’s not a real document. It’s not on the federal web. Then what you get to argue about is, is it plausible that Jack in 2024 wrote down some lies about water rights that would mean that I win this thing in 2020? You greatly narrow down the ways that the lie could happen. And most of the time you’ll say, OK, no, that doesn’t make sense. Jack wouldn’t have known to do that. This must be real. So that’s what we need to do. We need to attach a signature. We need to attach a timestamp. Getting into the technical weeds, we were moving pretty quickly, and we wanted to

ship something and we wanted it to be reliable for the long term at the same time. So the plan was to use very well understood basic standard off the shelf crypto. I think if you were designing this from scratch, you would use more modern algorithms, but what we reached for is open SSL and some standard ways that use open SSL design and timestamp things. And so we added a little extension to the bag it format, which you can find in my tool, bag nabbit, which I got to name.

Jed Sundwall: Hahaha

Jack Cushman: that it’ll put in a signature file that says all of those hashes of the stuff that’s actually in this thing, I’m going to sign that file of hashes, and I’m going to say, Jack swears that this is real. And it actually goes back to control of our email address at the Library Innovation Lab. So someone who was in control of the email address at this time signed this thing saying it was real. And then we timestamp it, just going out to a timestamp server, like Digicert, and say,

someone else out in the world with no reason to lie who timestamps a bunch of stuff says it’s existed at this time. And that signature plus timestamp can give you a lot of confidence. I also built it so that it can support multiple chains. So you could say, Jack swore this was real and timestamped it, and then someone else swore it was real and timestamped it, and then someone else did it. And you can start to make a collection of people who it’s implausible, the thing up. So technically, it’s trying to make a really simple, hard to mess up

convincing proof that this thing is what it says it is. And if you poke around the bag nabbit source that you just linked, you can see how we made those choices. And the goal was to have a cryptographer not say like, how brilliant, you did some really clever things here, but probably to say like, you did what we thought was amazing 10 years ago and is now fine. And you didn’t do it wrong. Because that was really the goal is like, don’t have any kind of like big implementation mistakes in cryptography.

which is kind of the level of cryptographer that I am. Like, I think I understand the tools we’ve been given and how to use them. And I understand that almost all the time what goes wrong is not some break in the cryptography itself, but like a screw up along the way in how you use the tools. So I’m going to try to use them in a very straightforward, obvious way. And that’s what this tool offers is like, here’s just like the most obvious, straightforward way to use a very standard tool to verify where something came from and when it appeared. And…

I don’t know, I like to imagine sometime five years from now, 10 years from now, 50 years from now, people saying like, is this real or is this made up to suit our moment? And being able to say, yes, I can trace it back through Jack’s software and say, very implausible that this was made up because you would have had to do a bunch of things that didn’t happen.

Jed Sundwall: Yeah. Okay. Well, this is, mean, it’s great to, we’ve talked about this before, but never, I’ve never gotten this full spiel from you. like, this is also, this has to be built into source. Like I’ve said this for a long time. It’s an aspiration for source. Like that, I’m glad you’re excited. Like, I want it, I want people to be able to use source to win court cases. Like, cause people are like, we’re have open data for impact. I’m like, well, how does that impact actually happen? Cause there’s

There’s always kind of like this like, what I call the data delusion. I got this from Jessica Seddon. She’s on our board. great co-conspirator forever, but like, she talks about imaginary decision makers, which is sort of like, there’s this in our circles, especially those of us who work like in environmental data, we’re like, well, once we have the data, then the people who are in power will know what to do and then they’ll do the right thing. we’re like, no, that’s not, that won’t happen. Like,

I don’t, you know, maybe sometimes that happens, but it’s pretty rare. The way that you get people to change their behaviors. I mean, one good way to do it is by suing them. and winning. And so I was like, yeah. And so I’m like, all right, well, like, then what do we need to do to make data actually suitable to be presented as evidence? And so that you just told the story perfectly. The funny thing, the beach ball thing is hilarious because a beach ball is so benign and then

Jack Cushman: Yeah, you’re trying to offer a theory of change.

Jed Sundwall: And then I’m like, how would you commit a crime with a beach ball? know, then I’m just…

Jack Cushman: Let’s not. I worked as a lawyer for a while and I worked some upsetting criminal law cases. My favorite though were torts. So I think if you’re looking for the fun, if you study torts, it’s the law of how you can get paid back after someone accidentally or intentionally hurts you. How do you just go to court and say, well, something bad happened. You should pay me until we’re even. How do you prove what even would be? And when you read a torts book, everyone starts with like, it feels like you’re reading the start of a horror movie.

Jed Sundwall: Mmm.

Jed Sundwall: Yeah.

Jack Cushman: It’s like, know, two brothers were riding a train. The train had no doors. The train was on a high bridge. The train went around a corner and you’re like, no! they go, on. But I like it when they’re 100 year old cases and you can kind of have some distance from them.

Jed Sundwall: Yeah.

Jed Sundwall: Yeah. Well, actually you just also another thing you mentioned water rights. and I love, you know, again, we worked a lot of environmental data. So, so water rights always come in, come up. And I often like to refer to the Sippus perus, perusinus. Um, but this is evident. This is text, um, that has been preserved on a, on a stone like tablet or I don’t know what you would call this thing, but, um, from

you know, two or three BC or something like that, or second or third century BC or something like that. And of course it’s about water rights. It’s basically like, this is our water. So anyway, talk about archival evidence. Okay. Yeah. You go ahead.

Jack Cushman: Totally.

Jack Cushman: Yeah, just to plus one the thing you said, you were saying, well, shouldn’t source be signing things? And I think that figuring out the technical details of that is such an important thing to sort out. And we’ll have a lot of fun, little nitty gritty design choices to it. But it goes back to this core thing that whenever we pass something from one hand to another, we should write down what was passed. Because it tells you, here’s the chain of people who would have to explain what this is for you to make sense of it. And that can be in court. It can be in research. It can be just like,

What is this object and where did it come from? But you have such a wonderful leverage point because you’re collecting a bunch of stuff that if you standardize, here’s how we get this into a provenance chain now. And from here on, it’s going to have a clear record of where it came from. It’s just a wonderful way to be a witness to what has happened and to start to make it possible for the community to know things more specifically and reliably.

Jed Sundwall: Yeah. No, I think we’re in a good position to do this. it’s the kind of thing that when I was building the open data program at AWS, like we, we would have not been able to do. It would be, I think basically impossible to get Amazon to say like, yeah, we’ll validate all this sort of stuff. think the, you know, for good reason, think like the Amazon’s lawyers would be like, that’s not a role that we’re going to play. and then of course my opinion also is quite strong, but like we should have,

differently governed entities to do that kind of thing. You don’t want an investor owned entity to do that because it’s just not core to the business.

Jack Cushman: You know, for people who want to get involved in that, right now, I think the C2PA coalition is really where that action is. And I was just noticing Amazon is one of the members of that. I think Adobe is really the driver of it. You their vision is if you take a picture with a camera, pass it to an editor, pass it to a newspaper, like at every step of the way, as a photo is handed from one place to another, including through Photoshop, you should get a reliable record of what did that person do to it, which is a perfect example of how we use provenance chains.

Jed Sundwall: Okay.

Jed Sundwall: Yeah.

Jack Cushman: they’re making a standard that is right on the cusp of being useful for everyone else too. It’s working with images as it’s motivating use case. And you can see some parts of it that are really shaped by that. But then you can also see overlapping almost completely. This is just a general standard for having a provenance chain that gets passed around with a piece of data. And whenever someone touches it, they add on what they did to it. And then they pass it on. And I think if we can get there, it’ll just unlock like

a correct answer for how we’re all supposed to be doing this thing. That we’ve made our own standard for how to attach provenance to web pages, the Waxey Auth standard. We have our own way that we did it with Baggett here. But if we can get this thing to be a generally applicable, here’s the right way you pass things around, it’ll be so powerful. And I throw that out here because, as you said, incentives can be weird for large corporations. And if one is driving it especially, it can end up kind of

Jed Sundwall: Mm-hmm.

Jack Cushman: overly shaped in the ways that they can see it can help and under theorize and others. This is just such a good time for people to pile in and help it be useful to everyone. I think OpenAI and the AI platforms have gotten interested in this as a way to say like, if you want to prove where this came from, if you’re not trying to hide that it came from AI, but trying to document it, here’s how you would document it using this. And that’s a good sign because it’s such a different use case, but I’d love to see more of that in there.

Jed Sundwall: Yeah, yeah. Okay, well then we will. Just making plans for 2026. All so there’s two other things I wanted to touch on. We still have time, again, as you said, think before we started streaming, if we really wanna make it to the top of Spotify, these things need to be three to four hours long, but we’re not there yet. you mentioned once this idea that sort of the internet has created this kind of like…

Jack Cushman: Totally.

Jed Sundwall: I would call this just sort of like a distortion or creates this illusion that data is safe when it’s not. it kind of directs everybody into having just one copy somewhere. Do you expound on that? Or did I represent that right?

Jack Cushman: Yeah, absolutely. That’s exactly right. We’re calling it the one copy problem. And the summary of the one copy problem is that all of the data that we rely on is very fragile. That’s the urgent thing. But the why it’s fragile gets really interesting. And it really comes from the economics of having the internet be very reliable, counter-intuitively. When you’re studying the internet, there’s this network diagram that gets passed around a lot. There are layers. There’s an hourglass where you have like

Jed Sundwall: Yeah.

Jack Cushman: IP and TCP and the DNS system and browsers and applications as a bunch of layers that each take care of their business and let the layers above and below them take care of theirs so the whole system works. So if you picture our data preservation system as layers, a layer that works incredibly well is the ability to reach out and contact a website. Cloudflare was down yesterday. everyone is talking about it and makes headlines that there’s some websites you can’t reach within a second right now.

But we’re used to almost all the time, almost all websites anywhere in the world, you can get in under a second. It’s incredibly robust. If you looked at it terms of how often is it online and how reliable is it, the system is designed very well and it works very well and it gets you things immediately with no complaints. And there’s exceptions to that, but it’s a reasonable way to think about what the internet is and how it works. And that reliability ends up creating fragility other places in the stack.

Because when you have two versions of something, it’s equally easy for the entire world to all go to one. There’s no kind of incentive to be like, well, this one’s down sometimes. This one’s down other times. This one’s closer to me. This one’s farther away. No. If you have like CDC data one and CDC data two, a crowd is going to kind of pick one. And then that one is going to gain momentum. And they’ll tell each other about it. And pretty soon, 100 % of people will be going to CDC data one. No one’s going to CDC data two.

And after a year or two, someone’s going to say, why are we still paying for this thing that no one uses? And it’s going come off the budget. And that’s true. It’s true for governments. It’s true for nonprofits and public interest preservation. It’s true for corporations and redundancy around, are we storing our archive as the New York Times or Amazon or whoever. Because of the reliability of the networking, we do this economic process of

putting 100 % of our reliance on one copy, 0 % on the other, and then deleting copy two. And it means that our memory becomes really fragile. There was a story just a few weeks ago of a fire in South Korea that destroyed 800,000 federal workers’ data. And you’re kind of like, oh, what idiots. If I was the system man, I would never have forgotten to do whatever. But no, actually, all of our data is that fragile, where a systemic shock like a fire really could destroy it.

Jack Cushman: Some is very well backed up, but most of it is subject to one or more correlated failure modes. So you’re not necessarily picturing like they only had one hard drive, they should have had two hard drives. But you have to picture they only had it behind one administrator password, and if someone stole that, it could be deleted, and it should have been behind multiple. Or they only had it in one geographic region. was all in Amazon’s data centers in Virginia, or it was all in California. And when there was a large scale disaster, it got lost. Or they only had it in one brand of hard drive. And when that

Jed Sundwall: Yeah.

Jack Cushman: brand failed, it failed. Or it was all paid for by one source. And when that source either changed its priorities, changed its policy, it got deleted. There’s a paper from the early 2000s from Locke’s that lists their threat model. And they list, I think, about 14 of these kind of correlated failures. Only one government. Whatever you can think of that is a failure. You could even go to like, well, it’s only on one planet and start to think about how to fix that. But for now, even on one planet, there’s a lot of correlated failure.

Jed Sundwall: Right.

Jed Sundwall: Yep, right.

Jack Cushman: And so the problem becomes like, how do you beat economics? Like, how do you beat market incentives to have only one copy that is subject to correlated failures for stuff that matters mostly to posterity? We have a public data project, and I’ve thought a lot about what that means, public data. And really the way that I think about it is public data is data that is mostly valuable to people outside of the data custodians. Like, if you’re a company and you collect, you know,

Jed Sundwall: Yeah.

Jed Sundwall: Mm-hmm.

Jed Sundwall: Interesting.

Jack Cushman: Internet visitor statistics so that you can model traffic and make ads better. That’s private data. You’re collecting it. You’re using it. You’re paying for it. If you delete it, like, you’ll be the one who’s sad about it. If you’re a government and you’re collecting, you know, what have been our tariffs over time, what have been our school crowding over time, you’re doing that primarily for the benefit of people besides you, the person making the spreadsheet, or even your department, but for the world to be able to navigate properly. And so there’s a kind of incentive misalignment.

the people who most value it are not in the room or able to advocate for themselves necessarily. And if you start thinking about what are all the kinds of data where people besides the ones holding the checkbook might care, it’s certainly things like government data sets, but it’s also things like the New York Times archive, all of the archives of news that are behind paywalls. Even like, I don’t know, YouTube. YouTube in many ways is the most important record of a bunch of things that have happened in the last 10 or 15 years. And like,

Jed Sundwall: Yeah.

Jack Cushman: There’s Google’s interest in preserving that, and then there’s society’s interest in preserving it, and that’s very hard to theorize. So public data becomes this kind of misalignment problem. We need to invest in something that the people who care are not here to advocate. And that’s what I think of as the one copy problem. Where do you intervene in the economics of this thing so that we can start to have durable memory of the stuff we most care about?

Jed Sundwall: Right.

Jed Sundwall: That’s fascinating. mean, you know, I mean, we’ve talked about, I’m very interested in raising an endowment. That’s going to be a huge area of focus for us because I do think going back to this discussion of focusing on the commodity layer, the very good thing about the tech sector that we have right now is that there is competition in it there’s plenty of downward pressure on pricing. And I think we can forecast costs well enough to endow the long-term preservation of data.

And what that could open up is you could say, like, look, we’ve endowed this data set, or I should use my own terminology. Like, you know, we’ve endowed this data product to be available via these URLs for 50 years. Would you like to endow a copy of it? You know, and we are at the point where it’s like, if it’s like a terabyte of data, like that is a, it’s, it’s thousands of dollars. I mean, don’t get me wrong. Like it’s, it’s a, it’s a real thing, but it’s a one-time check that,

Jack Cushman: Yes.

Jed Sundwall: a philanthropist can write, you know, it’s not, yeah.

Jack Cushman: I think it’s such an important provocation or design goal. Why can’t you endow a terabyte? If you’re like, this terabyte should exist for the next 50 or 100 years for humanity, why can’t any of us make that choice and say, yes, I’m going to invest in making that possible? I don’t know what apparatus you would use for that now. And actually, if you’re a Harvard professor, I do know. I would tell you to use the DRS, the Digital Repository System, that was founded about 20 years ago. going through a whole reinvention right now.

I think some big institutions have learned how to think about this for themselves. But how do we make it something that is available, not just at Harvard, but across the world, if you have something you care about, how do you endow it? I love that question. I think it’s such a good approach to it to start to realign those incentives, to say that someone now, today, can make an investment in something to pass it to posterity. And then the other thing I love about it is it makes you start to think about

Jed Sundwall: Yeah.

Jack Cushman: What does it mean to last for 50 years? What steps should you take with that money, the money that you’re handed when you endow a terabyte? And how do you defend against all of those correlated failure modes that Locke’s laid out? I think the gnarly thing, the tricky thing that’s at the end of that thought process is you probably actually need multiple mutually independent institutions to be involved. Because

Jed Sundwall: That’s right.

Jack Cushman: you, Jed, become a single point of failure that like, well, if I can buy you, can endow this thing. And that can’t be how it works either. So there’s a bunch of strategies, but how do we make it so that there is no one of us who can disappear and have the thing disappear?

Jed Sundwall: That’s right.

Jed Sundwall: That’s right. Yeah. man. Okay. Well, one, one last point I want to bring up is let’s talk about the Smithsonian really quickly. Cause again, it’s very relevant to everything we just just said. What’s what’s, what are your plans with the Smithsonian? I mean, what are our plans with the Smithsonian? You can say,

Jack Cushman: Absolutely. So the Smithsonian is our second major data collection after data.gov. And this is something that came up in the data preservation community, whether the Smithsonian’s public out of copyright data set as a whole could be preserved, which is over 700 terabytes stored on Amazon.

Jed Sundwall: Okay.

Jack Cushman: And then over 700 terabytes becomes enough that most projects are kind of, we can’t take that on. That’s too big a goal for us. And our public data project felt like we were able to do that, able to make a first collection. And then we talked with you and very fortunately, you felt like you were able to take it on with us and move it to source. So we start with this kind of giant blob of 700 terabytes that is really quite an undertaking for

our kind of community. It might not be a huge undertaking if we were Google, but for who we are, it’s a big thing. And now we have it. And what we have right now is just a straight copy. Let’s get a copy from here, move it to here. I think the first thing we’ll do is sign it, just like we talked about with the other thing. Just say, I have error that this is the copy I made, and I made it on this date. And from now on, you won’t need me to be around to know that this is exactly what the Smithsonian had. But beyond that, we have to start thinking about access.

And how can people actually benefit from using that thing? One of the things I’m really excited about is whether we can make a kind of access copy that is much smaller and that you could just have for yourself. It’s very common with these kind of preservation data sets that you have a preservation version that is like uncompressed full color images, for example, can be very large. And that’s one of the sources of your 700 terabytes.

But if you accepted a small amount of compression, even visually indistinguishable compression, you could get down to 10 % of the size. So I think exploring that, is there an access copy that is more like 70 terabytes instead of 700? And you could just have on your desk, like 70 terabytes is still a lot, but you could get an enclosure that you could just plug into your laptop and say, the Smithsonian collection is here on my laptop to talk to. So I love that aspect of it. And then the other piece is we have to figure out discovery. What do you do when you just have

a collection that size that lands in front of you and you don’t understand what’s in it. And I think you have the kind of, there’s one approach that is like when you click a file, you should be able to try before you buy and see what’s in there. But the other approach is, what about at a millions of files level, how do you get a view of in general what’s in here? What am I going to find if I start sifting through this? It’s what people call exploratory data analysis, but I think we have to democratize that and not have it sound like something that only data scientists do.

Jack Cushman: Or law firms do it too. Here’s the hard drives of your client or the opposing client and just figure out on the hard drive. That’s called forensic analysis. And I think both forensic analysis and exploratory data analysis, we have to move past that to what can I click to understand what I’m looking at? How can we make this more something that everyone can get their hands on?

Jed Sundwall: Yeah.

Jed Sundwall: Yeah. Well, so actually that was crazy because you just teed up actually next month’s episode of the live stream webinar podcast thing will be with Matt Hansen to talk about the spatial temporal asset catalog. So this is a metadata spec that has been, I mean, very rapidly adopted within the geospatial world that solves that, that collection level problem that you described, which is basically I have a, I have a collection of spatial temporal assets. So, the,

most common example you would think of is a collection of satellite imagery or drone imagery or something like that. you want to give people, what it is is you give people a JSON file at the root that says, here be spatio-temporal assets collected between these times and covering the spatial extent. So immediately you can kind of tell like, is this a timeframe or a area of the planet that I’m interested in or not? And you can move on, right? And those can be indexed. So you can search them.

Jack Cushman: Yeah, that makes perfect sense.

Jed Sundwall: That notion of figuring out the way to kind of distill a collection into something like at that high level so that you at least you’ve standardized. Here’s a bag. We can use any kind of metaphoric bag of collection. What are you gonna say? Like this is the universe it contains. Do you care or not? And move on. So this is a perennial issue. Yeah.

Jack Cushman: Yeah, if I could connect it, trying to wrap it up a bit, I think geodata is out ahead here because geodata has always had this problem. You go to Google Maps, and you can zoom out until you see the whole world. And then you can zoom in until you see just one block. And structuring the data to allow that, to be able to jump in and out and see the right level of detail when it’s all the same data set.

Jed Sundwall: Yeah.

Jack Cushman: has meant that geodata has to be very thoughtful about how is the data stored and indexed so that it’s discoverable by the software that needs it efficiently, which is just what we were talking about with how do we index our data.gov viewer so that that can be fetched efficiently. We need to start thinking that way about that very clever structuring of data across the board for making things available and kind of picture the like. We want to enable for everyone that Google Maps experience.

Jed Sundwall: Right.

Jack Cushman: that if you want to, can zoom out and see the world of the 700 terabytes. If you want to, you can zoom in and you can see the block. And you should be able to do both of those, and you should be able to do them very cleanly, which for that community, completely obvious, has been true the whole time. Wonderful technology for it. How do we take that technology and make it for any data set, I think, is a great challenge. I’ll also say, I’m always kind of looking for where is the bigger industry headed. And I think AI is kind of like a huge industry that blows us in a direction.

One thing that we’re going to find as data people is that indexing is critical to AI research and AI practice. There’s like, from a library perspective, using an information tool, there’s a question of, the model smart enough? There’s a question of, does the ground truth even exist? Is it possible to fetch it? But in between those two is, do you have an index that can get the correct answer instead of the wrong answer into your model’s context when you need it? And if you can do that, if you have those indexes, then you can make

Jed Sundwall: Right.

Jack Cushman: data tools that actually empower individuals, which is what we think about at the library. And if your indexes are bad, then you’re going to get the wrong answer in context, and it’s going to hallucinate or tell you the wrong thing, and it’s going to disempower people. It’s going to hurt them, which means we have this weird position that I think is a surprise for me as a library person and maybe a surprise to other folks that all of a sudden, indexing is cool. How you index your data is going to really matter. And I think it’s such an opportunity for us because we’ve been thinking about indexing forever. And now that it’s cool, let’s figure out

What we know about it that is cool that we can share.

Jed Sundwall: Yeah, your day has finally come. So we’ll wrap up, I want to, you actually, mentioning of like how the geo community is ahead here is, I’m sure flattering to those of our community who are listening in, but we did get one comment on LinkedIn from Linda Stevens, who we’ve worked with in the past, and she’s worked in the geospatial space for a really long time. But she made the comment that, you you have to certify a map at different layers. You have to track and certify all the layers that make it up.

it underscores the point that you made, is that maps are these confections of data that we’ve been figuring out how to create. I mean, it’s such a rich field. I cartography is just, it’s amazing because we’ve been trying to figure out how do we downscale so many things we know about the world into something that’s legible for humans and then assert that in a way that’s like credible. It’s a huge challenge and.

Yeah, would say our, my theory for why the geo community is out ahead is that most of us gave up on getting super rich a long time ago, which is as opposed to like the life sciences community where I think, you know, there’s, there’s real gold in those Hills. You know, people think they’re going to cure cancer and make a ton of money, which is great. Like I want them to try to do that. But the geo community is just generally much more open. And I think just has such a long history of sharing information. I mean, it’s.

Jack Cushman: Mm-hmm.

Jed Sundwall: core to what we do that.

Jack Cushman: Maybe try checking your maps for any hills that have gold in them. It’s probably worth a shot.

Jed Sundwall: We already did that, you know, that’s the point. Like, yeah, we ought to find those. Yeah, I mean, don’t get me wrong. mean, there’s, know, recent years has been lithium, you know, like there’s always going to be something else. There’s money in understanding spatial data for sure, but it’s not, it doesn’t have the, the mad rushes are over and there’s a huge community that’s just, I think very generous. And so, yeah.

Jack Cushman: We found those already.

Jack Cushman: You know, I love Linda’s point, too, that you do have to certify at every level. I’ve seen some of the work that goes into designing a product like Google or Apple Maps, where things have to appear or disappear as you go in and out. It has to be the right things. That has to be the things I care about at each level. And sometimes it’s better, sometimes it’s worse, as they’re kind of iterating on what should I show you. And it’s such a wonderful little example or crucible for how we do data in general, because you have a bunch of ground truth. People went out and wrote things down.

Jed Sundwall: Yeah.

Jack Cushman: It was maybe accurate at the time I saw it. It’s maybe not. You’re integrating a bunch of different views of the world. There’s a bunch of research just going into how do you tell if two data points are one store or two stores, all of that kind of integrating views of the world into one. And then once you’ve integrated into one view of the world, then there’s how do I express this to you so it’s not a lie? I could show you a map of your neighborhood so that I’m showing you the gas pipes and you’re just confused. I could show you one so that I’m showing you the benches and the things that you care about.

And am I meeting you where you are so that what I’m showing you empowers you instead of disempowers you? And am I doing that without oversimplifying it so much that in fact I’m lying to you and I’m disempowering you that way? And so it’s this perfect combination of seeing the world and getting ground truth, integrating it and deciding which things are going to believe and what you’re not, and then debating, well, how are we going to show this to people so that we are empowering them or not? What do we share with them? How do we lead them?

Let them get more expertise when they get it. I just love all of the parts of that design problem. And then it’s kind of like, now welcome to all the rest of it. What if it was a pile of zip files and some PDFs and some instructions and like the mess of the world? And I’m saying let’s is that we haven’t. It’s something that the data community has thought about for ages. How do you make those wonderful interfaces so that people can find the stuff they need outside of Maps too? I think there’s so much more room for us to improve on that. And that’ll be really exciting work to do.

Jed Sundwall: Yeah, well, let’s do it. Let’s, mean, I think we’re very aligned and we want to create the conditions to let lots of people run those experiments and make that possible. So yeah, let’s, let’s go. Well, thanks so much, Jack. think this has been, it’s been awesome. Hour and 20 minutes, not bad. Yeah. Yeah.

Jack Cushman: Thank you, Jet. I really appreciate it. Thanks for giving us a chance to talk about this stuff and thanks to folks for listening. I think we’d love to keep debating more. What are we meant to do and what are we meant to save and how do we save it and how do we pass on to humanity what we should? I just really appreciate the chance to talk about it with you.

Jed Sundwall: Okay. Well, we’ll, keep talking. Thanks Jack. All right. So we’re going to stop and then.

Jack Cushman: All right. Take care.

Featuring:

Jack Cushman

Jed Sundwall

→ Episode 3: Inside Harvard's data.gov Archive

Show notes

Links and Resources

Key takeaways

Transcript

Tags: