→ Episode 7: Demographic Data and the Perfect Census

March 25, 2026

Show notes

Jed talks with Christopher Dick, founder of Demographic Analytics Advisors and former Census Bureau researcher in the Population Division, about what it takes to count 330 million people, why that process is harder than it looks, and how demographic data moves from raw survey response to useful data product.

The conversation covers the mechanics of the decennial census and the annual population estimates — two distinct products with very different methodologies — along with the legal framework of Title 13, which governs how Census data can and cannot be used. They discuss the cookbook problem: individual data points may be public knowledge, but compiled datasets cross a line and can creaete liabilities, and the Census Bureau’s privacy obligations reflect that tension. He also discusses the long-term challenge of declining institutional trust and what it means for census participation rates.

Chris describes his current work helping school districts model enrollment trends and helping states build their own population estimates. That ground-level work surfaces just how hard it is to get clean, consistent data from local governments — attendance zone shapefiles that don’t exist, parcel data behind paywalls, planning files locked in PDFs. The conversation closes on sustainable data models: who should own public data, what vendors are actually selling, and why clients who want precise 10-year forecasts are a warning sign.

Links and Resources

Demographic Analytics Advisors — Chris’s consulting firm
dataindex.us — Tool for monitoring federal data infrastructure and tracking risk of datasets degrading or disappearing
Census Bureau API — Programmatic access to Census datasets
tidycensus by Kyle Walker — R package for retrieving Census Bureau data in tidyverse and spatial-analysis-ready formats
Hannah Recht — Data journalist at the AP who built one of the go-to R packages for the Census API
Feist Publications, Inc. v. Rural Telephone Service Co. — The 1991 Supreme Court case establishing that raw information without original creativity cannot be copyrighted (the “cookbook problem”)
Reality Has a Surprising Amount of Detail — John Salvatier’s essay on why complex systems are harder than they look
Nissim Lebovits — Software engineer and urbanist building open-source geospatial tools for cities
Ian Dees — Developer and one of the key contributors behind OpenAddresses
OpenAddresses — Free and open global address collection providing street names, house numbers, and coordinates as open-licensed infrastructure
OpenTopography — NSF-funded platform for high-resolution topography data including lidar and photogrammetry
Japan Earth Observer — Robert Cheatham’s newsletter covering space, Earth observation, and the geospatial industry in Japan

Key takeaways

Self-response remains irreplaceable — Administrative data and commercial sources can supplement census counts, but population characteristics like race, age, and household structure require direct self-identification. Commercial data that includes race is typically modeled off census data, so if census data degrades, everything built on top of it degrades too.
Title 13 shapes what’s possible — Census Bureau data is legally protected from use outside its stated purpose. Compiled data like the master address file cannot be released even when individual addresses are technically public. The tradeoffs are deliberate, but they’re rarely explained to data users.
Declining trust makes counting harder — Response rates fall as trust in institutions falls. There’s no purely technical solution; it requires both civic leadership and more creative outreach methods, and the two have to move together.
Purpose-built tools beat general-purpose portals — The Census Bureau’s LEHD program, which built several focused tools rather than one comprehensive interface, is a better model than a single portal trying to serve every possible use case.
Data ownership matters at procurement time — School districts and local governments should own the data they commission — shapefiles, forecasts, all of it. Vendors who retain ownership create lock-in that costs more over time and leaves public institutions without access to their own records.
Forecasting requires honesty about uncertainty — A 10-year enrollment forecast cannot give a precise number. Any vendor claiming otherwise is overselling. The value is in understanding the range and planning accordingly — clients who demand certainty are a signal to walk away.

Transcript

(this is an auto-generated transcript and may contain errors)

Jed Sundwall: Chris, thank you for joining Great Data Products. Everyone out there, thank you for coming along to listen. I’m Jed Sundwall. I’m executive director of Radiant Earth. This is a podcast we started to — it’s a live stream webinar podcast thing that we started to just elevate conversations about how hard it is to produce great data products and talk to the heroes that are making the dream work. And so Chris Dick is here to talk about demographic data, census data, and his career. He bears a lot of scars from producing, from holding a high bar, I would say, for producing really good data about our country. Before we get into it, I just want to say we are still planning — we’re going to do the CNG Forum in Snowbird, Utah, October 6th to 9th.

Early bird tickets are available until May. We’ve got a long ways until October, but you should get tickets. We also have a call for proposals there. It’s 2026.cloudnativegeo.org or if you go to cloudnativegeo.org, there’s a link to it. So do that. Also, this is news to the world — we haven’t announced this super publicly yet, but we will be doing CNG London, near Kings Cross station, in a building called the Jellicoe, on June 23rd. With London Climate Action Week, we’ll be doing a cloud native geo event there. And then one quick update on Source — we’re going to soon have a new landing page for Source, which we’re excited about because to date it’s just been a list of data products that is confusing to most people. We are now hosting over four petabytes of data on Source, which is incredible. We’re doing about 300 million requests for data every month.

Very grateful to everyone who’s publishing data on Source. It’s still invite only, but you can reach out to us at hello at source.coop if you want to talk to us about it. So that’s the housekeeping. Chris, please introduce yourself.

Christopher Dick: Of course. Thanks for having me on. Hey everyone, my name’s Chris Dick. I run a company called Demographic Analytics Advisors. We do a couple different things. We advocate for the importance of, especially federal, but really all public data. I have a deep background in census, especially at the Census Bureau, which I’ll talk about here in a moment. But I also do a lot of demographic work for K-12 educational institutions. So a lot of public schools throughout the country — a lot of enrollment forecasting as well as helping to draw attendance zone boundaries. So I started my career actually at the Census Bureau and I was not one of those folks who actually did the decennial census, though I had a lot of knowledge of things that were going on there. I actually did the thing that we do in every year that doesn’t end in zero, which is the population estimates. So basically, we have every 10 years the actual go-out-and-count-the-people. But every other year, that’s so expensive and such a huge thing, we can’t do that every year. So every other year, we have to create these population estimates. And there’s a whole process we do it by, using a bunch of different types of data to create those population estimates. And then I did a few other things in between that maybe we can get into today, but those are kind of right now the bookends of my career that are really always pushing me into the demographic data side, but also just a general lover of data and especially public data.

Jed Sundwall: Okay, wonderful. And I mean, I have a saying — this comes up a lot — like, in fact, just right now when I was saying we host all this data on Source and we’re logging hundreds of millions of requests every month. I was just talking to the team yesterday about this — weird things happen when you start to count millions of things. And I’m curious, so we have hundreds of millions of people in this country. How do you describe to mere mortals how we count people in a country, like a population?

Christopher Dick: I mean, it’s a huge undertaking. It is the largest non-military thing that our government does, right? So every single decade, we have to hire enormous hundreds of thousands of people to go out and actually help count. And each decade, it’s kind of this new process to go out and actually do the census. And there’s always technological innovation each time. 2020 was really the first online-first decennial census. So that was really cool. But really, at the end of the day, we’re going out, we’re trying to get people to respond themselves. And so there’s a ton of marketing that happens around that to really get people to respond. But at the end of the day, we have to have boots on the ground. We have to have people actually going and knocking on people’s doors and saying, hey, you haven’t answered your census yet. Can you tell us how many people live in this housing unit? And can you give me a little bit of information about them? So it’s an enormous, enormous thing. And then merge that together with the fact that in 2020, they were doing this really at the height of the start of the COVID-19 pandemic. It was just a huge, huge thing for the Bureau to be able to do.

Jed Sundwall: Yeah, I mean, well, and this is interesting, though. The idea of surveying the country on foot and going to talk to people is clearly the way you had to do it before. There are many alternative ways to count people now. And I’m curious to know your thoughts on that. Do you think we should be heading in different directions? Do you have any insight into the evolution of how we count people beyond the boots on the ground approach? Can we ever get away from that, or should we?

Christopher Dick: I think the answer is different depending on your time horizon of when you’re talking. I think personally for 2030, which will be our next decennial census, it would be very hard for us to move away from self-response. Self-response is the highest quality response. It also allows people to self-identify things, right? It allows people to self-identify things like race, Hispanic origin. It allows them to self-identify their household structure. And so that’s all very important for us to be able to do. But the Census Bureau is always doing a ton of research about how we can use some of these supplementary — they often talk about them as administrative — data sources. These are data that are collected for another reason, but then can be used to help count things. And these may be things that are collected by other government agencies or even other surveys from the Census Bureau, like the American Community Survey. They could be commercial products as well. The big thing is when we’re talking about the characteristics of the population, that has to come from somewhere. Right? And so the real question becomes: where are we getting that information? And if people are answering a form or we have commercial data that has some of those people-level characteristics on them and those data are being collected for a different reason, are they going to be of as high quality as something that is being collected for the sole purpose of enumerating our population? And I think the answer right now is no — it is a definitive no, is not as good. But I don’t think that has to always be the case. So it’s something that I think the Census Bureau and those of us outside the Census Bureau who care about good data need to be constantly thinking and pushing ourselves on.

Jed Sundwall: Yeah, I’d agree with that. I mean, I think like you have to — something like the Census Bureau or like some entity needs to maintain a high bar for what matters and then figure out how do we maintain this high bar while doing things as efficiently as possible at the same time.

Christopher Dick: Yeah, it’s hard.

Jed Sundwall: Well, it’s really hard. Things are moving so fast. So this is a wacky thought experiment. We all now have cameras. We all walk around with cameras, video cameras all the time that are really good. And we can imagine a time when the quality of the sensors on our cameras is so high that you could detect micro fluctuations in people’s skin — meaning the sensor could detect health conditions or something like that in a way that we can’t with our own eyes. And there are concepts of public knowledge, which is — I think legally, it’s not privileged information where you live because people can see you coming in and out of your house every day. But then what happens when there are ubiquitous sensors that can see things about people that humans couldn’t see? And suddenly we’re violating everyone’s health privacy or something like that.

Christopher Dick: Yeah, I mean, I am always, by training, a data analyst, data scientist, whatever you want to call the thing that we do — looking at data and trying to use it to solve problems for organizations. I always want to be like, what’s the highest utility data that I can have? But at the same time, data can be used for bad. We all know this. It’s something we really need to be thinking about. And it’s the same thing when we’re talking about the Census Bureau getting data — what, where are they getting the data from? If we’re merging all of these data sources together into some giant database, what could they be used for? What safeguards can we apply? And the Census Bureau already has some pretty decent safeguards, though they were written quite a while ago. So they might not be keeping up with technology. Things like Title 13. And these things provide a lot of privacy protection, but we also see them potentially being abused by folks. So we always need to be thinking about the fact that, yes, privacy has to be part of this conversation. I will never say that I am a privacy expert or privacy scholar, but I am very interested in data privacy because I just think it’s important for society. And I think we need to at least be able to know which data we’re giving up and what it’s being used for.

I still think even with a lot of the data we have, even with a lot of the advances that we have, so many of those models are built off of public data, right? Like they have public data that are helping ground truth them. I remember going in and looking at different commercial data sources about people — people-level commercial data sources. And a lot of them have race and ethnicity information on them. You start digging into their methodology. How do they get there? And it’s usually some combination of census data and using surnames, things like that, trying to build a model of what the likely race of this person is. And what happens if that census data goes away? Maybe for 10, 15 years, it’d be okay. But over time, that’s gonna degrade. So you need to have that high quality response, even if we’re getting a lot of these sensor data or other data sources.

Jed Sundwall: Right, right. Do you mind, can you expand upon what Title 13 is? Because I don’t know.

Christopher Dick: Yeah, I’m not a lawyer. I’m not going to say this exactly right. But it’s basically this idea that the Census Bureau has the right — or is actually forced to by the Constitution — to collect certain data sources. But when they collect those data sources, they are only to use them for certain things. So the Census Bureau — like you were talking about address level data — like that not being protected data. Well, if those data go into the Census Bureau, like the master address file that the Census Bureau has, which is how they run their decennial census — the read of lawyers within the Census Bureau and government lawyers in general is: once those data go into the Census Bureau and they have been compiled to create the census, they’re covered by Title 13. Those data — we can’t just release that address file back out. So this is the general legal protection of: the data shall be used to do these things, to create estimates of the population, survey, understand the population. They shall not be used for anything that is not that. And then the Census Bureau then uses other administrative data sources, like things from the IRS, that are then covered by other laws. Title 26 of the Code is basically even more protective — just the fact of filing is protected. So just knowing that a person filed their taxes, you can’t say.

Jed Sundwall: Right. Interesting. Reminder to everybody to do their taxes.

Christopher Dick: Yes, it’s timely.

Jed Sundwall: Yeah, it’s the worst. Okay, thank you for that. I mean, you’ve revealed so much that doesn’t get talked about enough. The fact that taking address information — my address as a single data point, I might be in the white pages somewhere, but interesting things happen when you have a lot of addresses. I call this the cookbook problem. I think there is actually some Supreme Court case about white pages where somebody copied the white pages and republished them somewhere. And I think it went to the Supreme Court and they said, no, you can’t protect this — this is not protected by copyright. But I call it the cookbook problem because there’s a similar kind of understanding of fair use: I can share a few recipes from a cookbook with you, but if I share all of the recipes, I’ve crossed some sort of line. And so it’s an interesting thing about data products — atomized single data points aren’t interesting on their own, but compiled, they become something very interesting and different.

Christopher Dick: Yeah, and to be very clear, I don’t think it’s that the Census Bureau doesn’t want to share them. I think they feel like, to protect privacy and based on the law, they shall not share them. And that’s — I think in a lot of ways, they would actually view it as a public good to have that out there and for folks to be able to use it. But I think it’s more of a utility versus privacy discussion, as well as a letter-of-the-law type discussion.

Jed Sundwall: Yeah. Well, I mean, it makes obvious sense to me. And yeah, so I worked a long time ago for gobiernousa.gov, which is the Spanish language version of usa.gov. That was a huge issue for GSA and for the use of the community we worked with at the time. It was around the 2010 census — we’re like, you can be included in the census, we want you to be included in the census, regardless of your immigration status. Certainly today as well, there are a lot of people in the country who think, I don’t trust the government, I don’t want to raise my hand and make myself known to the government. There’s no way I’m gonna share my data with you because I don’t want to get ratted out. Yeah, it’s gotta be a constant challenge for Census to be like, please trust us, we’re not gonna rat you out.

Christopher Dick: Yeah, and a lot of the messaging really leans on that idea of privacy protections and that they are required by law to protect their data. But to be completely honest, I don’t think people necessarily believe that. Some people believe that. I think probably more people don’t believe that now. We’ve generally seen, from the different polls, the different survey results talking about trust in institutions and how it’s gone down over time. This is a longer term pattern. Getting people to respond to a census or any survey just gets harder and harder every year.

There’s a confluence of factors there causing that. But I think it drives two things. Number one, at some point, we have to be able to find a way as a country to be able to drive trust in institutions again. And that has to probably come from both directions — from leadership of the country, as well as from us as citizens. And I think also we have to be thinking about new ways to reach people and to survey people. But I truly don’t think you can, in the longer term, science your way out of this. If people really truly don’t trust, we get into a situation where, even if you come up with the most creative thing in the world, you’re just not gonna have enough data to build it off of.

Jed Sundwall: Right, interesting. So tell me, tell the audience — what do you do? What do people pay you to do in your firm?

Christopher Dick: Yeah, so it’s a couple things. I get interested by a lot of different things and I have some different backgrounds that help me do a variety of projects. But I would say my real core of work really does kind of drive back onto that demography side of things.

So I work for a couple states helping them create their population estimates and forecasts — much like the country overall, we need to have forecast population for states as well. And for a host of recent reasons, some states create their own estimates — either because they have data that the Census Bureau doesn’t necessarily use in their methodology and therefore they build their own and then compare them to the census and decide which one is going to be their official, or because they need lower-level geographies that the Census Bureau doesn’t build. Maybe they need city or town level data by demographic characteristics, and the Census Bureau only goes down to the county level by demographic characteristics on the yearly estimates. So I do those type of projects.

I also work a lot — I would say half of my work is working with public school districts, trying to understand how they’re going to grow, decline, or mix of both. And what that means for where kids are gonna go to school, where new schools might need to be built, where perhaps some buildings are now excess and need to be closed. And I find great joy in that type of work because, number one, I can use my skills to help drive decisions, but it’s also so much closer to the community than I was at the Census Bureau. I loved being a data producer at the Census Bureau. I loved creating something that so many people used, but I often felt really removed from the people who were actually using the data or the data were impacting. And a lot of the school district work I do, I feel very close to the people that I’m doing the work for.

And then the other side — I always like to talk about the fact of how I started my company. I have a young family and I was ready to just kind of go out on my own. And Denise Ross, who I know you’ve had on the podcast, was actually my first client. She basically helped me feel confident enough to really launch a company. And that also really got me into the world of census and federal data advocacy, which I always cared a lot about, but I was really outside of that core group. More recently, starting in 2025, we started building something called dataindex.us, which is a tool that actually helps us monitor the federal infrastructure and understand the risk of certain datasets — either quality degrading, going away, or being impacted by RIFs or budget cuts — and really understanding how that might impact the data that we can use in the future.

Jed Sundwall: That’s amazing. In your work, what are the attributes that are sort of most important and meaningful and impactful? This is a pretty loaded question, so I defer to you on how you want to answer it.

Christopher Dick: Yeah, well, it totally depends on the work that you’re doing, right? Like, why you’re doing the thing matters on what you’re measuring. And that’s a lot — a lot of the things that you want to measure are like measurable parts of people, like their age. Some of these things are more like socially constructed, and so it depends more on how someone views themselves and how society views them. And then some of these things are just more like harder measures — like, how much money did you make last year? How many cars do you own? Things that are very planning-based.

Obviously, when it comes to the decennial census, there’s just a handful of questions asked. It’s talking about a housing unit, and then you’re looking at the relationships within that household. And then you’re looking at a very small number of characteristics about the people: how they’re related to one another, age, sex, race, Hispanic origin. But when I start doing a project on something like enrollment, you’re starting to want to dig into different variables that give you information both about the people that live in a place as well as the built environment of a place. And so I see myself doing a lot of: okay, what do I know about the students here? How many students are in different programs? Are there students who are on individual education plans, IEPs? Can we look at free and reduced lunch? Can we understand the distribution of that? Can we understand the places that these kids are living? Can we understand the transient status of some of these students?

And sometimes you just have to be creative about what different data sources you have. Can you think of places that have looked like that in the past? Can you build a model that tells you what that should look like moving forward? So it’s — I hate to give a total rambling non-answer, but it totally depends on what you’re trying to do. There’s some core things that you always want to be looking at, but beyond that, there’s a lot of things that you often need to add on to answer the question correctly.

Jed Sundwall: That’s great. No, it’s perfect actually, because it’s so practical. I want to talk about file formats though. This is all tabular data, right? When at rest, where does this data live? Is this in databases? Are they tables?

Christopher Dick: So I mean, to the Census Bureau’s credit, they try very hard to make these data accessible and easy to understand, but they have a very hard job. The Census Bureau collects a lot of data and it ain’t just that. In general for the census, you have a few different ways that you can get those data. You have the actual census API. And for those of us who are open-sourcey type people, there’s some great packages out there. I always love to give shoutouts to my two favorites in R, which is my language of choice — Tidy Census by Kyle Walker, and then census data tools from Hannah Recht, who is a data journalist.

You can also download — if you’re trying to get a full download of the whole dataset — they just have an FTP. So you just go and you can pull the data from the FTP server. That, if you’re doing bulk downloads, is obviously your fastest way to go. And the geography division of the Census Bureau has some very good shapefiles and other formats as well.

That’s census data. Census is doing a ton of work to make this relatively easy. In a lot of my work, I have to pull data from other places too, and it’s not always that easy. If I’m lucky, like let’s talk about development data — I’m often pulling that from planning departments. If I’m lucky, they have an open geospatial platform where I can just pull their parcel data, pull their development data, match those things together. No problem. But sometimes there’s just no data at all. It’s just in PDFs. And if it’s a small amount, you got some manual entry to do. I’ve even had clients where their shapefiles for their attendance zones — from whoever they worked with prior, they never got their shapefile, they just got the maps. So I’ve literally had a school district where I asked them, could you send me your shapefile or equivalent? And their first response was, “What’s a shapefile?” I’m like, great. Do you have a PDF of a map? No. “But we do have a map in our bus barn. Can we take a picture of that and send it to you?” I’m like, yeah, if that’s what you have.

Jed Sundwall: Great. Yeah, better than nothing.

Christopher Dick: Yeah. I always think of the worst case scenario — before I started doing this school district work, I always thought worst case scenario, I’m gonna have to deal with something that’s in a PDF that someone has tried to do something fancy table-looking that is not gonna OCR very well. And it’s like, no, it can get worse than that.

Jed Sundwall: Right. We all know there’s just — I can’t remember the guy who said it — but like 15, 16 years ago, when I created this thing called Open San Diego to advocate for open data about San Diego, there’s some guy who came to one of our meetups and he’s like, “Open data is great until you open your first zip file.” And it’s just a great line. You’re like, this is so exciting, this data is out here. And then you get your hands on it and you’re like, oh man, I have a lot of work to do.

Christopher Dick: Yeah. And I mean, the census data — when they actually put out the FTP files of especially the redistricting data, cause they get it out very quickly — even those later demographic characteristics files — I mean, these are old school flat files where you need to have a program to read in what the headers are supposed to be because it’s all by column position. So they’re just trying to make them as small as possible. And for people who aren’t used to that, it’s kind of like if you’ve never programmed on the command line — you’re like, what am I doing right now? For those of us who are nerdy like me, I was like, I get this. But for a lot of folks — demographers who tend to be social sciency, not as big of programmers — they’re like, wait, what is this file structure?

Jed Sundwall: Well, I mean, just — and we’re all about that use case with everything we do with Source Cooperative. Our people are the command line people. They want to use DuckDB. They want to just reference files. Flat files are awesome. I mean, I have very strong opinions about this. They are much more durable than databases that you have to maintain. And so we really push people in that direction. And also in 2026, we have pretty ready access to the greatest power user of all time, which is some large language model that can do all this stuff too.

Christopher Dick: I truly think that’s just a big thing for a lot of these different organizations because they know if they publish the documentation with it, people are gonna be able to get there or people are gonna write helper libraries to get there. And I mean, at the time, especially when they were first publishing these things in machine-readable formats, that was — like, you really needed to make sure, even though these files don’t look big to us anymore. But at the time they were pretty big files. They would have to separate them into multiple files, and it was a big deal. You gotta meet your users where they’re at too.

Jed Sundwall: I remembered back when I was working at AWS, I said on stage, “Spreadsheets are good actually.” Because we were building the open data program at AWS, which hosts petabyte-scale data, these huge things. And I’m like, spreadsheets are actually awesome — they’re an example of a very widely distributed technology. There’s the quote — science fiction writer — “The future is already here, it’s just not evenly distributed.” William Gibson. Anyway, I’m like, spreadsheets — everyone can use them. A CSV is phenomenal. You can open them anywhere. After that event, somebody came up to me in the hallway and they like yelled at me. They’re like, “How dare you? The future is exabyte-scale, trillions of records. You’re not gonna be able to do that in CSVs.” Which is fair — sort of — but still. Tables will never die. And here we are now, in the world of Parquet, where we are really scaling up tables in really interesting ways.

Christopher Dick: Yeah. And I mean, the Census Bureau changed — I can’t remember which one it was — but one of the tables changed quite a lot between 2010 and 2020. And a lot of people, myself included, were like, what are you doing? You just broke one of my programs. And they’re like, well, there was good reason X, Y, and Z. And I’m like, okay, yeah, those are good reasons. But I’m still annoyed. Because I just had to update my program and this just took me an extra 20 minutes instead of being super fast. It’s me being annoying — I shouldn’t be that annoyed about it. But at the same time, I remember the story from the Census Bureau around 2010, where certain tables needed to be run and the person who needed to run them was no longer around. And they were literally written in Fortran and there was no one else at the Census Bureau who knew how to rewrite these things in Fortran. So they quickly had to hire either a contractor or bring someone back who had that knowledge so they could get the table out on time. So it is totally — the older technology, and just slow to change because they’re only doing it every 10 years.

Jed Sundwall: Well, right, yeah, that’s a tough one. I have this old headline I always think about from when I was a government social media expert — I’d read this blog called Inside Facebook, and they posted an article that was like “Facebook redesigns newsfeed again, amid unrest.” It was this perfect headline for this very much-moving, fast-breaking-things situation. And just to designate that as unrest is hilarious. But it’s true — you will get unrest when you change the format.

Christopher Dick: Yeah. And you do see it with the Census Bureau — a lot of things don’t change like that file format of the fixed width files. They try not to change them, they know people get mad if they do. But even just over the last 15 years, the number of iterations of like data tools that have been the main data tool for the Census Bureau — it’s changed multiple times. There was something before American FactFinder, and then it was American FactFinder, and now we have data.census.gov. And I still see some data users talking about, I just wish I could have American FactFinder again. People get used to the thing. But then obviously a lot of people had complaints about it as well, and that’s why they upgraded to something new.

Jed Sundwall: Yeah. Well, this is actually a big reason why I have for many years advocated against creating interfaces and APIs to data. Because in doing that, you’re making opinions and you’re creating expense for yourself and debt and these expectations that are just very annoying, if not expensive. And if you can focus on the core data itself on a file or object level — if you get that right, you can open up all this flexibility for people to do whatever they want. I would be very interested in — and we were talking before the call — I am in the position now where I do have money to fund a few little projects. I’ve thought about doing this for 990 data, so that’s the IRS tax filing data for nonprofits. Getting as all of it that we can and trying to create a harmonized store of parquet files that would just make it dead simple for people to query. But I’d love to do that for census data too. If there’s somebody out there who wants to do an experiment with using object storage rather than an FTP-based approach to making a ton of data available, we should test that out.

Christopher Dick: Yeah. And I will say I both agree with you and disagree with you on making data available through tools. Because I think where we run into issues is if we build tools that we think are going to be general purpose and are going to solve all problems — this is just a general tool that’s going to let you pull the data — and trying to make that usable for every single use case ever. Whereas I think building data tools to solve a specific problem — that’s a super useful thing. The underlying data store, and then building the tool to actually solve a particular problem.

I think groups like the Census Bureau are in a really hard position because a lot of people want very easy access to the data. And that can mean a lot of things to different people. Do you need to build a ton of different tools that are very specific to different datasets? And how do you scale that? But you start looking at particular datasets where it’s been done, I think, at least somewhat successfully — if you look at commuting data, the LEHD program. The tool itself looks a little outdated, but what it is doing is helpful. And then you can pull that out of it and do even more with it. Instead of purpose-building one thing where you can get data from all the different parts of LEHD, they purpose-built several different things. And to me, that’s a very interesting take on the way to do it. I don’t know that it’s fully scalable across all of the Census Bureau, to be completely honest. But I think it works really well for that group, and I think it would work well for a lot of different data assets.

Jed Sundwall: Yeah, I mean, I have so many thoughts on this. I’d say an organization like the Census Bureau absolutely should build a tool for people, but with all the humility of understanding — this is not gonna serve everybody’s needs. It’s not gonna be able to do everything. And that’s the magic of data, is that there’s so much you can — like a great, robust, rich dataset, there’s so many ways to interpret it. And so what I’m always warning against is: don’t present your tool as the interface to the data, because you’re certainly going to overdetermine what the data is good for. So por que no los dos — you should do both. And also just the fact that LLMs right now — a hundred people can build a hundred tools ad hoc, and it’s awesome.

Christopher Dick: And so quickly. It’s insane.

Jed Sundwall: So quick. One last thing as we land the plane — I’m intrigued to get your take on: you were talking before about the variety of data formats and products and quality that you get from different parts of the country. Do you have a vision for how it’s possible to create a nationwide state-data product, where the portals that are out there right now — it’s great that a lot of states have adopted different portals — but they’re certainly not coherent. I did the experiment in Washington state looking at five county portals for parcels, all of which use different column names for what a parcel is. No one’s tried to solve this and certainly the market hasn’t fixed it. Is this a solvable problem worth working on?

Christopher Dick: Is it solvable in the very short term? No. Is it solvable in the more medium to long term? I think and hope it is. Because my constant frustration — and I know I’m not the only one who feels this — is just like, I need X, Y, and Z data asset from a county, and bringing it all together is really hard. Some states have fantastic groups at the state level who are doing this work. I’ve done a good amount of work in both Massachusetts and North Carolina — they both have like the NC OneMap, which is pretty good. Massachusetts has their state-level parcel dataset, which is again not perfect but pretty good. So some states do have this figured out. It’s just a question of if it’s been something they’ve really wanted to focus on. But then across state lines, I haven’t, other than companies who sell those data, personally seen a lot of that.

One of the hardest things — and I’ve run into this in some places — is you go and ask for a particular planning data asset, maybe parcels, and the county says, “Yeah, that’ll be a hundred dollars.” That starts becoming the question of: okay, that’s clear. I think usually it’s vendor lock-in that makes it cost that money. But it’s a question of: can you do those things or not. And then harmonizing those data would be incredibly, incredibly useful.

Jed Sundwall: Yeah. And I think these are models that we need to figure out. Like how can we make this sustainable for somebody with your expertise and your young family to be able to keep doing it? And can we do it in a way that’s affordable to states? So I do think a lot of it has to do with education around procurement and also education for public sector leaders to understand the implications of the business models of their vendors.

Christopher Dick: Yeah. And I mean, it is easy to say, but it’s also — when folks are having to procure way outside of their expertise, that’s a hard thing to ask of them. And so that’s obviously why groups have come up to help with procurement and with that knowledge. But I mean, it’s one of those yes-but type situations.

And there are some things that are just crazy to me — like getting one national level file of not just all school districts, cause that exists, but all attendance zones. So actually having that — the National Center for Education Statistics used to have a program that did that. And it’s not funded anymore. Stopped being funded over 10 years ago. So it’s a project where I still use that very old data sometimes when I’m just trying to get a first cut on what I think an attendance zone looks like if I don’t already have the data from the school district. It’s just crazy to me that we don’t have that file existing and funded in some way. Those types of things would be incredibly, incredibly useful.

And for me, I would be willing to pay for them — both with my taxpayer dollars, but also as a person who wants to use them. But it has to be a model that works for everyone. And school districts should own those data too. I am a firm believer that if you are consulting with a school district, you should not be able to own their shapefile that you provide to them. They should own it. That should be their property.

Jed Sundwall: Amen. Absolutely. Yeah, and we need to be doing this for procurement. Somebody needs to be figuring this out — looking, we need our kinds of people to be educating these school boards, cities, states, to understand how to write procurement language around ownership of data. We’re getting comments saying that there are a bunch of organizations that are trying to solve these issues, but they’re mostly commercial firms.

Christopher Dick: That is. And I mean, I say this 100% as a person who runs a private consulting business who works for the public sector — I am oftentimes a bad capitalist, because I feel like these things should be more closer to public goods. I do want people to recoup their costs. But it’s the gouging thing that can get on my nerves. And so, yeah — the open data examples, Open Addresses, Open Topography — those are both amazing. But yeah, if I’m working for a small school district, can they afford that? Maybe not.

Jed Sundwall: Yeah. Yeah, exactly. And in my pitches to especially school districts —

Christopher Dick: Yeah, one of the big things I tell them is: I am not gonna come in here and tell you that I’m gonna give you the exact number of students you’re gonna have in 10 years. If someone is telling you that they can do that with any level of extreme precision, they’re lying to you. Here’s what it is — it’s the science-based thing, it’s better with local knowledge, it’s a good planning tool, but you need to think about the error and all of that. And for me, it’s twofold. Number one, I always feel like I’m a nerd and I want to teach people about what I do because I love it. But also, I feel like it helps me understand which clients I want to work with. Because if they understand what I’m saying and they’re like, yes, I agree with this guy — those are the people I want to work with. The folks that come in and they’re very confident that someone is going to tell them exactly what’s going to happen — man, you’re over promising. It might go well in the short term, but long term it might not.

Jed Sundwall: Yeah, the fakes will get exposed. All right, this has been awesome. I really appreciate you coming on. We’ve been going solidly for about an hour and 15 minutes. It’s a pleasure, it’s a privilege to know you and I’m just really, really glad you came on and I hope our audience enjoys this. And yeah, let’s keep hacking on these things. I think we’ve got a moment now.

Christopher Dick: Yeah, absolutely. Thanks for having me, Jed. And yeah, pleasure being here. Glad that we met through a lot of this dataindex.us work and the work that you all are doing. Thanks for having me. You know I could talk forever.

Jed Sundwall: Yeah, okay. Well, maybe we’ll have you on again. All right. See ya.

Featuring:

Christopher Dick

Jed Sundwall

→ Episode 7: Demographic Data and the Perfect Census

Show notes

Links and Resources

Key takeaways

Transcript

Tags: