Great Data Products

░░░░░░░░░░░░░░░░░░░

A podcast about the ergonomics and craft of data. Brought to you by Source Cooperative. Subscribe ↓

Data Products

→ Episode 2: Protomaps and PMTiles


YouTube video thumbnail
Video also available on LinkedIn

Show notes

Jed talks with Brandon Liu about building maps for the web with Protomaps and PMTiles. We cover why new formats won’t work without a compelling application, how a single-file base map functions as a reusable data product, designing simple specs for long-term usability, and how object storage-based approaches can replace server-based stacks while staying fast and easy to integrate. Many thanks to our listeners from Norway and Egypt who stayed up very late for the live stream!

Key takeaways

  1. Ship a killer app if you want a new format to gain traction — The Protomaps base map is the product that makes the PMTiles format matter.
  2. Single-file, object storage first — PMTiles runs from a bucket or an SD card, with a browser-based viewer for offline use.
  3. Design simple, future‑proof specifications — Keep formats small and reimplementable with minimal dependencies; simplicity preserves longevity and portability.
  4. Prioritize the developer experience — Single-binary installs, easy local preview, and eliminating incidental complexity drive adoption more than raw capability.
  5. Build the right pipeline for the job — Separate visualization-optimized packaging from analysis-ready data; don’t force one format to do everything.

Transcript

(this is an auto-generated transcript and may contain errors)

Jed Sundwall (02:37.51) So I’m going to start it. first of all, happy Halloween, Brandon. Welcome to a special edition of Goth Data Products. If in case anyone’s wondering why we’re both red, if they’re watching, on the listening to the audio only don’t get the benefit of seeing us in this kind of like spooky, spooky color scheme. But welcome.

Brandon Liu (02:50.606) Thanks.

Brandon Liu (03:16.364) Yeah, so thanks for having me on the podcast. I’m excited to talk about, know, ProtoMaps, data source cooperative. So I’m here to answer questions, I guess. Yeah.

Jed Sundwall (03:25.753) Yeah, no, likewise. by the, sorry, I did chicken out. So I’ve changed the lighting. So I’m not right anymore. yeah, no, it’s, it’s great to have you. mean, when, when we started this thing, you were sort really top of mind of somebody who’s been very thoughtful about how to, about what I would call the, what we call the ergonomics of data, like figuring out how to make a lot of data accessible for people. so if you, if you don’t mind, let’s just start there. Like, can you just

Brandon Liu (03:47.97) Right.

Jed Sundwall (03:55.085) How do you describe yourself and what you do?

Brandon Liu (03:58.459) so the way I describe myself is,

I started a project called Protomaps six or seven years ago and the impetus for this was making it easy to make a map. And the direction that came from was very much just like, you think about a web developer that is making a website, like, so for example, they’re making like a site to look up different cafes in their neighborhood.

they might use something like Google Maps, but that is like a proprietary SaaS that they buy. And like, so I really wanted a way to sort of have like a home cooked way to make a map because there’s so many things you can publish on the web. You’re able to publish videos, you’re able to publish pictures or markdown or HTML, but being able to publish a interactive map has never been that way. So really the way I approach this is from

the idea of making it accessible for anyone to publish a map.

Jed Sundwall (05:06.149) Got it. Okay. And so amazing. and you’ve done it. And so you’ve reminded me. So, so one thing that we were going to be doing, I mean, I’m just gonna like say these things out loud, which is kind of funny is like part of the reason for doing this podcast is like, we’re doing so much stuff at radiant earth and we need like more channels to be able to talk about it. and so just last week, we put out a white paper and this will be in the show notes and I’ll put it in the in the chats, but it’s called emergent standards. so what you said is just like very relevant to this, which is that like in the paper, I argue that the web has turned out to be a really like an engine that helps people come up with new data standards. And so if you look at it from through that lens, you have HTML, which is like, let’s share a document and hypertext, you know, like hyperlinked documents with one another.

And then you end up, you’re like, well, what if I don’t want to load up a webpage, but I want a feed of updates. And so RSS emerged out of that. GTFS emerged out of the need for like standardized transit information. And I would say what you’re doing, and I guess specifically with PM tiles is like a way to do this, for vector tiles.

Brandon Liu (06:24.022) Yeah, I have a lot of, I guess, thoughts about the idea of standards in general, both in the web and also for geo. I think a lot of the web, we think about them as standards, like for example, HTML evolved very early. And maybe on the early web, was a lot of more sort of like, it was in the design phase where people would collaborate on creating some spec and that became a standard.

Nowadays, what you see is it’s more like if one of the big companies that makes browsers like Google or Microsoft, they make everyone adopt a standard because it’s in their incentive to do so. If Google can convince everyone to use, what is it, like JPEG 2000 instead of plain JPEG, then they can reduce the amount of bandwidth on the internet by 20%. And that is all that tech.

around things like serving video, serving audio and images is all like very mature to where you don’t really see a lot of emerging standards being adopted organically. They’re more like, there’s this committee at these huge companies that all collaborate on a standard. There is some examples of of sort of more like small scale solutions that became adopted. And that’s really how I see PMTiles fitting in with them.

is like, I don’t want it to be top down. Like I don’t want people to like make their organizations adopt PMTiles. I want people to use it because it solves the problem for them. There is a really cool format for images that I like. It’s called QOI. I think like it stands for literally like the quite okay image format. Like it’s very modest and it’s like, like it’s its name. But I think it is just like one guy came up with

Jed Sundwall (07:58.543) Right.

Jed Sundwall (08:14.437) Okay.

Brandon Liu (08:19.38) a way to do lossless compression of images that is a lot simpler than PNG and is good enough. It’s not more optimized, but it’s way faster to decode on a CPU thread. And that is one good example of a, not a standard from a standards body, but of something that had a simple design that became popular. And it was not adopted because it’s like,

Jed Sundwall (08:44.699) How popular? I’ve never heard of it.

Brandon Liu (08:48.428) I think it’s used, the original motivation was like for games, like if you have game assets and you need to be able to like decompress them and move them around in like, in just like raw RGB formats, then QoI is supported by like some of those engines. But actually like another one you mentioned is GTFS. So GTFS is like more geo adjacent. And that was, it also came out of, think Google’s requirement to like have some

systematic way of storing transit routes. But it wasn’t like some sort of consortium of transit agencies that came together to design like this like CSV format. It was like, it just became a widely adopted solution because it happened to be good enough.

Jed Sundwall (09:35.441) Right.

Yeah, well,

Brandon Liu (09:38.252) And that’s really how I see PMTiles. Yeah.

Jed Sundwall (09:41.423) Yeah. Well, so yeah, I mean, it happened to be good enough. Also Google had this cruise ship that everybody wanted to get on. I think, I don’t know who first said that, described it that way, but like every transit agency in the world was like, we want our data to be in Google maps. And so they had an incentive to do that. And so that’s a concept we explore in the white paper, which is like, you do need this mix of like good enoughness, because that is usually where things land is you have something that’s

good enough for a lot of people to adopt. They’re like, is fine. mean, sorry, what’s the acronym for the image? Like adequate, what is it? Adequate, quite okay. Like I love it. Like that’s usually where things land. Like RSS, the story of RSS is like a bunch of people fighting and a bunch of attempts that like top-down approaches to syndication until people kind of threw their hands up. And, but tellingly then the New York Times adopted it.

Brandon Liu (10:21.102) quite okay. Yeah.

Jed Sundwall (10:41.059) and started publishing RSS feeds and everyone’s like, okay, this is what we’re doing now. So it’s fascinating to see, do you have any sense for the traction of PM tiles as being like this? Like who’s using it?

Brandon Liu (10:56.27) So I have a couple of proxy ways. So I don’t actually know how many people are using it because by nature, I can’t track. I can’t add a tracking pixel each time someone looks at a map. The one thing I can track is the number of NPM downloads. So NPM is the package manager for JavaScript. And that is, I think, the most popular client for reading NPM tiles. And it’s something that I’ve

Jed Sundwall (11:07.61) Yeah.

Brandon Liu (11:24.896) it’s something that I maintain and that crossed like 100,000 downloads per month or it’s either per month or per week. I can’t remember like this year. So you can see like a growth curve of people using this library. Now I don’t actually know if that means anything because it could just be like an automated CI script, like on GitHub actions that is downloading it like a thousand times. But it has some correlation with usage. So the only way that I can kind of

see if people are using PMTiles or if it’s being adopted is through this like proxy metric like NPM downloads. Or people show me like a site that is built using it. So actually like probably the biggest one is like I think the New York Times had a visualization on their homepage that was about like a space debris falling to earth. And that used a map data set that was served from PMTiles.

Jed Sundwall (12:15.558) Okay.

Brandon Liu (12:20.392) So probably like a dataset that’s being served on the New York Times like front page in PMTiles format is like probably like the most high traffic use of it.

Jed Sundwall (12:21.284) Okay.

Jed Sundwall (12:32.209) Here we are again with the New York Times. really, it’s kind of interesting. mean, you think about the legacy of the New York Times as being like it, the story about them sort of crowning RSS, the standard for syndication, like that’s true. Like they did that. And like, they do have the imprimatur to do that kind of thing, which is, that’s awesome. That’s great. Like that’s, that is a sign that you’ve, you’ve made it. Shout out to Tim Wallace, who probably had something to do with.

Brandon Liu (13:01.816) Okay.

Jed Sundwall (13:02.053) with the New York Times using PMTiles. That’s awesome. Okay. Well, so one thing I can say though is like on source, you know, we host a lot of PMTiles files and you can correct me on all of this. Like there are some kind of like base map objects I think that are in there or something like that. But if I search GitHub, it’s one of my favorite things to do is to search GitHub for…

references to the source data proxy, which is data.source.coop. as of today, earlier today, it’s like 612 results pop up when I search for it. But a lot of them are two PM files. Do you know anything about, do you have any insight into that?

Brandon Liu (13:47.362) right. so the project I run is sort of an umbrella project called Protomaps and PM Piles is just one part. And that was by design because I never thought it would be good enough to just design a format. Because it’s like, if you design a format, then you also have to have like some killer app that makes people actually care. Because just having like a spec with some implementations is like, people are like, that’s cool. But like,

Jed Sundwall (13:48.889) Or aware of that? Yeah.

Jed Sundwall (13:55.471) Yes, yeah. Okay.

Jed Sundwall (14:11.057) All

Brandon Liu (14:16.182) I can’t immediately take advantage of it. So the way I approached it was to have like a killer app, which is a base map or like what people think of when they think of a map, which is like you look at it and there’s like city names and there’s like water and like roads and stuff. That’s based on OSM. So the actual data product that is like open source and free by default in the PM tiles format is this base map.

that’s from OSM. And I think a lot of the links to source are to that because going back to what I started with, it’s like if people just want some solution for showing a map on their site, know, like as an open source replacement to Google that they can run themselves, that they can copy and they can move around, they can download like, so as if it was a video or an image. But I imagine a lot of the links are to that just because it’s designed to be something that’s like immediately useful.

Jed Sundwall (14:47.663) Yeah, that’s what I was guessing.

Brandon Liu (15:12.686) Now I think with source, I think the the cores policy is like quite open. So if there’s other data sets, like a scientific data set that is in PM tiles format, people could link to that. And hopefully people do that more or they download from source and mirror to their own buckets and use that.

Jed Sundwall (15:30.969) Yeah. Yeah. Yeah. So I mean, this is something that we have to, we’re going to have to do our own analysis on this at some point. which is like, what is the cost of us hosting those, those objects? Cause yeah, our core’s policy is wide open so people can do that. and we can do the math on this, but I mean, you know, shout out to AWS. Thank you to the AWS Open Data Program that, still exists.

after yesterday. anyway, was a tough day for a lot of people at Amazon yesterday. There were a lot of layoffs, but the Open Data program is alive and kicking. And so they subsidize all of our storage and bandwidth for source. But we do want to get serious about this at some point and have an understanding of like, how much should it really cost to do something like this at what scale? We have…

Brandon Liu (16:00.493) Yeah.

Jed Sundwall (16:28.057) All the analytics we need, just haven’t sifted through the data yet to figure out like which of those objects are being hit the most and how much and what’s the throughput that’s going out. Cause I know you’ve done analysis on the costs of doing these things. I imagine you have some data on how much it costs to deploy PM tiles, but we also have a lot of this data, but we just haven’t shared it yet. So.

Brandon Liu (16:52.994) Right. So going back to that for a moment though, like, so I wonder if you think about like, like that idea of like being able to search for GitHub for all the links to source for people that are like hotlinking to it. Like in some sense, like I think it’s, it’s not directly correlated to success. Just, just a number of people that are consuming source. If people are making a copy of the data, if people are copying the data they get from source to their own bucket and then using that.

Jed Sundwall (17:03.257) Yeah. Yeah.

Jed Sundwall (17:15.588) Of course not, yeah.

Brandon Liu (17:22.284) That is still like using the platform as intended. Like there isn’t really like by design, I don’t know if source is designed to be like an intermediary platform. Like for example, like Airbnb. So for Airbnb is like you go to the site and you look up like bookings, like listings, but they will stop you from trying to go off the platform to like make an arrangement with like your host because that’s like, that’s, that’s exactly against their business model. Right. That’s like.

Jed Sundwall (17:22.779) Yes.

Jed Sundwall (17:41.115) Yeah.

Brandon Liu (17:51.887) So it’s for Airbnb, the entire point is like, they’re an intermediary between you, like your desire for a room and the host. Now, so I don’t think source is by design as like a data platform to be an intermediary for all data. There is a lot of like open data platforms in the past that have worked that way, where they make it very difficult for you to be able to consume the data outside of the platform. But it feels like with the sort of cloud native focus, part of the idea is that you’re able to

Jed Sundwall (18:01.861) Right, right.

Brandon Liu (18:21.602) you know, just like package up data and take it to go or access it just in chunks instead of having to be locked in to just using source. So if there was some way to maybe promote that as like a first-class way to consume source instead of just linking to assets, then maybe that would help alleviate some of these ideas around like cost sharing for bandwidth.

Jed Sundwall (18:39.632) Yeah.

Jed Sundwall (18:45.667) Yeah, well, no, mean, let me address this and then I want to acknowledge we have a viewer, Sig Till, I’m not exactly sure who they are, but they’re Sig Till on YouTube, who is joining us from Norway. So we were like, let’s do this at 4 p.m. Pacific. Sorry, everybody in Europe, but we’re doing it Asia Pacific, or at least, I mean, it’s what, it’s 7 a.m. where you are. So we’re kind of in a…

weird time zone right now, but we had somebody from Norway tuning in to ask what’s in the future for PM tiles and which changes would you like to see in the format itself or new tools that use the format? But anyway, Sigtil, just don’t go to sleep just yet. We’ll answer your question. The vision of source is not so much to be an intermediary. Sources by design, it doesn’t really do much other than provide reliable access to objects.

So we call it, it’s a data publishing utility. It’s not an analytic tool. I’m happy to have, I want people to build stuff on top of source. So yes, I do want people to link to it. However, this is math that we, this is kind of my point in saying we have to do this analysis on our usage is to say, well, how much is that really gonna cost us if we do that? And are there ways for us to…

get a handle on bandwidth and usage so that we don’t, we’re not abused, you know, or rather, abuse isn’t the right term, but just so that we can afford to do that in a way that’s reasonable. And so, and to say like, look, if you don’t want to host your own object somewhere, which tons of people don’t, I mean, sort of a core tenant of the product design is that like, we just know that a lot of people don’t want to host their own stuff. Like they don’t want to their own servers. They don’t want to think about infrastructure at all. If we can,

let them just link to reliable assets that are available. That’s great. But we have to figure out a way to do that in a way that doesn’t, you know, could scale to the usage of something like Google Maps without bankrupting us, you know? Then that means we have to figure out, for example, with like the open course policies, do we have to have some sort of way to say like, no, no, you have to be put onto an allow list?

Jed Sundwall (21:07.099) to be able to link to this or something like that. We’re gonna have to figure that out. So you’re right that I don’t want to be an intermediary. We’re not really trying to log people into source, but we do wanna provide a service that allows people to access data without having to download and re-serve their own copies if they don’t wanna do that.

Brandon Liu (21:25.612) Right. I mean, on the other hand, feel like, so part of the messaging is that just having object storage is a commodity. And in my experience, talking to developers that use PMTiles or that use other cloud data formats, a lot of people find using S3 very accessible, and it’s not a huge lift to ask them to be like, hey, go put this thing in your bucket. And it’s even among non-

Like I would say you could just be a front end developer. could be someone that spends all their time doing TypeScript programming and know nothing about like servers and you can figure out like object storage. So I think part of the solving point I’m trying to make is like exactly. Yeah. Like that audience I think is extremely large of people that of people that like it’s too much of a lift to host something like a server.

Jed Sundwall (22:06.693) That’s my story. Yeah.

Brandon Liu (22:20.472) But just putting a thing in a bucket is actually like a very good experience. It’s very simple, it has a nice abstraction. And if you can sort of encourage the world to be more object storage-y, that’s the way I think about it. And that’s a big part of why I think PMTiles as a format has succeeded is because that audience is so large.

Jed Sundwall (22:42.191) Yeah, totally. mean, so yes, agree. I’ll just tell a bit of history. I’ve told this story, tell the story a million times. I’ll probably tell it a lot as we keep doing this podcast, but like the story of the origin of the cloud optimized GeoTIFF and all this was when I found myself at AWS building this open data program and I figured out this one weird trick that I could just get the company to give out free S3.

but I had no engineers. had no, like I was embedded within a sales organization. So like, due to like HR practices, like the idea of hiring engineers to build software or tools or anything was out of the question. And so I’m like, what can we get away with if we can only use S3? And I also being kind of, guess I would say a front end guy, although I’ve never been like ever officially hired as an engineer, loved S3.

It’s like very intuitive product, super powerful, very capable. I wasn’t afraid of it. And so I’ll say this, like you, very talented, smart person, knows how to use S3, isn’t afraid of it, and neither are your friends. There’s tons of people out there that are afraid of S3. Like Source, and actually I got to shout this out. We’ve been working with Development Seed on Source. Anthony Lukash, shout out to Anthony at Development Seed has been.

just cranking out new features on source. Today, we pushed out like you can upload stuff into S3 through the browser through source. for source users now, which you still have to be invited to be a source user, you don’t even have to use the CLI. You don’t have to, you don’t have to look at the AWS console. Like I’m just here to tell you there’s a whole universe of people out there that they’re like.

No, I am scared of S3. I’m scared of AWS. I don’t want to look at that console. And I saw somewhere some tweet that was like, it was in reference to Vercell or something like that, but it was just sort of like, it’s amazing how big of a business you can build just by building an abstraction layer on top of the AWS console. And so that’s really what we’re trying to do. And in fact, I do hope there will be people in the future. mean, we already have a…

Jed Sundwall (25:01.777) a bunch of other organizations that are hosting their own PMTiles on source, they would rather put it on source than host their own S3 server. So, or rather like manage your own AWS account. So, I’ll leave it at that. Let me make sure, I’m hoping Sig is still awake in Norway. Do you want to take this question? What’s in the future for PMTiles?

Brandon Liu (25:25.858) What’s in the future? I would say the current version of the spec version three is done. There aren’t any plans for a version four right now. And I think I kind of got lucky in that sense that there was nothing like someone at a conference last month in Japan, they asked me is like, do you have any regrets about like the format design right now? And I’m like, I thought about it. I’m like, not really. It’s not perfect. Like the design overall has very specific trade-offs, you know?

Jed Sundwall (25:38.566) Okay.

Jed Sundwall (25:49.125) Okay.

Brandon Liu (25:55.212) Like it’s, almost stupidly simple in some sense. And like, didn’t want it to like get too carried away. didn’t want to like embed CRS information and that kind of thing. I would say the lowest hanging fruit for PM tiles is better compression methods, but that’s blocked on browser implementations. it, like, so browsers only support GZIP for decompression stream APIs. If that supported something like Z standard.

That would be great, but that is blocked on Apple, Microsoft, Google implementing Z standard support. What changes would I like to see in the formats itself? The format itself is, right now it’s good enough for static data. I would really like to see another format emerge that is for dynamic data that is still like S3 optimized.

that handles rapidly changing data. Because right now, if you edited some geodata and created a PMTiles, you’d have to replace the whole file on object storage. And that is a huge trade-off. Thankfully, a lot of the data out there is you can generate this building data set once. And maybe once a month, you run a new job and it generates a new one. Each time you are replacing it.

Jed Sundwall (26:52.987) Yeah.

Jed Sundwall (27:05.711) Yeah. Yes.

Brandon Liu (27:21.496) What I really want to see is a cloud native storage engine for real-time data. That would be a totally different design than PMTiles, but I think it’s still possible to do a cloud native thing on S3, for example, where maybe you have data in chunks, and then those chunks are addressed by a hash. And then you have a header that is just a reference to hashes. And then as you upload new data or data changes, you create new chunks and reference those.

and then garbage collect them. So I would like to see some other new formats separate from PM tiles that addresses real-time data. In terms of new tools for the format, sort of along this line, one experimental tool I have for PM tiles is a way to do deltas. So you have to replace a PM tiles on S3 each time. But I was thinking about a way to rsync data.

Like if you have like a 200 gigabyte PM tiles on the cloud, and then you have 200 on your desktop and they’re mostly the same, but one part is changed. You can use an algorithm like R sync basically to just fetch the parts that have changed. So that’s like one way from like the cloud to your computer, not the other way around. But I would like to see some use cases for that because I sort of built it as an idea.

But there’s not really a strong compelling use case right now. So that’s, those are a lot of my ideas for the PM tiles ecosystem right now.

Jed Sundwall (28:56.657) Okay, I love that.

you’re unearthing some feelings about source and like, you so we’re trying to, want source to be kind of this like one-to-one proxy between like for S3, but the idea being that we can create durable URLs that are undergirded by.

as many object stores as we want. So like if you have an object, you should be able to mirror it in lots of different regions and across clouds. And if you have your own S3 compatible object store, like we should be able to point to it and stuff like that. But a really interesting thing happened. If you go to, you’ll have to look around on this, but like the data.source.coop repo on GitHub, which is the repo for our data proxy, this guy Sylvain Lassage, who we’ve been working with on viewers,

You’ve encountered him on GitHub. He’s like, it’s weird. Hugging face can stream CSVs, but S3 can’t. And he looked into it and it had something to do with some header stuff that I don’t remember the details of. But it was like an easy add to the proxy that was basically just like it would pass some more information in the header when you’re calling the CSV and you can stream the CSV. And so like.

We have, we’ve crossed that line. It’s like, it’s sort of like, we’re going to do something. S three API doesn’t do. And I can see us going down a path where we are.

Jed Sundwall (30:27.937) more than just like a very simple abstraction on top of S3, but we’re extending what object stores can do. So we should keep talking about that.

Brandon Liu (30:38.23) Right. And also like, so going back to the idea of like a top down versus a bottom up standard. So S3 has become like a de facto standard, like a totally undocumented standard where every other vendor like sort of only implements the features they need to be S3 compatible. And if something is like wrong or like broken, they’re like, well, that’s how S3 works, you know? So it’s sort of become this, this odd thing where this quirky design that Amazon came up with.

Jed Sundwall (30:44.527) Yeah. That’s right. Yep.

Jed Sundwall (30:58.821) Right. Right.

Brandon Liu (31:07.07) is now like what everyone has to do de facto because all the tooling is built on is built with those assumptions that like this API, this XML API exists. They’re trying to do new things though with like there’s that like S3 express one zone that works differently. There is I think a new way to do like partial uploads. Like you can define an upload as being copied from a different object and that’s like accelerated.

Jed Sundwall (31:07.12) Yeah.

Jed Sundwall (31:25.115) Yeah.

Brandon Liu (31:36.536) But yeah, like it would be cool if some other company came up with like an actual, like maybe a more, like a more featureful spec for S3. But again, probably why it succeeded to the point it has is because it’s so simple. It’s like dumb, you know, there’s no really fancy, there’s no fancy semantics around like content hashes and stuff. Like if you look at how Google storage works, you know, it does seem like they had some, you know, some…

Jed Sundwall (31:46.768) Yeah?

Jed Sundwall (31:51.344) Well, right.

Brandon Liu (32:06.04) whatever like level seven engineers sit in a basement for months and like come up with some cooler design that is like more correct or that is more scalable. So there is platforms like Google storage that seem to have more sophistication than S3, but they don’t have the adoption of S3 in terms of the API, not the specific Amazon platform, but like the API, the interface. And I think that is like a fundamental thing, which is there’s always gonna be this trade-off between like,

Jed Sundwall (32:13.915) Yeah. Yeah.

Brandon Liu (32:36.606) the simpler and dumber you make it, the more likely it is to thrive, you know, like thrive organically. In terms of people being able to write their own implementation, people writing tools. That I think is also like the trade off between something like PM tiles, which is, you know, like I keep saying, it’s, it’s, simple and dumb versus something that is more full fledged, like a server application that serves WMS tiles, for example.

Jed Sundwall (33:04.111) Right, yeah, I mean, so we just have to be very careful with how we go about this. I imagine you’re familiar with the concept of pace layering or pace layers. You heard of this? Yeah, so I’m putting another, I’m just gonna be putting stuff in the chat. this is, it’s an idea I think Stuart Brand came up with, which is basically the notion that like you,

Brandon Liu (33:15.63) I don’t think so.

Jed Sundwall (33:32.271) that society, like the world, like society is our experience as humans moving through the world. It’s based on all these things that are moving at different rates. like nature undergirds everything. And on top of that, we have all kinds of different life forms and then humans have developed culture and governance and law, language itself. But these are all layers that like they evolve at faster and faster rates.

The funny thing is like sort of the top layer of the pace layer diagram is always like fashion, which is like all over the place. like fashion is this kind of like unpredictable crazy thing that humans do, but that’s based on these other more sort of like foundational things like markets and law and language and blah, blah. And so that’s how I, so, mean, I was at Amazon for eight years. like, and I totally bought into

the philosophy of AWS, which is to provide primitives, to provide primitive services that are reliable and are effectively, extremely durable. We had an AWS crash quite recently. things go wrong, but it’s pretty remarkably stable service in terms of like how complex it is and how much stuff it supports. But the way they do that is by being very primitive.

I would say there’s, to your point, there’s obviously room to extend that. And I think the right way to go about it or to think about it is to extend on top of the primitive. But to go slowly, you wanna add layers very carefully on top.

Jed Sundwall (35:23.897) All right, let’s see here. I make sure that we’re… I’m figuring out this chat stream thing. I can see it here in Riverside. Sorry, everybody out there, but we’re still figuring out how to do this thing. So I’m curious to get your… I mean, when did you realize you could just do a really huge file? Just like one gigantic file.

Brandon Liu (35:48.29) so I started ProtoMaps the project before I created PMTiles. and the original plan was to have a server, like a server process, that like serve tiles out of a database. So the original design was like not like, it was not like cloud native or cloud optimized at all. It did not use range requests. It was like a, it was still one file.

Jed Sundwall (36:01.595) Yeah.

Brandon Liu (36:17.134) that you like stored on a server and you had to like run this program to be able to like serve it over HTTP. And then I like, I eventually figured out that I could sort of cut out that entire part just by making it something that you could put onto like on S3 as a static file. So that actually came in probably like one or two years into the project.

For me, it’s so in a lot of cases, like that idea of being locked into using the server process to like serve the tiles, that is sort of like a feature. Like for most businesses, like, like if you have to run it on server, that creates like lock-in, you know, and you can monetize that. You can add like, you can add a paywall. You can say, Hey, like, so if you want to like be able to access this thing, it goes to the server. Just like get this API key, you know, once you go, once you go over like 10,000.

Jed Sundwall (37:03.129) Exactly, yes.

Brandon Liu (37:15.33) request, then you can pay like a subscription, like pay as you go. So that’s like a feature is to be able to like have it be a, a file like on a server versus just a single static object. but then like, once my, my thinking around like, okay, well like, you know, what is the long-term way this project succeeds? I’m like, you know, isn’t it more interesting to have it just be this like single object?

that you can copy around, like as if it was a video. So right, the original like motivation for the project was coming from like being able to create custom maps and host them yourself. Just the nature of how that was hosted evolved from being a traditional like sort of sassy server thing to being this like object storage focused thing later on.

Jed Sundwall (38:11.503) Okay, fascinating. Yeah, I mean, the…

this notion that if you control the server, if you have to be this intermediary, you get to control the data flows and also the users. I was thinking like studying Netflix is a really interesting thing to do if you think about like a data business. Netflix is a data business. They sell subscriptions to data. And the way they’re able to do that is by controlling the entire interface, like the entire chain. And so you have to go through them and pay their subscription and…

experience, know, have the Netflix experience, which is good. You know, the fact is like they provide, there’s a huge audience for that kind of data, which is videos that people like to watch. And they’ve just nailed the experience and people are happy to pay for that. know, whereas like, there are certainly people out there that are like, nope, like you have to have your own DVDs or I’m going to run my own local NAS with a bunch of my own video files because I want to have control. But most people are like, whatever, I don’t want to have to think about this. And so.

So all I’m saying is like, I’m underscoring the point that like, there is a business in providing that kind of service to people, but the market for maps is way too small to justify that kind of thing. That’s why I think so many geospatial like SaaS companies have had such a hard time because they might be able to provide a great, great experience to get some vectors and rasters and stuff delivered over their interface, but like,

the market for it’s just way too small to justify it. anyway, I’m a fan of your approach for obvious reasons. And I’m sorry, let me just keep going because Rachel Googler on LinkedIn asked, this is relevant to this. She asked, she said, were the AWS outers last week in Azure issues today? Which I didn’t know about that. We’ve seen how reliant we are as a society on centralized cloud infrastructure. How can cloud native formats be used in temporary local area or

Jed Sundwall (40:14.447) or peer-to-peer networks when that centralized connectivity is gone, such as during natural disasters. I think you kind of answered her question right away, but do want to address that kind of idea directly? Like how you think about this?

Brandon Liu (40:29.326) So I think of the Protomaps project as something that works on a server or works on S3, but also as something that works on an SD card. It’s like, if you can put a map or you can put a dataset from source, like a scientific dataset onto an SD card and carry it into the forest, then that is like…

That’s good enough, right? That’s how most technology should work. That’s how videos work. That’s how Word documents work. So I think once you’ve built the primitives, it addresses a lot of these questions about like portability and being able to be resilient against like certain failures of networks, for example. There is some interesting things around peer-to-peer. I know one of the contributors to PMTiles was like,

playing around with IPFS, which is like this distributed storage system, like where everything is like addressed by hash. think it’s cool. don’t know a lot about it, but I’m happy to hear that just designing like a simple single file format can be directly applied or like it just works with these things like IPFS. And…

Jed Sundwall (41:32.517) Yeah. Yeah.

Brandon Liu (41:53.696) I haven’t seen a lot of adoption for that specific peer-to-peer system outside of some more niche use cases. But in theory, so you could build a really resilient network of storage for any kind of data as long as what you’re trying to serve is just these simple files.

Jed Sundwall (42:18.255) Yeah, yeah, well, I mean, and again, I mean, I think the sort of the Netflix example is a good one to explain this, like to highlight also the sort of Rachel’s point of like these single points of failure that can occur where like if you are relying on one system to be able to deliver content like in a very specific way, if that system is brittle, it goes down for any reason, like you’re hosed, but this is the…

This is core to the file-based approach to data architectures, or what I would say specifically the object-based approach, because I like object storage, is that resilience in the face of a system going down to your point, like you can put on an SD card and take it into a forest, that’s perfect, that’s a great way to think about it. There’s kind of no way of getting around the power and effectiveness of sneaker net. However, this opens up the…

the door to a question that I’ve had about PMTiles is that you’ve created PMTiles as this format. If you give, so if I show up with a PMTiles file on an SD card and give it to a random person, they will not be able to open it. They’re gonna double click on it and be like, what is this? How do you get away with that? I mean, yeah.

Brandon Liu (43:31.619) Yeah.

Brandon Liu (43:36.148) Yeah. I think it’s tough because like, it sort of depends on the observer, right? Or the person opening it, are they opening it on Android? Are they opening it on Windows? Can I go talk to Apple and ask them to put a PMTiles viewer into Mac OS or something? And I think like my solution is this web viewer. There’s a website called PMTiles.io that I maintain where you can just like drag and drop.

Jed Sundwall (43:45.402) Right.

Brandon Liu (44:04.276) a local PMTiles file or a URL of a PMTiles on the cloud. So the sort of intention was that viewer emerged at the very beginning. There has to be essentially a file preview for these things that works locally too. You shouldn’t have to spin out the web server to be able to look at something. So the thing about data is people want to look at it. People don’t believe that it exists until they can see it.

It’s just like this inherent bias. So we know the machine can read it. People don’t trust it until they can look at it. And that is a lot of why people care about PMTiles overall is because they might have geo data in some format, but if they want to visualize it, have to turn it into some more visualizable format. And that’s really what PMTiles is, is making visualization easy. So the answer for the web viewer is as long as they have a copy of that.

web viewer, is open source on the USB stick, then they should be able to open that offline in a browser and just like open up that PMTiles file. That viewer is built using all like pretty standard web stuff. It uses map Libre and some like browser APIs.

Jed Sundwall (45:23.673) Right. But is that all built? Can that viewer be… that all be… This is a very naive question. Could you just have like an HTML file on that stick that contains the entire viewer?

Brandon Liu (45:38.454) And a JavaScript bundle. Yeah. The, there is a static build of it, cause it’s hosted on GitHub pages actually. And GitHub pages is just static files. So you could just like clone down a copy of that HTML JavaScript CSS bundle and have it offline and that should work. there is this like interesting question though of like, okay, like, there’s certain like formats like for archiving that are like, I think it’s like the library of Congress. They have like standards about like.

Jed Sundwall (45:40.345) Yeah, okay.

Jed Sundwall (45:44.783) Okay. right. Yeah.

Jed Sundwall (45:52.569) Yeah. Yeah.

Brandon Liu (46:08.482) they recommend JPEG as a format because it’s like based on the likelihood of like in like 50 years, there’s like some like library science people that are like, like we have these like historical like scans of like restaurant menus, but how do we open them? Because there’s like this, there’s this image format that like was popular, you know, back in the, in the two thousands and now nobody can read it. So there’s like this open question of like, you know, is,

Jed Sundwall (46:11.461) Right.

Jed Sundwall (46:15.034) Right.

Jed Sundwall (46:21.958) Yeah.

Brandon Liu (46:35.414) is PMTiles like a resilient format in, but like by that standard of measure. And I think that the way the format is designed, it could fit on one page. You know, it’s like, like I know people that have written like a implementation in a different language, like Rust or Swift or something, and they can do it in like a day because the format is intentionally like, like as simple as possible, like going back to

Jed Sundwall (46:55.408) Yeah.

Right.

Brandon Liu (47:03.906) that QOI format, just like, it needs to fit on like one PDF page. It can’t be like a white paper, like 200 page book to be able to write a reader. So like my hope is that even if all of, know, if GitHub, you know, like it’s blessed into the sun and we lose all the code, but you have to like write a reader for PM tiles, like from scratch. And all you have is the spec. I don’t think it’s that hard. It should be doable.

So even if you didn’t have like that web viewer or a thing on a USB stick, you could figure it out.

Jed Sundwall (47:39.237) Yeah. Amazing. This is, I mean, this is great. We’ll, we’ll be announcing this right away. but the, the next episode of great data products is with, we were pretty sure it’s going to be with the Harvard, library innovation lab. It’s the Harvard law school library innovation lab. So where I found like my kind of librarians, you know, that are thinking a lot about, you know, understand the benefits of object storage and these, you know, primitive commoditized layers of storage, but they have a lot of thoughts about this and.

we’re talking about many different types of content, but I think, I hope I want to make sure they, they hear this because your thoughtfulness on this, think is like really, really great. I mean, thinking, you know, the tagline of this podcast is the ergonomic ergonomics and craft of data. And you’re thinking so far ahead, like, what are the ergonomics of like finding a PM tiles file in the like rubble left after the nuclear like winter and people be like, actually I can figure this out.

What, yeah, great experience you’re thinking of for the future archaeologists. Have, yeah.

Brandon Liu (48:46.668) Right. So just as a comparison point, like it’s probably fine to like sort of bash on, as Ray stuff here, like I saw, I don’t think I’m like, or it’s not a bashing on it, but even like file, like a file geo database, which is like an F F GDB format. There are city governments that publish F GDBs and they expect you to open them. And like most developers that are not into as re ecosystem cannot open these files.

Jed Sundwall (48:59.696) Yeah.

Brandon Liu (49:14.656) Like I think like it might’ve been like in New York city, like they distribute their like road network as an FGDB. And you know, that format was maybe designed like 15 years ago. And even then most people I talked to are like, what do do with this file? I have no idea what to do with it. So that’s like an extreme example of like, well, you know, it’s not even a question of like 50 years of like…

of being able to open the file like in 50 years, it’s a question of like even five years later after you publish it, can anyone deal with this thing? And it’s like, well, not really. I think it’s like kind of proprietary or maybe there is some spec, but even things like shapefile, like shapefile like was proprietary from the very beginning, right? And then people sort of like kind of made some, made some like reverse engineered like readers for shapefile.

Jed Sundwall (49:54.747) Right?

Brandon Liu (50:08.79) And even then there’s like undocumented extensions for like doing indexing and stuff on top of shapefile. But it’s like all those things are, I think they sort of like fail this question of like that library tests. Like are people going to adopt this if they are thinking about things, like if they’re trying to preserve things like for the future.

Jed Sundwall (50:29.093) Yeah, absolutely. I mean, this is, you’re thinking the right way, you know, and what’s interesting is that like, Jackson says, geo package. That’s, yeah, there’s an answer there. Yeah. mean, what’s remarkable about,

Brandon Liu (50:42.382) Geo package, yeah.

Jed Sundwall (50:49.489) about the, I mean, just thinking about this, like just how short the history of the internet and computing really is, you know? And so it’s fun to think about what things will be like a hundred years from now or whatever. But like we went through a blip, I would say, where people were like, oh yeah, the way to control the market is by controlling the standards. know, Microsoft did that very effectively and developed incredible network effects through the dock and know, XLS formats.

that have since been effectively opened, but who cares? By this time, the damage is already done. Everybody uses Word and Excel, which I should also say, I’m not mad about. I think they’re great, obviously powerful tools that everyone uses. It’s technology that’s well distributed, so I’m not mad about that. But in the future, we have to think more about exactly what you’re saying, which is just sort of like, how durable is this going to be, really?

And that means being very thoughtful about how you design the spec. And it’s usually gonna be something simple. The only other thing I’ll say here is that like, I don’t wanna seem like I’m picking on PM tiles, cause like if I double click on a PM tiles file, nothing will happen. The same is true for Parquet, right? And so Parquet is like all the rage. So much data on hugging face right now is in Parquet. We love having tons of Parquet data on source.

And I was showing a guy earlier today who’s not really familiar with it, but I opened up on source and these are my favorite demos. My PMTiles demo is the best demo source, because we’ve got a great viewer built in and you can just look at it and it’s easy for people. Thank you for that viewer, the viewer that you created. And then Sylvain also built this Parquet viewer and it’s like, great, like now, you know, I mean, as of today, somebody can drag and drop a Parquet file into source.

and they can look at it in the browser right away. And I showed this guy, I’m like, yeah, here’s a parquet file, it’s 800,000 rows. And it’s just like streaming right through really easily. And we’re already at a point where there’s so much data out there and so many files are being adopted that like, no one’s even bothering developing a desktop viewer for them. It’s all being done in the browser. Like it’s all the expectations that’s gonna be done over the internet, which is amazing.

Jed Sundwall (53:11.395) we got some comments coming through. Yousef from Egypt. Hello, I don’t know who knows what time it is over there. He says new versions of GDALC can open up FGDB now.

Cheetal for the win.

Brandon Liu (53:27.308) I think I saw that. Yeah. I think like my standard workflow now is like I downloaded like the FGDB of like, it’s like New York city road center lines. And then I do like an OGR to OGR and just get it into like a geo JSON or something. but yeah, I believe there is a solution now. I remember, I think there was one time, like a decade ago where I like downloaded like the ArcGIS Pro trial and like activated the trial just to be able to like open.

the FGDB and then like save it out as something else. But I think that like the status quo is better now. Yeah, for sure.

Jed Sundwall (53:59.973) Yeah. Yeah.

Yeah, mean, GDAL, it just…

Shout out to Evan. A few more comments on YouTube. Jackson, hello Jackson. He says he’s in the midst of writing an implementation of GeoPackage in Julia. Good luck. Let us know. If you want to write about that on the CNG blog, we have a process for submitting stuff to the blogs. That’d be cool. It’s 2 52 in the morning where Yusuf is. Brandon, you are very popular. People are like, this is incredible.

Sun never sets on the brand and Lou Proto Maps empire. And then we’ve got Sigtil again from Norway, staying awake. I love this, this late night energy we’re getting. Asking, how do you see the new kid in town Geo Park versus PM titles? They have some of the same properties and some differences also. As you said, there’s a lot of new clay. Yeah, so yeah, I have Zarak, Hog, Flat Geo Buffs.

Brandon Liu (54:43.394) Cheers.

Jed Sundwall (55:07.481) You’ve explained this to me before, sort of the nuance between like what PMTiles does as opposed to what GeoParK does. I mean, I have my own guesses about this because it’s, GeoParK is like more about like data than PMTiles, which is more about viewing. Is that how you would describe it? Or what’s your response there?

Brandon Liu (55:29.944) That’s how I see it. Yeah. Like, so I make the distinction between like an, and like a format for, that is for analysis versus a format that’s like for visualization. And I think that’s like, maybe not intuitive because in some cases, those are the same. Like for a cog, viewing it and analyzing it are sort of the same because analyzing it means like, what is the value at this pixel? And viewing it is like, show me the raster, you know, colored in some way.

Jed Sundwall (55:38.672) Yeah.

Jed Sundwall (55:46.651) Yeah.

Brandon Liu (56:00.014) For PMTiles, a lot of the use cases right now for PMTiles are vector-based. And for vector, you sort of need to split out the analysis and visualization into separate things. Because if you wanted an overview for a vector dataset, you can’t really show everything. It would be too noisy. So PMTiles is inherently generalized. Like it has like an overview pyramid.

Jed Sundwall (56:21.104) Yeah.

Brandon Liu (56:27.36) So you can load it at any scale and it looks correct. But what you actually see at that level is like not, is not everything. You have to do some filtering down of the data. Sort of like for, for cogs, you have to build overviews that are like smaller and smaller down sampled resolution, like images of the full thing. So GeoPARK is, is, does not have a lot of use case overlap with PMTiles because GeoPARK is like,

and analytical format that is all, it’s just like all the raw data and then only one version of each, and only one version of each data point. While PM tiles will have copies of a single data point because it has to build those overviews. Now there are like approaches to using GeoParkade and visualizing it directly. Like for example, so there’s a project called Lawn Board that lets you like just show,

Jed Sundwall (57:13.209) Right, right, right.

Brandon Liu (57:25.922) GeoParkay on a map, whether or not that’s practical to use on the web really depends because if you want to be able to download an entire GeoParkay data set to visualize of a city, that might be 200 megabytes, which is more than people usually expect for a single web page. I mean, it’s possible that in 10 years, bandwidth will be so fast and cheap.

that downloading 200 megs for a single webpage might not matter. And maybe we like at that point, we don’t actually need like a visualization format. We can just be downloading raw data like everywhere. But I expect like some sort of strategy around being able to visualize data with overviews is always going to be necessary just because like some datasets are just really big. Like there’s building datasets on source that are like, like maybe half a terabyte, like they’re like open buildings datasets.

Jed Sundwall (58:05.007) Yeah.

Jed Sundwall (58:22.935) Yeah, the VITA datasets, those are my favorite demos. They’re like 300 gigs or 230 gigs or something like that. like, yeah, it’s like, it’s only going to be streamed.

Brandon Liu (58:29.795) Yeah.

Jed Sundwall (58:34.095) My assumption is that storage will keep getting cheaper. There’s still plenty of room to progress in terms of the cost of storage itself, but bandwidth, networking has actual physical limits in terms of the speed of light that I think are really compressing space like that. The movement of bytes over space or across space is really hard.

One, actually Qsheng Wu, awesome to have Qsheng on here, says that DuckDB supports serving vector tiles through Parquet, so they’re on LinkedIn. So, cool. It’s great. And then we have another, I wanna talk to you about the Hilbert curve. We’re getting at about an hour, so we can maybe start wrapping it up. But then Alex Kovac asks, and I’m gonna test this out, I’m still figuring out how to do this. You can see it, okay, so.

Brandon Liu (59:13.635) Nice.

Brandon Liu (59:30.786) I see it.

Jed Sundwall (59:35.621) How did, I think the people on LinkedIn can’t see this. So this is tooling on PM tiles. And also for the purposes of the people listening after the fact. Alex says tooling around PM tiles such as the viewer CLI, typical new base maps package, et cetera, is super convenient. How did that evolve? And do you think there’s anything big missing? Yeah.

Brandon Liu (59:59.042) Yes. I think part of the part I put the most thought into was the overall developer experience of using pm tiles. And from the beginning, had to be like a single binary you could just download. I did not want you to have to homebrew install or npm install or Python package install a package, just because that’s going to fail for a lot of people.

Jed Sundwall (01:00:12.069) Yeah.

Jed Sundwall (01:00:17.168) Yeah.

Brandon Liu (01:00:27.222) If you’ve ever been to a workshop where people like use Python, like a scientific workshop where people are like, we’re providing the material as like a Jupiter notebook. And then someone’s like, I’m on windows. And then you’re like, just use Conda. And then you’re like, trying to like fiddle with this, this conda setup. And I’m like, I don’t, I just don’t like, like, I feel like it, like it pushes people away. Like I understand that like that tooling is mature, but for me, it’s like, I think the best developer experience for any sort of data tooling is like.

Jed Sundwall (01:00:43.469) You’re right.

Jed Sundwall (01:00:49.039) Yeah.

Brandon Liu (01:00:56.46) just download a single binary. Like those are the tools I see having the most adoption and least problems in terms of like the installation. So the installation has to be super simple, like a single download. The viewer we talked about is like the web viewer is for like the viewer for PMTiles files to just browse them. I would say if there’s something big missing, I think tpknew is great.

and PMTile support for that is built in, thanks to felt. But I would say it’s still too hard to install. Like a lot of people that want to build PMTiles, they get stuck on like, do you install the vector tile generator? I would say that is the biggest missing piece, which is to have a single binary download vector tile engine.

Jed Sundwall (01:01:52.752) Okay.

Brandon Liu (01:01:53.838) Like a lot of the limitation for that is because the libraries you need to do geometry, like geometry processing, are generally only in a couple of languages, like C++, Java. And right now the CLI is like a GIL program and there’s no good libraries for that to go. Even Rust doesn’t have that great of support. You probably need to bring in a Geos via C++ bindings. So the biggest missing part is still like some…

Jed Sundwall (01:02:18.726) Yeah.

Brandon Liu (01:02:22.922) easy to install and large amount of data generator for vector tiles. It’s something I do want to work on, but right now I think the tpknew solution is good enough. But it’s the major pain point for using PMTiles.

Jed Sundwall (01:02:40.175) Yeah. I mean, talk about ergonomics of data. The way you think about this is so great. Everyone learned from Brandon. You’re so thoughtful. this is also just kind of like, see this is you’re helping level up the species just by thinking through things this way. Because yeah, it’s so goofy. I mean, I’ve been in all these hackathons in these rooms where people are like, yeah, like.

you end up spending half the time debugging people’s Python installations. it’s just like, no, there’s got to be a better way. Yeah.

Brandon Liu (01:03:16.364) Right. There’s also this idea of different kinds of complexity. There’s like inherent complexity versus incidental complexity. And I think a lot of solving these pain points is around solving incidental complexity, which is just complexity that happens to be there as an artifact that is not related to the actual problem we’re solving. Like maybe you’re trying to solve some route optimization problem. And that is it’s…

like is inherently a interesting computer science problem. But then the, the incidental part is like, I need to like install these packages with Conda and Conda is the like, doesn’t like this wrong version of my machine or something. And it’s just like, all that stuff is just like the part that is like, we can really like, we have to eliminate that in order to actually get to working on the hard problems.

Jed Sundwall (01:03:54.501) Right.

Jed Sundwall (01:04:07.375) Yeah, exactly. There’s what’s the line? It’s sort of you make the hard stuff easy and the like impossible stuff possible or something. There’s some axiom around like, know, guiding software development along these lines, which is like, we should be continually progressing in that direction. But you’re asking all these great questions or like framing it in the right way, which is just sort of like you imagine somebody who’s coming to a hackathon.

how quickly can you get them up and running? If you’re gonna take an SD card into the forest, what can you actually do with that, realistically? And I often think in terms of, this is what I was saying before about Excel and Word being very successful, is that they are sufficiently distributed technologies. The whole idea that the future’s already here, it’s just not evenly distributed. There are some that are evenly distributed, like spreadsheet software.

Like everyone can open a CSV. Like that’s awesome. CSVs are great, like because of that. But you know, as we’re getting better at producing more complex forms of data, we need to think about the ergonomics in that way. Like what are the experiences of people being introduced to this? So, Yusuf says that to pick a new in Windows is a nightmare by the way. So FYI.

Brandon Liu (01:05:29.474) heard that as well. Yeah. Yeah, I’m aware.

Jed Sundwall (01:05:33.881) So I remember years ago I asked you if you’d ever seen the movie Tar.

Brandon Liu (01:05:39.367) which I still haven’t, but I need to now that you’ve mentioned it twice.

Jed Sundwall (01:05:41.027) Okay, well, I’m just like, it’s a, TAR is a weird, TAR fans come out and tell me if you’ve watched the movie TAR. It’s TAR with an accent on the A, it’s a Todd Field movie, in which David Hilbert is a character of sorts. Like he just shows up in the background and I think there are references in the movie to the Hilbert curve.

Tell me about the Hilbert curve. Let’s close on this. Why the Hilbert curve and how did you get into space filling curves? I love this stuff.

Brandon Liu (01:06:18.574) I kind of ripped it off of S2. So S2 is Google’s geospatial indexing library, and they use the Hilbert curve there. It has some nice properties that make it work well for geodata. And the motivation behind this is even in Cloud-Optimized GeoTIFF,

Jed Sundwall (01:06:26.883) Okay. Yeah.

Jed Sundwall (01:06:32.934) Yeah.

Jed Sundwall (01:06:38.916) Okay.

Brandon Liu (01:06:48.088) People argue about like, like, so we’re making like a cloud, like a cloud optimized format, but like how big should the blocks be? You know, you’re like fetching blocks. If you have small blocks, those are good for certain use cases. If you have big blocks, those are good for like, for more like bulk downloading use cases, it’s more efficient. And there’s some trade-off between small blocks and large blocks. But the Hilbert curve is like a way to like, it’s like a lazy way to get around that argument.

which is because like it’s both small blocks and big blocks in the same, like in the same format. You can actually have any size block as long as the power of two. And the reason this is good for PM tiles is because one of the operations on PM tiles is for extracting one part of the world from a larger file. And the imagined use case for this is, so I host my OpenStreetMap data set on the cloud.

Jed Sundwall (01:07:26.801) Yeah.

Brandon Liu (01:07:46.36) But maybe you only care about Seattle. You don’t want to have a copy of 100 gigs of the whole world. You only want Seattle. Or maybe you only want Capitol Hill. So the block size in the archive should be small if you only care about a neighborhood. But if somebody else wants all of Canada instead, then they want to be able to have a format that has big blocks so they can download Canada in one chunk.

So the Hilbert curve is useful because it encompasses both of those use cases without having to make a trade off. Because if you did small blocks, it would be good for Capitol Hill, it would be bad for Canada. If you did big blocks, it’d be good for Canada, it’d be bad for Capitol Hill. So because the Hilbert curve is sort of scale-free, it has the same self-similar structure at every power of two.

you sort of get the best of both worlds in one thing. And that’s really the motivation for why the Hilbert curve was useful for this design. I would say it’s not fundamentally essential. You could build a pretty good format just using like other space filling curves or like a Z-order curve. There is some drawbacks in terms of it’s more computationally expensive to decode the Hilbert curve versus other ones.

Jed Sundwall (01:08:43.611) Yeah.

Jed Sundwall (01:09:04.571) Okay.

Brandon Liu (01:09:12.59) For example, there is these Bing, Quan key tile indexes that are much faster to compute than the Hilbert curve. For most use cases though, the cost of decoding and encoding the Hilbert curve is trivial compared to the network. If it spends two milliseconds doing a bunch of tile coordinates on Hilbert, then you’re spending 50 milliseconds fetching something over the network.

Jed Sundwall (01:09:13.085) interesting.

Jed Sundwall (01:09:22.576) Okay.

Brandon Liu (01:09:42.39) So like overall, like holistically, the price you pay for using the Hilbert curve is not that much relative to other things going on in like in some actual use case. But that’s like kind of the whole story as to why we use this like weird thing that is apparently in a movie as well.

Jed Sundwall (01:10:03.813) Yeah, I mean, just the movie. I turned the light red again, just because it’s kind of a spooky movie. Let me, there’s BV on YouTube asked a question if H3 grids are similar to the useful, but one thing I want to clarify about the Hilbert curve and like to make sure I understand it, which I’m pretty sure I don’t, which is that like the idea is that you can map two dimensions along one dimension.

Brandon Liu (01:10:10.03) Yeah.

Jed Sundwall (01:10:32.581) Right? Like with, you you just have like one string that can be extended into two dimensions, like effectively anywhere at any resolution you want. If I’m doing, if I’m loading up the Canada tile, am I just loading up one band? Like, how does it, how does it work? Like, or is it making multiple requests to do that? That’s, can you explain that even? Like, it sounds like the kind of thing you would need a whiteboard to describe, but.

Brandon Liu (01:10:59.086) Yeah, you’re opening up multiple like, so if the entire world is on one length of string, then Canada is multiple segments of that range of string. Now, where you can adjust is how finely traced the borders of Canada are because

Jed Sundwall (01:11:07.535) Yeah.

Jed Sundwall (01:11:12.367) Yeah.

Jed Sundwall (01:11:26.608) Yeah.

Brandon Liu (01:11:27.15) If you’re working in a networked environment, you can do some optimizations. can say, I’m going to grab a little bit more data than I need, but have less ranges. I can represent Canada using fewer segments of string, even though I get a little bit of America on the side.

Jed Sundwall (01:11:31.76) Yeah.

Jed Sundwall (01:11:38.459) Yeah.

Jed Sundwall (01:11:50.235) Right.

Brandon Liu (01:11:55.448) Pretty much that, like there isn’t really one Canada tile, but you can sort of trace out a contiguous segment of the file that is all next to each other, that is all inside of Canada. And then maybe grab a little bit on the sides for like different outline areas. But the interior of Canada, as long as it’s like an area, you know, like most countries in the world or most regions are not like Chile where it’s just like one long thing.

most of them are like kind of rectangular-ish, you know, they have like an interior and then like a border. So this sort of space filling curve is well suited to how people usually think about areas as having like an internal volume and then being able to slice that into just parts of this space filling curve without having to, you know, like use an excess of

Jed Sundwall (01:12:23.024) Yeah.

Jed Sundwall (01:12:47.834) Okay.

Got it. And then one follow up question on that from the chat is that, is there a benefit here that also these requests are close to each other? Meaning like, you want to look at the full Canada tile and then like the Vancouver tile, should they be near each other? My intuition though is that that shouldn’t matter with object storage and range requests, because it’s not like you’re.

He’s saying like, it’s similar to like how you defragment an old spinning hard drive, but like, that’s not how object storage works. I mean, we’re not assuming that we’re using spinning disk. We might be, but do you have any insight there? Yeah.

Brandon Liu (01:13:30.55) Right, so it matters a lot on HDDs because it’s like on those old spinning hard drives, it’s like you have to move the needle more if they’re not by each other. But I think most storage now is solid state and there’s not a huge difference in the seek time for like a far away chunk versus a near chunk. But yeah, there is also benefits to certain operations. Just having parts that are close in space also be close in the file.

Jed Sundwall (01:13:36.453) You have a head. That’s right. That’s right.

Brandon Liu (01:13:59.276) that is taken advantage of in some parts of the tool.

Jed Sundwall (01:14:02.745) Okay. And then let’s, do you have opinions about H3? mean, so BV is asking, are H3 grids similarly useful? I see it as probably not, but I don’t know how H3 content is. H3 is more of like an indexing concept. know.

Brandon Liu (01:14:20.95) H3 is really useful for visualization. Yeah, I think it’s like, so H3 is like, you’re usually storing like a value in each cell. And I think it’s like, it’s really great for making like really good looking visualizations of data with hexagons. There is some trade-offs like in H3, one hexagon does not perfectly nest.

Jed Sundwall (01:14:34.885) Right.

Jed Sundwall (01:14:41.072) Yeah.

Brandon Liu (01:14:49.578) it’s child hexagons while in tiles there is a perfect nesting. But for certain use cases like showing like aggregate statistics, it doesn’t matter. So I would say H3 grids are the perfect use or are the perfect match for certain use cases around visualization that are separate from doing tiling.

Jed Sundwall (01:14:54.938) Right, right.

Jed Sundwall (01:15:08.965) Right.

Yeah, exactly. Yeah, that’s sort of my understanding. And it is especially good for like visualization, but then also like statistics. Like, so if you’re doing like analysis on, I mean, you just think about the origins of it with Uber wanting to measure demand and activity in very, like very certain areas of different grains. It’s like perfect for that. So, okay. Well, look, we’ve been going for an hour and 15 minutes. This is incredible. We’ve got…

people stand up to all sorts of crazy, guys go to bed. Again, there’s a podcast. Like this audio will go out so you can listen to it whenever. But I really, we have been honored. People are honoring us with their time. I hope this has been interesting for them. Brandon, I love talking to you. I love, I obviously love what you’re doing. We’re very proud to have you as a Radiant Earth Fellow and have had you as a fellow for a long time.

man, are you serious? It’s this, Sigtil in Norway won’t let up. He’s got to go to bed, but he’s asking more questions. Are there some geometries that are not supported more difficult? For instance, polygons with, boy, with holes and holes made of curves, et cetera. What was the most difficult geometry to work with across tiles? This is too hard of a question. Are you seeing this comment? Go for it.

Brandon Liu (01:16:33.676) No, it’s like I’m able to address this. Yeah, it’s I mean, this is a good like deep question, but it goes back to what I was saying is that there is certain geometries that are hard to deal with. And a lot of it is you have to have a geometry library that is very robust against certain like numerical precision errors. And the only libraries right now that get it totally right are basically like Geos, which is part of

Jed Sundwall (01:16:37.295) All right, do it and then we’ll wrap it up. Okay.

Jed Sundwall (01:16:46.8) Yeah.

Jed Sundwall (01:16:58.746) Yeah.

Brandon Liu (01:17:03.192) part of PostGIS and JTS, which is a Java library that is related to Geos. And then a couple other ones, like there’s one that Mapbox made. But yeah, like that difficult geometry is the limitation in being able to write like an easy to install vector tile generator. So I would, I’m happy to follow up over email or something if you wanna like know more about like geometry processing, cause it’s like a really deep.

Jed Sundwall (01:17:10.566) Yeah.

Brandon Liu (01:17:33.046) subject that sort of is a stealth hard problem. People don’t realize how hard that problem is until they find some weird geometry that’s broken. But yeah, that is a good question. And again, I’m happy to talk about it more.

Jed Sundwall (01:17:50.991) Okay, and then, so to contact you, I’m gonna just put in the chat, protomaps.com, go to protomaps.com, there’s info down at the bottom with how to reach you. So, you’re easy to reach. Obviously, everyone listening to this knows how thoughtful you are. So, anyway, I mean, thanks so much for what you’ve given to our community.

Can’t thank you enough. Anything else you want to talk about? we missed?

Brandon Liu (01:18:27.662) I just wanted to say thanks for having me on the podcast. I am also on the, CNG Slack, the source cooperative Slack, which one do you want people to use? if people are CNG members, then they can join.

Jed Sundwall (01:18:38.373) That’s right, yeah.

Well, yeah, so CNG members, you got to be a member. For both, you kind of have to be a member. So membership to CNG is pretty cheap. We say it’s a symbolic fee. these memberships don’t really add up to pay many bills, but we ask people to pay to join CNG just to make sure that we know that people are there on purpose. They really want to be there. So join CNG if you’re not, and Brandon’s in the Slack there. Sores is still invite-only.

But source, so yeah, the best point of entry right now is the CNG, the Cloud Native Geo Slack. You can go to cloudnativegeo.org slash join and learn how to learn about it there. I’ll put that in the chat as well. But yeah, thank you. Yeah, it would be great to see people interacting with Brandon on any of our slacks, but he’s easy to find otherwise.

All right. And then it’s what is it? 817 in the morning there now.

Brandon Liu (01:19:44.032) It is, yeah. It’s red and early.

Jed Sundwall (01:19:45.435) You got a whole day ahead of you. All right, well, happy Thursday. Thanks again for doing this. I bet we’ll do it again.

Brandon Liu (01:19:54.498) Awesome, yeah, I’m looking forward to the next episodes.