June 1, 2010

Mapping Challenges

I started working on this project about 3 weeks ago in my spare time. The first few days I spent just trying to make some pretty pictures with some basic data, maybe a DEM, some contours, some summits, etc. Once I convinced myself that what I had in mind was possible I began to work in earnest. It quickly became clear that there were four very different aspects to making and sharing a map on the web:

Data
Tools and Infrastructure
Cartography
Publishing

Data

Finding geodata for a specific area, especially a popular area like the Sierra Nevada, is generally not too difficult. The problem I've had is piecing together a cohesive dataset from many sources. Take hiking trails for example, I put together the hiking trails from three different data sources: a Yosemite trails shapefile from the NPS, a Sequoia/King's Canyon trails shapefile from the NPS, and a generic trails 'extract' from the Forest Service.

Of course, the schema for these shapefiles were all completely different. The Yosemite and Sequoia/King's has trail names, the Forest Service stuff had none. While the Forest Service data had large chunks missing in the National Park regions there was still considerable overlap that I manually cleaned up. After a few hours of work I have a single shapefile that covers the trails of my region of interest but that's just the beginning. I want all of these trails (at least the big ones) to be labeled nicely and I'm going to have to go through and do that manually.

Similar story for just about every other substantive data layer:

TIGER road data is variable by year with some of the earlier stuff having more detail in the dirt/private roads that I care about and I still haven't found a public data source that indicates whether a stretch of road is paved or not.
I found many sources of populated places, some have way too many places, some not enough, some with population, some without.
Vegetation, hydrology, GNIS -- all have consistency and accuracy problems.

Hopefully I'll get a chance to go into detail on each of the above data layers in separate blog posts. Suffice it to say that data wrangling, processing, converting, munging, and fact-checking has taken a large chunk of time so far and will continue to do so for the foreseeable future.

Tools and Infrastructure

The next challenge is the 'engineering' part, dealing with all of the software to create approximately 220,000 map tiles. The software I'm relying on:

Mapnik -- the backbone of the operation, it generates the map tiles,
GDAL/OGR -- for data munging,
Postgres/PostGIS -- for storing my processed data for Mapnik,
Eclipse with PyDev -- for writing my Mapnik scripts,
GlobalMapper -- it may have a painful UI at times and it's not free but it's a quick way to view data and can convert just about any geodata format.

I feel like that, for the most part, the bulk of my work is done in this department. I've got a decent setup where I can test out cartographic ideas on my home machine and do the processing 'in the cloud' (see below).

Cartography

I think munging data will be the most time consuming piece of the process but that's generally easy work. The hard part is the cartography. A map is only as good as each of the decisions that was made in creating it. Should the hillshading be darker/lighter? Should I use 'Google Mercator' or something better? Should I include 200' contour lines at this zoom level? At what scale should the river flow stroke get smaller? How much should I accent the SPS summits versus other summits? There are thousands of such decisions to make. I plan on writing about these decisions and why I made them one way or another.

Publishing

How do I go about creating my tiles and publishing them? This falls in the 'engineering' category as well and has essentially nothing to do with cartography. Personally I quite enjoyed learning about all of the potential options but I can easily imagine would-be cartographers not sharing their work because of the technological burden.

I'll write more about it later but the bottom line is that I'm using virtual machines on Amazon's EC2 to generate the tiles (takes about 6 hours on a medium instance) and storing them on S3 (takes another 6 hours to upload them) where they are served via CloudFront. I then have a cheap web host to serve the static OpenLayers-based map viewer. Time will tell how expensive this is...

So there you have it. I feel like I have the infrastructure and publishing pieces solved so now I can focus my energy and time on the data and cartography and eventually produce a map that achieves my goals.

Back