June 26, 2010

Fixing GNIS Data

I've been working quite a bit on adding non-SPS peaks to the map. It's been a mix of learning some new skills and repetitive, tedious work.

This all started with California GNIS data. After filtering down to just the summits, I converted it to a shapefile and threw it on the map expecting it to work out pretty well. Not so much. Here's a good example of what I got:

As you can see the GNIS summit locations of Columbine Peak and Isosceles Peak are nowhere near where they should be. While this example is particularly bad, I'd say that almost all of the summits were unacceptably displaced from their true locations.

I knew early on that I'd want the ability to manually modify features on the map so this became the impetus to make that happen. First, I needed a back-end to store my modifiable point data. Borrowing some ideas from OpenStreetMap's schema, I went with a simple two-table design that stores points and key/value data. I'd really like to have something like OSM's revisioning system but at this point that's overly complex. Here's my simple schema:

mysql> desc points;
+-----------+---------+------+-----+---------+----------------+
| Field     | Type    | Null | Key | Default | Extra          |
+-----------+---------+------+-----+---------+----------------+
| point_id  | int(11) | NO   | PRI | NULL    | auto_increment |
| latitude  | double  | NO   |     | NULL    |                |
| longitude | double  | NO   |     | NULL    |                |
+-----------+---------+------+-----+---------+----------------+

mysql> desc point_tags;
+----------+--------------+------+-----+---------+-------+
| Field    | Type         | Null | Key | Default | Extra |
+----------+--------------+------+-----+---------+-------+
| point_id | int(11)      | NO   | PRI | 0       |       |
| tag      | varchar(255) | NO   | PRI | NULL    |       |
| value    | varchar(255) | YES  |     | NULL    |       |
+----------+--------------+------+-----+---------+-------+

Next I coded up some CRUD functions to bridge the gap between Python and MySQL. This allows me to treat objects in my point database as simple Python dictionaries. From here it was pretty trivial to import whatever GNIS data I wanted and stuff it in my database.

So far OpenLayers had proven to be pretty easy to use so I decided to give it a go as my editing front-end. Next step was a layer for getting between the front- and back-ends. A friend of mine suggested that writing a WSGI application would be about as simple as it gets and he was right. After about another 100 lines of code I had a mechanism for loading and storing data from a web-page using JSON as the intermediary format.

After a bit of hacking on OpenLayers I had my end-to-end (albeit a bit fragile) solution. I could select any point, drag it to it's correct location (oftentimes referring to the USGS topos), and save it back to the server. Getting to this point was fun as I got to toy around with several new technologies. But now the grunt work...

The problem is that there are about 2000 named summits in the bounds of my map. Ideally I would look at each one and decide where it should be plotted on the map. I started this process and realized that it was going to take a while without help (want to help? send me an email at dan at closed contour dot com). Narrowing my focus to summits in the vicinity of SPS peaks allowed me to actually finish the job without going insane. In the process I realized that my existing SPS peak locations weren't ideal (see below) so I fixed those too. (Random side note: it turns out that there are only 11 SPS peaks that aren't officially named in GNIS.)

With that work behind me I moved on to locating all of the GNIS 'Gap' entries -- passes, saddles, etc. There are only a couple hundred of these guys so it wasn't so time consuming. I should have new tiles with these new point features up by the end of the weekend.

At this point I feel like I have a workable system for modifying point-based data on the map. There's a bunch of point features I want to add in the near future: trailheads, campgrounds, and glaciers (all13 of them). Then, it's on to line data...

Back