I worked on a web-based descision support system (dss.mahlathini.org) aimed at helping smallholders and facilitators determine what farming practices best suit their physical environment and available resources.
A functional requirement was to automatically get the slope of a given coordinate found on a map UI. Though there are APIs available that can get the altitude of a coordinate, I wanted to have as little dependence on outside sources as possible, as many freely-available sources I found had been shut down.
A restriction on the application was that I wanted to keep server CPU load low. This meant all processing must be done client-side. The only client-server interaction is delivery of files, but the speed of the program client-side is still a concern.
The altitude data used is NASA’s Shuttle Radar Topology Mission (SRTM) 90m dataset. This provides a 90m precision, and is as good as it gets for South Africa – 30m and 10m variants are available for some more developed countries.
I downloaded the tiled data in .asc(ASCII) format. This data totalled 1.84GB, split into 10 different files – each file averaging around 170MB. Note that these values are given without gzip compression, but gzip is enabled on the server, making everything significantly smaller.
Due to the way HTTP works, it is impossible to only download a portion of a file without server-side code that can read that portion of the file and deliver it to the client. This means that any time we would want to find a value in the map, we would have to download 170MB just to find one value – very wasteful. Obviously this solution is relatively simple when a database is available.
To make delivery more optimal, I wrote a simple program on my laptop to partition the tiles. The program takes a glob-syntax string as input for all the map tiles, the output directory, and the number of times to divide each tile (which should be a multiple of the number of tiles).
Partitioned tile lookup
Instead of having 10 files each around 170MB, what if we instead had 10000 files each 170KB? And then if we knew which file corresponded to which region of geographical space, we can simply load that file and parse it.
This requires 3 types of asc files:
- The tile index
- The partition index
- The partitioned tiles themselves
The partitioned tile files are the result of partitioning each of the input tile asc files into smaller parts. There will be n^2 for each tile.
The partition index asc files each contain cells whose values correspond to the path of a partitioned area of the map (the partitioned cell file).
The tile index asc file is the entry point to finding the partitioned tile. Each cell value corresponds to the path of a partition index.
This gives us a simple tree structure that serves as an index for lookup:
- Look in the index for which cell my coordinate (x, y) falls, and get the z-value.
- Look in the tile index I was pointed to in the previous step for which tile my coordinate (x, y) falls and get the z-value.
- Look in the tile that the previous step’s z-index pointed to. I am now in the tile that will contain my value. Look for the cell that my coordinate (x, y) falls into to get the value.
The format of .asc files is quite simple. The first six lines are the metadata of the file, and the rest are a table, each containing the z-value of a cell:
The format is incredibly easy to reason, which made it simple to write an encoder and decoder.
Something I don’t like so much about the format is that it doesn’t provide any means for determining the exact length of the header quickly. This means one has to read the file byte-by-byte until the sixth \n is reached. A solution would be for the first line to contain a number that says how many bytes the header is.
Additionally, each of the key-value pairs in the header are separated by a number of spaces. It would be more optimal if they were separated by \t tabs instead.
Another gripe I have with the format is that each of the cells’ values can be a random length, and they are only separated by a space. This means one cannot simply move the cursor to position (row bytes, col bytes) and expect to be at the right place – instead (only to travel in the x-direction) it is necessary to iterate over the line byte-by-byte, splitting by a space. It would be better if all values were represented in a larger base, and all values were the same length.
Obviously these pitfalls are solved by GeoTIFF files, but they are less intuitive and harder to handle in code.
The original SRTM data is 1.84GB split into 11 tiles, each 6000×6000 pixels:
The tile index references where each of these tiles are in space:
The partition references where each of the partition indexes are in space:
The partition index is used to load actual image data which has been partitioned to fit the size of each of the pixels:
Here, each tile is 353×353 pixels at under 1MB, and, when gzipped, is considerably smaller.
The result is a working client-side lookup of files that requires 1/10000th the bandwidth, and uses significantly less client-side processing time (since there is less data to process).
It also served as an interesting application of my knowledge obtained from networks (encoding and decoding of different formats I did not know before) and data-structures and algorithms (creating a more-efficient index).