How Fit is Your GIS Data?
Seems like a bit of an odd question but it's a good one. We need to be asking it more often. At the heart of it is a common concern - what needs and questions can be appropriately answered with a given GIS dataset. The things we map change. A small stream today can be a sizable river in the rainy season. Today's GIS community wants to build and use "live maps" that address these dynamic issues as opposed to "dead maps". To complicate matters, we are looking to to inject Volunteered Geographic Information (VGI), citizen science data or sensor content from the Internet of Things (IoT) into our enterprise geospatial systems. This requires us to evaluate how good an open source data set is in comparison to our base information prior to ingesting it into our enterprise GIS. What it comes down to is that we have no way to quantitatively measure quality, until now.
To understand the fitness of your data set is to understand its quality. Geospatial professionals have thought about this for a long time, but mostly in the context of locational accuracy. As geographers, photogrammetrists, and surveyors we are good at documenting the spatial accuracy of data. There are several published and widely accepted specifications for doing so. In addition, GIS software are good at managing topological structure to ensure that features share boundaries and networks are fully connected as well as having data in database fields (e.g., no null values).
ISO standards identify 15 different elements that are required to fully understand the quality of a data set. Those in yellow in the image to the left are those that are most difficult to do and require visual review. We have no formal methodology to address evaluating quality issues such as errors of omission (e.g., missed buildings or utility components on a telephone pole), errors of commission (e.g., catch basins in a storm sewer network that are actually ) as well as attribute incorrectness. Knowing its absolute and relative positional accuracy is needed but what about how complete it is? With that, we are hard pressed to know what the data is useful for.
At my company, we use the term 'data fitness' as the concept of measuring the quality of a data set and understanding how it measures up to a given business requirement. To address this issue, we've developed an index called the Map Tolerance Percent Defect (MTPD) that provides a statistically relevant measurement of error within a data set based on those hard to measure criteria discussed earlier. An MTPD for a data set would report a 2% error at 95% confidence. Think of it as a health diagnostic for your data - kind of like your blood pressure or cholesterol level. What you do with the measurement is up to you. With the MTPD you can evaluate multiple data sets against each other, understand where you need to invest,
We created a software product called, wait for it... Data Fitness. That product helps you define the MTPD for a given data set. Data Fitness gives a way to quantitatively measure a data set's quality as well as then compare quality between data sets in a verifiable, repeatable way. In addition, with Data Fitness you can evaluate a VGI data set and have a metric to compare whether that new data set is any better than the one I'm already using. It will be officially released before spring, 2017.