Sunday, October 30, 2016

Regression Analysis to Estimate Missing Values

The availability of data for any type of analysis is necessarily potentially limited, but fortunately there are regression analyses with which one can extrapolate numbers with which to estimate missing values.  One relatively simple method for calculating possible values for those missing is the use of a regression line, which can be used to answer the question "what is the most likely value for Y if given a value for X."  One makes the assumption here that the trend in the data will be the same for the missing values, and that the relationship between X and Y will remain constant.

Year Station B x Station A y
1931 1005.84 1131.97
1932 1148.08 1269.09
1933 691.39 828.84
1934 1328.25 1442.78
1935 1042.42 1167.23
1936 1502.41 1610.67
1937 1027.18 1152.54
1938 995.93 1122.42
1939 1323.59 1438.29
1940 946.19 1074.47
1941 989.58 1116.30
1942 1124.60 1246.45
1943 955.04 1083.00
1944 1215.64 1334.22
1945 1418.22 1529.50
1946 1323.34 1438.04
1947 1391.75 1503.98
1948 1338.97 1453.11
1949 1204.47 1323.45
  
The above table has rainfall values for two weather stations for the years 1931 through 1949.  The values for Station A in this year range were found using the slope and Y-intercept for the relationship between the stations using the values from 1949-2004.  The formula for the regression line is y = bx + a, where b is the slope, a is the Y-intercept, and x is the given value for Station B for that year.  

Saturday, October 22, 2016

Interpolation and DEM Accuracy

The accuracy of continuous raster surfaces relies on the ability of the interpolation method to calculate values for areas between sample points that are as close as possible to the actual measurements.  This is, of course, determined by taking a measure of the difference between a value taken directly from the field and one produced by an interpolation method- the less difference between these two measures, the more accurate the interpolation.  Vertical (elevation) accuracy in DEM data can be influenced by a number of factors, most notably the features of the ground cover, as these can influence and obscure remotely sensed images and data.  Bias and uncertainty can be introduced rather easily into a data set, and can lead to systematic over or underestimation.  Looking at the summary statistics of the differences between measured and interpolated values for different methods may look something like this:


IDW elevation
Spline elevation
Kriging elevation
mean (m)
-1.544
-2.365
-1.967
standard deviation (m)
11.695
10.196
12.024
median (m)
-1.759
-1.838
-2.144

The interpolation methods across the top- Inverse Distance Weighting (IDW), spline and kriging- produced elevation values for 63 different points that were then compared to the actual elevations, and the differences between the values are summarized in the table above.  The method with the lowest mean difference- IDW- could be argued as superior to the others, as it produces values that are, on average, closer to the actual measure.    

Sunday, October 16, 2016

Estimating Values for Unsampled Areas: Surface Interpolation

Many different phenomena are represented on maps as being continuous across a large surface area, despite the fact that the measurement for whatever is being depicted is necessarily taken from a set of discrete, measured point locations.  A news station weather map of north America, for example, might show areas with more precipitation or higher temperature in different colors across the map.  The methods used to estimate the values of some variable(s) in locations where samples are not directly taken from, based upon the values of sampled points in the proximity, is called spatial interpolation.  There are a plethora of different interpolation methods, differing in complexity and difficulty to execute, and each has its own strengths and weaknesses.  Thiessen polygons and Inverse Distance Weighting (IDW), for example, are relatively simple interpolation methods, and are somewhat easier to perform.  More sophisticated methods, like splining or kriging, which use spatial statistics in their calculations, may be preferable if a higher degree of accuracy is required.


The above map depicts concentrations of biochemical oxygen demand (BOD) in Tampa Bay, and is generated from sample data that goes with each of the point locations.  IDW is the interpolation method used to generate the above raster surface, which shows the estimation of concentrations across the bay based on the measurements at the sample points.  IDW is a relatively simple and common measure, but it is valuable in a number of spatial analysis scenarios.  This, and most other interpolation methods, rely on the assumption of autocorrelation- that what is nearer is more alike than what is further away.  Although IDW is an adequate and appropriate method of interpolation in this application, it may not be appropriate for every situation- the choice of method is based upon the characteristics of the application.

Sunday, October 9, 2016

Elevation data: TIN vs. DEM (the vector/raster dichotomy epitomised)

In the world of geographic information science there exists two primary kinds of data- vector and raster.  The differences between the two could be discussed to no end, and everyone has their preference, but for some applications one can be simply superior to the other.  A  rather perfect example of this dyad is the comparison of elevation depicted with a vector TIN data set vs. that which is displayed with a raster DEM.  The triangular irregular network (TIN) is created from a series of elevation measurements called 'mass points,' which are connected with 'breaklines,' and elevations are interpolated from these.  The digital elevation model (DEM) is a digital representation of topography, which is displayed with a grid of spot heights, and is a widely used type of raster image.  Contour lines are a standard 2D representation of elevation, and can be created in ArcGIS from either a TIN or a DEM data set.



Above is an image with contour lines created from the same area, one set created from a TIN, and one from a DEM.  The most immediately notable difference between the two sets of lines is that the TIN contours are more angular, and don't curve in the same way as those created from the DEM.  This is somewhat logical, given that the TIN consists of discrete triangles, whereas the DEM represents a more continuous surface of values.  Because elevation is a more continuous phenomena- it's not delineated in nature with discrete lines- one might contend that the DEM provides the superior representation.  The TIN data lacks the more nuanced variations in values, which allow for the curved contour lines.  Thus it could be concluded that though vector data is invaluable in many situations, a raster image is somewhat preferable in the examination of phenomena like elevation.   

Sunday, October 2, 2016

Location-Allocation Modeling

Anyone who has ever wondered how, exactly, it is decided just where things like new schools, new stores or other facilities will be located has considered a problem of location-allocation.  In making these decisions, which mainly involve analysis of things like travel cost, distance, and customer locations, a multitude of factors must be considered, and locations are decided upon based on their suitability.



Above is a comparison of market area assignment to distribution centers- both before and after the location-allocation model.  Each of the 22 distribution center points serves the market areas, based upon customer locations.  After the location analysis, the market area assignments are improved, as the number of customers within each and the distribution centers they use have been optimized according to our specified criteria.