The Map Is Not the Territory

(But it sure beats staring at a pandas dataframe)

Hsin Hin Lim
The Startup

--

Photo by Pedro Lastra on Unsplash

The objective of the module 2 project on my Flatiron data science course was to use a multiple linear regression model to predict the sale price of houses in King County, Washington as accurately as possible. The business case I had constructed around the project was to assume that I was advising a new firm of real estate agents who want to start a business in King County, but were strangers to King County’s real estate market and geography.

It is not the intention of this blog to run through the project step-by-step; the technical aspects of the project ought to have already been covered in a Jupyter notebook that is in the Github repository here:

Furthermore, the non-technical aspects of the project, where I expand on the business case I had constructed around the project, is covered by my video here:

The grist for this blog comes from the frustration gnawing at my mind throughout the course of the project that I had never been to King County before. Just like my fictional real estate firm, I had no feel for its geography — its roads, its distances and its views. Looking at the numbers in a pandas dataframe was well and good, but short of travelling there yourself, the best way to bring those numbers alive was to plot a map.

Simple folium maps with markers

As part of the exploratory data analysis in the project, I wanted to see whether all the properties which were expressed to be near a waterfront was indeed near body of water:

Sample of locations with Waterfront assigned a value of “1” in the King County dataset

Working with the underlying data, the above map identifies a sample of locations which the data says has views of a waterfront. Even this simple exercise was valuable, because it led me to challenge my expectation of what the data actually meant when it has assigned a “1” to “Waterfront”. Assuming the data is correct and looking at the output on a map, it is evident that not all properties that the data expresses to be by a waterfront is actually next to a body of water. There could be several other explanations for this — apartments on high floors with a view of a body of water might have been marked by the data as being by the “Waterfront”. It is also more than likely that King County isn’t flat; even properties of modest height but located on top of a hill would have a view of a body of water and thus, would be classified as having a “Waterfront”. And so in this manner, we are able to critically assess the data we have been provided with from a perspective we would not have been able to do just by staring at numbers in a dataframe, even if our tool for doing so is a humble folium map with markers.

Choropleth maps

What else could I do with a map that would help me see the data better? I spent much of the project trying to plot a choropleth map which, according to Wikipedia, is “a type of thematic map in which areas are shaded or patterned in proportion to a statistical variable that represents an aggregate summary of a geographic characteristic.” I ended up with this:

Explore King County and its area by features

The above is an interactive choropleth map which divides up King County by the zipcodes covered in the dataset and shades each zipcode by the median age of the properties in that zipcode. This would give a stranger to King County a relative feel for where the old neighbourhoods are in King County. By hovering the mouse pointer over defined areas, the user could navigate King County by zip code. For instance, one could see what the median living space by square feet for property in a particular zip code and thus get a feel for how big houses are in that location. This is no substitute for actually walking the ground to see it for oneself, but this is as close as we can get just by using the data we have been given.

Making an interactive choropleth map like the above has been well documented by others in blogs and forums; but by way of overview manufacturing such a map required:

· Downloading open-source polygons that have been made available by the City of Seattle Open Data portal here:

· Working with the polygon shapes in GeoJson format such that only polygons that correspond to the zipcodes that are in our dataset are used as per the code snippet below:

· Wrangling the data and linking it to the individual polygons which we have identified as being relevant before using the choropleth method in folium to plot the choropleth map as per the code snippet below:

Having worked out how to make these choropleth maps, I used them extensively in my notebook and in my non-technical presentation. With a focus to learning more about the geographical distribution of the features I had identified in my multiple linear regression as being predictive of house prices, I used these choropleth maps to further my business case to the fictitious real estate agents. In particular, I used them to familiarise these real estate agents with the geography of King County and also to advise the areas they should focus on their energies on depending on the sales strategy they had chosen.

OSMnx

The thing with every project is one is never done with these projects; sometimes, I feel the most interesting things I’ve learnt with each project never make it into the projects themselves. This is certainly the case with the OSMnx package, which really made geospatial data come alive. I found the package here:

To make things more interesting, OSMnx does not (at the moment) seem to support a simple pip install like previous packages I had encountered. I had to therefore dig deeper into the world of creating a new conda environment from the terminal prompt and did a bit more research into how I can run a Jupyter notebook with a kernel environment that supported OSMnx. Once I got up the package up and running, the package opened my eyes to the field of geospatial data science, which appeared to be a world unto itself (and several more weeks of tinkering with it if I wanted to be conversant with it). As a fundamental premise, OSMnx allows one to download and model street networks and other networked infrastructure. A very simple line of code:

Would create something like this:

Zipcode 98112 King County rendered by OSMnx

The map above captures the nodes from the Open Streetmap database in a 3km area around the “centre” of zipcode 98112 (which we had identified earlier from our choropleth map as being the oldest neighbourhood in King County).

Such a map generated by the package is but the tip of the iceberg of what it could achieve. The series of nodes so captured not only allows us to visualise the street network, the package can also be used to instantiate an object with data which can be used to calculate and model certain features of the geography of a designated area (in our case, zipcode 98112). For instance, one could use this package to calculate and visualise the shortest-paths between two geographical points to minimise distance and travel time. This would open up new possibilities for the project — not only would it be possible to further inform ourselves about the physical geography of King County, we would be able to conduct analysis which answered questions like whether property price is affected by distance between the property and a major road or the distance between the property like major infrastructure like an airport or train station. We would be able to zoom in on certain areas for all kinds of geospatial analysis. In particular, we would be able to answer some of the questions we had thrown up earlier about the data; for instance, do some of the properties which claim to have a waterfront view (according to the data) actually have a waterfront view because they sat on the top of a hill?

While exploring the capabilities of the package I spent quite a lot of time using the package simply to plot street networks of European capitals I am familiar with. Like this:

OSMnx rendering of the street networks in Barcelona

I also spent quite a lot of time reading this blog by Geoff Boeing, the author of OSMnx, and the various ways he has deployed the capabilities of the package; I would highly recommend giving it a thorough read. If I were to encounter another dataset with a geographical angle, I would be looking to leverage this package to develop more insights from that dataset.

Conclusion

In conclusion, while this project began as an exercise in multiple linear regression, my frustration in not being able to see for myself the terrain of King County led me further and further into the territory of geospatial analysis and the possibilities that are afforded by Python. Given the availability of open source resources and the detailed nature of the Open Streetmap database, using Python as a tool for geospatial analysis is something I am sure to return to in the near future.

--

--