An Intern's Intro to Emu Analytics

An Intern's Intro to Emu Analytics

Bliss is a Big data Science MSc (with industrial experience) student at Queen Mary University of London, currently on a year long Data Science internship at Emu Analytics. Before launching into the Big data space, she worked as a Network Operations Engineer in Etisalat, one of the largest telecommunications providers in the Middle East and Africa. Aside from being a techie, Bliss loves to sing and has been a part of various gospel choirs over the years. Here is what she has got to say about working at Emu Analytics so far.

The first week was the hardest and the longest for sure! Like every new job, it involved getting to know everyone, learning about the company and seeing how my skills and experiences could contribute to the companies’ goals as well as settling down into a new environment. However, having kind and thoughtful colleagues is always the cure to the first week’s unease and these are the qualities that make up the team at Emu Analytics. The last working day of the week was a day to see everyone in the team face to face (thanks to flexible working) and head out for an exciting game of pétanque, a delicious dinner and a nice time at Skylight in Wapping­. After much effort on the game by everyone, Robin (thoughtfully watching Jon throw in the picture below) emerged as the champion of the day.

Emus at Petanque - First Outing with the Mob

Now, this is what I found about Emu in my first week- there can’t be another start-up out there with ALL team members highly skilled and competent. My one-on-one with everyone in the team left me amazed at their vast wealth of experience and technical skills. I started off with a workshop with Alice Goudie, Senior Location Intelligence Analyst, who happens to be my line manager. Alice introduced me to the world of location intelligence, with technologies such as ESRI ArcGIS, QGIS and Postgres/PostGIS.

For anyone like me, who didn’t have a clue about location intelligence/GIS (Geographical Information System) before coming to Emu, let me give you a quick walk through of what I have learnt so far.

Location intelligence

According to my favourite Wikipedia, Location intelligence is the process of deriving meaningful insights from geospatial data. So, in layman’s terms, it is the process of analysing data that has an implicit or explicit association with a location relative to the Earth to derive meaningful insights that can be used to make decisions or solve a problem. So how do we achieve this practically? There are several GIS software packages, including Google Earth, QGIS, AcrGIS, Carto, PySAL…. the list goes on. ArcGIS, QGIS and Carto are the main tools used here at Emu. I started an introductory course on GIS and I can definitely recommend this GIS tutorial to anyone who is interested.

What is GIS all about?

Simply put GIS is a system designed to capture, store, manipulate, analyse and present data that has geographical coordinates (locations relative to the earth). The data can be represented as vectors or rasters. Vector data represents features with distinct boundaries such as a houses, rivers, or bird sightings (or any other distinct feature). This can be represented by points, lines or polygons. Points have one pair of geographical coordinates, lines have at least two points of geographical coordinates connected by lines while polygons have multiple pairs of geographical coordinates connected and closed and are used to define the precise location of features on earth.

London Marathon 20172017 London Marathon - Mapped

The map above by Alice and Jon, shows how vector points were used to map out the locations of pubs, toilets and the best places to load up with carbs along the route of the 2017 London Marathon. Polylines connect these points and give a pace guide to the athlete’s routes. View the interactive map.

Not all objects have distinct boundaries. Data such as precipitation and heat of a forest fire have continuous surfaces without a well-defined outline. GIS represents this using a raster data model. A raster is made up of equal-sized cells arranged in rows and columns. The raster has an origin (which is a real-world location). GIS uses the origin, relative cell location and cell size to determine the location of the cells on earth.

So, is this all GIS is about? Definitely not. The world of GIS and Location Intelligence opens up endless possibilities, some of which have been explored here at Emu and featured in our blogs. Some of my favourite Emu blogs are Santander Cycles at Christmas - by Alice – which analyses London’s Cycle Hire scheme and reveals a spike in cycle hire journeys on Christmas day, and The Heights of England: The Technology Behind the Map by Robin , which brings us the technologies used to achieve the visualization of 13,000,000 buildings in England. Have a look and watch out for more!

As part of my GIS tutorials, I created my first map using the dark grey canvas basemap and a simplified world climate zones layer in ArcGIS. This segments the world into climatic zones.

World Climate Zones

It’s not just about GIS

No, it’s not! My second workshop was with Emu’s Head of Technology, Robin (his list of talents include singing, boating (he lives on a boat!), piano and Kimchi production…). The workshop was an overview of stages involved in developing the excellent real time visualizations of Emu’s products, from the back end to the front end, including interfaces with databases (relational and non-relation), email servers, enterprise systems and real-time data. Having a background in computer science meant that part of the framework was familiar. I was also introduced to the spatial capabilities of PostgreSQL (with the PostGIS spatial database extension) which allows location queries to run in SQL.

We also looked at the use of Vector tile mapping (VTM), which enables great zooming and panning performance by delivering the map data to the web browser as vector representations of the features in the tile. Individual tiles are a set of drawing instructions interpreted by a rendering engine in the web browser. I discovered MapboxGL, which is the VTM application used at Emu Analytics to deliver our interactive maps. It basically converts the vector tile drawing instructions into the maps we see on our screen.

On the front end, I discovered the powerful capabilities of AngularJS which enables component-based development that can be easily reused, scaled to large datasets and runs at the maximum speed possible today. I am already signed up on a course, join me for more on AngularJS.

There will be more workshops with other team members, discovering the wealth of resources and expertise during my yearlong internship, but let me tell you a bit of what I was up to before coming over to Emu Analytics.

Big Data, what is all the hype about?

That question seems to be asked by so many (5,460,000 results on google). Well, I decided to figure it out for myself when I started an MSc in Big Data Science at Queen Mary University of London last year. I found out it’s definitely worth the hype and still moving at a fast pace. Maybe because before that, I was stuck with the manual process of analysing call details records (CDRs) with SQL and excel with one of the major telecoms operators in the Middle east and Africa (very laborious and annoying). So, here’s a quick overview of my time at QMUL which was broken down into eight modules.

Big Data Processing

As the name implies, the module was an exposition into past and current technologies for large scale data processing including Apache Hadoop (HDFS, MapReduce), Apache Spark (Scala, DStream, MLIib, Graphx) and Apache Storm. Module projects included; analysing a large distributed dataset of tweets for the 2016 Rio Olympics to determine which countries had the most fans, the average number of tweets per day, the average length and the number of characters of each tweet. The project compared the performance of MapReduce and Spark across different algorithms including iterative algorithms (k-means), joins and numeric summarizations. My major take home - no size of data is too big to be processed as fast as possible (thanks to scalability).

Performance Comparison of MapReduce and Spark

Surprisingly, MapReduce with a combiner outperformed Spark in numerical summarization based on our findings (there is still hope for MapReduce!).

Machine Learning

Now, this is where the real hype is in the industry, but again, in my opinion it’s definitely worth the hype. From supervised learning, to unsupervised learning and reinforcement learning, we learnt it all. So basically, with supervised learning, there is a target label (or labels) in the dataset. The model learns how to predict this label(s) given features in the training dataset, evaluates and optimizes the performance of the model using the validation dataset such that given the test dataset, it can accurately predict the label(s). Linear regression, Logistic regression, K-Nearest neighbours (KNN), Decision trees, Support Vector Machine (SVM), Neural Networks (also unsupervised) and Naïve Bayes are some of the common supervised learning models.

With unsupervised learning, there are no target labels in the dataset. The goal is to understand the structure within the data, find patterns or explain the data. Unsupervised learning algorithms include K-means, Gaussian mixture models, Hierarchical clustering and Hidden Markov models.

K-means Clustering Showing Initial Step (a) to Convergence (g)

Data Mining

Data mining has a lot in common with machine learning (it wasn’t helpful having the same lecturer for both!) but the major difference was that we analysed the complexity of each model (system resource, time and memory requirement) to determine the relevant applications for each model. Here is some of what I found out;

  • Decision trees are very efficient for classification because they handle feature interaction for both numeric and discrete features but they easily overfit the training data (though this can be resolved through ensemble).
  • KNN have zero complexity in training but they should not be your first choice with a large dataset because they store the entire training dataset in memory to calculate the distance between each training data sample and the test data samples during testing.
  • Naïve Bayes is resilient to missing values and can be updated with online learning (batch or stream learning) but assumes independence between features, hence cannot model interaction between features.

For my project, I participated in the Titanic competition on kaggle to predict the number of survivors and earned a reasonable score on the leader board (keeping the score to myself). If you are interested in Data Science, I recommend trying out competitions on kaggle to improve your skills.

Data Analytics

This module gave me all the knowledge I needed for an end to end data analysis project. From the problem definition, to data extraction and preparation, all the way to the preparing the final report for a client. This knowledge was put into practise in my project to determine the factors predictive of employee attrition and predict the next employee to leave using a Human resource dataset on kaggle. It was great to implement the entire project in python which improved my skills. What was not so amazing was preparing a 7000 words report at the end…

Factors Predictive of Employee Attrition

The information derived from the mutual information classifier function (from the sci-kit learn package in python that identifies the most predictive factors) indicates that the employee’s satisfaction level, number of projects and working hours are the most predictive of employee attrition, while the employee’s department has no significance in predicting attrition.

The best part of this module was statistical machine learning with Bayesian networks. Unlike machine learning models which use a black-box approach to predict outputs based on historical data, statistical models such as Bayesian network use a probabilistic graph that represents the conditional dependencies between variables. So, they arrive at decisions based on visible, auditable reasoning. Making them useful for real-life applications in which a discovery of causal relationships is essential.

Well, I could go on and on and bore you further on the other four modules which were Information retrieval and Natural Language Processing, Cloud Computing, Applied Statistics and Business Technology Strategy but since I have told you a little about the core data science modules, I will stop here for now.

Finally…… how I plan to use this knowledge at Emu

With all that knowledge backed up, watch out for more on this space to find out how these skills are going to be applied! Especially with the ongoing Smart Meter project, where I will be leading the data analytics part of the project which will focus on deriving meaningful insights from smart meter datasets to forecast trends. Check out our blog by Alice Goudie to find out more about this project.

I am currently analysing the Energy Performance Certificate (EPC) dataset with a goal to discovering some hidden patterns in the dataset that would proffer solutions to improve energy efficiency and reduce energy wastage. The dataset contains potential energy rating, potential energy efficiency, potential energy consumption as well as current energy performances and building attributes such as roof type, window type, building type, number of habitable rooms and tenure type. I am investigating the factors in the dataset that are predictive of these potential ratings and values based on the other attributes in the dataset. This analysis will also be useful for our ongoing smart meter project where we hope to include the information derived from the EPC data into our analysis.