An Essential Skill for every Data Scientist

Data Visualization, a unique approach with the use of algorithms, and an example through “What’s on the Menu” dataset

Shaashwat Agrawal
Towards Data Science

--

If we have data, let's look at data. If all we have are opinions, let’s go with mine. — Jim Barksdale

In the century of big data, any and all conclusions demand proof. Data is omnipresent and it is essential to extract information from it. It is the job of analysts and data scientists to extract information and draw conclusions. Data visualization is one such domain that involves data interpretation. It is able to combine the visual power of our eyes with computation. Simply knowing the tools is just not enough, one must have an intuitive understanding of the data at hand and its related analytical tools.

One’s ability to understand, manipulate, and visualize data, defines their role as a data scientist. In this article, we will try to define data visualization in our own way. We will discuss a systematic approach to understanding and visualizing a dataset with the help of algorithms. The dataset ‘What’s on the Menu” is visualized in python as an example.

Data Visualization

Data Visualization is a discipline that deals with a graphic and pictorial representation of data. It is like looking at a box instead of actually trying to imagine a cuboid of l x b x h cm. In simple terms, data visualization is taking loads of data, and presenting parts of it in such a way that removes all language barriers. A good visualization should be as interpretable to any random person as it is to you.

An intuitive approach to data visualization is important for success. The size, parameters, attributes, properties, and complexity of data vary with each dataset. From understanding to analysis and finally, visualization will be performed in 5 steps.

1. Getting Familiar With The Data

Before cooking a meal, the chef needs to be familiar with the ingredients he has and what he can make of it. Data visualization is a sort of abstraction of data in which we tend to hide the complexity and show only meaning. To be able to perform this one must be familiar with the data they are to use.

The dataset I will be using is named “What’s on the Menu”. Import libraries and load CSV file:

Often start with the heading and move on to the contents. Our data is clearly a food-related dataset, that probably describes restaurant menus, the dishes. On impact analysis, it could be useful to chefs, restaurant owners, and even foodies.

Menu CSV file in Dataset

The CSV file in the image describes the various attributes of a menu like the number of dishes, location, description, number of pages, etc. By looking at this we could think of inferring menu popularity, restaurant hotspots, etc.

**Note: The whole dataset is a combination of 4 CSV files namely menu, menu page, menu item, and finally, dish. It also contains images of these menus. In this example only menu.csv is used. Go through the github link to find more visualizations and exmaples.

2. Data and Dataset categorization

The type of data hugely determines the variety of charts and visualization tools that could be used. Time cannot be visualized by maps and coordinates cannot completely be visualized by line charts. The data in every dataset can be categorized as items, attributes, links, positions, and grids. Generally, a dataset (CSV, XLS files) consists of items and attributes. Attributes are your data fields like location, name, length, price whereas items are their values. If you have links and attributes in your data, trees, treemaps, adjacency matrices are preferred. Categorical data calls for bar charts or pie charts whereas sequential data calls for line charts.

image in https://jenniewblog.wordpress.com/2016/02/04/what-data-abstraction-chapter-2/

Data can also be categorized by the physical quantity they represent and it’s often easier to classify them as such. Temporal data for time, geospatial data for coordinates and location, text data, etc. If the data at hand is observed, we can observe a great variety in it. In our example, we will take the location attribute for plotting.

3. Brainstorming of Visual Tools

Once the data and its type is comprehended, the exact visual tool must be brainstormed. Since we will be using location, let's take examples of maps. Even in maps, you must decide between physical, political, topographical maps, and the exact library needed to complete the task. In python, you can use Google APIs, Geopandas, folium, and many more.

google maps

4. Data Analysis (Algorithmic Approach)

My initial implementations of data visualization completely skipped this step. I learned the hard way that visualization and analysis go hand in hand. If I intend to show something then the message must be clear. A scatter plot can be quite informative in terms of average and grouping of values but with a regression line, it tells a much clearer story.

In our dataset, we will try to visualize the location of restaurants with folium maps and K Means Clustering algorithm to find hotspot food areas and also their centroid. Start coding by taking the first 1000 locations and converting them to coordinates:

encoding and storing locations

After the coordinates are stored, we use the sklearn library to fit them into 10 clusters by K means Clustering.

Why K Means Clustering?

image in https://www.geeksforgeeks.org/ml-k-means-algorithm/

If I visualized only the locations, it would be no better than telling the user to type the coordinates in google and get the name. If I had to plan my trip abroad then this map should somehow help me, apart from just laying out well-known facts. What a clustering algorithm does is it talks about locations that are populated and in our case food hotspots. I have used the K Means clustering algorithm since it is easily implemented and fits the application perfectly. The number of clusters can be defined and for each, the population.

5. Visualization

After the application of a clustering algorithm, we just need to figure out the clusters which are above a certain threshold, populated enough to be called a hotspot. After finding these clusters we find the coordinates of those cluster’s centroid and draw circles.

Try to implement all the codes in a juptyer notebook. Folium is used since all tasks like drawing markers, circles, and naming them can easily be done. Folium also has cluster markers that could be used.

output clusters(by author)

Conclusion

I have followed this approach to data visualization quite a few times and it works for me. To implement algorithms side by side visualization tools makes it more appealing and interpretable. Apart from the strategy and tools, experience matters. To know your target audience and tools that perfectly describe the data is important. This concludes the article if you face any errors or have any doubts do comment below. The code for the above map visualization and some others can be found in this colab link. If you want to stay updated do connect on GitHub and LinkedIn for new projects and articles.

References:
1. What's on the Menu Dataset
2. Folium Maps tutorial
3. Google Maps

--

--

Hey! I am Shaashwat, a hardworking and enthusiastic techie. Love to explore various fields of computer science and always ready to work.