Recent Trends in Visualization for the Data-Driven World
By Peter V. Henstock, Machine Learning & AI Technical Lead, Pfizer Inc. [NYSE: PFE]
Peter V. Henstock, Machine Learning & AI Technical Lead, Pfizer Inc. [NYSE: PFE]
Over the past six years, the business technology community has come to realize the value in collecting data at scale and applying analytical approaches to drive decision-making. This transformation across business, industry, healthcare and science in general has been well documented in CIOReview articles spanning Big Data, artificial intelligence and informatics. The data scientist, dubbed Harvard Business Review’s “sexiest job of the decade,” has emerged to meet the challenge of leveraging data from databases, deploying machine learning techniques across computational architectures, and producing actionable insights. Both visualization and machine learning are both critically important and codependent for success.
Generating visual summaries is one of many core skills required for data scientists, as part of their communications and storytelling role. Visualizations serve as a common language bridging the analysts and decision makers to exchange interpretations, assumptions, patterns and artifacts found in all data. Our brains have evolved to effectively find patterns and comprehend data visually. Many new open-source tools have recently become available including interactive web-based tools for the standard python/R languages of data science. It is not coincidental that machine learning libraries have similarly proliferated since there is a strong interplay between the visualization and machine learning fields leveraged by data scientists.
Machine learning is currently the driving force behind artificial intelligence. It can identify relationships, make predictions, and extract insights from data. Before choosing a machine learning approach, understanding the data is crucial since each data set has unique nuances. Some approaches can handle missing data entries perfectly, whereas others break. Outliers can distort or add complexity; yet robust methods minimize their effects. Many statistical-based methods require a certain data distribution.
"Although the formal Big Data movement may have passed, leveraging large and diverse data sets continues to be fundamental to business intelligence"
Whether using a simple two-parameter regression or deep learning with millions of hyper parameters, analysts must check the model assumptions and validity of the results to ensure a correct interpretation. Caruana et al. highlighted this point in their 2017 study predicting hospital readmission for pneumonia patients. They demonstrated that applying appropriate analyses on carefully collected data gave completely incorrect results that would have been undetected without their recommended visual verification. One can therefore argue that the practice of machine learning in terms of strategy, verification, and communication depends fully on visualization.
With the well-documented shortage of data scientists, some are turning to automating machine learning. The tuning and optimization required for machine learning algorithms are certainly achievable, particularly for well-defined problems or re-analyses given new data. However, the correct interpretation for a given algorithm still requires an understanding of the data. We have found that creating fully numeric tools for a specific analysis are prone to incorrect interpretations, even with user training. A better alternative is providing visualizations with guiding explanations to help users attain the correct interpretation.
Although the formal Big Data movement may have passed, leveraging large and diverse data sets continues to be fundamental to business intelligence. Both the business and science fields recognize that an isolated event has value, but can be increased with additional context obtained by integrating additional data sets and applying machine learning techniques. In aggregating data, the number of features (i.e. dimensionality) increases substantially. However, our standard bar and pie charts can only handle 1-dimensional data, allowing an examination of each feature separately. Scatter plots usually give two dimensions so we can examine all pairs of features possibly with color or other glyphs, but cannot easily see the more complex and less obvious patterns. The more advanced parallel coordinates visualization can handle about ten dimensions, but we frequently have data with 100s or 1000s of features.
Machine learning methods such as principal component analysis (PCA) can map the high dimensional space into two dimensions fora scatter plot. Similar new methods have emerged recently such as t-SNE that balance the local and global similarities displayed. These approaches empower the user to understand not only the final analysis, but also provide a window into the nuances of the raw data.
The full data science pipeline is a multi-step process that evaluates hypotheses on high dimensional data, and generally conveys the results through visualization. The machine learning components of the pipeline can find the outliers, characterize the importance of each feature, find the key patterns and provide the answers. Still, different analysis pipelines for the same problem may produce results that vary as a function of time, use-case, target population, etc. A further visualization challenge is how to present these combinatorial sets of results to the user with the constraint of Miller’s Law that states that human capacity for processing information is limited to 7±2 objects at a time. Current research in Human Computer Interaction (HCI) continues to tackle this problem by exploring how users can navigate and effectively comprehend multiple sets of information. Not surprisingly, machine learning research is also being used to recommend sequences of views and identify interesting regions to flag for the user.
In this data science era, visualizations guide data scientists to select appropriate analysis strategies and help communicate results. Machine learning facilities the multidimensional data visualizations and identifies the salient patterns to highlight. The interdisciplinary visualization field is working to combine the information theoretic and human cognitive models together into a unified framework. This will facilitate better comprehension of raw data, provide insight into the machine learning techniques and convey the correct interpretations of results.