28 mai 2020

Explore & Analyse your data with Apache Zeppelin

In order to gain insight and benefit from the growing amount of data generated both internally and externally by their sources and systems, companies need strong and reliable tools to reach that objective. Apache Zeppelin — an open-source data analytics and visualization platform — can take us a long way toward meeting that goal


In today’s world, data is being generated at an exponential rate, it has shown no signs of slowing down, so that analysts are predicting our global data creation to increase 10 times by 2025. In its Data Age 2025 report for Seagate, IDC forecasts the global data sphere will reach 175 zettabytes by 2025.

Being frightening and fascinating at the same time, the future of big data analytics promises to change the way businesses operate in finance, healthcare, manufacturing, and other industries.

It is sure that the overwhelming size of big data may create additional challenges in the future, including data privacy and security risks, shortage of data professionals, and difficulties in data storage and processing. However, most experts agree that big data will mean big value. It will give rise to new job categories and even entire departments responsible for data management in large organizations. Most companies will shift from being data-generating to data-powered, making use of actionable data and business insights.

Businesses are now collecting data across every internal system and external source that impacts their company, and with it comes an ever-growing need to analyze the data to gain insight into how it can be used to improve and enhance their business decisions.

Apache Zeppelin — an open-source data analytics and visualization platform — can take us a long way toward meeting that goal.

In this ‘Apache Zeppelin Analysis- Part 1’, we are going to see first, an overview on what is Apache Zeppelin and why it is considered as one of the most suitable tools to your Big Data projects. Second, we are going to find out the most adapted browsers to work with Apache Zeppelin. Finally, we are going to deep dive into a real analytics showcase using Apache Zeppelin as one of the best tools for your whole Data Analysis process starting from the preparation of the Data until its visualization.

What is Apache Zeppelin?

Zeppelin is an interactive, web-based notebook that enables data-driven, interactive data analytics with built-in visualizations and collaborative documents.

 It lets you write code into a web page, execute it, and display the results in a table or graph. It also does much more as it supports markdown and JavaScript (Angular). So you can write code, hide it from your users, and create beautiful reports and share them. And you can even schedule a job (via cron) to run at a regular interval and also create real time reports and graphs and share them with your users using websockets.

Apache Zeppelin presents itself as:

“The one interface for all your Big Data needs”

 It’s easier to mix languages in the same notebook. You can do for example, some Spark Sql, Scala then Markdown to document it all together. You can also easily convert your notebook into a presentation style, for maybe presenting to management or using in dashboards.

Zeppelin supports many languages such as Spark, PySpark, Spark R, Spark SQL with dependency loader. It lets you also connect any JDBC data sources seamlessly. Postgresql, Mysql, MariaDB, Redshift, Apache Hive and so on.

Here is a list of the different languages, interpreters and features Zeppelin supports

Hadoop File operationsPsql
HbaseLinux shell (sh)
JDBC    Scala

Apache Zeppelin notebook allow you to create multiple notes, and each note is a set of paragraphs. In each paragraph, you’ll develop your processing using the languages provided by Zeppelin (Sql, Scala, Shell, Markdown…)

In each paragraph, you can: change the size (to horizontally align several paragraphs), add a title, hide/show the editor

Finally, once you have finished your development, you can switch to the report mode, it will hide the editors and only show the results and then share your notes, or by sharing the url of a paragraph and integrating it in a web site. Zeppelin provide also the possibility to export your reports into Csv or Tsv format.

Isn’t it Nice?

Adapted browsers to work with Apache Zeppelin

Apache Zeppelin works well with Google Chrome browser. It is considered as the most adapted browser to work with Zeppelin.

Zeppelin works fine with Mozilla Firefox with some latency for uploading its web page while authentication.

Just to note here that since there is a limit in the size of data exported to csv file, Firefox browser allows larger set of data to download: In Chrome the limit is either 1.5MB or 3500 rows, in Firefox it is possible to download until 100k rows, however the web interface may slow to a crawl.

Internet Explorer is not suitable for working with Zeppelin, you will face several freezes in the Zeppelin UI whether while authentication or while scrolling the Zeppelin web page. It should be pointed out, that Internet Explorer latest versions support Zeppelin. (Zeppelin is not supported on versions 8 or 9, due to lack of native support for WebSockets.)

Apache Zeppelin the best tool for Data Analysis: Preparation, Exploration, Modeling & Data Visualization

Apache Zeppelin is an open source project for simplifying Big Data analytics with web-based notes that enable interactive data analytics. It has multiple language Backends and apache Spark integration, thus allowing to address various analytic tasks inside the notebook. Moreover, it allows interactive visualization and collaboration with ready insights into data. Thus, Apache Zeppelin is able to do Data Preparation, Exploration, Modeling and Data Visualization in a single note.

The following Demo on Zeppelin Analysis shows how zeppelin is able to cover a whole data discovery flow inside a single Zeppelin note:

A. Preparing Data:

In Zeppelin we can download data using simple shell script as if typing in terminal. This is realized by the means of shell interpreter. Thus, the following script in Figure 1 can download our data.

rm ~/bank.zip
rm -rf  ~/data
cd ~
wget http://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip
mkdir data
unzip bank.zip -d data
rm bank.zip

hdfs dfs -put ~/data/bank-full.csv .
hdfs dfs -ls -h bank-full.csv

The first line with %sh implies that shell interpreter is going to be used. Lines 2 and 3 clean up folder in case previous data exists there. Then we can download data using ‘wget’ command followed by unzipping and putting data into HDFS. Now our data is ready for further processing.

B. Loading Data into Spark:

Once our dataset is in HDFS we can access it and load into Spark specific RDD format. Bellow script is written in default Scala Spark interpreter in Zeppelin.

The script starts with creation of Spark context. Further, the input file is mentioned and Bank ‘case class’ representing an object from dataset is created. Then according to structure of Bank class we can load whole dataset using similar template. As you can see, every element of the dataset has five fields: age, job, marital status, education, balance. Finally, this table is registered under “bank” name.

//import sys.process._
//sc is an existing SparkContext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val bankText  = sc.textFile("bank-full.csv")
case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer)
val bank = bankText.map(s  => s.split(";")).filter(s => s(0) != "\"age\"").map(
     s => Bank(s(0).toInt,
                     s(1).replaceAll("\"", ""),
                     s(2).replaceAll("\"", ""),
                     s(3).replaceAll("\""), ""),
                     s(5).replaceAll("\"", "").toInt
// toDF works only starting from spark 1.3.0


Since Spark2, we can directly load the csv file as a Dataframe using Scala Spark with one line of code:

val bank = spark.read.format("csv")
                     .option("header", "true")                                                                 
                     .option("delimiter", ";")    

C. Exploration and Modeling Data:

 Once we loaded dataset into structured table, we can do various kinds of queries using SQL interpreter of Zeppelin. Suppose we want to get statistics about bank users who are under 30. The bellow SQL query can be used in that case.

select age, count(1) value
from bank
where age <  30
group by age	
order by age

First line with %sql refers to the usage of SQL interpreter.


We can reach the same result (the distribution under 30) as the one of the Query 1, but with better performance in term of time of execution by using whether %spark2.sql or %spark2 interpreters provided by Zeppelin like bellow:

val distAgeUnderThirty = spark.sql("select age, count(1) as value from bank where age < 30 group by age order by age").show() 
val distAgeUnderThirty = bank.filter($"age" < 30)

D. Visualizing Data:

The results of “Query 1 – distribution under 30” are shown in Figure bellow. This is the graph view of the results.

Results of Query 1

Zeppelin also provides table, histogram, pie chart as well as other views. This view is pluggable, thus new types of visualizations can be added by the users. Note that the view can interactively change using settings menu.

After this showcase it is crystal clear that Apache Zeppelin can go through the whole data discovery workflow: we saw Zeppelin addressing all the stages of data discovery process inside a single note. Moreover, Apache Zeppelin supports various back-end interpreters with various front-end visualizations, thus being one of the most adapted Big Data tools to solve your Data Analytics projects.

Have a look at our other Big Data article here !

Écrit par


Submit a Comment

Votre adresse de messagerie ne sera pas publiée. Les champs obligatoires sont indiqués avec *