Corinium Blog

This Blog Template is made by b2bml.com

Subscribe to Our Newsletter

Subscribe Here!

Posts by Tag

See all

Cloudera presentation at the Chief Analytics Officer, Fall 2016

Written by Alexis Efstathiou on 11, October 2016

Cloudera presentation at the Chief Analytics Officer, Fall 2016

  1. © Cloudera, Inc. All rights reserved. Data Engineering and Data Science Modern Analytics and Data Processing for the Enterprise
  2. © Cloudera, Inc. All rights reserved. Today, Data is Everything! Instrumentation Consumerization Experimentation Today, everything that can be measured will be measured. Today, data IS the application. Today, becoming data-driven is a business imperative.
  3. © Cloudera, Inc. All rights reserved. “It will soon be technically feasible & affordable to record & store everything…” — New York Times “Digital technologies will, in the near future, accomplish many tasks once considered uniquely human.” . — Second Machine Age Data is abundant, diverse & shared freely As is how we store, process and analyze it Streaming Machine Learning BI ETL Modeling
  4. © Cloudera, Inc. All rights reserved. The new analytics paradigm Understand why it happened Change what happens next Determine what happened Make it happen consistently
  5. © Cloudera, Inc. All rights reserved. Modern Data Engineering and Data Science requires a new approach in order to handle more data, faster, with better access and a simplified architecture.
  6. © Cloudera, Inc. All rights reserved. Apache Hadoop Hadoop Distributed File System (HDFS) File Sharing& Data Protection Across Physical Servers YARN/MapReduce v2 Distributed ComputingAcross Physical Servers Flexibility •A single repository for storing processing& analyzing any type of data •Not bound by a single schema •On Premises and in the Cloud Scalability + Complex Analysis •Scale-outarchitecturedivides workloads across multiple nodes •Flexible file system eliminates ETL bottlenecks •Real-time analytics Low Cost •Can be deployed on industry standardhardware •Open source platformguards against vendor lock •1-2 Orders of magnitude less expensive than traditionalsystems Apache Hadoop is a platform for data storage and processing that is… • Distributed • Scalable • Fault tolerant • Open source (Original) Core Hadoop Components
  7. © Cloudera, Inc. All rights reserved. End to End Lifecycle of Data Science Data Engineering Data Science Production(Data Engineering/ App Development) Data Wrangling Visualization and Analysis Model Training & Testing Production Model Preparation Batch Scoring Online Scoring Serving Dev Tools:IDEs/Notebooks, Collaboration Ops Tools: Versioning, Scheduling, Workflow, Publishing Data GovernanceGovernance Processing Acquisition Model Quality & Performance Experiments
  8. © Cloudera, Inc. All rights reserved. Our Goal: Bring More Data Science Users to Hadoop Help more data scientists use the power of Hadoop Use a powerful, familiar environment with direct access to Hadoop data and compute Data Scientist Data Engineer Make it easy and secure to add new users, use cases Offer secure self-service analytics and a faster path to production on common, affordable infrastructure Enterprise Architect Hadoop Admin
  9. © Cloudera, Inc. All rights reserved. Who is Data Engineering for? • Needs projectsto scale • Cares about performance • Cares about SLA’s • Needs multitenancy,security, and optimized architecture • Needs better scale • Cares about access to data • Wants better collaboration without managing dependencies Data Engineer/ETLEngineer Data Scientist/Data Analyst • Cares that his team is productive • Cares about enforcing standards. • Wants results he can share with the business Analytics Leader
  10. © Cloudera, Inc. All rights reserved. Requirements of a Data Science Platform • LeverageBig Data – Volume, Variety,Velocity – to tacklevarious use cases • Enable real-timeuse cases • Provide sufficienttoolset for the Data Analysts • Provide sufficienttoolset for the Data Scientists + Data Engineers • Provide standarddata governancecapabilities • Provide standardsecurity across the stack • Provide flexible deployment options • Integratewith partner tools • Provide management tools that make it easy for IT to deploy/maintain
  11. © Cloudera, Inc. All rights reserved. Cloudera Enterprise, A New Way Forward
  12. © Cloudera, Inc. All rights reserved. Data Engineering and Data Science Workloads Data Ingestion (Kafka, Navigator, Search) Cloudera enables users to build real-time, end- to-end data pipelines in order to power their business. Leadership in Apache Spark and Kafka have made Cloudera a trusted resource for users who want to capture real-time, streaming, and time series data without being presented with gaps in security. Data Processing (Spark, Hive) Cloudera is helping users accelerate their data pipelines with leadership in technologies like Apache Spark. Data processing in Cloudera Enterprise can help take processing windows from hours to minutes and enables faster access to data for a variety of users and skillsets. Data Science (Spark MLlib) Cloudera is bringing the most popular data science languages/librariesto our platform for easier collaboration, self-service exploration, and implementation at scale. Cloudera is advancing the state of distributed machine learning at scale. Cloudera enables exploratory data science and the ability to deliver robust data products.
  13. © Cloudera, Inc. All rights reserved. Data Ingestion for Hadoop Ingest Any Data Type at Any Rate STRUCTURED Sqoop UNSTRUCTURED Kafka, Flume PROCESS, ANALYZE, SERVE UNIFIED SERVICES RESOURCEMANAGEMENT YARN SECURITY Sentry, RecordService FILESYSTEM HDFS RELATIONAL Kudu NoSQL HBase STORE INTEGRATE BATCH Spark, Hive, Pig MapReduce STREAM Spark SQL Impala SEARCH Solr SDK Kite Apache Sqoop: SQL to Hadoop • Efficiently bulk load data (bidirectional) • Easily get startedwith custom connectorsfreely available (RDBMS/EDW/NoSQL) Apache Flume: Log Aggregation for Hadoop • Efficiently move large amounts of streaming/log data • Reliable, scalable, manageable, and extensible for production • Connector ecosystem for common streaming data sources • Easily gather logs from multiple systems Apache Kafka: Pub-Sub Messaging for Hadoop • Move data from many “producers”to many “consumers” • Mostflexible to supporta wide range of use cases • Integrateswith Flume, HBase, Spark, etc
  14. © Cloudera, Inc. All rights reserved. Powerful Data Processing The Most Apache Spark Experience STRUCTURED Sqoop UNSTRUCTURED Kafka, Flume PROCESS, ANALYZE, SERVE UNIFIED SERVICES RESOURCEMANAGEMENT YARN SECURITY Sentry, RecordService FILESYSTEM HDFS RELATIONAL Kudu NoSQL HBase STORE INTEGRATE BATCH Spark, Hive, Pig MapReduce STREAM Spark SQL Impala SEARCH Solr SDK Kite Spark: Data processing and data science for developers and data scientists • Easy development • Flexible, extensible API • Fastbatch and stream processing Cloudera: Most experience with Spark on Hadoop for instant success • First to ship and support • Most Spark users trained • Most customersrunning Spark • Most engineering resources (committers,contributors,support) • Onlyvendor focused on enterprise Spark
  15. © Cloudera, Inc. All rights reserved. Data Science A Unified Platform to Accelerate Data Science from Exploration to Production. Data Scientists need to use data to… ▪ Explore ▪ Model ▪ Test The field of data science blends math and statistics knowledge with advanced computer knowledge. ▪ “Data Scientist: Person who is better at statistics than any software engineer and better at software engineering than any statistician” Josh Wills
  16. © Cloudera, Inc. All rights reserved. Spark MLlib Collectionof mainstreammachine learning algorithmsbuilt on Spark Including: •Classifiers: logistic regression, boosted trees, random forests, etc •Clustering: k-means, Latent Dirichlet Allocation (LDA) •Recommender Systems: Alternating Least Squares •Dimensionality Reduction: Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) •Feature Engineering & Selection: TF-IDF, Word2Vec, Normalizer, etc •Statistical Functions: Chi-Squared Test, Pearson Correlation, etc
  17. © Cloudera, Inc. All rights reserved. Logistic Regression Performance (Data Fits in Memory) 0 500 1000 1500 2000 2500 3000 3500 4000 1 5 10 20 30 RunningTime(s) # of Iterations MapReduce Spark 110 s/iteration First iteration = 80s Further iterations 1s due to caching
  18. © Cloudera, Inc. All rights reserved. End to End Lifecycle of Data Science Data Engineering Data Science Production(Data Engineering/ App Development) Data Wrangling Visualization and Analysis Model Training & Testing Production Model Preparation Batch Scoring Online Scoring Serving Dev Tools:IDEs/Notebooks, Collaboration Ops Tools: Versioning, Scheduling, Workflow, Publishing Data GovernanceGovernance Processing Acquisition Model Quality & Performance Experiments
  19. © Cloudera, Inc. All rights reserved. Cloudera Data Science Workbench A unified platform to accelerate data science from exploration to production. 1. Team Productivity Cloudera Workbench 2. Automation Cloudera Pipelines 3. Data Products Cloudera Models
  20. © Cloudera, Inc. All rights reserved. Hadoop as a Data Science Platform • Leverage Big Data • Enable real-time use cases • Provide sufficienttoolset for the Data Analysts • Provide sufficienttoolset for the Data Scientists + Data Engineers • Provide standard data governance capabilities • Provide standard security across the stack • Provide flexible deployment options • Integrate with partner tools • Provide management tools that make it easy for IT to deploy/maintain Hadoop Kafka, Spark Streaming, Kudu Spark, Hive, Impala, Hue Cloudera Data Science Workbench Navigator + Partners Kerberos, Sentry, Record Service, KMS/KTS Cloudera Director Rich Ecosystem Cloudera Manager/Director
  21. © Cloudera, Inc. All rights reserved. Three Core Enterprise Applications OPERATIONS DATAMANAGEMENT UNIFIED SERVICES PROCESS,ANALYZE, SERVE STORE INTEGRATE Process data, develop & serve predictive models Data Engineering & Science ELT, reporting, exploratory business intelligence Analytic Database Build data-driven applications to deliver real-time insights Operational Database
  22. © Cloudera, Inc. All rights reserved. DATA-DRIVEN PRODUCTS Delivering Improved Cash Flow to Healthcare Providers • Streamlined transfer of messages between payers and providers • Reduced cost per terabyte of storage by 90% • Delivered data encryption and security protection for HIPAA compliance HEALTHCARE » PRODUCT IMPROVEMENT » PREDICTIVE ANALYTICS » IT COST REDUCTION
  23. © Cloudera, Inc. All rights reserved. • End-to-end view of data is helping save lives by detecting sepsis early enough for successful treatment • Has saved 100s of lives already & reduced hospital readmissions • Centralized data from many systems available in a secure environment • 2PB+ in multi-tenant environment supporting 100s of clients Improve Products & Services Efficiency
  24. © Cloudera, Inc. All rights reserved. Thank you jordan.volz@cloudera.com linkedin.com/in/jordanvolz

Save

Topics: Presentation, Data Science, CAO, CDAO, Data Analytics

Related posts

placeholder_200x200

CTA Area

Get This Blog

digit

General Enquiries

America or Europe

Amy Brierley
Marketing Director

APAC – Corinium Partnership Opportunities

Kye Ling Gan
APAC Marketing Director

Sponsorship & Exhibition Opportunities

Susan Feigenbaum
Event Sponsorship Director