Cloud and Big Data

What the hell is that??

Created by Pierre Zemb / @PierreZ

+Me

  • French Engineer student at ISEN Brest
  • CIR student
  • Part-time internship at Crédit Mutuel Arkéa

Just to clarify

I'm just a student!

Why this presentation?

  • I wanted to have a good point of view about Big data and cloud
  • they are unknown at ISEN
  • It's something big obviously

Cloud? Like in the sky?

3 types of cloud

  • Infrastructure as a Service
  • Platform as a Service
  • Software as a Service

Explanations

Examples

  • IaaS(Infrastructure as a service):
    Amazon EC2, Windows Azure, Rackspace

  • PaaS(Platform as a service):
    AWS Elastic Beanstalk, Heroku, Force.com, Google App Engine

  • Saas(Software as a service):
    Google Apps, Microsoft Office 365

Hadoop doesn't belong to this !

4 Reasons Why Development in the Cloud Makes Sense

  • SaaS hosted solution are cool

  • Great for distributed teams

  • Effortlessly scalable

Overview of Openstack 1/2

Overview of Openstack 2/2

Overview of Juju 1/2

Overview of Juju 2/2

What's Big data?

meme

Size does matter 1/2

  • Facebook owns 300 petabytes of data and generates 500 terabytes of information per day

  • The experiments in the Large Hadron Collider produce about 15 petabytes of data per year

  • Steam delivers over 30 petabytes of content monthly

Size does matter 2/2

  • At its 2012 closure of file storage services, Megaupload held ~28 petabytes of user uploaded data

  • The 2009 movie Avatar is reported to have taken over 1 petabyte of local storage at Weta Digital for the rendering of the 3D CGI effects

  • Google processed about 24 petabytes of data per day in 2009

Why do we need to make it big? 1/2

90% of the data in the world today has been created in the last two years alone

Why do we need to make it big? 2/2

Big Data = 3 V's

  • Volume

  • Velocity

  • Variety

What's the objective?

Bring together and analyze large pools of data to discern patterns and make better decisions, which are impossible with regular technologies

The pros

  • scalability

  • open source

  • fail-safe system

The cons

meme

Big data is already here


We Are Data

Hadoop

Nice elephant! But what is it?


Quote from Wikipedia:


Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware created by Yahoo.

Where Is It From?

Apache Hadoop's MapReduce and HDFS components originally derived respectively from:


  • Google's MapReduce paper

  • Google File System (GFS) papers

The power of Hadoop

  • Hadoop Distributed File System (HDFS) - a distributed file-system that stores data on commodity machines

  • Hadoop MapReduce - a programming model for large scale data processing.

Some key words...

  • Job Tracker

  • Task Tracker

  • Name Node

  • Secondary node

  • Data Node

What about an recap ?

Hadoop Distributed File System

HDFS is good for:

  • Very large files
  • Write once, read many-times
  • Any hardware

and not good for:

  • Low-latency access
  • Lots of small files
  • random writing

How does it work?

(Sorry guys, it's in French)

Map/Reduce

3 phases

  • Map phase

  • Shuffle phase

  • Reduce phase

Why is it so good?


Data locality optimization


Example

That's all!


Questions?