+Me
- French Engineer student at ISEN Brest
- CIR student
- Part-time internship at Crédit Mutuel Arkéa
Just to clarify
I'm just a student!
Why this presentation?
- I wanted to have a good point of view about Big data and cloud
- they are unknown at ISEN
- It's something big obviously
What's Big data?
Size does matter 1/2
- Facebook owns 300 petabytes of data and generates 500 terabytes of information per day
- The experiments in the Large Hadron Collider produce about 15 petabytes of data per year
- Steam delivers over 30 petabytes of content monthly
Size does matter 2/2
- At its 2012 closure of file storage services, Megaupload held ~28 petabytes of user uploaded data
- The 2009 movie Avatar is reported to have taken over 1 petabyte of local storage at Weta Digital for the rendering of the 3D CGI effects
- Google processed about 24 petabytes of data per day in 2009
Why do we need to make it big? 1/2
90% of the data in the world today has been created in the last two years alone
Why do we need to make it big? 2/2
What's the objective?
Bring together and analyze large pools of data to discern patterns and make better decisions, which are impossible with regular technologies
The pros
- scalability
- open source
- fail-safe system
The cons
Hadoop
Nice elephant! But what is it?
Quote from Wikipedia:
Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware created by Yahoo.
Where Is It From?
Apache Hadoop's MapReduce and HDFS components originally derived respectively from:
-
Google's MapReduce paper
- Google File System (GFS) papers
The power of Hadoop
- Hadoop Distributed File System (HDFS) - a distributed file-system that stores data on commodity machines
- Hadoop MapReduce - a programming model for large scale data processing.
Some key words...
- Job Tracker
- Task Tracker
- Name Node
- Secondary node
- Data Node
What about an recap ?
Hadoop Distributed File System
HDFS is good for:
- Very large files
- Write once, read many-times
- Any hardware
and not good for:
- Low-latency access
- Lots of small files
- random writing
How does it work?
(Sorry guys, it's in French)
Map/Reduce
3 phases
- Map phase
- Shuffle phase
- Reduce phase
Why is it so good?
Data locality optimization
Example