My recent presentation about what is Big Data, Why so much Hype now, Startling Facts, Opportunity, History, Important Research Papers such as GFS, Map-Reduce , Technology Platforms and Organizations , Hadoop, Cassandra, Introduction to Hadoop, Contribution of Indians to various Big Data technologies working in Google, Cloudera, Hortonworks, Yahoo, Facebook, Aadhar - "All your answers lie in data - @Sameer Sawhney"
We all live in Data Age ….. While data storage capacity has increased, the speed at which data is read is still very slow.. Amount of data that is publicly available is increasing at a very past pace..Big data[1][2] is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications
Collosal amount of data is being generated, and this has changed things..
In good old days, we were using RDMS to store and process this data…we used to bring data to processing units but now data is huge…2 technologies have made this possible..
Characteristics of Big Data :Gartner defined 3 Vs : Volume, Velocity and varietyVeracity: Can this data be trusted ?Volume : Peta/Exa not TB ? Twitter alone around 7 TB of data every day, Facebook 10 TB, Google 20 PB every dayIn 2013: 200 million active users creating over 400 million Tweets each day.In 2011: Every day 200 million tweets, 10 million page book , reading this text will take 31 years In 2010: 65 million a dayIn 2009: 2 million tweets a dayVariety : Different sourcesValue : Meaningful
What is the problem that solution solves?Technology overviewSpecific solutionChallenges in current implementation/solution if any?Advantages and DisadvantagesAny alternatives of the specific solutionWay forward for the technology/solution?(Optional)
In defining big data, it’s also important to understand the mix of unstructured and multi-structured data that comprises the volume of information.Unstructured data comes from information that is not organized or easily interpreted by traditional databases or data models, and typically, it’s text-heavy. Metadata, Twitter tweets, and other social media posts are good examples of unstructured data.Multi-structured data refers to a variety of data formats and types and can be derived from interactions between people and machines, such as web applications or social networks. A great example is web log data, which includes a combination of text and visual images along with structured data like form or transactional information.
Characteristics of Big Data :Gartner defined 3 Vs : Volume, Velocity and varietyVeracity: Can this data be trusted ?Volume : Peta/Exa not TB ? Twitter alone around 7 TB of data every day, Facebook 10 TB, Google 20 PB every dayIn 2013: 200 million active users creating over 400 million Tweets each day.In 2011: Every day 200 million tweets, 10 million page book , reading this text will take 31 years In 2010: 65 million a dayIn 2009: 2 million tweets a dayVariety : Different sourcesValue : Meaningful
Characteristics of Big Data :Gartner defined 3 Vs : Volume, Velocity and varietyVeracity: Can this data be trusted ?Volume : Peta/Exa not TB ? Twitter alone around 7 TB of data every day, Facebook 10 TB, Google 20 PB every dayIn 2013: 200 million active users creating over 400 million Tweets each day.In 2011: Every day 200 million tweets, 10 million page book , reading this text will take 31 years In 2010: 65 million a dayIn 2009: 2 million tweets a dayVariety : Different sourcesValue : Meaningful
Characteristics of Big Data :Gartner defined 3 Vs : Volume, Velocity and varietyVeracity: Can this data be trusted ?Volume : Peta/Exa not TB ? Twitter alone around 7 TB of data every day, Facebook 10 TB, Google 20 PB every dayIn 2013: 200 million active users creating over 400 million Tweets each day.In 2011: Every day 200 million tweets, 10 million page book , reading this text will take 31 years In 2010: 65 million a dayIn 2009: 2 million tweets a dayVariety : Different sourcesValue : Meaningful
Data Source : Data Repository (data persists) : Filter and Transform : Compute (Distributed Scale out system)Map Reduce is inevitable.1980: Impedance Mismatch problem : Row/Columns for Relational Databases Integration Mechanism ( Relational Dominance into the 2000s)1990: Object databases : 2000: Big Internet sites, Amazon , Google ( Traffic) Lots of trafficBigger boxes : Real limits, CostLot of little boxes, SQL was designed on single node system.Google: Big TableAmazon: DynamoNoSQL movement: term comes from Johan Oskarsson : san francisco --- London , proposed meetup (late 2000), twitter hashtag,Short unique, #nosql, (Twitter hashtag to advertise a single meeting)Data Model:1. Key-Value: 2. Document Data model : JSON ( No schema), portions of documents, 3. Column Family : Single Row key having multiple column families, where each column family is aggregate of columsn which fit together.Aggregate is about storing all related items in 1cluster.
1980: Impedance Mismatch problem : Row/Columns for Relational Databases Integration Mechanism ( Relational Dominance into the 2000s)1990: Object databases : 2000: Big Internet sites, Amazon , Google ( Traffic) Lots of trafficBigger boxes : Real limits, CostLot of little boxes, SQL was designed on single node system.Google: Big TableAmazon: DynamoNoSQL movement: term comes from Johan Oskarsson : san francisco --- London , proposed meetup (late 2000), twitter hashtag,Short unique, #nosql, (Twitter hashtag to advertise a single meeting)Data Model:1. Key-Value: 2. Document Data model : JSON ( No schema), portions of documents, 3. Column Family : Single Row key having multiple column families, where each column family is aggregate of columsn which fit together.Aggregate is about storing all related items in 1cluster.
HadoopMapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.
The Mapper implementation (lines 14-26), via the map method (lines 18-25), processes one line at a time, as provided by the specified TextInputFormat (line 49). It then splits the line into tokens separated by whitespaces, via the StringTokenizer, and emits a key-value pair of < <word>, 1>.The Reducer implementation (lines 28-36), via the reduce method (lines 29-35) just sums up the values, which are the occurence counts for each key (i.e. words in this example).
1980: Impedance Mismatch problem : Row/Columns for Relational Databases Integration Mechanism ( Relational Dominance into the 2000s)1990: Object databases : 2000: Big Internet sites, Amazon , Google ( Traffic) Lots of trafficBigger boxes : Real limits, CostLot of little boxes, SQL was designed on single node system.Google: Big TableAmazon: DynamoNoSQL movement: term comes from Johan Oskarsson : san francisco --- London , proposed meetup (late 2000), twitter hashtag,Short unique, #nosql, (Twitter hashtag to advertise a single meeting)Data Model:1. Key-Value: 2. Document Data model : JSON ( No schema), portions of documents, 3. Column Family : Single Row key having multiple column families, where each column family is aggregate of columsn which fit together.Aggregate is about storing all related items in 1cluster.