Anzeige

Big data analytics: Technology's bleeding edge

Research Trainee at Defence Research and Development Organisation um Defence Research and Development Organisation
24. Aug 2014
Anzeige

Más contenido relacionado

Anzeige

Big data analytics: Technology's bleeding edge

  1.  Big Data refers to massive, often unstructured data that is beyond the processing capabilities of traditional data management tools.  Big Data can take up terabytes and petabytes of storage space in diverse formats including text, video, sound, images etc.  Traditional relational database management systems cannot deal with such large masses of data.  Examples : User updates over fb. Clicks over the internet.
  2.  Volume refers to huge amount of data being generated every minute.  90% of the data we have now is created in just past 2 years.  IP traffic by 2015 would turn 4X than what it is now.  3 billion people would be online by 2015 .
  3.  Velocity refers to SPEED at which new data is being generated and moves around.  It includes Real time working systems such as Online banking.  Need of low response time.  Technology “In-Memory Analytics” is employed to deal with data in motion.
  4.  Variety refers to various datatypes which we can now use.  Earlier focus was on neat and structured data kept in form of tables in RDBMS.  80% of data available now is unstructured data  Datatypes are anomalous varying from text to videos to audios to pictures.
  5. Transform problems into possibilities
  6.  It is the process of examining large amounts of data of a variety of types (big data) to uncover hidden patterns, unknown correlations and other real- time insights.  Use of Big Data Analytics – Google Search recommendations, Satyamev jayte, Genes reading Data Mining Big data Analytics Data constraints like data must be neat and clean  Big data can not be neat as it is unstructured  Elaborate ETL required thus have to wait for completion of ETL cycle for insights.  Big data analytics provide real – time insights.
  7.  Descriptive  Diagnostic  Predictive  Prescriptive
  8.  Relational databases failed to store and process Big Data.  As a result, a new class of big data technology has emerged and is being used in many big data analytics environments.  The technologies associated with big data analytics include  Hadoop  Mapreduce  NoSQL
  9.  Hadoop is a open source framework  Java-based programming framework  Processing and storing of large data sets  Distributed computing environment.  Components of hadoop  HDFS( hadoop distributed file system)  Mapreduce
  10.  HDFS stores data in DISTRIBUTED,SCALABLE and FAULT- TOLERANT WAY.  Name node have metadata about data on DataNodes  DataNodes actually have data on them in form of blocks and they are capable of communicating
  11. Hadoop SQL  Data is stored in form of compressed files across n number of commodity servers  Data is stored in form of tables and columns with relation in them  Fault tolerant – if one node fails ,system still work  If any one node crashes ,it gives error so as to maintain consistency Any questions ???...
  12.  Copying same file over all (thousands) of nodes ? doesn’t it seem like wastage of space !  It actually is not a waste memory, because of 2 reasons:  If one node failed ,System would still work as data is never lost.  The query is scaled over nodes so it bring about faster results due to parallel processing eg- Select the count of word ‘happy’ on twitter. The query is split across multiple servers with a criteria (here months), and the results are consolidated.
  13.  MapReduce is a programming model designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks. as in previous example twitter data was processed on different servers on basis of months .  Hadoop is the physical implementation of Mapreduce .  It is combination of 2 java functions : Mapper() and Reducer()  example: to check popularity of text. use of word-count..
  14.  Mapper function maps the split files and provide input to reducer  Mapper ( filename , file –contents): for each word in file-contents: emit (word , 1)  Reducer function clubs the input provided by mapper and produce output  Reducer ( word , values): sum=0; for each value in values: sum=sum + value emit(word , sum) can anyone think of any disadvantages??..
  15.  There were 2 major disadvantages when hadoop was developed which now have been dissolved  HDFS dependency on single Namenode solution: A secondary Namenode is attached to Primary Namenode  MapReduce is a java fraamework and did not support sql queries solution: Facebook developed HIVE which allowed scientists work with sql on distributed database.
  16.  Not only SQL  Non- relational database management system  Used where no fix schemas are required and data is scaled horizontally.  4 Categories of Nosql databases:  Key-value pair  Columnar database  Graph databases  Document databases
  17.  KEY-VALUE PAIR  keys used to get Value from opaque Data blocks  Hash map  Tremendously fast Drawback: No provision for content based queries .
  18.  DOCUMENT DATABASE • Again a key value store but value is in form of document. • Documents are not of fixed schemas • documents can be nested • Queries based on content as well as keys • Use cases: blogging websites
  19.  COLUMNAR DATABASE  Works on attributes rather than tuples  Key here is column name and value is contiguous column values  Best for aggregation queries  Trend : select (1 or 2 column’s values ) where ( same or the other column value ) = some value.
  20.  GRAPH DATABASES • Is a collection of nodes and edges • Nodes represent data while edge represent link between them • Most dynamic and flexible
  21.  Websites : • http://searchbusinessanalytics.techtarget.com/ Experts sound off on big data , Analytics and its tools • http://www.ibmbigdatahub.com/infographic/four-vs-big-data Big data and analytics hub • https://bigdatauniversity.com/bdu-wp/bdu-course/hadoop- fundamentals-i-version-3/ Hadoop fundamentals Research papers : •MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Appeared in: OSDI'04: Sixth Symposium on Operating System Design San Francisco, CA, December, 2004.
  22. Data is the new oil Without Big data analysis companies are deaf and dumb , mere wanderers on web ... Like a cattle on the highway ! Thank you ! Keep dreaming BIG :D
Anzeige