SlideShare ist ein Scribd-Unternehmen logo
1 von 31
An unsupervised framework
for effective indexing of Big
Data
Ramakrishna Sakhamuri , Dr.Pradeep Chowriappa
Outline:
 Indexing
 Regular File Systems.
 Relational Databases.
 Decision Making.
 HDFS
 BIRCH Algorithm
 Project
Sample output of a
Data Analysis:
http://www.medsci.org/v13/p0099/ijmsv13p0099g003.jpg
Introduction:
Indexing:
4
• An index is a systematic arrangement of entries
designed to enable users to locate required
information.
• The process of creating an index is called
indexing.
File System:
• Every file system maintains
an index tree/table which
helps the user to store, parse
and retrieve the files.
• We can parse a file system in
two ways:
• Name driven.
• Content driven.
5
6
Databases:
• Primary Index
• Secondary Index
7
Traditional Decision
Making System:
http://www.jmir.org/article/viewFile/3555/1/39724
9
What are we looking at ?
• Integration of Indexing and Clustering.
• Building a secondary index on top of
existing file system index.
Hadoop System
Architecture:
Hadoop is a free, Java-based
programming framework that
supports the processing of large
data sets in a distributed
computing environment.
11
HDFS Architecture:
HDFS stands for Hadoop Distributed
Files System, is a filesystem designed
for storing very large files with
streaming data access patterns,
running on clusters of commodity
hardware.
12
Design of HDFS:
• Very large files : “Very large” in this context means files that are hundreds of megabytes,
gigabytes or terabytes in size.
• Streaming data access : Each analysis will involve a large proportion, if not all, of the
dataset, so the time to read the whole dataset is more important than the latency in reading
the first record.
• Commodity Hardware: Hadoop doesn’t require expensive, highly reliable hardware to run
on. It’s designed to run on clusters of commodity hardware.
13
HDFS Concepts:
• Blocks: A disk has a block size, which is the minimum amount of data that it can read or write.
Filesystem blocks are typically 512 bytes. HDFS, too, has the concept of a block, but it is a
much larger unit—64 MB by default. Like in a filesystem for a single disk, files in HDFS are
broken into block-sized chunks.
• Namenodes and Datanodes: An HDFS cluster has two types of nodes operating in a master-
worker pattern: a namenode (the master) and a number of datanodes (workers).
• The namenode manages the filesystem namespace. The namenode also knows the
datanodes on which all the blocks for a given file are located, however, it does not store
block locations persistently
• Datanodes are the workhorses of the filesystem. They store and retrieve blocks when they
are told to (by clients or the namenode), and they report back to the namenode
periodically with lists of blocks that they are storing.
14
Data Flow (Data Read):
• The client opens the file it wishes to
read by calling open() on the
FileSystem object.
• Distributed FileSystem calls the
namenode, using RPC, to determine
the locations of the blocks for the first
few blocks in the file. For each block,
the namenode returns the addresses of
the datanodes that have a copy of that
block.
• The DistributedFileSystem returns an
FSDataInputStream to the client for it
to read data from
15
Data Flow (Data Write):
• DFSOutputStream splits it into packets,
which it writes to an internal queue, called
the data queue.
• The data queue is consumed by the
DataStreamer , whose responsibility it is
to ask the namenode to allocate new
blocks by picking a list of suitable
datanodes to store the replicas.
• The DataStreamer streams the packets to
the first datanode in the pipeline, which
stores the packet and forwards it to the
second datanode in the pipeline and then
the second node passes on to the third in
the pipeline.
16
BIRCH:
 BIRCH stands for Balanced Iterative Reducing and Clustering using Hierarchies.
 BIRCH is especially appropriate for very large data sets.
 The BIRCH clustering algorithm consists of two main phases.
 Phase 1: Build theCF Tree. Load the data in to memory by building a cluster-feature tree
(CF tree, defined below). Optionally, condense this initial CF tree into a smaller CF.
 Phase 2: Global Clustering. Apply an existing clustering algorithm on the leaves of the
CF tree. Optionally, refine these clusters.
17
Cluster Feature:
 BIRCH clustering achieves its high efficiency by clever use of a small set of
summary statistics to represent a larger set of data points.
 For clustering purposes, these summary statistics constitute a CF, and represent a
sufficient substitute for the actual data.
 A CF is a set of three summary statistics that represent a set of data points in a
single cluster.
 Count. How many data values in the cluster.
 Linear Sum. Sum the individual coordinates. This is a measure of the location of
thecluster.
 Squared Sum. Sum the squared coordinates. This is a measure of the spread of the
cluster.
18
Cluster Feature (contd):
CF Tree:
 A CF tree is a tree structure composed of CFs.
 A CF tree represents characteristic form of data.
 There are three important parameters for any CF Tree.
 Branching factor ‘B’ which defines the maximum children allowed for a non leaf node.
 Threshold ‘T’ which is the upper limit for the radius of the cluster in a leaf node.
 ‘L’ , number of entries in a leaf node.
 For a CF entry in a root node or a non-leaf node, that CF entry equals the sum of
the CF entries in the child nodes of that entry.
20
Building a CF Tree:
 Compare the incoming CF with each CF in root node using linear sum or mean of CF.
 Create a dictionary which holds each CF and the respective distance.
 Enter into branch whose CF is closest to the root node CF.
 If node type is Non Leaf node then repeat the above step until reaches the leaf node.
 If current node type is Leaf node then compare the incoming record CF to the leaf node CF’s
and enter into the closest one.
 Perform one of (a) or (b):
a. If the radius of the chosen leaf including the new record does not exceed the Threshold T, then the
incoming record is assigned to that leaf and all of its parent CFs are updated.
b. If the radius of the chosen leaf including the newrecord does exceed the Threshold T, then a new leaf
is formed, consisting of the incoming record only and update the parent.
21
Diameter of a
cluster:
Diameter for any cluster is calculated and
compared with the given threshold value. Here
is the formulae used in the implementation.
22
R =
2𝑛(𝑆𝑆)2−(𝐿𝑆)2
𝑛(𝑛−1)
CF Tree structure:
General structure of a CF tree,
with branching factor B, and L
leafs in each leaf node.
23
L
B
Sample steps:
24
25
Project Data
Flow:
1. Writing Files into HDFS.
2. Pulling the block address
information from HDFS.
3. Passing the address and
the data to the BIRCH
algorithm.
Birch Algorthm in Python:
26
 In this project we have done the BIRCH implementation in Python.
 We imported a package which contains few implemented classes like CF tree,
cfnode. And two other classes non-leaf node, leaf node which are inherited from
cfnode.
 We designed a Birch program which creates an instance of this cftree class and
passes the input data and few other values like branching factor, intial diamter,
maximum node entries to cftree.
 And the data given as an input should only contains the numbers as it completely
deals with it using linear and square sums.
 Hence we have done the data preprocessing before passing it to the algorithm.
Birch Algorithm in Python: (cont..)
27
 Once cftree gets the data it uses the other classes like cfnode, non-leaf, leaf nodes
and builds the cf tree which looks similar to the one in slide 13.
 Once the tree is built it will return all the leafes and the details about each leaf.
 How many CFs each leaf contains.
 And the list of all the individual CFs in a leaf.
 How many successors in its parent non-leaf node.
 And it also shows the address of the non-leaf node it belongs to.
Hadoop Implementation:
28
 Install virtual machine on windows or OS.
 Install Ubuntu on the virtual machine.
 Download and install Hadoop package on Ubuntu.
 Tell Hadoop where Java installation has been done.
 For pseudo-distribution mode, change the configuration files to configure:
a. Core-site.xml -> to set default Schema and authority.
b. Hdfs-site.xml -> to set def.replication to 1 rather than the default three, otherwise all
the blocks would always be alarmed with under replication.
c. Mapred-site.xml -> To let know of host and port pair where the Jobtrackers runs at.
► Format the name node.
29Project Implementation:
 Both these scripts are designed to run in the background all the time.
Client.sh Client.py
► Execute the shell script client.sh as a
background process.
► bash $PATH/client.sh > $LOGFILE 2>&1
&
► This client file handles following steps:
► It looks for the files in INPUT
directory and once it get any files
moves them into HADOOP
processing directory.
► Then it loads the data into Hadoop
and handshakes Python process to
proceed further steps.
► Execute the client.py Python script along with
the above script.
► Python $PATH/client.py > $LOGFILE
2>&1 &
► This script handles following steps:
► It looks for the Hadoop processed
files and once it get those files it
pulls the address of those files from
HDFS.
► It gives the data files one by one to
birch along with their respective
addresses.
► After every run it writes the entire tree in
new file in a predefined location.
Sample Output
Refernces:
1. Larose, D. T. (2015). Data MIning and Predictive Analystics. Wiley.
2. Athman Bouguettaya, Q. Y. (2014). Efficient agglomerative hierarchical clustering.
Expert Systems with Applications .
3. Tian Zhang, R. R. (1996). BIRCH: an efficient data clustering method for very large
databases. NY.
4. https://codemphasis.files.wordpress.com/2012/09/hdfs-arch.jpg
31

Weitere ähnliche Inhalte

Was ist angesagt? (16)

Introduction to NetCDF-4
Introduction to NetCDF-4Introduction to NetCDF-4
Introduction to NetCDF-4
 
Map Reduce Execution Architecture
Map Reduce Execution Architecture Map Reduce Execution Architecture
Map Reduce Execution Architecture
 
File Handling In C++(OOPs))
File Handling In C++(OOPs))File Handling In C++(OOPs))
File Handling In C++(OOPs))
 
Status of HDF-EOS, Related Software and Tools
Status of HDF-EOS, Related Software and ToolsStatus of HDF-EOS, Related Software and Tools
Status of HDF-EOS, Related Software and Tools
 
Python file handling
Python file handlingPython file handling
Python file handling
 
Ceph
CephCeph
Ceph
 
File handling in c++
File handling in c++File handling in c++
File handling in c++
 
Plank
PlankPlank
Plank
 
Implementing HDF5 in MATLAB
Implementing HDF5 in MATLABImplementing HDF5 in MATLAB
Implementing HDF5 in MATLAB
 
Parallel HDF5 Developments
Parallel HDF5 DevelopmentsParallel HDF5 Developments
Parallel HDF5 Developments
 
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
Less is More: 2X Storage Efficiency with HDFS Erasure CodingLess is More: 2X Storage Efficiency with HDFS Erasure Coding
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
 
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage EfficiencyHDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
 
Sap technical deep dive in a column oriented in memory database
Sap technical deep dive in a column oriented in memory databaseSap technical deep dive in a column oriented in memory database
Sap technical deep dive in a column oriented in memory database
 
Dns
DnsDns
Dns
 
2.introduction to hdfs
2.introduction to hdfs2.introduction to hdfs
2.introduction to hdfs
 
Managing large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and conceptsManaging large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and concepts
 

Ähnlich wie An unsupervised framework for effective indexing of BigData

co-Hadoop: Data co-location on Hadoop.
co-Hadoop: Data co-location on Hadoop.co-Hadoop: Data co-location on Hadoop.
co-Hadoop: Data co-location on Hadoop.Yousef Fadila
 
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File SystemFredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File SystemFredrick Ishengoma
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesKelly Technologies
 
Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Fra...
Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Fra...Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Fra...
Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Fra...Govt.Engineering college, Idukki
 
big data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing databig data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing datapreetik9044
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesappaji intelhunt
 
Hadoop-professional-software-development-course-in-mumbai
Hadoop-professional-software-development-course-in-mumbaiHadoop-professional-software-development-course-in-mumbai
Hadoop-professional-software-development-course-in-mumbaiUnmesh Baile
 
Hadoop professional-software-development-course-in-mumbai
Hadoop professional-software-development-course-in-mumbaiHadoop professional-software-development-course-in-mumbai
Hadoop professional-software-development-course-in-mumbaiUnmesh Baile
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentationAmrut Patil
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Simplilearn
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Apache hadoop and hive
Apache hadoop and hiveApache hadoop and hive
Apache hadoop and hivesrikanthhadoop
 

Ähnlich wie An unsupervised framework for effective indexing of BigData (20)

co-Hadoop: Data co-location on Hadoop.
co-Hadoop: Data co-location on Hadoop.co-Hadoop: Data co-location on Hadoop.
co-Hadoop: Data co-location on Hadoop.
 
HADOOP.pptx
HADOOP.pptxHADOOP.pptx
HADOOP.pptx
 
Hadoop data management
Hadoop data managementHadoop data management
Hadoop data management
 
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File SystemFredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologies
 
Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Fra...
Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Fra...Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Fra...
Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Fra...
 
Hadoop -HDFS.ppt
Hadoop -HDFS.pptHadoop -HDFS.ppt
Hadoop -HDFS.ppt
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Unit-3.pptx
Unit-3.pptxUnit-3.pptx
Unit-3.pptx
 
big data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing databig data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing data
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologies
 
Hadoop-professional-software-development-course-in-mumbai
Hadoop-professional-software-development-course-in-mumbaiHadoop-professional-software-development-course-in-mumbai
Hadoop-professional-software-development-course-in-mumbai
 
Hadoop professional-software-development-course-in-mumbai
Hadoop professional-software-development-course-in-mumbaiHadoop professional-software-development-course-in-mumbai
Hadoop professional-software-development-course-in-mumbai
 
Unit V.pptx
Unit V.pptxUnit V.pptx
Unit V.pptx
 
Hadoop
HadoopHadoop
Hadoop
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentation
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Lab 1 Essay
Lab 1 EssayLab 1 Essay
Lab 1 Essay
 
Apache hadoop and hive
Apache hadoop and hiveApache hadoop and hive
Apache hadoop and hive
 

Kürzlich hochgeladen

VVIP Pune Call Girls Hadapsar (7001035870) Pune Escorts Nearby with Complete ...
VVIP Pune Call Girls Hadapsar (7001035870) Pune Escorts Nearby with Complete ...VVIP Pune Call Girls Hadapsar (7001035870) Pune Escorts Nearby with Complete ...
VVIP Pune Call Girls Hadapsar (7001035870) Pune Escorts Nearby with Complete ...Call Girls in Nagpur High Profile
 
Stark Industries Marketing Plan (1).pptx
Stark Industries Marketing Plan (1).pptxStark Industries Marketing Plan (1).pptx
Stark Industries Marketing Plan (1).pptxjeswinjees
 
2-tool presenthdbdbdbdbddhdhddation.pptx
2-tool presenthdbdbdbdbddhdhddation.pptx2-tool presenthdbdbdbdbddhdhddation.pptx
2-tool presenthdbdbdbdbddhdhddation.pptxsuhanimunjal27
 
Kala jadu for love marriage | Real amil baba | Famous amil baba | kala jadu n...
Kala jadu for love marriage | Real amil baba | Famous amil baba | kala jadu n...Kala jadu for love marriage | Real amil baba | Famous amil baba | kala jadu n...
Kala jadu for love marriage | Real amil baba | Famous amil baba | kala jadu n...babafaisel
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756dollysharma2066
 
Top Rated Pune Call Girls Saswad ⟟ 6297143586 ⟟ Call Me For Genuine Sex Serv...
Top Rated  Pune Call Girls Saswad ⟟ 6297143586 ⟟ Call Me For Genuine Sex Serv...Top Rated  Pune Call Girls Saswad ⟟ 6297143586 ⟟ Call Me For Genuine Sex Serv...
Top Rated Pune Call Girls Saswad ⟟ 6297143586 ⟟ Call Me For Genuine Sex Serv...Call Girls in Nagpur High Profile
 
Top Rated Pune Call Girls Koregaon Park ⟟ 6297143586 ⟟ Call Me For Genuine S...
Top Rated  Pune Call Girls Koregaon Park ⟟ 6297143586 ⟟ Call Me For Genuine S...Top Rated  Pune Call Girls Koregaon Park ⟟ 6297143586 ⟟ Call Me For Genuine S...
Top Rated Pune Call Girls Koregaon Park ⟟ 6297143586 ⟟ Call Me For Genuine S...Call Girls in Nagpur High Profile
 
(AISHA) Ambegaon Khurd Call Girls Just Call 7001035870 [ Cash on Delivery ] P...
(AISHA) Ambegaon Khurd Call Girls Just Call 7001035870 [ Cash on Delivery ] P...(AISHA) Ambegaon Khurd Call Girls Just Call 7001035870 [ Cash on Delivery ] P...
(AISHA) Ambegaon Khurd Call Girls Just Call 7001035870 [ Cash on Delivery ] P...ranjana rawat
 
infant assessment fdbbdbdddinal ppt.pptx
infant assessment fdbbdbdddinal ppt.pptxinfant assessment fdbbdbdddinal ppt.pptx
infant assessment fdbbdbdddinal ppt.pptxsuhanimunjal27
 
UI:UX Design and Empowerment Strategies for Underprivileged Transgender Indiv...
UI:UX Design and Empowerment Strategies for Underprivileged Transgender Indiv...UI:UX Design and Empowerment Strategies for Underprivileged Transgender Indiv...
UI:UX Design and Empowerment Strategies for Underprivileged Transgender Indiv...RitikaRoy32
 
CALL ON ➥8923113531 🔝Call Girls Kalyanpur Lucknow best Female service 🧵
CALL ON ➥8923113531 🔝Call Girls Kalyanpur Lucknow best Female service  🧵CALL ON ➥8923113531 🔝Call Girls Kalyanpur Lucknow best Female service  🧵
CALL ON ➥8923113531 🔝Call Girls Kalyanpur Lucknow best Female service 🧵anilsa9823
 
DragonBall PowerPoint Template for demo.pptx
DragonBall PowerPoint Template for demo.pptxDragonBall PowerPoint Template for demo.pptx
DragonBall PowerPoint Template for demo.pptxmirandajeremy200221
 
Brookefield Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Brookefield Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Brookefield Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Brookefield Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Tapestry Clothing Brands: Collapsing the Funnel
Tapestry Clothing Brands: Collapsing the FunnelTapestry Clothing Brands: Collapsing the Funnel
Tapestry Clothing Brands: Collapsing the Funneljen_giacalone
 
Call Girls Basavanagudi Just Call 👗 7737669865 👗 Top Class Call Girl Service ...
Call Girls Basavanagudi Just Call 👗 7737669865 👗 Top Class Call Girl Service ...Call Girls Basavanagudi Just Call 👗 7737669865 👗 Top Class Call Girl Service ...
Call Girls Basavanagudi Just Call 👗 7737669865 👗 Top Class Call Girl Service ...amitlee9823
 
💫✅jodhpur 24×7 BEST GENUINE PERSON LOW PRICE CALL GIRL SERVICE FULL SATISFACT...
💫✅jodhpur 24×7 BEST GENUINE PERSON LOW PRICE CALL GIRL SERVICE FULL SATISFACT...💫✅jodhpur 24×7 BEST GENUINE PERSON LOW PRICE CALL GIRL SERVICE FULL SATISFACT...
💫✅jodhpur 24×7 BEST GENUINE PERSON LOW PRICE CALL GIRL SERVICE FULL SATISFACT...sonalitrivedi431
 
Government polytechnic college-1.pptxabcd
Government polytechnic college-1.pptxabcdGovernment polytechnic college-1.pptxabcd
Government polytechnic college-1.pptxabcdshivubhavv
 
Best VIP Call Girls Noida Sector 47 Call Me: 8448380779
Best VIP Call Girls Noida Sector 47 Call Me: 8448380779Best VIP Call Girls Noida Sector 47 Call Me: 8448380779
Best VIP Call Girls Noida Sector 47 Call Me: 8448380779Delhi Call girls
 

Kürzlich hochgeladen (20)

VVIP Pune Call Girls Hadapsar (7001035870) Pune Escorts Nearby with Complete ...
VVIP Pune Call Girls Hadapsar (7001035870) Pune Escorts Nearby with Complete ...VVIP Pune Call Girls Hadapsar (7001035870) Pune Escorts Nearby with Complete ...
VVIP Pune Call Girls Hadapsar (7001035870) Pune Escorts Nearby with Complete ...
 
Stark Industries Marketing Plan (1).pptx
Stark Industries Marketing Plan (1).pptxStark Industries Marketing Plan (1).pptx
Stark Industries Marketing Plan (1).pptx
 
Call Girls Service Mukherjee Nagar @9999965857 Delhi 🫦 No Advance VVIP 🍎 SER...
Call Girls Service Mukherjee Nagar @9999965857 Delhi 🫦 No Advance  VVIP 🍎 SER...Call Girls Service Mukherjee Nagar @9999965857 Delhi 🫦 No Advance  VVIP 🍎 SER...
Call Girls Service Mukherjee Nagar @9999965857 Delhi 🫦 No Advance VVIP 🍎 SER...
 
2-tool presenthdbdbdbdbddhdhddation.pptx
2-tool presenthdbdbdbdbddhdhddation.pptx2-tool presenthdbdbdbdbddhdhddation.pptx
2-tool presenthdbdbdbdbddhdhddation.pptx
 
Kala jadu for love marriage | Real amil baba | Famous amil baba | kala jadu n...
Kala jadu for love marriage | Real amil baba | Famous amil baba | kala jadu n...Kala jadu for love marriage | Real amil baba | Famous amil baba | kala jadu n...
Kala jadu for love marriage | Real amil baba | Famous amil baba | kala jadu n...
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Top Rated Pune Call Girls Saswad ⟟ 6297143586 ⟟ Call Me For Genuine Sex Serv...
Top Rated  Pune Call Girls Saswad ⟟ 6297143586 ⟟ Call Me For Genuine Sex Serv...Top Rated  Pune Call Girls Saswad ⟟ 6297143586 ⟟ Call Me For Genuine Sex Serv...
Top Rated Pune Call Girls Saswad ⟟ 6297143586 ⟟ Call Me For Genuine Sex Serv...
 
Top Rated Pune Call Girls Koregaon Park ⟟ 6297143586 ⟟ Call Me For Genuine S...
Top Rated  Pune Call Girls Koregaon Park ⟟ 6297143586 ⟟ Call Me For Genuine S...Top Rated  Pune Call Girls Koregaon Park ⟟ 6297143586 ⟟ Call Me For Genuine S...
Top Rated Pune Call Girls Koregaon Park ⟟ 6297143586 ⟟ Call Me For Genuine S...
 
(AISHA) Ambegaon Khurd Call Girls Just Call 7001035870 [ Cash on Delivery ] P...
(AISHA) Ambegaon Khurd Call Girls Just Call 7001035870 [ Cash on Delivery ] P...(AISHA) Ambegaon Khurd Call Girls Just Call 7001035870 [ Cash on Delivery ] P...
(AISHA) Ambegaon Khurd Call Girls Just Call 7001035870 [ Cash on Delivery ] P...
 
infant assessment fdbbdbdddinal ppt.pptx
infant assessment fdbbdbdddinal ppt.pptxinfant assessment fdbbdbdddinal ppt.pptx
infant assessment fdbbdbdddinal ppt.pptx
 
UI:UX Design and Empowerment Strategies for Underprivileged Transgender Indiv...
UI:UX Design and Empowerment Strategies for Underprivileged Transgender Indiv...UI:UX Design and Empowerment Strategies for Underprivileged Transgender Indiv...
UI:UX Design and Empowerment Strategies for Underprivileged Transgender Indiv...
 
CALL ON ➥8923113531 🔝Call Girls Kalyanpur Lucknow best Female service 🧵
CALL ON ➥8923113531 🔝Call Girls Kalyanpur Lucknow best Female service  🧵CALL ON ➥8923113531 🔝Call Girls Kalyanpur Lucknow best Female service  🧵
CALL ON ➥8923113531 🔝Call Girls Kalyanpur Lucknow best Female service 🧵
 
young call girls in Pandav nagar 🔝 9953056974 🔝 Delhi escort Service
young call girls in Pandav nagar 🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Pandav nagar 🔝 9953056974 🔝 Delhi escort Service
young call girls in Pandav nagar 🔝 9953056974 🔝 Delhi escort Service
 
DragonBall PowerPoint Template for demo.pptx
DragonBall PowerPoint Template for demo.pptxDragonBall PowerPoint Template for demo.pptx
DragonBall PowerPoint Template for demo.pptx
 
Brookefield Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Brookefield Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Brookefield Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Brookefield Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Tapestry Clothing Brands: Collapsing the Funnel
Tapestry Clothing Brands: Collapsing the FunnelTapestry Clothing Brands: Collapsing the Funnel
Tapestry Clothing Brands: Collapsing the Funnel
 
Call Girls Basavanagudi Just Call 👗 7737669865 👗 Top Class Call Girl Service ...
Call Girls Basavanagudi Just Call 👗 7737669865 👗 Top Class Call Girl Service ...Call Girls Basavanagudi Just Call 👗 7737669865 👗 Top Class Call Girl Service ...
Call Girls Basavanagudi Just Call 👗 7737669865 👗 Top Class Call Girl Service ...
 
💫✅jodhpur 24×7 BEST GENUINE PERSON LOW PRICE CALL GIRL SERVICE FULL SATISFACT...
💫✅jodhpur 24×7 BEST GENUINE PERSON LOW PRICE CALL GIRL SERVICE FULL SATISFACT...💫✅jodhpur 24×7 BEST GENUINE PERSON LOW PRICE CALL GIRL SERVICE FULL SATISFACT...
💫✅jodhpur 24×7 BEST GENUINE PERSON LOW PRICE CALL GIRL SERVICE FULL SATISFACT...
 
Government polytechnic college-1.pptxabcd
Government polytechnic college-1.pptxabcdGovernment polytechnic college-1.pptxabcd
Government polytechnic college-1.pptxabcd
 
Best VIP Call Girls Noida Sector 47 Call Me: 8448380779
Best VIP Call Girls Noida Sector 47 Call Me: 8448380779Best VIP Call Girls Noida Sector 47 Call Me: 8448380779
Best VIP Call Girls Noida Sector 47 Call Me: 8448380779
 

An unsupervised framework for effective indexing of BigData

  • 1. An unsupervised framework for effective indexing of Big Data Ramakrishna Sakhamuri , Dr.Pradeep Chowriappa
  • 2. Outline:  Indexing  Regular File Systems.  Relational Databases.  Decision Making.  HDFS  BIRCH Algorithm  Project
  • 3. Sample output of a Data Analysis: http://www.medsci.org/v13/p0099/ijmsv13p0099g003.jpg Introduction:
  • 4. Indexing: 4 • An index is a systematic arrangement of entries designed to enable users to locate required information. • The process of creating an index is called indexing.
  • 5. File System: • Every file system maintains an index tree/table which helps the user to store, parse and retrieve the files. • We can parse a file system in two ways: • Name driven. • Content driven. 5
  • 6. 6
  • 9. 9 What are we looking at ?
  • 10. • Integration of Indexing and Clustering. • Building a secondary index on top of existing file system index.
  • 11. Hadoop System Architecture: Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. 11
  • 12. HDFS Architecture: HDFS stands for Hadoop Distributed Files System, is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware. 12
  • 13. Design of HDFS: • Very large files : “Very large” in this context means files that are hundreds of megabytes, gigabytes or terabytes in size. • Streaming data access : Each analysis will involve a large proportion, if not all, of the dataset, so the time to read the whole dataset is more important than the latency in reading the first record. • Commodity Hardware: Hadoop doesn’t require expensive, highly reliable hardware to run on. It’s designed to run on clusters of commodity hardware. 13
  • 14. HDFS Concepts: • Blocks: A disk has a block size, which is the minimum amount of data that it can read or write. Filesystem blocks are typically 512 bytes. HDFS, too, has the concept of a block, but it is a much larger unit—64 MB by default. Like in a filesystem for a single disk, files in HDFS are broken into block-sized chunks. • Namenodes and Datanodes: An HDFS cluster has two types of nodes operating in a master- worker pattern: a namenode (the master) and a number of datanodes (workers). • The namenode manages the filesystem namespace. The namenode also knows the datanodes on which all the blocks for a given file are located, however, it does not store block locations persistently • Datanodes are the workhorses of the filesystem. They store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing. 14
  • 15. Data Flow (Data Read): • The client opens the file it wishes to read by calling open() on the FileSystem object. • Distributed FileSystem calls the namenode, using RPC, to determine the locations of the blocks for the first few blocks in the file. For each block, the namenode returns the addresses of the datanodes that have a copy of that block. • The DistributedFileSystem returns an FSDataInputStream to the client for it to read data from 15
  • 16. Data Flow (Data Write): • DFSOutputStream splits it into packets, which it writes to an internal queue, called the data queue. • The data queue is consumed by the DataStreamer , whose responsibility it is to ask the namenode to allocate new blocks by picking a list of suitable datanodes to store the replicas. • The DataStreamer streams the packets to the first datanode in the pipeline, which stores the packet and forwards it to the second datanode in the pipeline and then the second node passes on to the third in the pipeline. 16
  • 17. BIRCH:  BIRCH stands for Balanced Iterative Reducing and Clustering using Hierarchies.  BIRCH is especially appropriate for very large data sets.  The BIRCH clustering algorithm consists of two main phases.  Phase 1: Build theCF Tree. Load the data in to memory by building a cluster-feature tree (CF tree, defined below). Optionally, condense this initial CF tree into a smaller CF.  Phase 2: Global Clustering. Apply an existing clustering algorithm on the leaves of the CF tree. Optionally, refine these clusters. 17
  • 18. Cluster Feature:  BIRCH clustering achieves its high efficiency by clever use of a small set of summary statistics to represent a larger set of data points.  For clustering purposes, these summary statistics constitute a CF, and represent a sufficient substitute for the actual data.  A CF is a set of three summary statistics that represent a set of data points in a single cluster.  Count. How many data values in the cluster.  Linear Sum. Sum the individual coordinates. This is a measure of the location of thecluster.  Squared Sum. Sum the squared coordinates. This is a measure of the spread of the cluster. 18
  • 20. CF Tree:  A CF tree is a tree structure composed of CFs.  A CF tree represents characteristic form of data.  There are three important parameters for any CF Tree.  Branching factor ‘B’ which defines the maximum children allowed for a non leaf node.  Threshold ‘T’ which is the upper limit for the radius of the cluster in a leaf node.  ‘L’ , number of entries in a leaf node.  For a CF entry in a root node or a non-leaf node, that CF entry equals the sum of the CF entries in the child nodes of that entry. 20
  • 21. Building a CF Tree:  Compare the incoming CF with each CF in root node using linear sum or mean of CF.  Create a dictionary which holds each CF and the respective distance.  Enter into branch whose CF is closest to the root node CF.  If node type is Non Leaf node then repeat the above step until reaches the leaf node.  If current node type is Leaf node then compare the incoming record CF to the leaf node CF’s and enter into the closest one.  Perform one of (a) or (b): a. If the radius of the chosen leaf including the new record does not exceed the Threshold T, then the incoming record is assigned to that leaf and all of its parent CFs are updated. b. If the radius of the chosen leaf including the newrecord does exceed the Threshold T, then a new leaf is formed, consisting of the incoming record only and update the parent. 21
  • 22. Diameter of a cluster: Diameter for any cluster is calculated and compared with the given threshold value. Here is the formulae used in the implementation. 22 R = 2𝑛(𝑆𝑆)2−(𝐿𝑆)2 𝑛(𝑛−1)
  • 23. CF Tree structure: General structure of a CF tree, with branching factor B, and L leafs in each leaf node. 23 L B
  • 25. 25 Project Data Flow: 1. Writing Files into HDFS. 2. Pulling the block address information from HDFS. 3. Passing the address and the data to the BIRCH algorithm.
  • 26. Birch Algorthm in Python: 26  In this project we have done the BIRCH implementation in Python.  We imported a package which contains few implemented classes like CF tree, cfnode. And two other classes non-leaf node, leaf node which are inherited from cfnode.  We designed a Birch program which creates an instance of this cftree class and passes the input data and few other values like branching factor, intial diamter, maximum node entries to cftree.  And the data given as an input should only contains the numbers as it completely deals with it using linear and square sums.  Hence we have done the data preprocessing before passing it to the algorithm.
  • 27. Birch Algorithm in Python: (cont..) 27  Once cftree gets the data it uses the other classes like cfnode, non-leaf, leaf nodes and builds the cf tree which looks similar to the one in slide 13.  Once the tree is built it will return all the leafes and the details about each leaf.  How many CFs each leaf contains.  And the list of all the individual CFs in a leaf.  How many successors in its parent non-leaf node.  And it also shows the address of the non-leaf node it belongs to.
  • 28. Hadoop Implementation: 28  Install virtual machine on windows or OS.  Install Ubuntu on the virtual machine.  Download and install Hadoop package on Ubuntu.  Tell Hadoop where Java installation has been done.  For pseudo-distribution mode, change the configuration files to configure: a. Core-site.xml -> to set default Schema and authority. b. Hdfs-site.xml -> to set def.replication to 1 rather than the default three, otherwise all the blocks would always be alarmed with under replication. c. Mapred-site.xml -> To let know of host and port pair where the Jobtrackers runs at. ► Format the name node.
  • 29. 29Project Implementation:  Both these scripts are designed to run in the background all the time. Client.sh Client.py ► Execute the shell script client.sh as a background process. ► bash $PATH/client.sh > $LOGFILE 2>&1 & ► This client file handles following steps: ► It looks for the files in INPUT directory and once it get any files moves them into HADOOP processing directory. ► Then it loads the data into Hadoop and handshakes Python process to proceed further steps. ► Execute the client.py Python script along with the above script. ► Python $PATH/client.py > $LOGFILE 2>&1 & ► This script handles following steps: ► It looks for the Hadoop processed files and once it get those files it pulls the address of those files from HDFS. ► It gives the data files one by one to birch along with their respective addresses. ► After every run it writes the entire tree in new file in a predefined location.
  • 31. Refernces: 1. Larose, D. T. (2015). Data MIning and Predictive Analystics. Wiley. 2. Athman Bouguettaya, Q. Y. (2014). Efficient agglomerative hierarchical clustering. Expert Systems with Applications . 3. Tian Zhang, R. R. (1996). BIRCH: an efficient data clustering method for very large databases. NY. 4. https://codemphasis.files.wordpress.com/2012/09/hdfs-arch.jpg 31