To develop an indexing system which helps to build an Unsupervised Indexing for Big Data. With this indexing system one can search for data files not only based on keywords and file names but also with the closest meaningful data to your input content (clustering approach).
3. Sample output of a
Data Analysis:
http://www.medsci.org/v13/p0099/ijmsv13p0099g003.jpg
Introduction:
4. Indexing:
4
• An index is a systematic arrangement of entries
designed to enable users to locate required
information.
• The process of creating an index is called
indexing.
5. File System:
• Every file system maintains
an index tree/table which
helps the user to store, parse
and retrieve the files.
• We can parse a file system in
two ways:
• Name driven.
• Content driven.
5
10. • Integration of Indexing and Clustering.
• Building a secondary index on top of
existing file system index.
11. Hadoop System
Architecture:
Hadoop is a free, Java-based
programming framework that
supports the processing of large
data sets in a distributed
computing environment.
11
12. HDFS Architecture:
HDFS stands for Hadoop Distributed
Files System, is a filesystem designed
for storing very large files with
streaming data access patterns,
running on clusters of commodity
hardware.
12
13. Design of HDFS:
• Very large files : “Very large” in this context means files that are hundreds of megabytes,
gigabytes or terabytes in size.
• Streaming data access : Each analysis will involve a large proportion, if not all, of the
dataset, so the time to read the whole dataset is more important than the latency in reading
the first record.
• Commodity Hardware: Hadoop doesn’t require expensive, highly reliable hardware to run
on. It’s designed to run on clusters of commodity hardware.
13
14. HDFS Concepts:
• Blocks: A disk has a block size, which is the minimum amount of data that it can read or write.
Filesystem blocks are typically 512 bytes. HDFS, too, has the concept of a block, but it is a
much larger unit—64 MB by default. Like in a filesystem for a single disk, files in HDFS are
broken into block-sized chunks.
• Namenodes and Datanodes: An HDFS cluster has two types of nodes operating in a master-
worker pattern: a namenode (the master) and a number of datanodes (workers).
• The namenode manages the filesystem namespace. The namenode also knows the
datanodes on which all the blocks for a given file are located, however, it does not store
block locations persistently
• Datanodes are the workhorses of the filesystem. They store and retrieve blocks when they
are told to (by clients or the namenode), and they report back to the namenode
periodically with lists of blocks that they are storing.
14
15. Data Flow (Data Read):
• The client opens the file it wishes to
read by calling open() on the
FileSystem object.
• Distributed FileSystem calls the
namenode, using RPC, to determine
the locations of the blocks for the first
few blocks in the file. For each block,
the namenode returns the addresses of
the datanodes that have a copy of that
block.
• The DistributedFileSystem returns an
FSDataInputStream to the client for it
to read data from
15
16. Data Flow (Data Write):
• DFSOutputStream splits it into packets,
which it writes to an internal queue, called
the data queue.
• The data queue is consumed by the
DataStreamer , whose responsibility it is
to ask the namenode to allocate new
blocks by picking a list of suitable
datanodes to store the replicas.
• The DataStreamer streams the packets to
the first datanode in the pipeline, which
stores the packet and forwards it to the
second datanode in the pipeline and then
the second node passes on to the third in
the pipeline.
16
17. BIRCH:
BIRCH stands for Balanced Iterative Reducing and Clustering using Hierarchies.
BIRCH is especially appropriate for very large data sets.
The BIRCH clustering algorithm consists of two main phases.
Phase 1: Build theCF Tree. Load the data in to memory by building a cluster-feature tree
(CF tree, defined below). Optionally, condense this initial CF tree into a smaller CF.
Phase 2: Global Clustering. Apply an existing clustering algorithm on the leaves of the
CF tree. Optionally, refine these clusters.
17
18. Cluster Feature:
BIRCH clustering achieves its high efficiency by clever use of a small set of
summary statistics to represent a larger set of data points.
For clustering purposes, these summary statistics constitute a CF, and represent a
sufficient substitute for the actual data.
A CF is a set of three summary statistics that represent a set of data points in a
single cluster.
Count. How many data values in the cluster.
Linear Sum. Sum the individual coordinates. This is a measure of the location of
thecluster.
Squared Sum. Sum the squared coordinates. This is a measure of the spread of the
cluster.
18
20. CF Tree:
A CF tree is a tree structure composed of CFs.
A CF tree represents characteristic form of data.
There are three important parameters for any CF Tree.
Branching factor ‘B’ which defines the maximum children allowed for a non leaf node.
Threshold ‘T’ which is the upper limit for the radius of the cluster in a leaf node.
‘L’ , number of entries in a leaf node.
For a CF entry in a root node or a non-leaf node, that CF entry equals the sum of
the CF entries in the child nodes of that entry.
20
21. Building a CF Tree:
Compare the incoming CF with each CF in root node using linear sum or mean of CF.
Create a dictionary which holds each CF and the respective distance.
Enter into branch whose CF is closest to the root node CF.
If node type is Non Leaf node then repeat the above step until reaches the leaf node.
If current node type is Leaf node then compare the incoming record CF to the leaf node CF’s
and enter into the closest one.
Perform one of (a) or (b):
a. If the radius of the chosen leaf including the new record does not exceed the Threshold T, then the
incoming record is assigned to that leaf and all of its parent CFs are updated.
b. If the radius of the chosen leaf including the newrecord does exceed the Threshold T, then a new leaf
is formed, consisting of the incoming record only and update the parent.
21
22. Diameter of a
cluster:
Diameter for any cluster is calculated and
compared with the given threshold value. Here
is the formulae used in the implementation.
22
R =
2𝑛(𝑆𝑆)2−(𝐿𝑆)2
𝑛(𝑛−1)
23. CF Tree structure:
General structure of a CF tree,
with branching factor B, and L
leafs in each leaf node.
23
L
B
25. 25
Project Data
Flow:
1. Writing Files into HDFS.
2. Pulling the block address
information from HDFS.
3. Passing the address and
the data to the BIRCH
algorithm.
26. Birch Algorthm in Python:
26
In this project we have done the BIRCH implementation in Python.
We imported a package which contains few implemented classes like CF tree,
cfnode. And two other classes non-leaf node, leaf node which are inherited from
cfnode.
We designed a Birch program which creates an instance of this cftree class and
passes the input data and few other values like branching factor, intial diamter,
maximum node entries to cftree.
And the data given as an input should only contains the numbers as it completely
deals with it using linear and square sums.
Hence we have done the data preprocessing before passing it to the algorithm.
27. Birch Algorithm in Python: (cont..)
27
Once cftree gets the data it uses the other classes like cfnode, non-leaf, leaf nodes
and builds the cf tree which looks similar to the one in slide 13.
Once the tree is built it will return all the leafes and the details about each leaf.
How many CFs each leaf contains.
And the list of all the individual CFs in a leaf.
How many successors in its parent non-leaf node.
And it also shows the address of the non-leaf node it belongs to.
28. Hadoop Implementation:
28
Install virtual machine on windows or OS.
Install Ubuntu on the virtual machine.
Download and install Hadoop package on Ubuntu.
Tell Hadoop where Java installation has been done.
For pseudo-distribution mode, change the configuration files to configure:
a. Core-site.xml -> to set default Schema and authority.
b. Hdfs-site.xml -> to set def.replication to 1 rather than the default three, otherwise all
the blocks would always be alarmed with under replication.
c. Mapred-site.xml -> To let know of host and port pair where the Jobtrackers runs at.
► Format the name node.
29. 29Project Implementation:
Both these scripts are designed to run in the background all the time.
Client.sh Client.py
► Execute the shell script client.sh as a
background process.
► bash $PATH/client.sh > $LOGFILE 2>&1
&
► This client file handles following steps:
► It looks for the files in INPUT
directory and once it get any files
moves them into HADOOP
processing directory.
► Then it loads the data into Hadoop
and handshakes Python process to
proceed further steps.
► Execute the client.py Python script along with
the above script.
► Python $PATH/client.py > $LOGFILE
2>&1 &
► This script handles following steps:
► It looks for the Hadoop processed
files and once it get those files it
pulls the address of those files from
HDFS.
► It gives the data files one by one to
birch along with their respective
addresses.
► After every run it writes the entire tree in
new file in a predefined location.
31. Refernces:
1. Larose, D. T. (2015). Data MIning and Predictive Analystics. Wiley.
2. Athman Bouguettaya, Q. Y. (2014). Efficient agglomerative hierarchical clustering.
Expert Systems with Applications .
3. Tian Zhang, R. R. (1996). BIRCH: an efficient data clustering method for very large
databases. NY.
4. https://codemphasis.files.wordpress.com/2012/09/hdfs-arch.jpg
31