SlideShare ist ein Scribd-Unternehmen logo
1 von 77
INDEXING
Jehan-François Pâris
Spring 2015
Overview
 Three main techniques
Conventional indexes
 Think of a page table, …
B and B+ trees
 Perform better when records are constantly
added or deleted
Hashing
Conventional
indexes
Indexes
 A database index is a data structure that
improves the speed of data retrieval operations
on a database table at the cost of additional
writes and storage space to maintain the index
data structure.
Wikipedia
Types of indexes
 An index can be
Sparse
 One entry per data block
 Identifies the first record of the block
 Requires data to be sorted
Dense
 One entry per record
 Data do not have to be sorted
Respective advantages
 Sparse
 Occupy much less space
 Can keep more of it in main memory
Faster access
Dense
 Can tell if a given record exists without
accessing the file
 Do not require data to be sorted
Indexes based on primary keys
 Each key value corresponds to a specific record
 Two cases to consider:
Table is sorted on its primary key
 Can use a sparse index
Table is either non-sorted or sorted on
another field
 Must use a dense index
Sparse Index
Ahmed … …
Amita … …
Brenda … …
Carlos … …
Dana … …
Dino … …
Emily … …
Frank … …
Alan .
Dana .
Gina .
Dense Index
Ahmed … …
Frank … …
Brenda … …
Dana … …
Emily … …
Dino … …
Carlos … …
Amita … …
Ahmed
Amita
Brenda
Carlos
Dana
Dino
Emily
Frank
Indexes based on other fields
 Each key value may correspond to more than
one record
clustering index
 Two cases to consider:
Table is sorted on the field
 Can use a sparse index
Table is either non-sorted or sorted on
another field
 Must use a dense index
Sparse clustering index
Ahmed Austin …
Frank Austin …
Brenda Austin …
Dana Dallas …
Emily Dallas …
Dino Dallas …
Carlos Laredo …
Amita Laredo …
Austin .
Dallas .
Laredo .
Dense clustering index
Austin
Austin
Austin
Dallas
Dallas
Dallas
Laredo
Laredo
Dana Dallas …
Dino Dallas …
Emily Dallas …
Frank Austin …
Ahmed Austin …
Amita Laredo …
Brenda Austin …
Carlos Laredo …
Another realization
Dana Dallas …
Dino Dallas …
Emily Dallas …
Frank Austin …
Ahmed Austin …
Amita Laredo …
Brenda Austin …
Carlos Laredo …
Austin
Dallas .
Laredo .
We save space
and add one extra
level of indirection
A side comment
 "We can solve any problem by introducing an
extra level of indirection, except of course for the
problem of too many indirections."
 David John Wheeler
Indexing the index
 When index is very large, it makes sense to
index the index
Two-level or three-level index
Index at top level is called master index
 Normally a sparse index
Two levels
AKA
Master Index
Top Index
Updating indexed tables
 Can be painful
No silver bullet
B-trees and B+ trees
Motivation
 To have dynamic indexing structures that can
evolve when records are added and deleted
Not the case for static indexes
 Would have to be completely rebuilt
 Optimized for searches on block devices
 Both B trees and B+ trees are not binary
Objective is to increase branching factor
(degree or fan-out) to reduce the number of
device accesses
Binary vs. higher-order tree
 Binary trees:
Designed for in-
memory searches
Try to minimize the
number of memory
accesses
 Higher-order trees:
Designed for
searching data on
block devices
Try to minimize the
number of device
accesses
 Searching within
a block is cheap!
B trees
 Generalization of binary search trees
 Not binary trees
The B stands for Bayer (or Boeing)
 Designed for searching data stored on block-
oriented devices
A very small B tree
Bottom nodes are leaf nodes: all their
pointers are NULL
In reality
In
tree
ptr
Key
Data ptr
In
tree
ptr
Key
Data ptr
In
tree
ptr
Key
Data ptr
In
tree
ptr
Key
Data ptr
In
tree
ptr
To
Leaf
7 To
leaf
16 To
Leaf
--
Null
Null
--
Null
Null
Organization
 Each non-terminal node can have a variable
number of child nodes
Must all be in a specific key range
Number of child nodes typically vary between
d and 2d
 Will split nodes that would otherwise have
contained 2d + 1 child nodes
 Will merge nodes that contain less than d
child nodes
Searching the tree
keys < 7 keys > 16
7 < keys < 16
Balancing B trees
 Objective is to ensure that all terminals nodes be
at the same depth
Insertions
 Assume a tree where each node can contain
three pointers (non represented)
 Step 1:
 Step 2:
 Step 3:
Split node in middle
1
1 2
1 2 3 2
1 3
Insertions
 Step 4:
 Step 5:
Split
Move up
5
3
2
1 4
3
2
1 4
4
2
1 3 5
Insertions
 Step 6:
 Step 7:
4
2
1 3 5 6
4
2
1 3 5 6 7
Step 7 continued
4
2
1 3 6
4 7
4
2
1 3
6
5 7
Split
Promote
Step 7 continued
 Split after
the promotion
4
2
1 3
6
5 7
4
2
1 3
6
5 7
Two basic operations
 Split:
When trying to add to a full node
Split node at central value
 Promote:
Must insert root of split
node higher up
May require a new split
7
5
6
6
5 7
B+ trees
 Variant of B trees
 Two types of nodes
Internal nodes have no data pointers
Leaf nodes have no in-tree pointers
 Were all null!
B+ tree nodes
In
tree
ptr
Key
In
tree
ptr
Key
In
tree
ptr
Key
In
tree
ptr
Key
In
tree
ptr
Key
In
tree
ptr
Key
Data ptr
Key
Data ptr
Key
Data ptr
Key
Data ptr
Key
Data ptr
Key
Data ptr
More about internal nodes
 Consist of n -1 key values K1, K2, …, Kn-1 ,and n
tree pointers P1, P2, …, Pn :
< P1,K1, P2, K2, P3, …, Pn-1, Kn-1,, Pn>
 The keys are ordered K1 < K2 < … < Kn-1
 For each tree value X in the subtree pointed at
by tree pointer Pi, we have:
X > Ki-1 for 1 ≤ i ≤ n
X ≤ Ki for 1 ≤ i ≤ n - 1
Warning
 Other authors assume that
For each tree value X in the subtree pointed
at by tree pointer Pi, we have:
 X ≥ Ki-1 for 1 ≤ i ≤ n
 X < Ki for 1 ≤ i ≤ n - 1
 Changes the key value that is promoted when
an internal node is split
Advantages
 Removing unneeded pointers allows to pack
more keys in each node
Higher fan-out for a given node size
 Normally one block
 Having all keys present in the leaf nodes allows
us to build a linked list of all keys
Properties
 If m is the order of the tree
 Every internal node has at most m children.
 Every internal node (except root) has at least
⌈m ⁄ 2⌉ children.
 The root has at least two children if it is not a
leaf node.
 Every leaf has at most m − 1 keys
 An internal node with k children has k − 1
keys.
 All leaves appear in the same level
Best cases and worst cases
 A B+ tree of degree m and height h will store
At most mh – 1(m – 1) = mh – m records
At least 2⌈m ⁄ 2⌉h – 1 records
Searches
 def search (k) :
return tree_search (k, root)
Searches
def tree_search (k, node) :
if node is a leaf :
return node
elif k < k_0 :
return tree_search(k, p_0)
…
elif k_i ≤ k < k_{i+1}
return tree_search(k, p_{i+1})
…
elif k_d ≤ k
return tree_search(k, p_{d+1});
Insertions
 def insert (entry) :
 Find target leaf L
 if L has less than m – 2 entries :
 add the entry
else :
 Allocate new leaf L'
 Pick the m/2 highest keys of L and move them to L'
 Insert highest key of L and corresponding address leaf
into the parent node
 If the parent is full :
 Split it and add the middle key to its parent node
 Repeat until a parent is found that is not full
Deletions
 def delete (record) :
 Locate target leaf and remove the entry
 If leaf is less than half full:
 Try to re-distribute, taking from sibling (adjacent
node with same parent)
 If re-distribution fails:
 Merge leaf and sibling
 Delete entry to one of the two merged leaves
 Merge could propagate to root
Insertions
 Assume a B+ tree of degree 3
 Step 1:
 Step 2:
 Step 3:
Split node in middle
1
1 2
1 2 3 2
1 2 3
Insertions
 Step 4:
 Step 5:
Split
Move up
5
3
2
1 2 4
3
2
1 2 4
4
2
1 2 3 4 5
Insertions
 Step 6:
 Step 7:
4
2
1 2 3 4 5 6
4
2
1 2 3 4 5 6 7
Step 7 continued
4
2
1 2 3 4 6
5 6 7
4
2
1 2
3 4
6
5 6 7
Split
Promote
Step 7 continued
 Split after
the promotion
4
2
1 3
6
5 7
4
2
1 3
6
5 7
Importance
 B+ trees are used by
NTFS, ReiserFS, NSS, XFS, JFS, ReFS, and
BFS file systems for metadata indexing
BFS for storing directories.
IBM DB2, Informix, Microsoft SQL Server,
Oracle 8, Sybase ASE, and SQLite for table
indexes
An interesting variant
 Can simplify entry deletion by never merging
nodes that have less than ⌈m ⁄ 2⌉ entries
 Wait instead until there are empty and can be
deleted
 Requires more space
 Seems to be a reasonable tradeoff assuming
random insertions and deletions
Not on
Spring 2015
first quiz
Hashing
Fundamentals
 Define m target addresses (the "buckets")
 Create a hash function h(k) that is defined for
all possible values of the key k and returns an
integer value h such that 0 ≤ h ≤ m – 1
Key h(k)
The idea
Key
Hash
value
is
Bucket
address
Bucket sizes
 Each bucket consists of one or more blocks
Need some way to convert the hash value
into a logical block address
 Selecting large buckets means we will have to
search the contents of the target bucket to find
the desired record
If search time is critical and the database
infrequently updated, we should consider
sorting the records inside each bucket
Bucket organization
 Two possible solutions
Buckets contain records
 When bucket is full, records go to an
overflow bucket
Buckets contain pairs <key, address>
 When bucket is full, pairs <key, address>
go to an overflow bucket
Buckets contain records
Assume each
bucket contains
two records
Overflow bucket
Buckets contain records
KEY
A bucket can
contain many
more keys
than records
KEY
A record
Many
more
records
Finding a good hash function
 Should distribute records evenly among the
buckets
A bad hash function will have too many
overflowing buckets and too many empty or
near-empty buckets
A good starting point
 If the key is numeric
Divide the key by the number of buckets
 If the number of buckets is a power of two,
this means selecting log2 m least significant
bits of key
 Otherwise
Transform the key into a numerical value
Divide that value by the number of buckets
Looking further
 Hashing works best when the number of buckets
is a prime number
 If performance matters, consult
Donald Knuth's Art of Computer Programming
http://en.wikipedia.org/wiki/Hash_function
Selecting the load factor
 Percentage of used slots
Best range is between 0.5 and 0.8
 If load factor < 0.5
Too much space is wasted
 If load factor > 0.8
Bucket overflows start becoming a problem
 Depending on how evenly the hash
function distributes the keys among the
buckets
Dynamic hashing
 Conventional hashing techniques work well
when the maximum number of records is known
ahead of time
 Dynamic hashing lets the hash table grow as the
number of records grow
 Two techniques:
Extendible hashing
Linear hashing
Extendible hashing
 Represent hash values as bit strings:
100101, 001001, …
 Introduce an additional level of indirection, the
directory
One entry per key value
Multiple entries can point to the same bucket
Extendible hashing
 We assume a three-bit key
000
001
010
001
100
101
110
101
Directory
K = 010
K = 111
Records with
key = 0*
Records with
key = 1*
Both buckets are at same depth d
d = 1
d = 1
Extendible hashing
 When a bucket overflows, we split it
000
001
010
001
100
101
110
101
Directory
K = 000
K = 111
Records with
key = 00*
Records with
key = 1*
K = 011
K = 010 Records with
key = 01*
d = 2
d = 2
d = 1
Explanations (I)
 Choice of a bucket is based on the most
significant bits (MSBs) of hash value
 Start with a single bit
Will have two buckets
 One for MSB = 0
 Other for MSB = 1
 Depth of bucket is 1
Explanations (II)
 Each time a bucket overflows, we split it
Assume first bucket overflows
 Will add a new bucket containing records
with MSBs of hash value = 01
 Older bucket will keep records with MSBs
of hash value = 00
 Depths of these two bucket is 2
Explanations (III)
 At any given time, the hash table will contain
buckets at different depths
In our example, buckets 00 and 01 are at
depth 2 while bucket 1 is at depth 1
 Each bucket will include a record of its depth
Just a few bits
Discussion
 Extendible hashing
Allows hash table contents
 To grow, by splitting buckets
 To shrink by merging buckets
but
Adds one level of indirection
 No problem if the directory can reside in
main memory
Linear hashing
 Does not add an additional level of indirection
 Reduces but does not eliminate overflow buckets
 Uses a family of hash functions
hi(K) = K mod m
hi+1(K) = K mod 2m
hi+2(K) = K mod 4m
…
How it works (I)
 Start with
m buckets
hi(K) = K mod m
 When any bucket overflows
Create an overflow bucket
Create a new bucket at location m
Apply hash function hi+1(K)= K mod 2m to the
contents of bucket 0
 Will now be split between buckets 0 and m
How it works (II)
 When a second bucket overflows
Create an overflow bucket
Create a new bucket at location m + 1
Apply hash function hi+1(K)= K mod 2m to the
contents of bucket 1
 Will now be split between buckets 1 and
m + 1
How it works (III)
 Each time a bucket overflows
Create an overflow bucket
Apply hash function hi+1(K)= K mod 2m to the
contents of the successor s + 1 of the last
bucket that was split
 Contents of bucket s + 1 will now be split
between buckets s and m + s – 1
 The size of the hash table grows linearly at each
split until all buckets use the new hash function
Advantages
 The hash table goes linearly
 As we split buckets in linear order, bookkeeping
is very simple:
Need only to keep track of the last bucket s
that was split
 Buckets 0 to s use the new hash function
hi+1(K)= K mod 2m
 Buckets s + 1 to m – 1 still use the old hash
function hi(K)= K mod m
Example (I)
 Assume m = 4 and one record per bucket
 Table contains two records
Hash value = 0
Hash value = 2
Example (II)
 We add one record with hash value = 2
Hash value = 2 Hash value = 2
Overflow bucket
Hash value = 4
New bucket
We assume that the contents of bucket 0 were
migrated to bucket 4
Multi-key indexes
 Not covered this semester

Weitere ähnliche Inhalte

Ähnlich wie Indexing.ppt

Index Structures.pptx
Index Structures.pptxIndex Structures.pptx
Index Structures.pptxMBablu1
 
Furnish an Index Using the Works of Tree Structures
Furnish an Index Using the Works of Tree StructuresFurnish an Index Using the Works of Tree Structures
Furnish an Index Using the Works of Tree Structuresijceronline
 
exing.ppt hhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
exing.ppt hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhexing.ppt hhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
exing.ppt hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhRAtna29
 
btrees.ppt ttttttttttttttttttttttttttttt
btrees.ppt  tttttttttttttttttttttttttttttbtrees.ppt  ttttttttttttttttttttttttttttt
btrees.ppt tttttttttttttttttttttttttttttRAtna29
 
trees in data structure
trees in data structure trees in data structure
trees in data structure shameen khan
 
Trees - Data structures in C/Java
Trees - Data structures in C/JavaTrees - Data structures in C/Java
Trees - Data structures in C/Javageeksrik
 
Fundamentals of data structures
Fundamentals of data structuresFundamentals of data structures
Fundamentals of data structuresNiraj Agarwal
 
VCE Unit 05.pptx
VCE Unit 05.pptxVCE Unit 05.pptx
VCE Unit 05.pptxskilljiolms
 
Dynamic multi level indexing Using B-Trees And B+ Trees
Dynamic multi level indexing Using B-Trees And B+ TreesDynamic multi level indexing Using B-Trees And B+ Trees
Dynamic multi level indexing Using B-Trees And B+ TreesPooja Dixit
 
Chapter 8: tree data structure
Chapter 8:  tree data structureChapter 8:  tree data structure
Chapter 8: tree data structureMahmoud Alfarra
 

Ähnlich wie Indexing.ppt (20)

Trees
TreesTrees
Trees
 
A41001011
A41001011A41001011
A41001011
 
08 B Trees
08 B Trees08 B Trees
08 B Trees
 
Index Structures.pptx
Index Structures.pptxIndex Structures.pptx
Index Structures.pptx
 
Furnish an Index Using the Works of Tree Structures
Furnish an Index Using the Works of Tree StructuresFurnish an Index Using the Works of Tree Structures
Furnish an Index Using the Works of Tree Structures
 
exing.ppt hhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
exing.ppt hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhexing.ppt hhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
exing.ppt hhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
 
ch12
ch12ch12
ch12
 
Tree.pptx
Tree.pptxTree.pptx
Tree.pptx
 
btrees.ppt ttttttttttttttttttttttttttttt
btrees.ppt  tttttttttttttttttttttttttttttbtrees.ppt  ttttttttttttttttttttttttttttt
btrees.ppt ttttttttttttttttttttttttttttt
 
B+ tree.pptx
B+ tree.pptxB+ tree.pptx
B+ tree.pptx
 
trees in data structure
trees in data structure trees in data structure
trees in data structure
 
DATASTORAGE.pptx
DATASTORAGE.pptxDATASTORAGE.pptx
DATASTORAGE.pptx
 
Trees - Data structures in C/Java
Trees - Data structures in C/JavaTrees - Data structures in C/Java
Trees - Data structures in C/Java
 
DATASTORAGE.pdf
DATASTORAGE.pdfDATASTORAGE.pdf
DATASTORAGE.pdf
 
DATASTORAGE
DATASTORAGEDATASTORAGE
DATASTORAGE
 
Fundamentals of data structures
Fundamentals of data structuresFundamentals of data structures
Fundamentals of data structures
 
Module - 5_Trees.pdf
Module - 5_Trees.pdfModule - 5_Trees.pdf
Module - 5_Trees.pdf
 
VCE Unit 05.pptx
VCE Unit 05.pptxVCE Unit 05.pptx
VCE Unit 05.pptx
 
Dynamic multi level indexing Using B-Trees And B+ Trees
Dynamic multi level indexing Using B-Trees And B+ TreesDynamic multi level indexing Using B-Trees And B+ Trees
Dynamic multi level indexing Using B-Trees And B+ Trees
 
Chapter 8: tree data structure
Chapter 8:  tree data structureChapter 8:  tree data structure
Chapter 8: tree data structure
 

Mehr von KalsoomTahir2

Mehr von KalsoomTahir2 (20)

005813616.pdf
005813616.pdf005813616.pdf
005813616.pdf
 
009576860.pdf
009576860.pdf009576860.pdf
009576860.pdf
 
005813185.pdf
005813185.pdf005813185.pdf
005813185.pdf
 
HASH FUNCTIONS.pdf
HASH FUNCTIONS.pdfHASH FUNCTIONS.pdf
HASH FUNCTIONS.pdf
 
6. McCall's Model.pptx
6. McCall's Model.pptx6. McCall's Model.pptx
6. McCall's Model.pptx
 
ch02-Database System Concepts and Architecture.ppt
ch02-Database System Concepts and Architecture.pptch02-Database System Concepts and Architecture.ppt
ch02-Database System Concepts and Architecture.ppt
 
9223301.ppt
9223301.ppt9223301.ppt
9223301.ppt
 
11885558.ppt
11885558.ppt11885558.ppt
11885558.ppt
 
chap05-info366.ppt
chap05-info366.pptchap05-info366.ppt
chap05-info366.ppt
 
1650607.ppt
1650607.ppt1650607.ppt
1650607.ppt
 
005281271.pdf
005281271.pdf005281271.pdf
005281271.pdf
 
soa_and_jra.ppt
soa_and_jra.pptsoa_and_jra.ppt
soa_and_jra.ppt
 
ERP_Up_Down.ppt
ERP_Up_Down.pptERP_Up_Down.ppt
ERP_Up_Down.ppt
 
Topic1CourseIntroduction.ppt
Topic1CourseIntroduction.pptTopic1CourseIntroduction.ppt
Topic1CourseIntroduction.ppt
 
Lecture 19 - Dynamic Web - JAVA - Part 1.ppt
Lecture 19 - Dynamic Web - JAVA - Part 1.pptLecture 19 - Dynamic Web - JAVA - Part 1.ppt
Lecture 19 - Dynamic Web - JAVA - Part 1.ppt
 
CommercialSystemsBahman.ppt
CommercialSystemsBahman.pptCommercialSystemsBahman.ppt
CommercialSystemsBahman.ppt
 
EJBDetailsFeb25.ppt
EJBDetailsFeb25.pptEJBDetailsFeb25.ppt
EJBDetailsFeb25.ppt
 
jan28EAI.ppt
jan28EAI.pptjan28EAI.ppt
jan28EAI.ppt
 
005428052.pdf
005428052.pdf005428052.pdf
005428052.pdf
 
jini-1.ppt
jini-1.pptjini-1.ppt
jini-1.ppt
 

Kürzlich hochgeladen

Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxShobhayan Kirtania
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 

Kürzlich hochgeladen (20)

Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptx
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 

Indexing.ppt

  • 2. Overview  Three main techniques Conventional indexes  Think of a page table, … B and B+ trees  Perform better when records are constantly added or deleted Hashing
  • 4. Indexes  A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space to maintain the index data structure. Wikipedia
  • 5. Types of indexes  An index can be Sparse  One entry per data block  Identifies the first record of the block  Requires data to be sorted Dense  One entry per record  Data do not have to be sorted
  • 6. Respective advantages  Sparse  Occupy much less space  Can keep more of it in main memory Faster access Dense  Can tell if a given record exists without accessing the file  Do not require data to be sorted
  • 7. Indexes based on primary keys  Each key value corresponds to a specific record  Two cases to consider: Table is sorted on its primary key  Can use a sparse index Table is either non-sorted or sorted on another field  Must use a dense index
  • 8. Sparse Index Ahmed … … Amita … … Brenda … … Carlos … … Dana … … Dino … … Emily … … Frank … … Alan . Dana . Gina .
  • 9. Dense Index Ahmed … … Frank … … Brenda … … Dana … … Emily … … Dino … … Carlos … … Amita … … Ahmed Amita Brenda Carlos Dana Dino Emily Frank
  • 10. Indexes based on other fields  Each key value may correspond to more than one record clustering index  Two cases to consider: Table is sorted on the field  Can use a sparse index Table is either non-sorted or sorted on another field  Must use a dense index
  • 11. Sparse clustering index Ahmed Austin … Frank Austin … Brenda Austin … Dana Dallas … Emily Dallas … Dino Dallas … Carlos Laredo … Amita Laredo … Austin . Dallas . Laredo .
  • 12. Dense clustering index Austin Austin Austin Dallas Dallas Dallas Laredo Laredo Dana Dallas … Dino Dallas … Emily Dallas … Frank Austin … Ahmed Austin … Amita Laredo … Brenda Austin … Carlos Laredo …
  • 13. Another realization Dana Dallas … Dino Dallas … Emily Dallas … Frank Austin … Ahmed Austin … Amita Laredo … Brenda Austin … Carlos Laredo … Austin Dallas . Laredo . We save space and add one extra level of indirection
  • 14. A side comment  "We can solve any problem by introducing an extra level of indirection, except of course for the problem of too many indirections."  David John Wheeler
  • 15. Indexing the index  When index is very large, it makes sense to index the index Two-level or three-level index Index at top level is called master index  Normally a sparse index
  • 17. Updating indexed tables  Can be painful No silver bullet
  • 18. B-trees and B+ trees
  • 19. Motivation  To have dynamic indexing structures that can evolve when records are added and deleted Not the case for static indexes  Would have to be completely rebuilt  Optimized for searches on block devices  Both B trees and B+ trees are not binary Objective is to increase branching factor (degree or fan-out) to reduce the number of device accesses
  • 20. Binary vs. higher-order tree  Binary trees: Designed for in- memory searches Try to minimize the number of memory accesses  Higher-order trees: Designed for searching data on block devices Try to minimize the number of device accesses  Searching within a block is cheap!
  • 21. B trees  Generalization of binary search trees  Not binary trees The B stands for Bayer (or Boeing)  Designed for searching data stored on block- oriented devices
  • 22. A very small B tree Bottom nodes are leaf nodes: all their pointers are NULL
  • 23. In reality In tree ptr Key Data ptr In tree ptr Key Data ptr In tree ptr Key Data ptr In tree ptr Key Data ptr In tree ptr To Leaf 7 To leaf 16 To Leaf -- Null Null -- Null Null
  • 24. Organization  Each non-terminal node can have a variable number of child nodes Must all be in a specific key range Number of child nodes typically vary between d and 2d  Will split nodes that would otherwise have contained 2d + 1 child nodes  Will merge nodes that contain less than d child nodes
  • 25. Searching the tree keys < 7 keys > 16 7 < keys < 16
  • 26. Balancing B trees  Objective is to ensure that all terminals nodes be at the same depth
  • 27. Insertions  Assume a tree where each node can contain three pointers (non represented)  Step 1:  Step 2:  Step 3: Split node in middle 1 1 2 1 2 3 2 1 3
  • 28. Insertions  Step 4:  Step 5: Split Move up 5 3 2 1 4 3 2 1 4 4 2 1 3 5
  • 29. Insertions  Step 6:  Step 7: 4 2 1 3 5 6 4 2 1 3 5 6 7
  • 30. Step 7 continued 4 2 1 3 6 4 7 4 2 1 3 6 5 7 Split Promote
  • 31. Step 7 continued  Split after the promotion 4 2 1 3 6 5 7 4 2 1 3 6 5 7
  • 32. Two basic operations  Split: When trying to add to a full node Split node at central value  Promote: Must insert root of split node higher up May require a new split 7 5 6 6 5 7
  • 33. B+ trees  Variant of B trees  Two types of nodes Internal nodes have no data pointers Leaf nodes have no in-tree pointers  Were all null!
  • 35. More about internal nodes  Consist of n -1 key values K1, K2, …, Kn-1 ,and n tree pointers P1, P2, …, Pn : < P1,K1, P2, K2, P3, …, Pn-1, Kn-1,, Pn>  The keys are ordered K1 < K2 < … < Kn-1  For each tree value X in the subtree pointed at by tree pointer Pi, we have: X > Ki-1 for 1 ≤ i ≤ n X ≤ Ki for 1 ≤ i ≤ n - 1
  • 36. Warning  Other authors assume that For each tree value X in the subtree pointed at by tree pointer Pi, we have:  X ≥ Ki-1 for 1 ≤ i ≤ n  X < Ki for 1 ≤ i ≤ n - 1  Changes the key value that is promoted when an internal node is split
  • 37. Advantages  Removing unneeded pointers allows to pack more keys in each node Higher fan-out for a given node size  Normally one block  Having all keys present in the leaf nodes allows us to build a linked list of all keys
  • 38. Properties  If m is the order of the tree  Every internal node has at most m children.  Every internal node (except root) has at least ⌈m ⁄ 2⌉ children.  The root has at least two children if it is not a leaf node.  Every leaf has at most m − 1 keys  An internal node with k children has k − 1 keys.  All leaves appear in the same level
  • 39. Best cases and worst cases  A B+ tree of degree m and height h will store At most mh – 1(m – 1) = mh – m records At least 2⌈m ⁄ 2⌉h – 1 records
  • 40. Searches  def search (k) : return tree_search (k, root)
  • 41. Searches def tree_search (k, node) : if node is a leaf : return node elif k < k_0 : return tree_search(k, p_0) … elif k_i ≤ k < k_{i+1} return tree_search(k, p_{i+1}) … elif k_d ≤ k return tree_search(k, p_{d+1});
  • 42. Insertions  def insert (entry) :  Find target leaf L  if L has less than m – 2 entries :  add the entry else :  Allocate new leaf L'  Pick the m/2 highest keys of L and move them to L'  Insert highest key of L and corresponding address leaf into the parent node  If the parent is full :  Split it and add the middle key to its parent node  Repeat until a parent is found that is not full
  • 43. Deletions  def delete (record) :  Locate target leaf and remove the entry  If leaf is less than half full:  Try to re-distribute, taking from sibling (adjacent node with same parent)  If re-distribution fails:  Merge leaf and sibling  Delete entry to one of the two merged leaves  Merge could propagate to root
  • 44. Insertions  Assume a B+ tree of degree 3  Step 1:  Step 2:  Step 3: Split node in middle 1 1 2 1 2 3 2 1 2 3
  • 45. Insertions  Step 4:  Step 5: Split Move up 5 3 2 1 2 4 3 2 1 2 4 4 2 1 2 3 4 5
  • 46. Insertions  Step 6:  Step 7: 4 2 1 2 3 4 5 6 4 2 1 2 3 4 5 6 7
  • 47. Step 7 continued 4 2 1 2 3 4 6 5 6 7 4 2 1 2 3 4 6 5 6 7 Split Promote
  • 48. Step 7 continued  Split after the promotion 4 2 1 3 6 5 7 4 2 1 3 6 5 7
  • 49. Importance  B+ trees are used by NTFS, ReiserFS, NSS, XFS, JFS, ReFS, and BFS file systems for metadata indexing BFS for storing directories. IBM DB2, Informix, Microsoft SQL Server, Oracle 8, Sybase ASE, and SQLite for table indexes
  • 50. An interesting variant  Can simplify entry deletion by never merging nodes that have less than ⌈m ⁄ 2⌉ entries  Wait instead until there are empty and can be deleted  Requires more space  Seems to be a reasonable tradeoff assuming random insertions and deletions Not on Spring 2015 first quiz
  • 52. Fundamentals  Define m target addresses (the "buckets")  Create a hash function h(k) that is defined for all possible values of the key k and returns an integer value h such that 0 ≤ h ≤ m – 1 Key h(k)
  • 54. Bucket sizes  Each bucket consists of one or more blocks Need some way to convert the hash value into a logical block address  Selecting large buckets means we will have to search the contents of the target bucket to find the desired record If search time is critical and the database infrequently updated, we should consider sorting the records inside each bucket
  • 55. Bucket organization  Two possible solutions Buckets contain records  When bucket is full, records go to an overflow bucket Buckets contain pairs <key, address>  When bucket is full, pairs <key, address> go to an overflow bucket
  • 56. Buckets contain records Assume each bucket contains two records Overflow bucket
  • 57. Buckets contain records KEY A bucket can contain many more keys than records KEY A record Many more records
  • 58. Finding a good hash function  Should distribute records evenly among the buckets A bad hash function will have too many overflowing buckets and too many empty or near-empty buckets
  • 59. A good starting point  If the key is numeric Divide the key by the number of buckets  If the number of buckets is a power of two, this means selecting log2 m least significant bits of key  Otherwise Transform the key into a numerical value Divide that value by the number of buckets
  • 60. Looking further  Hashing works best when the number of buckets is a prime number  If performance matters, consult Donald Knuth's Art of Computer Programming http://en.wikipedia.org/wiki/Hash_function
  • 61. Selecting the load factor  Percentage of used slots Best range is between 0.5 and 0.8  If load factor < 0.5 Too much space is wasted  If load factor > 0.8 Bucket overflows start becoming a problem  Depending on how evenly the hash function distributes the keys among the buckets
  • 62. Dynamic hashing  Conventional hashing techniques work well when the maximum number of records is known ahead of time  Dynamic hashing lets the hash table grow as the number of records grow  Two techniques: Extendible hashing Linear hashing
  • 63. Extendible hashing  Represent hash values as bit strings: 100101, 001001, …  Introduce an additional level of indirection, the directory One entry per key value Multiple entries can point to the same bucket
  • 64. Extendible hashing  We assume a three-bit key 000 001 010 001 100 101 110 101 Directory K = 010 K = 111 Records with key = 0* Records with key = 1* Both buckets are at same depth d d = 1 d = 1
  • 65. Extendible hashing  When a bucket overflows, we split it 000 001 010 001 100 101 110 101 Directory K = 000 K = 111 Records with key = 00* Records with key = 1* K = 011 K = 010 Records with key = 01* d = 2 d = 2 d = 1
  • 66. Explanations (I)  Choice of a bucket is based on the most significant bits (MSBs) of hash value  Start with a single bit Will have two buckets  One for MSB = 0  Other for MSB = 1  Depth of bucket is 1
  • 67. Explanations (II)  Each time a bucket overflows, we split it Assume first bucket overflows  Will add a new bucket containing records with MSBs of hash value = 01  Older bucket will keep records with MSBs of hash value = 00  Depths of these two bucket is 2
  • 68. Explanations (III)  At any given time, the hash table will contain buckets at different depths In our example, buckets 00 and 01 are at depth 2 while bucket 1 is at depth 1  Each bucket will include a record of its depth Just a few bits
  • 69. Discussion  Extendible hashing Allows hash table contents  To grow, by splitting buckets  To shrink by merging buckets but Adds one level of indirection  No problem if the directory can reside in main memory
  • 70. Linear hashing  Does not add an additional level of indirection  Reduces but does not eliminate overflow buckets  Uses a family of hash functions hi(K) = K mod m hi+1(K) = K mod 2m hi+2(K) = K mod 4m …
  • 71. How it works (I)  Start with m buckets hi(K) = K mod m  When any bucket overflows Create an overflow bucket Create a new bucket at location m Apply hash function hi+1(K)= K mod 2m to the contents of bucket 0  Will now be split between buckets 0 and m
  • 72. How it works (II)  When a second bucket overflows Create an overflow bucket Create a new bucket at location m + 1 Apply hash function hi+1(K)= K mod 2m to the contents of bucket 1  Will now be split between buckets 1 and m + 1
  • 73. How it works (III)  Each time a bucket overflows Create an overflow bucket Apply hash function hi+1(K)= K mod 2m to the contents of the successor s + 1 of the last bucket that was split  Contents of bucket s + 1 will now be split between buckets s and m + s – 1  The size of the hash table grows linearly at each split until all buckets use the new hash function
  • 74. Advantages  The hash table goes linearly  As we split buckets in linear order, bookkeeping is very simple: Need only to keep track of the last bucket s that was split  Buckets 0 to s use the new hash function hi+1(K)= K mod 2m  Buckets s + 1 to m – 1 still use the old hash function hi(K)= K mod m
  • 75. Example (I)  Assume m = 4 and one record per bucket  Table contains two records Hash value = 0 Hash value = 2
  • 76. Example (II)  We add one record with hash value = 2 Hash value = 2 Hash value = 2 Overflow bucket Hash value = 4 New bucket We assume that the contents of bucket 0 were migrated to bucket 4
  • 77. Multi-key indexes  Not covered this semester