SlideShare ist ein Scribd-Unternehmen logo
1 von 129
Downloaden Sie, um offline zu lesen
‫بسم ال الرحمن الرحيم‬




A Novel Web Search Engine Model Based on Index-
            Query Bit-Level Compression

                         Prepared By

                    Saif Mahmood Saab

                        Supervisor By

                  Dr. Hussein Al-Bahadili

                         Dissertation

               Submitted In Partial Fulfillment

 of the Requirements for the Degree of Doctorate of Philosophy

              in Computer Information Systems

        Faculty of Information Systems and Technology

         University of Banking and Financial Sciences

                       Amman - Jordan


                         (May - 2011)
i
Authorization

I, the undersigned Saif Mahmood Saab authorize the Arab
Academy for Banking and Financial Sciences to provide copies of
this Dissertation to Libraries, Institutions, Agencies, and any
Parties upon their request.




Name: Saif Mahmood Saab


Signature:


Date: 30/05/2011




                               ii
‫‪Dedications‬‬



                                  ‫الى روح والدي الطاهرة ...‬

                          ‫الى والدتي الحبيبة ...‬

                 ‫الى زوجتي الغالية ...‬

           ‫الى ابنائي العزاء...‬

‫أهدي عملي المتواضع هذا.‬




                          ‫‪iii‬‬
Acknowledgments



First and foremost, I thank Allah (Subhana Wa Taala) for endowing me
with health, patience, and knowledge to complete this work.


I am thankful to anyone who supported me during my study. I would like
to thank my honorific supervisor, Dr. Hussein Al-Bahadili, who accepted
me as his Ph.D. student without any hesitation and offered me so much
advice, patiently supervising me, and always guiding me in the right
direction.


Last but not least, I would like to thank my parents for their support over
the years, my wife for her understanding and continuance encouragement
and my friends specially Mahmoud Alsiksek and Ali AlKhaledi.



It will not be enough to express my gratitude in words to all those people
who helped me; I would still like to give my many, many thanks to all
these people.




                                    iv
List of Figures

Figure                                 Title                                Page
 1.1     Architecture and main components of standard search engine          10
         model.
 3.1     Architecture and main components of the CIQ Web search              41
         engine model.
 3.2     Lists of IDs for each type of character sets assuming m=6.          48
3.3-a    Locations of data and parity bits in 7-bit codeword                 54
3.3-b    An uncompressed binary sequence of 21-bit length divided            54
         into 3 blocks of 7-bit length, where b1 and b3 are valid blocks,
         and b2 is a non-valid block
3.3-c    The compressed binary sequence (18-bit length).                     54
 3.4     The main steps of the HCDC compressor                               55
 3.5     The main steps of the HCDC decompressor.                            56
 3.6     Variation of Cmin and Cmax with p.                                  58
 3.7     Variation of r with p.
                       1                                                     59
 3.8     Variations of C with respect to r for various values of p.          60
 3.9     The compressed file header of the HCDC scheme.                      65
 4.1     The compression ratio (C) for different sizes index files           75
 4.2     The reduction factor (Rs) for different sizes index files.          76
 4.3     Variation of C and average Sf for different sizes index files.      89
 4.4     Variation of Rs and Rt for different sizes index files.             89
 4.5     The CIQ performance triangle.                                       90
 5.1     The CIQ performance triangle.                                       92




                                        v
List of Tables
Table                                 Title                               Page
 1.1    Document ID and its contents.                                      8

 1.2    A record and word level inverted indexes for documents in          8
        Table (1.1).
 3.1    List of most popular stopwords (117 stop-words).                   47
 3.2    Type of character sets and equivalent maximum number of            47
        IDs
 3.4    Variation of Cmin, Cmax, and r1 with number of parity bits (p).    58
 3.6    Variations of C with respect to r for various values of p.         59
 3.7    Valid 7-bit codewords.                                             61
 3.8    The HCDC algorithm compressed file header.                         64
 4.1    List of visited Websites                                           71
 4.2    The sizes of the generated indexes.                                72
 4.3    Type and number of characters in each generated inverted           73
        index file.
 4.4    Type and frequency of characters in each generated inverted        74
        index file.
 4.5    Values of C and Rs for different sizes index files.                75
 4.6    Performance analysis and implementation validation.                77
 4.7    List of keywords.                                                  78
 4.8    Values of No, Nc, To, Tc, Sf and Rt for 1000 index file            79
 4.9    Values of No, Nc, To, Tc, Sf and Rt for 10000 index file           80
4.10    Values of No, Nc, To, Tc, Sf and Rt for 25000 index file           81
4.11    Values of No, Nc, To, Tc, Sf and Rt for 50000 index file           82
4.12    Values of No, Nc, To, Tc, Sf and Rt for 75000 index file           83
4.13    Variation of Sf for different index sizes and keywords.            85
4.14    Variation of No and Nc for different index sizes and keywords.     86
4.15    Variation of To and Tc for different index sizes and keywords.     87
4.16    Values of C, Rs, average Sf, and average Rt for different sizes    88
        index files.




                                      vi
Abbreviations

ACW      Adaptive Character Wordlength
API      Application Programming Interface
ASCII    American Standard Code for Information Interchange
ASF      Apache Software Foundation
BWT      Burrows-Wheeler block sorting transform
CIQ      compressed index-query
CPU      Central Processing Unit
DBA      Database Administrator
FLH      Fixed-Length Hamming
GFS      Google File System
GZIP     GNU zip
HCDC     Hamming Code Data Compression
HTML     Hypertext Mark-up Language
ID3      A metadata container used in conjunction with the MP3 audio file
format
JSON     JavaScript Object Notation
LAN      Local Area Networks
LANMAN   Microsoft LAN Manager
LDPC     Low-Density Parity Check
LZW      Lempel-Zif-Welch
MP3      A patented digital audio encoding format
NTLM     Windows NT LAN Manager
PDF      Portable Document Format
RLE      Run Length Encoding
RSS      Really Simple Syndication
RTF      Rich Text Format
SAN      Storage Area Networks
SASE     Shrink And Search Engine
SP4      Windows Service Pack 4
UNIX     UNiplexed Information and Computing Service
URL      Uniform Resource Locator
XML      Extensible Markup Language
ZIP      A data compression and archive format, the name zip (meaning speed)




                                   vii
Table of Contents
Authorization                                                                - ii -
Dedications                                                                  - iii -
Acknowledgments                                                              - iv -
List of Figures                                                              -v-
List of Tables                                                               - vi -
Abbreviations                                                                - vii -
Table of Contents                                                            - viii -
Abstract                                                                     -x-

Chapter One                                                                  -1-
Introduction                                                                 -1-
       1.1   Web Search Engine Model                                         -3-
             1.1.1 Web crawler                                               -3-
             1.1.2 Document analyzer and indexer                             -4-
             1.1.3 Searching process                                         -9-
       1.2   Challenges to Web Search Engines                                - 10 -
       1.3   Data Compression Techniques                                     - 12 -
             1.3.1 Definition of data compression                            - 12 -
             1.3.2 Data compression models                                   - 12 -
             1.3.3 Classification of data compression algorithms             - 14 -
             1.3.4 Performance evaluation parameters                         - 17 -
       1.4   Current Trends in Building High-Performance Web Search Engine   - 20 -
       1.5   Statement of the Problem                                        - 20 -
       1.6   Objectives of this Thesis                                       - 21 -
       1.7   Organization of this Thesis                                     - 21 -

Chapter Two                                                                  - 23 -
Literature Review                                                            - 23 -
        2.1    Trends Towards High-Performance Web Search Engine             - 23 -
               2.1.1 Succinct data structure                                 - 23 -
               2.1.2 Compressed full-text self-index                         - 24 -
               2.1.3 Query optimization                                      - 24 -
               2.1.4 Efficient architectural design                          - 25 -
               2.1.5 Scalability                                             - 25 -
               2.1.6 Semantic search engine                                  - 26 -
               2.1.7 Using Social Networks                                   - 26 -
               2.1.8 Caching                                                 - 27 -
        2.2    Recent Research on Web Search Engine                          - 27 -
        2.3    Recent Research on Bit-Level Data Compression Algorithms      - 33 -




                                            viii
Chapter Three                                                                                     - 39 -
The Novel CIQ Web Search Engine Model                                                             - 39 -
      3.1    The CIQ Web Search Engine Model                                                      - 40 -
      3.2    Implementation of the CIQ Model: CIQ-based Test Tool (CIQTT)                         - 42 -
             3.2.1 COLCOR: Collects the testing corpus (documents)                                - 42 -
             3.2.2 PROCOR: Processing and analyzing testing corpus (documents)                    - 46 -
             3.2.3 INVINX: Building the inverted index and start indexing.                        - 46 -
             3.2.4 COMINX: Compressing the inverted index                                         - 50 -
             3.2.5 SRHINX: Searching index (inverted or inverted/compressed index)                - 51 -
             3.2.6 COMRES: Comparing the outcomes of different search processes
                  performed by SRHINX procedure.                                                  - 52 -
      3.3    The Bit-Level Data Compression Algorithm                                             - 52 -
             3.3.1 The HCDC algorithm                                                             - 52 -
             3.3.2 Derivation and analysis of HCDC algorithm compression ratio                    - 56 -
             3.3.3 The Compressed File Header                                                     - 63 -
      3.4    Implementation of the HCDC algorithm in CIQTT                                        - 65 -
      3.5    Performance Measures                                                                 - 66 -

Chapter Four                                                                                      - 68 -
Results and Discussions                                                                           - 68 -
     4.1      Test Procedures                                                                     - 69 -
     4.2      Determination of the Compression Ratio (C) & the Storage Reduction Factor (Rs) - 70 -
              4.2.1 Step 1: Collect the testing corpus using COLCOR procedure                     - 70 -
              4.2.2 Step 2: Process and analyze the corpus to build the inverted index file using
                    PROCOR and INVINX procedures                                                  - 72 -
              4.2.3 Step 3: Compress the inverted index file using the INXCOM procedure - 72 -
     4.3      Determination of the Speedup Factor (Sf) and the Time Reduction Factor (Rt)         - 77 -
              4.3.1 Choose a list of keywords                                                     - 77 -
              4.3.2 Perform the search processes                                                  - 78 -
              4.3.3 Determine Sf and Rt.                                                          - 84 -
       4.4    Validation of the Accuracy of the CIQ Web Search Model                              - 88 -
       4.5    Summary of Results                                                                  - 88 -

Chapter Five                                                                                      - 91 -
Conclusions and Recommendations for Future Work                                                   - 91 -
      5.1     Conclusions                                                                         - 91 -
      5.2     Recommendations for Future Work                                                     - 93 -

References                                                                                        - 94 -
Appendix I                                                                                        - 105 -
Appendix II                                                                                       - 108 -
Appendix III                                                                                      - 112 -
Appendix IV                                                                                       - 115 -




                                                ix
Abstract
Web search engine is an information retrieval system designed to help finding
information stored on the Web. Standard Web search engine consists of three main
components: Web crawler, document analyzer and indexer, and search processor. Due
to the rapid growth in the size of the Web, Web search engines are facing enormous
performance challenges, in terms of: storage capacity, data retrieval rate, query
processing time, and communication overhead. Large search engines, in particular,
have to be able to process tens of thousands of queries per second on tens of billions
of documents, making query throughput a critical issue. To satisfy this heavy
workload, search engines use a variety of performance optimizations including
succinct data structure, compressed text indexing, query optimization, high-speed
processing and communication systems, and efficient search engine architectural
design. However, it is believed that the performance of the current Web search engine
models still short from meeting users and applications needs.

In this work we develop a novel Web search engine model based on index-query
compression, therefore, it is referred to as the compressed index-query (CIQ) model.
The model incorporates two compression layers both implemented at the back-end
processor (server) side, one layer resides after the indexer acting as a second
compression layer to generate a double compressed index, and the second layer be
located after the query parser for query compression to enable compressed index-
query search. The data compression algorithm used is the novel Hamming code data
compression (HCDC) algorithm.

The different components of the CIQ model is implemented in a number of
procedures forming what is referred to as the CIQ test tool (CIQTT), which is used as
a test bench to validate the accuracy and integrity of the retrieved data, and to evaluate
the performance of the CIQ model. The results obtained demonstrate that the new
CIQ model attained an excellent performance as compared to the current
uncompressed model, as such: the CIQ model achieved a tremendous accuracy with
100% agreement with the current uncompressed model.

The new model demands less disk space as the HCDC algorithm achieves a
compression ratio over 1.3 with compression efficiency of more than 95%, which
implies a reduction in storage requirement over 24%. The new CIQ model performs
faster than the current model as it achieves a speed up factor over 1.3 providing a
reduction in processing time of over 24%.




                                            x
Chapter One
                                    Introduction
A search engine is an information retrieval system designed to help in finding files stored
on a computer, for example, public server on the World Wide Web (or simply the Web),
server on a private network of computers, or on a stand-alone computer [Bri 98]. The
search engine allows us to search the storage media for a certain content in a form of text
meeting specific criteria (typically those containing a given word or phrase) and
retrieving a list of files that match those criteria. In this work, we are concerned with the
type of search engine that is designed to help in finding files stored on the Web (Web
search engine).

Webmasters and content providers began optimizing sites for Web search engines in the
mid-1990s, as the first search engines were cataloging the early Web. Initially, all a
webmaster needed to do was to submit the address of a page, or the uniform resource
locator (URL), to various engines which would send a spider to crawl that page, extract
links to other pages from it, and return information found on the page to be indexed [Bri
98]. The process involves a search engine crawler downloading a page and storing it on
the search engine's own server, where a second program, known as an indexer, extracts
various information about the page, such as the words it contains and where there
location are, as well as any weight for specific words, and all links the page contains,
which are then placed into a scheduler for crawling at a later date [Web 4].

Standard search engine consists of the following main components: Web crawler,
document analyzer and indexer, and searching process [Bah 10d]. The main purpose of
using certain data structure for searching is to construct an index that allows focusing the
search for a given keyword (query). The improvement in the query performance is paid
by the additional space necessary to store the index. Therefore, most of the research in
this field has been directed to design data structures which offer a good trade between
queries and update time versus space usage.

For this reason compression appears always as an attractive choice, if not mandatory.
However space overhead is not the only resource to be optimized when managing large


                                            1
data collections; in fact, data turn out to be useful only when properly indexed to support
search operations that efficiently extract the user-requested information. Approaches to
combine compression and indexing techniques are nowadays receiving more and more
attention. A first step towards the design of a compressed full-text index is achieving
guaranteed performance and lossless data [Fer 01].

In the light of the significant increase in CPU speed that makes more economical to store
data in compressed form than uncompressed. Storing data in a compressed form may
introduce significant improvement in space occupancy and also processing time. This is
because space optimization is closely related to time optimization in a disk memory
(improve time processing) [Fer 01].

There are a number of trends that have been identified in the literature for building high-
performance search engines, such as: succinct data structure, compressed full-text self-
index, query optimization, and high-speed processing and communication systems.

Starting from these promising trends, many researchers have tried to combine text
compression with indexing techniques and searching algorithms. They have mainly
investigated and analyzed the compressed matching problem under various compression
schemes [Fer 01].

Due to the rapid growth in the size of the Web, Web search engines are facing enormous
performance challenges, in terms of: (i) storage capacity, (ii) data retrieval rate, (iii) query
processing time, and (iv) communication overhead. The large engines, in particular, have
to be able to process tens of thousands of queries per second on tens of billions of
documents, making query throughput a critical issue. To satisfy this heavy workload,
search engines use a variety of performance optimizations including index compression.

With the tremendous increase in users and applications needs we believe that the current
search engines model need more retrieval performance and more compact and cost-
effective systems are still required.

In this work we develop a novel web search engine model that is based on index-query
bit-level compression. The model incorporates two bit-level data compression layers both


                                             2
implemented at the back-end processor side, one after the indexer acting as a second
compression layer to generate a double compressed index, and the other one after the
query parser for query compression to enable bit-level compressed index-query search.
So that less disk space is required to store the compressed index file and also reducing
disk I/O overheads, and consequently higher retrieval rate or performance.

An important feature of the bit-level technique to be used for performing the search
process at the compressed index-query level, is to generate similar compressed binary
sequence for the same character from the search queries and the index files. The data
compression technique that satisfies this important feature is the HCDC algorithm [Bah
07b, Bah 08a]. Therefore; it will be used in this work. Recent investigations on using this
algorithm for text compression have demonstrated an excellent performance in
comparison with many widely-used and well-known data compression algorithms and
state of the art tools [Bah 07b, Bah 08a].


1.1     Web Search Engine Model
A Web search engine is an information retrieval system designed to help find files stored
on a public server on the Web [Bri 98, Mel 00]. Standard Web search engine consists of
the following main components:

   •    Web crawler

   •    Document analyzer and indexer

   •    Searching process

In what follows we provide a brief description for each of the above components.

1.1.1   Web crawler

A Web crawler is a computer program that browses the Web in a methodical, automated
manner. Other terms for Web crawlers are ants, automatic indexers, bots, worms, Web
spider and Web robot. Unfortunately, each spider has its own personal agenda as it
indexes a site. Some search engines use META tag; others may use the META description



                                             3
of a page, and some use the first sentence or paragraph on the sites. That is mean; a page
that ranks higher on one web search engine may not rank as well on another. Given a set
of “URLs” unified resource locations, the crawler repeatedly removes one URL from the
set, downloads the targeted page, extracts all the URLs contained in it, and adds all
previously unknown URLs to the set [Bri 98, Jun 00].

Web search engines work by storing information about many Web pages, which they
retrieve from the Web itself. These pages are retrieved by a spider - sophisticated Web
browser which follows every link extracted or stored in its database. The contents of each
page are then analyzed to determine how it should be indexed, for example, words are
extracted from the titles, headings, or special fields called Meta tags.

1.1.2   Document analyzer and indexer

Indexing is the process of creating an index that is a specialized file containing a
compiled version of documents retrieved by the spider [Bah 10d]. Indexing process
collect, parse, and store data to facilitate fast and accurate information retrieval. Index
design incorporates interdisciplinary concepts from linguistics, mathematics, informatics,
physics and computer science [Web 5].

The purpose of storing an index is to optimize speed and performance in finding relevant
documents for a search query. Without an index, the search engine would scan every
(possible) document in the Internet, which would require considerable time and
computing power (impossible with the current Internet size). For example, while an index
of 10000 documents can be queried within milliseconds, a sequential scan of every word
in the documents could take hours. The additional computer storage required to store the
index, as well as the considerable increase in the time required for an update to take
place, are traded off for the time saved during information retrieval [Web 5].




                                            4
Index design factors

Major factors should be carefully considered when designing a search engines, these
include [Bri 98, Web 5]:

   •   Merge factors: How data enters the index, or how words or subject features are
       added to the index during text corpus traversal, and whether multiple indexers can
       work asynchronously. The indexer must first check whether it is updating old
       content or adding new content. Traversal typically correlates to the data collection
       policy. Search engine index merging is similar in concept to the SQL Merge
       command and other merge algorithms.

   •   Storage techniques: How to store the index data, that is, whether information
       should be data compressed or filtered.

   •   Index size: How much computer storage is required to support the index.

   •   Lookup speed: How quickly a word can be found in the index. The speed of
       finding an entry in a data structure, compared with how quickly it can be updated
       or removed, is a central focus of computer science.

   •   Maintenance: How the index is maintained over time.

   •   Fault tolerance: How important it is for the service to be robust. Issues include
       dealing with index corruption, determining whether bad data can be treated in
       isolation, dealing with bad hardware, partitioning, and schemes such as hash-
       based or composite partitioning, as well as replication.

Index data structures

Search engine architectures vary in the way indexing is performed and in methods of
index storage to meet the various design factors. There are many architectures for the
indexes and the most used is inverted index. Inverted index save a list of occurrences of
every keyword, typically, in the form of a hash table or binary tree [Bah 10c].



                                           5
Through the indexing, there are several processes taken place, here the processes that
related to our work will be discussed. These processes may be used and this depends on
the search engine configuration [Bah 10d].

   •   Extract URLs. A process of extracting all URLs from the document being
       indexed, it used to guide crawling the website, do link checking, build a site map,
       and build a table of internal and external links from the page.

   •   Code striping. A process of removing hyper-text markup language (HTML) tags,
       scripts, and styles, and decoding HTML character references and entities used to
       embed special characters.

   •   Language recognition. A process by which a computer program attempts to
       automatically identify, or categorize, the language or languages by which a
       document is written.

   •   Document tokenization. A process of detecting the encoding used for the page;
       determining the language of the content (some pages use multiple languages);
       finding word, sentence and paragraph boundaries; combining multiple adjacent-
       words into one phrase; and changing the case of text.

   •   Document parsing or syntactic analysis. The process of analyzing a sequence of
       tokens (for example, words) to determine their grammatical structure with respect
       to a given (more or less) formal grammar.

   •   Lemmatization/stemming. The process for reducing inflected (or sometimes
       derived) words to their stem, base or root form – generally a written word form,
       this stage can be done in indexing and/or searching stage. The stem doesn't need
       to be identical to the morphological root of the word; it is usually sufficient that
       relate words map to the same stem, even if this stem is not in itself a valid root.
       The process is useful in search engines for query expansion or indexing and other
       natural language processing problems.




                                           6
•   Normalization. The process by which text is transformed in some way to make it
       consistent in a way which it might not have been before. Text normalization is
       often performed before text is processed in some way, such as generating
       synthesized speech, automated language translation, storage in a database, or
       comparison.

Inverted Index

The inverted index structure is widely used in the modern supper fast Web search engine
like Google, Yahoo, Lucene and other major search engines. Inverted index (also referred
to as postings file or inverted file) is an index data structure storing a mapping from
content, such as words or numbers, to its locations in a database file, or in a document or
a set of documents. The main purpose of using the inverted index is to allow fast full text
searches, at a cost of increased processing when a document is added to the index [Bri 98,
Nag 02, Web 4]. The inverted index is one of the most used data structure in information
retrieval systems [Web 4, Bri 98].

There are two main variants of inverted indexes [Bae 99]:

  (1) A record level inverted index (or inverted file index or just inverted file) contains a
       list of references to documents for each word; we use this simple type in our
       search engine.

  (2) A word level inverted index (or full inverted index or inverted list) additionally
       contains the positions of each word within a document; these positions can be used
       to rank the results according to document relevancy to the query.

The latter form offers more functionality (like phrase searches), but needs more time and
space to be created. In order to simplify the understanding of the above two inverted
indexes let us consider the following example.




                                           7
Example

Let us consider a case in which six documents have the text shown in Table (1.1). The
contents of a record and word level indexes are shown in Table (1.2).

                                        Table (1.1)
                                 Document ID and its contents.
                        Document ID                    Text
                                 1       Aqaba is a hot city
                                 2       Amman is a cold city
                                 3       Aqaba is a port
                                 4       Amman is a modern city
                                 5       Aqaba in the south
                                 6       Amman in Jordan

                                         Table (1.2)
            A record and word level inverted indexes for documents in Table (1.1).
         Record level inverted index                       Word level inverted index
         Text              Documents                   Text            Documents: Location
 Aqaba              1, 3, 5                   Aqaba                  1:1 , 3:1 , 5:1
 is                 1, 2, 3, 4                is                     1:2 , 2:2 , 3:2 , 4:2
 a                  1, 2, 3, 4                a                      1:3 , 2:3 , 3:3 , 4:3
 hot                1                         hot                    1:4
 city               1, 2, 4                   city                   1:5 , 2:5 , 4:3
 Amman              2, 4, 6                   Amman                  2:1 , 4:1 , 6:1
 cold               2                         cold                   2:4
 the                5                         the                    5:3
 modern             4                         modern                 4:2
 south              5                         south                  5:4
 in                 5, 6                      in                     5:2 , 6:2
 Jordan             6                         Jordan                 6:3




                                             8
When we search for the word “Amman”, we get three results which are documents 2, 4, 6
if a record level inverted index is used, and 2:1, 4:1, 6:1 if a word level inverted index is
used. In this work, the record level inverted index is used for it's simplicity and because
we don't need to rank our results.




1.1.3   Searching process

When the index is ready the searching can be perform through query interface, a user
enters a query into a search engine (typically by using keywords), the engine examines its
index and provides a listing of best matching Web pages according to its criteria, usually
with a short summary containing the document's title and sometimes parts of the text
[Bah 10d].

In this stage the results ranked, where ranking is a relationship between a set of items
such that, for any two items, the first is either “ranked higher than”, “ranked lower than”
or “ranked equal” to the second. In mathematics, this is known as a weak order or total
pre-order of objects. It is not necessarily a total order of documents because two different
documents can have the same ranking. Ranking is done according to document relevancy
to the query, freshness and popularity [Bri 98]. Figure (1.1) outlines the architecture and
main components of standard search engine model.




                                            9
Figure (1.1). Architecture and main components of standard search engine model.

1.2      Challenges to Web Search Engines
Building and operating large-scale Web search engine used by hundreds of millions of
people around the world provides a number of interesting challenges [Hen 03, Hui 09,
Ois 10, Ami 05]. Designing such systems requires making complex design trade-offs in a
number of dimensions and the main challenges to designing efficient, effective, and
reliable Web search engine are:

         •   The Web is growing much faster than any present-technology search engine
             can possibly index.

         •   The cost of index storing which include data storage cost, electricity and cool-
             ing the data center.

         •   The real time web which updated in real time requires a fast and reliable
             crawler and then indexes this content to make it searchable.




                                             10
•   Many Web pages are updated frequently, which forces the search engine to re-
    visit them periodically.

•   Query time (latency), the need to keep up with the increase of index size and
    to perform the query and show the results in less time.

•   Most search engine uses keyword for searching and this limited the results to
    text pages only.

•   Dynamically generated sites, which may be slow or difficult to index, or may
    result in excessive results from a single site.

•   Many dynamically generated sites are not indexable by search engines; this
    phenomenon is known as the invisible Web.

•   Several content types are not crawlable and indexable by search engines like
    multi-media and flash content.

•   Some sites use tricks to manipulate the search engine to display them as the
    first result returned for some keywords and this known as Spamming. This
    can lead to some search results being polluted, with more relevant links being
    pushed down in the result list.

•   Duplicate hosts, Web search engines try to avoid having duplicate and near-
    duplicate pages in their collection, since such pages increase the time it takes
    to add useful con-tent to the collection.

•   Web graph modeling, the open problem is to come up with a random graph
    model that models the behavior of the Web graph on the pages and host level.

•   Scalability, search engine technology should scale in a dramatic way to keep
    up with the growth of the Web.

•   Reliability, search engine requires a reliable technology to support it 24 hour
    operation to meet users needs.



                                      11
1.3      Data Compression Techniques
This section presents definition, models, classification methodologies and classes, and
performance evaluation measures of data compression algorithms. Further details on data
compression can be found in [Say 00].

1.3.1    Definition of data compression

Data compression algorithms are designed to reduce the size of data so that it requires
less disk space for storage and less memory [Say 00]. Data compression is usually
obtained by substituting a shorter symbol for an original symbol in the source data,
containing the same information but with a smaller representation in length. The symbols
may be characters, words, phrases, or any other unit that may be stored in a dictionary of
symbols and processed by a computing system.


A data compression algorithm usually utilizes an efficient algorithmic transformation of
data representation to produce more compact representation. Such an algorithm is also
known as an encoding algorithm. It is important to be able to restore the original data
back, either in an exact or an approximate form, therefore a data decompression
algorithm, also known as a decoding algorithm.

1.3.2    Data compression models

There are a number of data compression algorithms that have been developed throughout
the years. These algorithms can be categorized into four major categories of data
compression models [Rab 08, Hay 08, Say 00]:

      1. Substitution data compression model

      2. Statistical data compression model

      3. Dictionary based data compression model

      4. Bit-level data compression model




                                              12
In substitution compression techniques, a shorter representation is used to replace a
sequence of repeating characters. Example of substitution data compression techniques
include: null suppression [Pan 00], Run Length Encoding [Smi 97], bit mapping and half
byte packing [Pan 00].

In statistical techniques, the characters in the source file are converted to a binary code,
where the most common characters in the file have the shortest binary codes, and the
least common have the longest, the binary codes are generated based on the estimated
probability of the character within the file. Then, the binary coded file is compressed
using 8-bit character wordlength, or by applying the adaptive character wordlength
(ACW) algorithm [Bah 08b, Bah 10a], or it variation the ACW(n,s) scheme [Bah 10a]
Example of statistical data compression techniques include: Shannon-Fano coding [Rue
06], static/adaptive/semi-adaptive Huffman coding [Huf 52, Knu 85, Vit 89], and
arithmetic coding [How 94, Wit 87].

Dictionary based data compression techniques involved the substitution of sub-strings of
text by indices or pointer code, relative to a dictionary of the sub-strings, such as Lempel-
Zif-Welch (LZW) [Ziv 78, Ziv 77, Nel 89]. Many compression algorithms use a
combination of different data compression techniques to improve compression ratios.

Finally, since data files could be represented in binary digits, a bit-level processing can be
performed to reduce the size of data. A data file can be represented in binary digits by
concatenating the binary sequences of the characters within the file using a specific
mapping or coding format, such as ASCII codes, Huffman codes, adaptive codes, …, etc.
The coding format has a huge influence on the entropy of the generated binary sequence
and consequently the compression ratio (C) or the coding rate (Cr) that can be achieved.

The entropy is a measure of the information content of a message and the smallest
number of bits per character needed, on average, to represent a message. Therefore, the
entropy of a complete message would be the sum of the individual characters’ entropy.
The entropy of a character (symbol) is represented as the negative logarithm of its
probability and expressed using base two.




                                            13
Where the probability of each symbol of the alphabet is constant, the entropy is
calculated as [Bel 89, Bel 90]:

               n
        E=−∑ p i log 2 p i                                                   (1.1)
              i= 1


Where           E is the entropy in bits

               pi is the estimated probability of occurrence of character (symbol)

               n is the number of characters.

In bit-level processing, n is equal to 2 as we have only two characters (0 and 1).

In bit-level data compression algorithms, the binary sequence is usually divided into
groups of bits that are called minterms, blocks, subsequences, etc. In this work we shall
used the term blocks to refer to each group of bits. These blocks might be considered as
representing a Boolean function.

Then, algebraic simplifications are performed on these Boolean functions to reduce the
size or the number of blocks, and hence, the number of bits representing the output
(compressed) data is reduced as well. Examples of such algorithms include: the
Hamming code data compression (HCDC) algorithm [Bah 07b, Bah 08a], the adaptive
HCDC(k) scheme [Bah 07a, Bah 10b, Rab 08], the adaptive character wordlength (ACW)
algorithm [Bah 08b, Bah 10a], the ACW(n,s) scheme [Bah 10a], the Boolean functions
algebraic simplifications algorithm [Nof 07], the fixed length Hamming (FLH) algorithm
[Sha 04], and the neural network based algorithm [Mah 00].

1.3.3   Classification of data compression algorithms

Data compression algorithms are categorized by several characteristics, such as:

   •    Data compression fidelity

   •    Length of data compression symbols




                                           14
•    Data compression symbol table

   •    Data compression processing time

In what follows a brief definition is given for each of the above classification criteria.

Data compression fidelity

Basically data compression can be classified into two fundamentally different styles of
data compression depending on the fidelity of the restored data, these are:

(1) Lossless data compression algorithms

       In a lossless data compression, a transformation of the representation of the original
       data set is performed such that it is possible to reproduce exactly the original data
       set by performing a decompression transformation. This type of compression is
       usually used in compressing text files, executable codes, word processing files,
       database files, tabulation files, and whenever the original needs to be exactly
       restored from the compressed data.

       Many popular data compression applications have been developed utilizing lossless
       compression algorithms, for example, lossless compression algorithms are used in
       the popular ZIP file format and in the UNIX tool gzip. It is mainly used for text and
       executable files compression as in such file data must be exactly retrieved
       otherwise it is useless. It is also used as a component within lossy data compression
       technologies. It can usually achieve a 2:1 to 8:1 compression ratio range.

(2) Lossy data compression algorithms

        In a lossy data compression a transformation of the representation of the original
        data set is performed such that an exact representation of the original data set can
        not be reproduced, but an approximate representation is reproduced by performing
        a decompression transformation.




                                             15
A lossy data compression is used in applications wherein exact representation of
        the original data is not necessary, such as in streaming multimedia on the Internet,
        telephony and voice applications, and some image file formats. Lossy
        compression can provide higher compression ratios of 100:1 to 200:1, depending
        on the type of information being compressed. In addition, higher compression
        ratio can be achieved if more errors are allowed to be introduced into the original
        data [Lel 87].

Length of data compression symbols

Data compression algorithms can be classified, depending on the length of the symbols
the algorithm can process, into fixed and variable length; regardless of whether the
algorithm uses variable length symbols in the original data or in the compressed data, or
both.

For example, the run-length encoding (RLE) uses fixed length symbols in both the
original and the compressed data. Huffman encoding uses variable length compressed
symbols to represent fixed-length original symbols. Other methods compress variable-
length original symbols into fixed-length or variable-length compressed data.

Data compression symbol table

Data compression algorithms can classified as either static, adaptive, or semi-adaptive
data compression algorithms [Rue 06, Pla 06, Smi 97]. In static compression algorithms,
the encoding process is fixed regardless of the data content; while in adaptive algorithms,
the encoding process is data dependent. In semi-adaptive algorithms, the data to be
compressed are first analyzed in their entirety, an appropriate model is then built, and
afterwards the data is encoded. The model is stored as part of the compressed data, as it is
required by the decompressor to reverse the compression.

Data compression/decompression processing time

Data compression algorithms can be classified according to the compression/
decompression processing time as symmetric or asymmetric algorithms. In symmetric



                                            16
algorithms the compression/decompression processing time are almost the same; while
for asymmetric algorithms, normally, the compression time is much more than the
decompression processing time [Pla 06].

1.3.4         Performance evaluation parameters

In order to be able to compare the efficiency of the different compression techniques
reliably, and not allowing extreme cases to cloud or bias the technique unfairly, certain
issues need to be considered.

The most important issues need to be taken into account in evaluating the performance of
various algorithms includes [Say 00]:

      (1)        Measuring the amount of compression

      (2)        Compression/decompression time (algorithm complexity)

These issues need to be carefully considered in the context for which the compression
algorithm is used. Practically, things like finite memory, error control, type of data, and
compression style (adaptive/dynamic, semi-adaptive or static) are also factors that should
be considered in comparing the different data compression algorithms.

(1)           Measuring the amount of compression

Several parameters are used to measure the amount of compression that can be achieved
by a particular data compression algorithm, such as:

      (i)        Compression ratio (C)

      (ii)       Reduction ratio (Rs)

      (iii)      Coding rate (Cr)




                                              17
Definitions of these parameters are given below.

(i)     Compression ratio (C)

The compression ratio (C) is defined as the ratio between the size of the data before
compression and the size of the data after compression. It is expressed as:

              So
        C =                                                              (1.1)
              Sc


Where              So   is the size of the original data (uncompressed data)

                   Sc   is the sizes of the compressed data

(ii)    Reduction ratio (Rs)

The reduction ratio (R) represents the ratio between the difference between the size of the
original data (So) and the size of the compressed data (Sc) to the size of the original data.
It is usually given in percents and it is mathematically expressed as:

                                                                                 (1.2)



                                                                                 (1.3)




(iii)   Coding rate (Cr)

The coding rate (Cr) expresses the same concept at the compression ratio, but it relates
the ratio to a more tangible quantity. For example, for a text file, the coding rate may be
expressed in “bits/character” (bpc), where in uncompressed text file a coding rate of 7 or
8 bpc is used. In addition, the coding rate of an audio stream may be expressed in
“bits/analogue”. For still image compression, the coding rate is expressed in “bits/pixel”.
In general, it can the coding rate can be expressed mathematically as:


                                            18
q ⋅ Sc
        Cr =                                                              (1.4)
                  So

Where q is the number of bit represents each symbol in the uncompressed file. The
relationship between the coding rate (Cr) and the compression ratio (C), for example, for
text file originally using 7 bpc, can be given by:

               7
        Cr =                                                              (1.5)
               C


It can be clearly seen from Eqn. (1.5) that a lower coding rate indicates a higher
compression ratio.

(2)    Compression/decompression time (algorithm complexity)

The compression/decompression time (which is an indication of the algorithm
complexity) is defined as the processing time required compressing or decompressing the
data. These compression and decompression times have to be evaluated separately. As it
has been discussed in Section 1.4.3, data compression algorithms are classified according
to the compression/decompression time into either symmetric or asymmetric algorithms.

In this context, data storage applications mainly concern with the amount of compression
that can be achieved and the decompression processing time that is required to retrieve
the data back (asymmetric algorithms). As in data compression applications, the
compression is only performed once or non-frequently repeated.

Data transmission applications focus predominately on reducing the amount of data to be
transmitted over communication channels, and both compression and decompression
processing times are the same at the respective junctions or nodes (symmetric algorithms)
[Liu 05].

For a fair comparison between the different available algorithms, it is important to
consider both the amount of compression and the processing time. Therefore, it would be
useful to be able to parameterize the algorithm such that the compression ratio and
processing time could be optimized for a particular application.


                                            19
There are extreme cases where data compression works very well or other conditions
where it is inefficient, the type of data that the original data file contains and the upper
limits of the processing time have an appreciable effect on the efficiency of the technique
selected. Therefore, it is important to select the most appropriate technique for a
particular data profile in terms of both data compression and processing time [Rue 06].

1.4      Current Trends in Building High-Performance Web Search
         Engine
There are several major trends that can be identified in the literature for building high-
performance Web search engine. A list of these trends is given below and further
discussion will be given in Chapter 2; these trends include:

      (1) Succinct data structure

      (2) Compressed full-text self-index

      (3) Query optimization

      (4) Efficient architectural design

      (5) Scalability

      (6) Semantic Search Engine

      (7) Using Social Network

      (8) Caching

1.5      Statement of the Problem
Due to the rapid growth in the size of the Web, Web search engines are facing enormous
performance challenges, in terms of storage capacity, data retrieval rate, query processing
time, and communication overhead. Large search engines, in particular, have to be able to
process tens of thousands of queries per second on tens of billions of documents, making
query throughput a critical issue. To satisfy this heavy workload, search engines use a
variety of performance optimizations techniques including index compression; and some


                                            20
obvious solutions to these issues are to develop more succinct data structure, compressed
index, query optimization, and higher-speed processing and communication systems.

We believe that current search engine model cannot meet users and applications needs
and more retrieval performance and more compact and cost-effective systems are still
required. The main contribution of this thesis is to develop a novel Web search engine
model that is based on index-query compression; therefore, it is referred to as the CIQ
Web search engine model or simply the CIQ model. The model incorporates two bit-level
compression layers both implemented at the back-end processor side, one after the
indexer acting as a second compression layer to generate a double compressed index, and
the other one after the query parser for query compression to enable bit-level compressed
index-query search. So that less disk space is required storing the index file, reducing
disk I/O overheads, and consequently higher retrieval rate or performance.

1.6       Objectives of this Thesis
The main objectives of this thesis can be summarized as follows:

      •   Develop a new Web search engine model that is accurate as the current Web
          search engine model, requires less disk space for storing index files, performs
          search process faster than current models, reduces disk I/O overheads, and
          consequently provides higher retrieval rate or performance.

      •   Modify the HCDC algorithm to meet the requirements of the new CIQ model.

      •   Study and optimize the statistics of the inverted index files to achieve maximum
          possible performance (compression ratio and minimum searching time).

      •   Validate the searching accuracy of the new CIQ Web search engine model.

      •   Evaluate and compare the performance of the new Web search engine model in
          terms of disk space requirement and query processing time (searching time) for
          different search scenarios.




                                             21
1.7    Organization of this Thesis
This thesis is organized into five chapters. Chapter 1 provides an introduction to the
general domain of this thesis. The rest of this thesis is organized as follows: Chapter 2
presents a literature work and also summarizes some of the previous work that is related
to Web search engine, in particular, works that is related to enhancing the performance of
the Web search engine through data compression at different levels.

Chapter 3 describes the concept, methodology, and implementation of the novel CIQ Web
search engine model. It also includes the detail description of the HCDC algorithm and
the modifications implemented to meet the new application needs.

Chapter 4 presents a description of a number of scenarios simulated to evaluate the
performance of the new Web search engine model. The effect of index file size on the
performance of the new model is investigated and discussed. Finally, in Chapter 5, based
on the results obtained from the different simulations, conclusions are drawn and
recommendations for future work are pointed-out.




                                          22
Chapter Two
                                 Literature Review
This work is concern with the development of a novel high-performance Web search
engine model that is based on compressing the index files and search queries using a bit-
level data compression technique, namely, the novel Hamming codes based data
compression (HCDC) algorithm [Bah 07b, Bah 08a]. In this model the search process is
performed at a compressed index-query level. It produces a double compressed index file,
which consequently requires less disk space to store the index files, reduces
communication time, and on the other hand, compressing the search query, reduces I/O
overheads and increases retrieval rate.

This chapter presents a literature review, which is divided into three sections. Section 2.1
presents a brief definition of the current trends towards enhancing the performance of
Web search engines. Then, in Section 2.2 and 2.3, we present a review on some of the
most recent and related work on Web search engine and bit-level data compression
algorithms, respectively.

2.1     Trends Towards High-Performance Web Search Engine
Chapter 1 list several major trends that can be identified in the literature for building
high-performance Web search engine. In what follows, we provide a brief definition for
each of these trends.

2.1.1   Succinct data structure

Recent years have witnessed an increasing interest on succinct data structures. Their aim
is to represent the data using as little space as possible, yet efficiently answering queries
on the represented data. Several results exist on the representation of sequences [Fer 07,
Ram 02], trees [Far 05], graphs [Mun 97], permutations and functions [Mun 03], and
texts [Far 05, Nav 04].

One of the most basic structures, which lie at the heart of the representation of more
complex ones are binary sequences with rank and select queries. Given a binary sequence


                                            23
S=s1s2 … sn, which is denoted by Rankc(S; q) the number of times the bit c appears in S[1;
q]=s1s2 … sq, and by Selectc(S; q) the position in S of the q-th occurrence of bit c. The best
results answer those queries in constant time, retrieve any sq in constant time, and occupy
nH0(S)+O(n) bits of storage, where H0(S) is the zero-order empirical entropy of S. This
space bound includes that for representing S itself, so the binary sequence is being
represented in compressed form yet allowing those queries to be answered optimally
[Ram 02].

For the general case of sequences over an arbitrary alphabet of size r, the only known re-
sult is the one in [Gro 03] which still achieves nH0(S)+O(n) space occupancy. The data
structure in [Gro 03] is the elegant wavelet tree, it takes O(log r) time to answer Rankc(S;
q) and Selectc(S; q) queries, and to retrieve any character sq.

2.1.2   Compressed full-text self-index

A compressed full-text self-index [Nav 07] represents a text in a compressed form and
still answers queries efficiently. This represents a significant advancement over the full-
text indexing techniques of the previous decade, whose indexes required several times the
size of the text. full-text indexing must be used.

Although it is relatively new, this algorithmic technology has matured up to a point where
theoretical research is giving way to practical developments. Nonetheless this requires
significant programming skills, a deep engineering effort, and a strong algorithmic back-
ground to dig into the research results. To date only isolated implementations and focused
comparisons of compressed indexes have been reported, and they missed a common API,
which prevented their re-use or deployment within other applications.

2.1.3   Query optimization

Query optimization is an important skill for search engine developers and database ad-
ministrators (DBAs). In order to improve the performance of the search queries, develop-
ers and DBAs need to understand the query optimizer and the techniques it uses to select
an access path and prepare a query execution plan. Query tuning involves knowledge of




                                             24
techniques such as cost-based and heuristic-based optimizers, plus the tools a search plat-
form provides for explaining a query execution plan [Che 01].

2.1.4   Efficient architectural design

Answering large number of queries per second on a huge collection of data requires the
equivalent of a small supercomputer, and all current major engines are based on large
clusters of servers connected by high-speed local area networks (LANs) or storage area
networks (SANs).

There are two basic ways to partition an inverted index structure over the nodes:

   •    A local index organization where each node builds a complete index on its own
        subset of documents (used by AltaVista and Inktomi)

   •    A global index organization where each node contains complete inverted lists for
        a subset of the words.

Each scheme has advantages and disadvantages that we do not have space to discuss here
and further discussions can be found in [Bad 02, Mel 00].

2.1.5   Scalability

Search engine technology should scale in a dramatic way to keep up with the growth of
the Web [Bri 98]. In 1994, one of the first Web search engines, the World Wide Web
Worm (WWWW) had an index of 110,000 pages [Mcb 94]. At the end of 1997, the top
search engines claim to index from 2 million (WebCrawler) to 100 million Web docu-
ments [Bri 98]. In 2005 Google claim to index 1.2 billion pages (as they were showing in
Google home page) in July 2008 Google announced to hit a new milestone: 1 trillion (as
in 1,000,000,000,000) unique URLs on the Web at once [Web 2].

At the same time, the number of queries search engines handle has grown rabidly too. In
March and April 1994, the WWWW received an average of about 1500 queries per day.
In November 1997, Altavista claimed it handled roughly 20 million queries per day. With
the increasing number of users on the web, and automated systems which query search
engines, Google handled hundreds of millions of queries per day in 2000 and about 3 bil-


                                           25
lion queries per day in 2009 and twitter handled about 635 millions queries per day [web
1].

Creating a Web search engine which scales even to today’s Web presents many challeng-
es. Fast crawling technology is needed to gather the Web documents and keep them up to
date. Storage space must be used efficiently to store indexes and, optionally, the docu-
ments themselves as cashed pages. The indexing system must process hundreds of giga-
bytes of data efficiently. Queries must be handled quickly, at a rate of hundreds to thou-
sands per second.

2.1.6   Semantic search engine

The semantic Web is an extension of the current Web in which information is given well-
defined meaning, better enabling computers and people to work together in cooperation
[Guh 03]. It is the idea of having data on the Web defined and linked in a way that it can
be used for more effective discovery, automation, integration, and reuse across various
applications.

In particular, the semantic Web will contain resources corresponding not just to media ob-
jects (such as Web pages, images, audio clips, etc.) as the current Web does, but also to
objects such as people, places, organizations and events. Further, the semantic Web will
contain not just a single kind of relation (the hyperlink) between resources, but many dif-
ferent kinds of relations between the different types of resources mentioned above [Guh
03].

Semantic search attempts to augment and improve traditional search results (based on in-
formation retrieval technology) by using data from the semantic Web and to produce pre-
cise answers to user queries. This can be done easily by taking advantage of the availabil-
ity of explicit semantics of information in the context of the semantic Web search engine
[Lei 06].

2.1.7   Using Social Networks

There is an increasing interest about social networks. In general, recent studies suggest
that a social network of a person has a significant impact on his or her information acqui-


                                           26
sition [Kir 08]. It is an ongoing trend that people increasingly reveal very personal infor-
mation on social network sites in particular and in the Web.

As this information becomes more and more publicly available from these various social
network sites and the Web in general, the social relationships between people can be
identified. This in turn enables the automatic extraction of social networks. This trend is
furthermore driven and enforced by recent initiatives such as Facebook’s connect. MyS-
pace’s data availability and Google’s FriendConnect by making their social network data
available to anyone [Kir 08].

So to combine the social network data with the search engine technology to improve the
results relevancy to the users and to increase the sociality of the results is one of the
trends currently used by the search engine like Google and Bing. Microsoft and Facebook
have announced a new partnership that brings Facebook data and profile search to Bing.
The deal marks a big leap forward in social search and also represents a new advantage
for Bing [Web 3].

2.1.8   Caching

Popular Web search engines receive a round hundred millions of queries per day, and for
each search query, return a result page(s) to the user who submitted the query. The user
may request additional result pages for the same query, submit a new query, or quit
searching process altogether. An efficient scheme for caching query result pages may en-
able search engines to lower their response time and reduce their hardware requirements
[Lem 04].

Studies have shown that a small set of popular queries accounts for a significant fraction
of the query stream. These statistical properties of the query stream seem to call for the
caching of search results [Sar 01].

2.2     Recent Research on Web Search Engine
E. Moura et al. [Mou 97] presented a technique to build an index based on suffix arrays
for compressed texts. They developed a compression scheme for textual databases based
on words that generates a compression code that preserves the lexicographical ordering of


                                           27
the text words. As a consequence, it permits the sorting of the compressed strings to gen-
erate the suffix array without decompressing. Their results demonstrated that as the com-
pressed text is under 30% of the size of the original text, they were able to build the suffix
array twice as fast on the compressed text. The compressed text plus index is 55-60% of
the size of the original text plus index and search times were reduced to approximately
half the time. They presented analytical and experimental results for different variations
of the word-oriented compression paradigm.

S. Varadarajan and T. Chiueh [Var 97] described a text search engine called shrink and
search engine (SASE), which operates in the compressed domain. It provides an exact
search mechanism using an inverted index and an approximate search mechanism using a
vantage point tree. SASE allows a flexible trade-off between search time and storage
space required to maintain the search indexes. The experimental results showed that the
compression efficiency is within 7-17% of GZIP, which is one of the best lossless
compression utilities. The sum of the compressed file size and the inverted indexes is
only between 55-76% of the original database, while the search performance is
comparable to a fully inverted index.

S. Brin and L. Page [Bri 98] presented the Google search engine, a prototype of a large-
scale search engine which makes heavy use of the structure present in hypertext. Google
is designed to crawl and index the Web efficiently and produce much more satisfying
search results than existing systems. They provided an in-depth description of the large-
scale web search engine. Apart from the problems of scaling traditional search techniques
to data of large magnitude, there are many other technical challenges, such as the use of
the additional information present in hypertext to produce better search results. In their
work they addressed the question of how to build a practical large-scale system that can
exploit the additional information present in hypertext.

E. Moura et al. [Mou 00] presented a fast compression and decompression technique for
natural language texts. The novelties are that (i) decompression of arbitrary portions of
the text can be done very efficiently, (ii) exact search for words and phrases can be done
on the compressed text directly by using any known sequential pattern matching



                                            28
algorithm, and (iii) word-based approximate and extended search can be done efficiently
without any decoding. The compression scheme uses a semi-static word-based model and
a Huffman code where the coding alphabet is byte-oriented rather than bit-oriented.

N. Fuhr and N. Govert [Fuh 02] investigated two different approaches for reducing index
space of inverted files for XML documents. First, they considered methods for
compressing index entries. Second, they developed the new XS tree data structure which
contains the structural description of a document in a rather compact form, such that
these descriptions can be kept in main memory. Experimental results on two large XML
document collections show that very high compression rates for indexes can be achieved,
but any compression increases retrieval time.

A. Nagarajarao et al. [Nag 02] implemented an inverted index as a part of a mass
collaboration system. It provides the facility to search for documents that satisfy a given
query. It also supports incremental updates whereby documents can be added without re-
indexing. The index can be queried even when updates are being done to it. Further,
querying can be done in two modes. A normal mode that can be used when an immediate
response is required and a batched mode that can provide better throughput at the cost of
increased response time for some requests. The batched mode may be useful in an alert
system where some of the queries can be scheduled. They implemented generators to
generate large data sets that they used as benchmarks. They tested there inverted index
with data sets of the order of gigabytes to ensure scalability.

R. Grossi et al. [Gro 03] presented a novel implementation of compressed suffix arrays
exhibiting new tradeoffs between search time and space occupancy for a given text (or
sequence) of n symbols over an alphabet α, where each symbol is encoded by log | α, |
bits. They showed that compressed suffix arrays use just nHh+O(n log log n/ log| α, | n)
bits, while retaining full text indexing functionalities, such as searching any pattern
sequence of length m in O(mlog | α, |+polylog(n)) time. The term Hh<log | α, | denotes the
hth-order empirical entropy of the text, which means that the index is nearly optimal in
space apart from lower-order terms, achieving asymptotically the empirical entropy of the
text (with a multiplicative constant 1). If the text is highly compressible so that H h=O(1)



                                            29
and the alphabet size is small, they obtained a text index with O(m) search time that
requires only O(n) bits.

X. Long and T. Suel [Lon 03] studied pruning techniques that can significantly improve
query throughput and response times for query execution in large engines in the case
where there is a global ranking of pages, as provided by Page rank or any other method,
in addition to the standard term-based approach. They described pruning schemes for this
case and evaluated their efficiency on an experimental cluster based search engine with
million Web pages. Their results showed that there is significant potential benefit in such
techniques.

V. N. Anh and A. Moffat [Anh 04] described a scheme for compressing lists of integers
as sequences of fixed binary codewords that had the twin benefits of being both effective
and efficient. Because Web search engines index large quantities of text the static costs
associated with storing the index can be traded against dynamic costs associated with
using it during query evaluation. Typically, index representations that are effective and
obtain good compression tend not to be efficient, in that they require more operations
during query processing. The approach described by Anh and Moffat results in a
reduction in index storage costs compared to their previous word-aligned version, with no
cost in terms of query throughput.

Udayan Khurana and Anirudh Koul [Khu 05] presented a new compression scheme for
text. The same is efficient in giving high compression ratios and enables super fast
searching within the compressed text. Typical compression ratios of 70-80% and reduc-
ing the search time by 80-85% are the features of this paper. Till now, a trade-off between
high ratios and searchability within compressed text has been seen. In this paper, they
showed that greater the compression, faster the search.

Stefan Buttcher and Charles L. A. Clarke [But 06] examined index compression tech-
niques for schema-independent inverted files used in text retrieval systems. Schema-inde-
pendent inverted files contain full positional information for all index terms and allow the
structural unit of retrieval to be specified dynamically at query time, rather than statically
during index construction. Schema-independent indexes have different characteristics


                                            30
than document-oriented indexes, and this difference can affect the effectiveness of index
compression algorithms greatly. There experimental results show that unaligned binary
codes that take into account the special properties of schema-independent indexes
achieve better compression rates than methods designed for compressing document in-
dexes and that they can reduce the size of the index by around 15% compared to byte-
aligned index compression.

P. Farragina et al [Fer 07] proposed two new compressed representations for general se-
quences, which produce an index that improves over the one in [Gro 03] by removing
from the query times the dependence on the alphabet size and the polylogarithmic terms.

R. Gonzalez and G. Navarro [Gon 07a] introduced a new compression scheme for suffix
arrays which permits locating the occurrences extremely fast, while still being much
smaller than classical indexes. In addition, their index permits a very efficient secondary
memory implementation, where compression permits reducing the amount of I/O needed
to answer queries. Compressed text self-indexes had matured up to a point where they
can replace a text by a data structure that requires less space and, in addition to giving
access to arbitrary text passages, support indexed text searches. At this point those
indexes are competitive with traditional text indexes (which are very large) for counting
the number of occurrences of a pattern in the text. Yet, they are still hundreds to
thousands of times slower when it comes to locating those occurrences in the text.

R. Gonzalez and G. Navarro [Gon 07b] introduced a disk-based compressed text index
that, when the text is compressible, takes little more than the plain text size (and replaces
it). It provides very good I/O times for searching, which in particular improve when the
text is compressible. In this aspect the index is unique, as compressed indexes have been
slower than their classical counterparts on secondary memory. They analyzed their index
and showed experimentally that it is extremely competitive on compressible texts.

A. Moffat and J. S. Culpepper [Mof 07] showed that a relatively simple combination of
techniques allows fast calculation of Boolean conjunctions within a surprisingly small
amount of data transferred. This approach exploits the observation that queries tend to
contain common words, and that representing common words via a bitvector allows



                                            31
random access testing of candidates, and, if necessary, fast intersection operations prior to
the list of candidates being developed. By using bitvectors for a very small number of
terms that (in both documents and in queries) occur frequently, and byte coded inverted
lists for the balance can reduce both querying time and query time data-transfer volumes.

The techniques described in [Mof 07] are not applicable to other powerful forms of
querying. For example, index structures that support phrase and proximity queries have a
much more complex structure, and are not amenable to storage (in their full form) using
bitvectors. Nevertheless, there may be scope for evaluation regimes that make use of
preliminary conjunctive filtering before a more detailed index is consulted, in which case
the structures described in [Mof 07] would still be relevant.

Due to the rapid growth in the size of the web, web search engines are facing enormous
performance challenges. The larger engines in particular have to be able to process tens
of thousands of queries per second on tens of billions of documents, making query
throughput a critical issue. To satisfy this heavy workload, search engines use a variety of
performance optimizations including index compression, caching, and early termination.

J. Zhang et al [Zha 08] focused on two techniques, inverted index compression and index
caching, which play a crucial rule in web search engines as well as other high-
performance information retrieval systems. We perform a comparison and evaluation of
several inverted list compression algorithms, including new variants of existing
algorithms that have not been studied before. We then evaluate different inverted list
caching policies on large query traces, and finally study the possible performance benefits
of combining compression and caching. The overall goal of this paper is to provide an
updated discussion and evaluation of these two techniques, and to show how to select the
best set of approaches and settings depending on parameter such as disk speed and main
memory cache size.

P. Ferragina et al [Fer 09] presented an article to fill the gap between implementations
and focused comparisons of compressed indexes. They presented the existing implemen-
tations of compressed indexes from a practitioner's point of view; introduced the
Pizza&Chili site, which offers tuned implementations and a standardized API for the


                                            32
most successful compressed full-text self-indexes, together with effective test-beds and
scripts for their automatic validation and test; and, finally, they showed the results of ex-
tensive experiments on a number of codes with the aim of demonstrating the practical rel-
evance of this novel algorithmic technology.

Ferragina et al [Fer 09], first, presented the existing implementations of compressed in-
dexes from a practitioner’s point of view. Second, they introduced the Pizza&Chili site,
which offers tuned implementations and a standardized API for the most successful com-
pressed full-text self-indexes, together with effective test beds and scripts for their auto-
matic validation and test. Third, they showed the results of their extensive experiments on
these codes with the aim of demonstrating the practical relevance of this novel and excit-
ing technology.

H. Yan et al [Yan 09] studied index compression and query processing techniques for
such reordered indexes. Previous work has focused on determining the best possible or-
dering of documents. In contrast, they assumed that such an ordering is already given,
and focus on how to optimize compression methods and query processing for this case.
They performed an extensive study of compression techniques for document IDs and pre-
sented new optimizations of existing techniques which can achieve significant improve-
ment in both compression and decompression performances. They also proposed and
evaluated techniques for compressing frequency values for this case. Finally, they studied
the effect of this approach on query processing performance. Their experiments showed
very significant improvements in index size and query processing speed on the TREC
GOV2 collection of 25.2 million Web pages.




2.3    Recent Research on Bit-Level Data Compression Algorithms
This section presents a review of some of the most recent research on developing an
efficient bit-level data compression algorithms, as the algorithm we use in thesis is a bit-
level technique.

A. Jardat and M. Irshid [Jar 01] proposed a very simple and efficient binary run-length
compression technique. The technique is based on mapping the non-binary information



                                            33
source into an equivalent binary source using a new fixed-length code instead of the
ASCII code. The codes are chosen such that the probability of one of the two binary
symbols; say zero, at the output of the mapper is made as small as possible. Moreover,
the "all ones" code is excluded from the code assignments table to ensure the presence of
at least one "zero" in each of the output codewords.

Compression is achieved by encoding the number of "ones" between two consecutive
"zeros" using either a fixed-length code or a variable-length code. When applying this
simple encoding technique to English text files, they achieve a compression of 5.44 bpc
(bit per character) and 4.6 bpc for the fixed-length code and the variable length
(Huffman) code, respectively.

Caire et al [Cai 04] presented a new approach to universal noiseless compression based
on error correcting codes. The scheme was based on the concatenation of the Burrows-
Wheeler block sorting transform (BWT) with the syndrome former of a low-density
parity-check (LDPC) code. Their scheme has linear encoding and decoding times and
uses a new closed-loop iterative doping algorithm that works in conjunction with belief-
propagation decoding. Unlike the leading data compression methods, their method is
resilient against errors, and lends itself to joint source-channel encoding/decoding;
furthermore their method offers very competitive data compression performance.

A. A. Sharieh [Sha 04] introduced a fixed-length Hamming (FLH) algorithm as
enhancement to Huffman coding (HU) to compress text and multimedia files. He
investigated and tested these algorithms on different text and multimedia files. His results
indicated that the HU-FLH and FLH-HU enhanced the compression ratio.

K. Barr and K. Asanovi’c [Bar 06] presented a study of the energy savings possible by
lossless compressing data prior to transmission. Because wireless transmission of a single
bit can require over 1000 times more energy than a single 32-bit computation. It can
therefore be beneficial to perform additional computation to reduce the number of bits
transmitted.

If the energy required to compress data is less than the energy required to send it, there is



                                            34
a net energy savings and an increase in battery life for portable computers. This work
demonstrated that, with several typical compression algorithms, there was actually a net
energy increase when compression was applied before transmission. Reasons for this
increase were explained and suggestions were made to avoid it. One such energy-aware
suggestion was asymmetric compression, the use of one compression algorithm on the
transmit side and a different algorithm for the receive path. By choosing the lowest-
energy compressor and decompressor on the test platform, overall energy to send and
receive data can be reduced by 11% compared with a well-chosen symmetric pair, or up
to 57% over the default symmetric scheme.

The value of this research is not merely to show that one can optimize a given algorithm
to achieve a certain reduction in energy, but to show that the choice of how and whether
to compress is not obvious. It is dependent on hardware factors such as relative energy of
the central processing unit (CPU), memory, and network, as well as software factors
including compression ratio and memory access patterns. These factors can change, so
techniques for lossless compression prior to transmission/reception of data must be re-
evaluated with each new generation of hardware and software.

A. Jaradat et al. [Jar 06] proposed a file splitting technique for the reduction of the nth-
order entropy of text files. The technique is based on mapping the original text file into a
non-ASCII binary file using a new codeword assignment method and then the resulting
binary file is split into several sub files each contains one or more bits from each
codeword of the mapped binary file. The statistical properties of the sub files are studied
and it was found that they reflect the statistical properties of the original text file which
was not the case when the ASCII code is used as a mapper.

The nth-order entropy of these sub files was determined and it was found that the sum of
their entropies was less than that of the original text file for the same values of
extensions. These interesting statistical properties of the resulting subfiles can be used to
achieve better compression ratios when conventional compression techniques were
applied to these sub files individually and on a bit-wise basis rather than on character-
wise basis.

H. Al-Bahadili [Bah 07b, Bah 08a] developed a lossless binary data compression scheme


                                            35
that is based on the error correcting Hamming codes. It was referred to as the HCDC
algorithm. In this algorithm, the binary sequence to be compressed is divided into blocks
of n bits length. To utilize the Hamming codes, the block is considered as a Hamming
codeword that consists of p parity bits and d data bits (n=d+p).

Then each block is tested to find if it is a valid or a non-valid Hamming codeword. For a
valid block, only the d data bits preceded by 1 are written to the compressed file, while
for a non-valid block all n bits preceded by 0 are written to the compressed file. These
additional 1 and 0 bits are used to distinguish the valid and the non-valid blocks during
the decompression process.

An analytical formula was derived for computing the compression ratio as a function of
block size, and fraction of valid data blocks in the sequence. The performance of the
HCDC algorithm was analyzed, and the results obtained were presented in tables and
graphs. The author concluded that the maximum compression ratio that can be achieved
by this algorithm is n/(d+1), if all blocks are valid Hamming codewords.

S. Nofal [Nof 07] proposed a bit-level files compression algorithm. In this algorithm, the
binary sequence is divided into a set of groups of bits, which are considered as minterms
representing Boolean functions. Applying algebraic simplifications on these functions
reduce in turn the number of minterms, and hence, the number of bits of the file is
reduced as well. To make decompression possible one should solve the problem of
dropped Boolean variables in the simplified functions. He investigated one possible
solution and their evaluation shows that future work should find out other solutions to
render this technique useful, as the maximum possible compression ratio they achieved
was not more than 10%.

H. Al-Bahadili and S. Hussain [Bah 08b] proposed and investigated the performance of a
bit-level data compression algorithm, in which the binary sequence is divided into blocks
each of n-bit length. This gives each block a possible decimal values between 0 to 2 n-1. If
the number of the different decimal values (d) is equal to or less than 256, then the binary
sequence can be compressed using the n-bit character wordlength. Thus, a compression
ratio of approximately n/8 can be achieved. They referred to this algorithm as the


                                           36
adaptive character wordlength (ACW) algorithm, since the compression ratio of the
algorithm is a function of n, it was referred to it as the ACW(n) algorithm.

Implementation of the ACW(n) algorithm highlights a number of issues that may degrade
its performance, and need to be carefully resolved, such as: (i) If d is greater than 256,
then the binary sequence cannot be compressed using n-bit character wordlength, (ii) the
probability of being able to compress a binary sequence using n-bit character wordlength
is inversely proportional to n, and (iii) finding the optimum value of n that provides
maximum compression ratio is a time consuming process, especially for large binary
sequences. In addition, for text compression, converting text to binary using the
equivalent ASCII code of the characters gives a high entropy binary sequence, thus only a
small compression ratio or sometimes no compression can be achieved.

To overcome all drawbacks that mentioned in the ACW(n) algorithm, Al-Bahadili and
Hussain [Bah 10a] developed an efficient implementation scheme to enhance the
performance of the ACW(n) algorithm. In this scheme the binary sequence was divided
into a number of subsequences (s), each of them satisfies the condition that d is less than
256, therefore it is referred to as the ACW(n,s) scheme. The scheme achieved
compression ratios of more than 2 on most text files from most widely used corpora.

H. Al-Bahadili and A. Rababa’a [Bah 07a, Rab 08, Bah 10b] developed a new scheme
consists of six steps some of which are applied repetitively to enhance the compression
ratio of the HCDC algorithm [Bah 07b, Bah 08a], therefore, the new scheme was referred
to as the HCDC(k) scheme, where k refers to the number of repetition loops. The
repetition loops continue until inflation is detected. The overall (accumulated)
compression ratio is the multiplication of the compression ratios of the individual loops.

The results obtained for the HCDC(k) scheme demonstrated that the scheme has a higher
compression ratio than most well-known text compression algorithms, and also exhibits a
competitive performance with respect to many widely-used state-of-the-art software. The
HCDC algorithm and the HCDC(k) scheme will be discussed in details in the next
Chapter.




                                           37
S. Ogg and B. Al-Hashimi [Ogg 06] proposed a simple yet effective real-time compres-
sion technique that reduces the amount of bits sent over serial links. The proposed tech-
nique reduces the number of bits and the number of transitions when compared to the
original uncompressed data. Results of compression on two MPEG1 coded picture data
showed average bit reductions of approximately 17% to 47% and average transition re-
ductions of approximately 15% to 24% over a serial link. The technique can be employed
with such network-on-chip (NoC) technology to improve the bandwidth bottleneck issue.
Fixed and dynamic block sizing was considered and general guidelines for determining a
suitable fixed block length and an algorithm for dynamic block sizing were shown. The
technique exploits the fact that unused significant bits do not need to be transmitted. Also,
the authors outlined a possible implementation of the proposed compression technique,
and the area overhead costs and potential power and bandwidth savings within a NoC en-
vironment were presented.

J. Zhang and X. Ni [Zha 10] presented a new implementation of bit-level arithmetic cod-
ing using integer additions and shifts. The algorithm has less computational complexity
and more flexibility, and thus is very suitable for hardware design. They showed that their
implementation has the least complexity and the highest speed in Zhao’s algorithm [Zha
98], Rissanen and Mohiuddin. (RM) algorithm [Ris 89], Langdon and Rissanen (LR) al-
gorithm [Lan 82] and the basic arithmetic coding algorithm. Sometimes it has higher
compression rate than basic arithmetic encoding algorithm. Therefore, it provides an ex-
cellent compromise between good performance and low complexity.




                                            38
Chapter Three
               The Novel CIQ Web Search Engine Model
This chapter presents a description of the proposed Web search engine model. The model
incorporates two bit-level data compression layers, both installed at the back-end
processor, one for index compression (index compressor), and one for query compression
(query or keyword compressor), so that the search process can be performed at the
compressed index-query level and avoid any decompression activities during the
searching process. Therefore, it is referred to as the compressed index-query (CIQ)
model. In order to be able to perform the search process at the compressed index-query
level, it is important to have a data compression technique that is capable of producing
the same pattern for the same character from both the query and the index.

The algorithm that meets the above main requirements is the novel Hamming code data
compression (HCDC) algorithm [Bah 07b, Bah 08a]. The HCDC algorithm creates a
compressed file header (compression header) to store some parameters that are relevant
to compression process, which mainly include the character-binary coding pattern. This
header should be stored separately to be accessed by the query compressor and the index
decompressor. Introducing the new compression layers should reduce disk space for
storing index files; increase query throughput and consequently retrieval rate. On the
other hand, compressing the search query reduces I/O overheads and query processing
time as well as the system response time.

This section outlines the main theme of this chapter. The rest of this chapter is organized
as follows: The detail description of the new CIQ Web search engine model is given in
Section 3.2. Section 3.3 presents the implementation of the new model and its main
procedures. The data compression algorithm, namely, the HCDC algorithm is described
in Section 3.4. In addition in Section 3.4, derivation and analysis of the HCDC
compression ratio is given. The performance measures that are use to evaluate and
compare the performance of the new model is introduced in Section 3.5.




                                            39
3.1    The CIQ Web Search Engine Model
In this section, a description of the proposed Web search engine model is presented. The
new model incorporates two bit-level data compression layers, both installed at the back-
end processor, one for index compression (index compressor) and one for query
compression (query compressor or keyword compressor), so that the search process can
be performed at the compressed index-query level and avoid any decompression
activities, therefore, we refer to it as the compressed index-query (CIQ) Web search
engine model or simply the CIQ model.

In order to be able to perform the search process at the CIQ level, it is important to have a
data compression technique that is capable of producing the same pattern for the same
character from both the index and the query. The HCDC algorithm [Bah 07b, Bah 08a]
which will be described in the next section, satisfies this important feature, and it will be
used at the compression layers in the new model. Figure (3.1) outlines the main
components of the new CIQ model and where the compression layers are located.

It is believed that introducing the new compression layers reduce disk space for storing
index files, increases query throughput and consequently retrieval rate. On the other
hand, compressing the search query reduces I/O overheads and query processing time as
well as the system response time.

The CIQ model works as follows: At the back-end processor, after the indexer generates
the index, and before sending it to the index storage device it keeps it in a temporary
memory to apply a lossless bit-level compression using the HCDC algorithm, and then
sends the compressed index file to the storage device. So that it requires less-disk space
enabling more documents to be indexed and accessed in comparatively less CPU time.

The HCDC algorithm creates a compressed-file header (compression header) to store
some parameters that are relevant to compression process, which mainly include the
character-to-binary coding pattern. This header should be stored separately to be accessed
by the query compression layer (query compressor).




                                            40
On the other hand, the query parser, instead of passing the query to the index file, it
passes it to the query compressor before accessing the index file. In order to produce
similar binary pattern for the similar compressed characters from the index and the query,
the character-to-binary codes used in converting the index file are passed to be used at the
query compressor. If a match is found the retrieved data is decompressed, using the index
decompressor, and passed through the ranker and the search engine interface to the end-
user.




 Figure (3.1). Architecture and main components of the CIQ Web search engine model.




                                           41
3.2      Implementation of the CIQ Model: The CIQ-based Test Tool
         (CIQTT)
This section describes the implementation of a CIQ-based test tool (CIQTT), which is
developed to:

      (1) Validate the accuracy and integrity of the retrieved data. Ensuring that the same
         data sets can be retrieved using the new CIQ model.

      (2) Evaluate the performance of the CIQ model. Estimate the reduction in the index
         file storage requirement and processing or search time.

The CIQTT consists from six main procedures; these are:

      (1) COLCOR: Collecting the testing corpus (documents).

      (2) PROCOR: Processing and analyzing the testing corpus (documents).

      (3) INVINX: Building the inverted index and start indexing.

      (4) COMINX: Compressing the inverted index.

      (5) SRHINX: Searching the index file (inverted or inverted/compressed index).

      (6) COMRES: Comparing the outcomes of different search processes performed by
          SRHINX procedure.

In what follows, we shall provide a brief description for each of the above procedures.

3.2.1    COLCOR: Collects the testing corpus (documents)

In this procedure, the Nutch crawler [Web 6] is used to collect the targeted corpus
(documents). Nutch is an open-source search technology initially developed by Douglas
Reed Cutting who is an advocate and creator of open-source search technology. He
originated Lucene and, with Mike Cafarella, Nutch, both open-source search technology
projects which are now managed through the Apache Software Foundation (ASF). Nutch
builds on Lucene [Web 7] and Solr [Web 8].


                                             42
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD
A novel web search engine model based on index query bit-level compression - PhD

Weitere ähnliche Inhalte

Ähnlich wie A novel web search engine model based on index query bit-level compression - PhD

Virtuoso: Semantikk som skalerer!
Virtuoso: Semantikk som skalerer!Virtuoso: Semantikk som skalerer!
Virtuoso: Semantikk som skalerer!Bouvet ASA
 
REST and Resource Oriented Architecture - okcDG March 2008
REST and Resource Oriented Architecture - okcDG March 2008REST and Resource Oriented Architecture - okcDG March 2008
REST and Resource Oriented Architecture - okcDG March 2008Ryan Hoegg
 
Powering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script TransformationPowering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script TransformationDatabricks
 
VLSI lab manual Part A, VTU 7the sem KIT-tiptur
VLSI lab manual Part A, VTU 7the sem KIT-tipturVLSI lab manual Part A, VTU 7the sem KIT-tiptur
VLSI lab manual Part A, VTU 7the sem KIT-tipturPramod Kumar S
 
005 cluster monitoring
005 cluster monitoring005 cluster monitoring
005 cluster monitoringScott Miao
 
Actors, a Unifying Pattern for Scalable Concurrency | C4 2006
Actors, a Unifying Pattern for Scalable Concurrency | C4 2006 Actors, a Unifying Pattern for Scalable Concurrency | C4 2006
Actors, a Unifying Pattern for Scalable Concurrency | C4 2006 Real Nobile
 
Evolution of the Netflix API
Evolution of the Netflix APIEvolution of the Netflix API
Evolution of the Netflix APIC4Media
 
Implementation and Optimisation of Queries in XSPARQL
Implementation and Optimisation of Queries in XSPARQLImplementation and Optimisation of Queries in XSPARQL
Implementation and Optimisation of Queries in XSPARQLStefan Bischof
 
Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1Henry S
 
Tech-Spark: Exploring the Cosmos DB
Tech-Spark: Exploring the Cosmos DBTech-Spark: Exploring the Cosmos DB
Tech-Spark: Exploring the Cosmos DBRalph Attard
 
History and Background of the USEWOD Data Challenge
History and Background of the  USEWOD Data ChallengeHistory and Background of the  USEWOD Data Challenge
History and Background of the USEWOD Data ChallengeKnud Möller
 
Querying datasets on the Web with high availability
Querying datasets on the Web with high availabilityQuerying datasets on the Web with high availability
Querying datasets on the Web with high availabilityRuben Verborgh
 
Time travelling through DBpedia
Time travelling through DBpediaTime travelling through DBpedia
Time travelling through DBpediaMiel Vander Sande
 
Overview of REST - Raihan Ullah
Overview of REST - Raihan UllahOverview of REST - Raihan Ullah
Overview of REST - Raihan UllahCefalo
 
Debugging Microservices - QCON 2017
Debugging Microservices - QCON 2017Debugging Microservices - QCON 2017
Debugging Microservices - QCON 2017Idit Levine
 
Doug Cutting on the State of the Hadoop Ecosystem
Doug Cutting on the State of the Hadoop EcosystemDoug Cutting on the State of the Hadoop Ecosystem
Doug Cutting on the State of the Hadoop EcosystemCloudera, Inc.
 
SubSift web services and workflows for profiling and comparing scientists and...
SubSift web services and workflows for profiling and comparing scientists and...SubSift web services and workflows for profiling and comparing scientists and...
SubSift web services and workflows for profiling and comparing scientists and...Simon Price
 
Technologies For Appraising and Managing Electronic Records
Technologies For Appraising and Managing Electronic RecordsTechnologies For Appraising and Managing Electronic Records
Technologies For Appraising and Managing Electronic Recordspbajcsy
 

Ähnlich wie A novel web search engine model based on index query bit-level compression - PhD (20)

Virtuoso: Semantikk som skalerer!
Virtuoso: Semantikk som skalerer!Virtuoso: Semantikk som skalerer!
Virtuoso: Semantikk som skalerer!
 
REST and Resource Oriented Architecture - okcDG March 2008
REST and Resource Oriented Architecture - okcDG March 2008REST and Resource Oriented Architecture - okcDG March 2008
REST and Resource Oriented Architecture - okcDG March 2008
 
Powering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script TransformationPowering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script Transformation
 
VLSI lab manual Part A, VTU 7the sem KIT-tiptur
VLSI lab manual Part A, VTU 7the sem KIT-tipturVLSI lab manual Part A, VTU 7the sem KIT-tiptur
VLSI lab manual Part A, VTU 7the sem KIT-tiptur
 
005 cluster monitoring
005 cluster monitoring005 cluster monitoring
005 cluster monitoring
 
Actors, a Unifying Pattern for Scalable Concurrency | C4 2006
Actors, a Unifying Pattern for Scalable Concurrency | C4 2006 Actors, a Unifying Pattern for Scalable Concurrency | C4 2006
Actors, a Unifying Pattern for Scalable Concurrency | C4 2006
 
Evolution of the Netflix API
Evolution of the Netflix APIEvolution of the Netflix API
Evolution of the Netflix API
 
Implementation and Optimisation of Queries in XSPARQL
Implementation and Optimisation of Queries in XSPARQLImplementation and Optimisation of Queries in XSPARQL
Implementation and Optimisation of Queries in XSPARQL
 
Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1
 
Tech-Spark: Exploring the Cosmos DB
Tech-Spark: Exploring the Cosmos DBTech-Spark: Exploring the Cosmos DB
Tech-Spark: Exploring the Cosmos DB
 
NGS: Mapping and de novo assembly
NGS: Mapping and de novo assemblyNGS: Mapping and de novo assembly
NGS: Mapping and de novo assembly
 
History and Background of the USEWOD Data Challenge
History and Background of the  USEWOD Data ChallengeHistory and Background of the  USEWOD Data Challenge
History and Background of the USEWOD Data Challenge
 
Querying datasets on the Web with high availability
Querying datasets on the Web with high availabilityQuerying datasets on the Web with high availability
Querying datasets on the Web with high availability
 
Ruby On Rails
Ruby On RailsRuby On Rails
Ruby On Rails
 
Time travelling through DBpedia
Time travelling through DBpediaTime travelling through DBpedia
Time travelling through DBpedia
 
Overview of REST - Raihan Ullah
Overview of REST - Raihan UllahOverview of REST - Raihan Ullah
Overview of REST - Raihan Ullah
 
Debugging Microservices - QCON 2017
Debugging Microservices - QCON 2017Debugging Microservices - QCON 2017
Debugging Microservices - QCON 2017
 
Doug Cutting on the State of the Hadoop Ecosystem
Doug Cutting on the State of the Hadoop EcosystemDoug Cutting on the State of the Hadoop Ecosystem
Doug Cutting on the State of the Hadoop Ecosystem
 
SubSift web services and workflows for profiling and comparing scientists and...
SubSift web services and workflows for profiling and comparing scientists and...SubSift web services and workflows for profiling and comparing scientists and...
SubSift web services and workflows for profiling and comparing scientists and...
 
Technologies For Appraising and Managing Electronic Records
Technologies For Appraising and Managing Electronic RecordsTechnologies For Appraising and Managing Electronic Records
Technologies For Appraising and Managing Electronic Records
 

Kürzlich hochgeladen

Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 

Kürzlich hochgeladen (20)

Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 

A novel web search engine model based on index query bit-level compression - PhD

  • 1. ‫بسم ال الرحمن الرحيم‬ A Novel Web Search Engine Model Based on Index- Query Bit-Level Compression Prepared By Saif Mahmood Saab Supervisor By Dr. Hussein Al-Bahadili Dissertation Submitted In Partial Fulfillment of the Requirements for the Degree of Doctorate of Philosophy in Computer Information Systems Faculty of Information Systems and Technology University of Banking and Financial Sciences Amman - Jordan (May - 2011)
  • 2. i
  • 3. Authorization I, the undersigned Saif Mahmood Saab authorize the Arab Academy for Banking and Financial Sciences to provide copies of this Dissertation to Libraries, Institutions, Agencies, and any Parties upon their request. Name: Saif Mahmood Saab Signature: Date: 30/05/2011 ii
  • 4. ‫‪Dedications‬‬ ‫الى روح والدي الطاهرة ...‬ ‫الى والدتي الحبيبة ...‬ ‫الى زوجتي الغالية ...‬ ‫الى ابنائي العزاء...‬ ‫أهدي عملي المتواضع هذا.‬ ‫‪iii‬‬
  • 5. Acknowledgments First and foremost, I thank Allah (Subhana Wa Taala) for endowing me with health, patience, and knowledge to complete this work. I am thankful to anyone who supported me during my study. I would like to thank my honorific supervisor, Dr. Hussein Al-Bahadili, who accepted me as his Ph.D. student without any hesitation and offered me so much advice, patiently supervising me, and always guiding me in the right direction. Last but not least, I would like to thank my parents for their support over the years, my wife for her understanding and continuance encouragement and my friends specially Mahmoud Alsiksek and Ali AlKhaledi. It will not be enough to express my gratitude in words to all those people who helped me; I would still like to give my many, many thanks to all these people. iv
  • 6. List of Figures Figure Title Page 1.1 Architecture and main components of standard search engine 10 model. 3.1 Architecture and main components of the CIQ Web search 41 engine model. 3.2 Lists of IDs for each type of character sets assuming m=6. 48 3.3-a Locations of data and parity bits in 7-bit codeword 54 3.3-b An uncompressed binary sequence of 21-bit length divided 54 into 3 blocks of 7-bit length, where b1 and b3 are valid blocks, and b2 is a non-valid block 3.3-c The compressed binary sequence (18-bit length). 54 3.4 The main steps of the HCDC compressor 55 3.5 The main steps of the HCDC decompressor. 56 3.6 Variation of Cmin and Cmax with p. 58 3.7 Variation of r with p. 1 59 3.8 Variations of C with respect to r for various values of p. 60 3.9 The compressed file header of the HCDC scheme. 65 4.1 The compression ratio (C) for different sizes index files 75 4.2 The reduction factor (Rs) for different sizes index files. 76 4.3 Variation of C and average Sf for different sizes index files. 89 4.4 Variation of Rs and Rt for different sizes index files. 89 4.5 The CIQ performance triangle. 90 5.1 The CIQ performance triangle. 92 v
  • 7. List of Tables Table Title Page 1.1 Document ID and its contents. 8 1.2 A record and word level inverted indexes for documents in 8 Table (1.1). 3.1 List of most popular stopwords (117 stop-words). 47 3.2 Type of character sets and equivalent maximum number of 47 IDs 3.4 Variation of Cmin, Cmax, and r1 with number of parity bits (p). 58 3.6 Variations of C with respect to r for various values of p. 59 3.7 Valid 7-bit codewords. 61 3.8 The HCDC algorithm compressed file header. 64 4.1 List of visited Websites 71 4.2 The sizes of the generated indexes. 72 4.3 Type and number of characters in each generated inverted 73 index file. 4.4 Type and frequency of characters in each generated inverted 74 index file. 4.5 Values of C and Rs for different sizes index files. 75 4.6 Performance analysis and implementation validation. 77 4.7 List of keywords. 78 4.8 Values of No, Nc, To, Tc, Sf and Rt for 1000 index file 79 4.9 Values of No, Nc, To, Tc, Sf and Rt for 10000 index file 80 4.10 Values of No, Nc, To, Tc, Sf and Rt for 25000 index file 81 4.11 Values of No, Nc, To, Tc, Sf and Rt for 50000 index file 82 4.12 Values of No, Nc, To, Tc, Sf and Rt for 75000 index file 83 4.13 Variation of Sf for different index sizes and keywords. 85 4.14 Variation of No and Nc for different index sizes and keywords. 86 4.15 Variation of To and Tc for different index sizes and keywords. 87 4.16 Values of C, Rs, average Sf, and average Rt for different sizes 88 index files. vi
  • 8. Abbreviations ACW Adaptive Character Wordlength API Application Programming Interface ASCII American Standard Code for Information Interchange ASF Apache Software Foundation BWT Burrows-Wheeler block sorting transform CIQ compressed index-query CPU Central Processing Unit DBA Database Administrator FLH Fixed-Length Hamming GFS Google File System GZIP GNU zip HCDC Hamming Code Data Compression HTML Hypertext Mark-up Language ID3 A metadata container used in conjunction with the MP3 audio file format JSON JavaScript Object Notation LAN Local Area Networks LANMAN Microsoft LAN Manager LDPC Low-Density Parity Check LZW Lempel-Zif-Welch MP3 A patented digital audio encoding format NTLM Windows NT LAN Manager PDF Portable Document Format RLE Run Length Encoding RSS Really Simple Syndication RTF Rich Text Format SAN Storage Area Networks SASE Shrink And Search Engine SP4 Windows Service Pack 4 UNIX UNiplexed Information and Computing Service URL Uniform Resource Locator XML Extensible Markup Language ZIP A data compression and archive format, the name zip (meaning speed) vii
  • 9. Table of Contents Authorization - ii - Dedications - iii - Acknowledgments - iv - List of Figures -v- List of Tables - vi - Abbreviations - vii - Table of Contents - viii - Abstract -x- Chapter One -1- Introduction -1- 1.1 Web Search Engine Model -3- 1.1.1 Web crawler -3- 1.1.2 Document analyzer and indexer -4- 1.1.3 Searching process -9- 1.2 Challenges to Web Search Engines - 10 - 1.3 Data Compression Techniques - 12 - 1.3.1 Definition of data compression - 12 - 1.3.2 Data compression models - 12 - 1.3.3 Classification of data compression algorithms - 14 - 1.3.4 Performance evaluation parameters - 17 - 1.4 Current Trends in Building High-Performance Web Search Engine - 20 - 1.5 Statement of the Problem - 20 - 1.6 Objectives of this Thesis - 21 - 1.7 Organization of this Thesis - 21 - Chapter Two - 23 - Literature Review - 23 - 2.1 Trends Towards High-Performance Web Search Engine - 23 - 2.1.1 Succinct data structure - 23 - 2.1.2 Compressed full-text self-index - 24 - 2.1.3 Query optimization - 24 - 2.1.4 Efficient architectural design - 25 - 2.1.5 Scalability - 25 - 2.1.6 Semantic search engine - 26 - 2.1.7 Using Social Networks - 26 - 2.1.8 Caching - 27 - 2.2 Recent Research on Web Search Engine - 27 - 2.3 Recent Research on Bit-Level Data Compression Algorithms - 33 - viii
  • 10. Chapter Three - 39 - The Novel CIQ Web Search Engine Model - 39 - 3.1 The CIQ Web Search Engine Model - 40 - 3.2 Implementation of the CIQ Model: CIQ-based Test Tool (CIQTT) - 42 - 3.2.1 COLCOR: Collects the testing corpus (documents) - 42 - 3.2.2 PROCOR: Processing and analyzing testing corpus (documents) - 46 - 3.2.3 INVINX: Building the inverted index and start indexing. - 46 - 3.2.4 COMINX: Compressing the inverted index - 50 - 3.2.5 SRHINX: Searching index (inverted or inverted/compressed index) - 51 - 3.2.6 COMRES: Comparing the outcomes of different search processes performed by SRHINX procedure. - 52 - 3.3 The Bit-Level Data Compression Algorithm - 52 - 3.3.1 The HCDC algorithm - 52 - 3.3.2 Derivation and analysis of HCDC algorithm compression ratio - 56 - 3.3.3 The Compressed File Header - 63 - 3.4 Implementation of the HCDC algorithm in CIQTT - 65 - 3.5 Performance Measures - 66 - Chapter Four - 68 - Results and Discussions - 68 - 4.1 Test Procedures - 69 - 4.2 Determination of the Compression Ratio (C) & the Storage Reduction Factor (Rs) - 70 - 4.2.1 Step 1: Collect the testing corpus using COLCOR procedure - 70 - 4.2.2 Step 2: Process and analyze the corpus to build the inverted index file using PROCOR and INVINX procedures - 72 - 4.2.3 Step 3: Compress the inverted index file using the INXCOM procedure - 72 - 4.3 Determination of the Speedup Factor (Sf) and the Time Reduction Factor (Rt) - 77 - 4.3.1 Choose a list of keywords - 77 - 4.3.2 Perform the search processes - 78 - 4.3.3 Determine Sf and Rt. - 84 - 4.4 Validation of the Accuracy of the CIQ Web Search Model - 88 - 4.5 Summary of Results - 88 - Chapter Five - 91 - Conclusions and Recommendations for Future Work - 91 - 5.1 Conclusions - 91 - 5.2 Recommendations for Future Work - 93 - References - 94 - Appendix I - 105 - Appendix II - 108 - Appendix III - 112 - Appendix IV - 115 - ix
  • 11. Abstract Web search engine is an information retrieval system designed to help finding information stored on the Web. Standard Web search engine consists of three main components: Web crawler, document analyzer and indexer, and search processor. Due to the rapid growth in the size of the Web, Web search engines are facing enormous performance challenges, in terms of: storage capacity, data retrieval rate, query processing time, and communication overhead. Large search engines, in particular, have to be able to process tens of thousands of queries per second on tens of billions of documents, making query throughput a critical issue. To satisfy this heavy workload, search engines use a variety of performance optimizations including succinct data structure, compressed text indexing, query optimization, high-speed processing and communication systems, and efficient search engine architectural design. However, it is believed that the performance of the current Web search engine models still short from meeting users and applications needs. In this work we develop a novel Web search engine model based on index-query compression, therefore, it is referred to as the compressed index-query (CIQ) model. The model incorporates two compression layers both implemented at the back-end processor (server) side, one layer resides after the indexer acting as a second compression layer to generate a double compressed index, and the second layer be located after the query parser for query compression to enable compressed index- query search. The data compression algorithm used is the novel Hamming code data compression (HCDC) algorithm. The different components of the CIQ model is implemented in a number of procedures forming what is referred to as the CIQ test tool (CIQTT), which is used as a test bench to validate the accuracy and integrity of the retrieved data, and to evaluate the performance of the CIQ model. The results obtained demonstrate that the new CIQ model attained an excellent performance as compared to the current uncompressed model, as such: the CIQ model achieved a tremendous accuracy with 100% agreement with the current uncompressed model. The new model demands less disk space as the HCDC algorithm achieves a compression ratio over 1.3 with compression efficiency of more than 95%, which implies a reduction in storage requirement over 24%. The new CIQ model performs faster than the current model as it achieves a speed up factor over 1.3 providing a reduction in processing time of over 24%. x
  • 12. Chapter One Introduction A search engine is an information retrieval system designed to help in finding files stored on a computer, for example, public server on the World Wide Web (or simply the Web), server on a private network of computers, or on a stand-alone computer [Bri 98]. The search engine allows us to search the storage media for a certain content in a form of text meeting specific criteria (typically those containing a given word or phrase) and retrieving a list of files that match those criteria. In this work, we are concerned with the type of search engine that is designed to help in finding files stored on the Web (Web search engine). Webmasters and content providers began optimizing sites for Web search engines in the mid-1990s, as the first search engines were cataloging the early Web. Initially, all a webmaster needed to do was to submit the address of a page, or the uniform resource locator (URL), to various engines which would send a spider to crawl that page, extract links to other pages from it, and return information found on the page to be indexed [Bri 98]. The process involves a search engine crawler downloading a page and storing it on the search engine's own server, where a second program, known as an indexer, extracts various information about the page, such as the words it contains and where there location are, as well as any weight for specific words, and all links the page contains, which are then placed into a scheduler for crawling at a later date [Web 4]. Standard search engine consists of the following main components: Web crawler, document analyzer and indexer, and searching process [Bah 10d]. The main purpose of using certain data structure for searching is to construct an index that allows focusing the search for a given keyword (query). The improvement in the query performance is paid by the additional space necessary to store the index. Therefore, most of the research in this field has been directed to design data structures which offer a good trade between queries and update time versus space usage. For this reason compression appears always as an attractive choice, if not mandatory. However space overhead is not the only resource to be optimized when managing large 1
  • 13. data collections; in fact, data turn out to be useful only when properly indexed to support search operations that efficiently extract the user-requested information. Approaches to combine compression and indexing techniques are nowadays receiving more and more attention. A first step towards the design of a compressed full-text index is achieving guaranteed performance and lossless data [Fer 01]. In the light of the significant increase in CPU speed that makes more economical to store data in compressed form than uncompressed. Storing data in a compressed form may introduce significant improvement in space occupancy and also processing time. This is because space optimization is closely related to time optimization in a disk memory (improve time processing) [Fer 01]. There are a number of trends that have been identified in the literature for building high- performance search engines, such as: succinct data structure, compressed full-text self- index, query optimization, and high-speed processing and communication systems. Starting from these promising trends, many researchers have tried to combine text compression with indexing techniques and searching algorithms. They have mainly investigated and analyzed the compressed matching problem under various compression schemes [Fer 01]. Due to the rapid growth in the size of the Web, Web search engines are facing enormous performance challenges, in terms of: (i) storage capacity, (ii) data retrieval rate, (iii) query processing time, and (iv) communication overhead. The large engines, in particular, have to be able to process tens of thousands of queries per second on tens of billions of documents, making query throughput a critical issue. To satisfy this heavy workload, search engines use a variety of performance optimizations including index compression. With the tremendous increase in users and applications needs we believe that the current search engines model need more retrieval performance and more compact and cost- effective systems are still required. In this work we develop a novel web search engine model that is based on index-query bit-level compression. The model incorporates two bit-level data compression layers both 2
  • 14. implemented at the back-end processor side, one after the indexer acting as a second compression layer to generate a double compressed index, and the other one after the query parser for query compression to enable bit-level compressed index-query search. So that less disk space is required to store the compressed index file and also reducing disk I/O overheads, and consequently higher retrieval rate or performance. An important feature of the bit-level technique to be used for performing the search process at the compressed index-query level, is to generate similar compressed binary sequence for the same character from the search queries and the index files. The data compression technique that satisfies this important feature is the HCDC algorithm [Bah 07b, Bah 08a]. Therefore; it will be used in this work. Recent investigations on using this algorithm for text compression have demonstrated an excellent performance in comparison with many widely-used and well-known data compression algorithms and state of the art tools [Bah 07b, Bah 08a]. 1.1 Web Search Engine Model A Web search engine is an information retrieval system designed to help find files stored on a public server on the Web [Bri 98, Mel 00]. Standard Web search engine consists of the following main components: • Web crawler • Document analyzer and indexer • Searching process In what follows we provide a brief description for each of the above components. 1.1.1 Web crawler A Web crawler is a computer program that browses the Web in a methodical, automated manner. Other terms for Web crawlers are ants, automatic indexers, bots, worms, Web spider and Web robot. Unfortunately, each spider has its own personal agenda as it indexes a site. Some search engines use META tag; others may use the META description 3
  • 15. of a page, and some use the first sentence or paragraph on the sites. That is mean; a page that ranks higher on one web search engine may not rank as well on another. Given a set of “URLs” unified resource locations, the crawler repeatedly removes one URL from the set, downloads the targeted page, extracts all the URLs contained in it, and adds all previously unknown URLs to the set [Bri 98, Jun 00]. Web search engines work by storing information about many Web pages, which they retrieve from the Web itself. These pages are retrieved by a spider - sophisticated Web browser which follows every link extracted or stored in its database. The contents of each page are then analyzed to determine how it should be indexed, for example, words are extracted from the titles, headings, or special fields called Meta tags. 1.1.2 Document analyzer and indexer Indexing is the process of creating an index that is a specialized file containing a compiled version of documents retrieved by the spider [Bah 10d]. Indexing process collect, parse, and store data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, mathematics, informatics, physics and computer science [Web 5]. The purpose of storing an index is to optimize speed and performance in finding relevant documents for a search query. Without an index, the search engine would scan every (possible) document in the Internet, which would require considerable time and computing power (impossible with the current Internet size). For example, while an index of 10000 documents can be queried within milliseconds, a sequential scan of every word in the documents could take hours. The additional computer storage required to store the index, as well as the considerable increase in the time required for an update to take place, are traded off for the time saved during information retrieval [Web 5]. 4
  • 16. Index design factors Major factors should be carefully considered when designing a search engines, these include [Bri 98, Web 5]: • Merge factors: How data enters the index, or how words or subject features are added to the index during text corpus traversal, and whether multiple indexers can work asynchronously. The indexer must first check whether it is updating old content or adding new content. Traversal typically correlates to the data collection policy. Search engine index merging is similar in concept to the SQL Merge command and other merge algorithms. • Storage techniques: How to store the index data, that is, whether information should be data compressed or filtered. • Index size: How much computer storage is required to support the index. • Lookup speed: How quickly a word can be found in the index. The speed of finding an entry in a data structure, compared with how quickly it can be updated or removed, is a central focus of computer science. • Maintenance: How the index is maintained over time. • Fault tolerance: How important it is for the service to be robust. Issues include dealing with index corruption, determining whether bad data can be treated in isolation, dealing with bad hardware, partitioning, and schemes such as hash- based or composite partitioning, as well as replication. Index data structures Search engine architectures vary in the way indexing is performed and in methods of index storage to meet the various design factors. There are many architectures for the indexes and the most used is inverted index. Inverted index save a list of occurrences of every keyword, typically, in the form of a hash table or binary tree [Bah 10c]. 5
  • 17. Through the indexing, there are several processes taken place, here the processes that related to our work will be discussed. These processes may be used and this depends on the search engine configuration [Bah 10d]. • Extract URLs. A process of extracting all URLs from the document being indexed, it used to guide crawling the website, do link checking, build a site map, and build a table of internal and external links from the page. • Code striping. A process of removing hyper-text markup language (HTML) tags, scripts, and styles, and decoding HTML character references and entities used to embed special characters. • Language recognition. A process by which a computer program attempts to automatically identify, or categorize, the language or languages by which a document is written. • Document tokenization. A process of detecting the encoding used for the page; determining the language of the content (some pages use multiple languages); finding word, sentence and paragraph boundaries; combining multiple adjacent- words into one phrase; and changing the case of text. • Document parsing or syntactic analysis. The process of analyzing a sequence of tokens (for example, words) to determine their grammatical structure with respect to a given (more or less) formal grammar. • Lemmatization/stemming. The process for reducing inflected (or sometimes derived) words to their stem, base or root form – generally a written word form, this stage can be done in indexing and/or searching stage. The stem doesn't need to be identical to the morphological root of the word; it is usually sufficient that relate words map to the same stem, even if this stem is not in itself a valid root. The process is useful in search engines for query expansion or indexing and other natural language processing problems. 6
  • 18. Normalization. The process by which text is transformed in some way to make it consistent in a way which it might not have been before. Text normalization is often performed before text is processed in some way, such as generating synthesized speech, automated language translation, storage in a database, or comparison. Inverted Index The inverted index structure is widely used in the modern supper fast Web search engine like Google, Yahoo, Lucene and other major search engines. Inverted index (also referred to as postings file or inverted file) is an index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents. The main purpose of using the inverted index is to allow fast full text searches, at a cost of increased processing when a document is added to the index [Bri 98, Nag 02, Web 4]. The inverted index is one of the most used data structure in information retrieval systems [Web 4, Bri 98]. There are two main variants of inverted indexes [Bae 99]: (1) A record level inverted index (or inverted file index or just inverted file) contains a list of references to documents for each word; we use this simple type in our search engine. (2) A word level inverted index (or full inverted index or inverted list) additionally contains the positions of each word within a document; these positions can be used to rank the results according to document relevancy to the query. The latter form offers more functionality (like phrase searches), but needs more time and space to be created. In order to simplify the understanding of the above two inverted indexes let us consider the following example. 7
  • 19. Example Let us consider a case in which six documents have the text shown in Table (1.1). The contents of a record and word level indexes are shown in Table (1.2). Table (1.1) Document ID and its contents. Document ID Text 1 Aqaba is a hot city 2 Amman is a cold city 3 Aqaba is a port 4 Amman is a modern city 5 Aqaba in the south 6 Amman in Jordan Table (1.2) A record and word level inverted indexes for documents in Table (1.1). Record level inverted index Word level inverted index Text Documents Text Documents: Location Aqaba 1, 3, 5 Aqaba 1:1 , 3:1 , 5:1 is 1, 2, 3, 4 is 1:2 , 2:2 , 3:2 , 4:2 a 1, 2, 3, 4 a 1:3 , 2:3 , 3:3 , 4:3 hot 1 hot 1:4 city 1, 2, 4 city 1:5 , 2:5 , 4:3 Amman 2, 4, 6 Amman 2:1 , 4:1 , 6:1 cold 2 cold 2:4 the 5 the 5:3 modern 4 modern 4:2 south 5 south 5:4 in 5, 6 in 5:2 , 6:2 Jordan 6 Jordan 6:3 8
  • 20. When we search for the word “Amman”, we get three results which are documents 2, 4, 6 if a record level inverted index is used, and 2:1, 4:1, 6:1 if a word level inverted index is used. In this work, the record level inverted index is used for it's simplicity and because we don't need to rank our results. 1.1.3 Searching process When the index is ready the searching can be perform through query interface, a user enters a query into a search engine (typically by using keywords), the engine examines its index and provides a listing of best matching Web pages according to its criteria, usually with a short summary containing the document's title and sometimes parts of the text [Bah 10d]. In this stage the results ranked, where ranking is a relationship between a set of items such that, for any two items, the first is either “ranked higher than”, “ranked lower than” or “ranked equal” to the second. In mathematics, this is known as a weak order or total pre-order of objects. It is not necessarily a total order of documents because two different documents can have the same ranking. Ranking is done according to document relevancy to the query, freshness and popularity [Bri 98]. Figure (1.1) outlines the architecture and main components of standard search engine model. 9
  • 21. Figure (1.1). Architecture and main components of standard search engine model. 1.2 Challenges to Web Search Engines Building and operating large-scale Web search engine used by hundreds of millions of people around the world provides a number of interesting challenges [Hen 03, Hui 09, Ois 10, Ami 05]. Designing such systems requires making complex design trade-offs in a number of dimensions and the main challenges to designing efficient, effective, and reliable Web search engine are: • The Web is growing much faster than any present-technology search engine can possibly index. • The cost of index storing which include data storage cost, electricity and cool- ing the data center. • The real time web which updated in real time requires a fast and reliable crawler and then indexes this content to make it searchable. 10
  • 22. Many Web pages are updated frequently, which forces the search engine to re- visit them periodically. • Query time (latency), the need to keep up with the increase of index size and to perform the query and show the results in less time. • Most search engine uses keyword for searching and this limited the results to text pages only. • Dynamically generated sites, which may be slow or difficult to index, or may result in excessive results from a single site. • Many dynamically generated sites are not indexable by search engines; this phenomenon is known as the invisible Web. • Several content types are not crawlable and indexable by search engines like multi-media and flash content. • Some sites use tricks to manipulate the search engine to display them as the first result returned for some keywords and this known as Spamming. This can lead to some search results being polluted, with more relevant links being pushed down in the result list. • Duplicate hosts, Web search engines try to avoid having duplicate and near- duplicate pages in their collection, since such pages increase the time it takes to add useful con-tent to the collection. • Web graph modeling, the open problem is to come up with a random graph model that models the behavior of the Web graph on the pages and host level. • Scalability, search engine technology should scale in a dramatic way to keep up with the growth of the Web. • Reliability, search engine requires a reliable technology to support it 24 hour operation to meet users needs. 11
  • 23. 1.3 Data Compression Techniques This section presents definition, models, classification methodologies and classes, and performance evaluation measures of data compression algorithms. Further details on data compression can be found in [Say 00]. 1.3.1 Definition of data compression Data compression algorithms are designed to reduce the size of data so that it requires less disk space for storage and less memory [Say 00]. Data compression is usually obtained by substituting a shorter symbol for an original symbol in the source data, containing the same information but with a smaller representation in length. The symbols may be characters, words, phrases, or any other unit that may be stored in a dictionary of symbols and processed by a computing system. A data compression algorithm usually utilizes an efficient algorithmic transformation of data representation to produce more compact representation. Such an algorithm is also known as an encoding algorithm. It is important to be able to restore the original data back, either in an exact or an approximate form, therefore a data decompression algorithm, also known as a decoding algorithm. 1.3.2 Data compression models There are a number of data compression algorithms that have been developed throughout the years. These algorithms can be categorized into four major categories of data compression models [Rab 08, Hay 08, Say 00]: 1. Substitution data compression model 2. Statistical data compression model 3. Dictionary based data compression model 4. Bit-level data compression model 12
  • 24. In substitution compression techniques, a shorter representation is used to replace a sequence of repeating characters. Example of substitution data compression techniques include: null suppression [Pan 00], Run Length Encoding [Smi 97], bit mapping and half byte packing [Pan 00]. In statistical techniques, the characters in the source file are converted to a binary code, where the most common characters in the file have the shortest binary codes, and the least common have the longest, the binary codes are generated based on the estimated probability of the character within the file. Then, the binary coded file is compressed using 8-bit character wordlength, or by applying the adaptive character wordlength (ACW) algorithm [Bah 08b, Bah 10a], or it variation the ACW(n,s) scheme [Bah 10a] Example of statistical data compression techniques include: Shannon-Fano coding [Rue 06], static/adaptive/semi-adaptive Huffman coding [Huf 52, Knu 85, Vit 89], and arithmetic coding [How 94, Wit 87]. Dictionary based data compression techniques involved the substitution of sub-strings of text by indices or pointer code, relative to a dictionary of the sub-strings, such as Lempel- Zif-Welch (LZW) [Ziv 78, Ziv 77, Nel 89]. Many compression algorithms use a combination of different data compression techniques to improve compression ratios. Finally, since data files could be represented in binary digits, a bit-level processing can be performed to reduce the size of data. A data file can be represented in binary digits by concatenating the binary sequences of the characters within the file using a specific mapping or coding format, such as ASCII codes, Huffman codes, adaptive codes, …, etc. The coding format has a huge influence on the entropy of the generated binary sequence and consequently the compression ratio (C) or the coding rate (Cr) that can be achieved. The entropy is a measure of the information content of a message and the smallest number of bits per character needed, on average, to represent a message. Therefore, the entropy of a complete message would be the sum of the individual characters’ entropy. The entropy of a character (symbol) is represented as the negative logarithm of its probability and expressed using base two. 13
  • 25. Where the probability of each symbol of the alphabet is constant, the entropy is calculated as [Bel 89, Bel 90]: n E=−∑ p i log 2 p i (1.1) i= 1 Where E is the entropy in bits pi is the estimated probability of occurrence of character (symbol) n is the number of characters. In bit-level processing, n is equal to 2 as we have only two characters (0 and 1). In bit-level data compression algorithms, the binary sequence is usually divided into groups of bits that are called minterms, blocks, subsequences, etc. In this work we shall used the term blocks to refer to each group of bits. These blocks might be considered as representing a Boolean function. Then, algebraic simplifications are performed on these Boolean functions to reduce the size or the number of blocks, and hence, the number of bits representing the output (compressed) data is reduced as well. Examples of such algorithms include: the Hamming code data compression (HCDC) algorithm [Bah 07b, Bah 08a], the adaptive HCDC(k) scheme [Bah 07a, Bah 10b, Rab 08], the adaptive character wordlength (ACW) algorithm [Bah 08b, Bah 10a], the ACW(n,s) scheme [Bah 10a], the Boolean functions algebraic simplifications algorithm [Nof 07], the fixed length Hamming (FLH) algorithm [Sha 04], and the neural network based algorithm [Mah 00]. 1.3.3 Classification of data compression algorithms Data compression algorithms are categorized by several characteristics, such as: • Data compression fidelity • Length of data compression symbols 14
  • 26. Data compression symbol table • Data compression processing time In what follows a brief definition is given for each of the above classification criteria. Data compression fidelity Basically data compression can be classified into two fundamentally different styles of data compression depending on the fidelity of the restored data, these are: (1) Lossless data compression algorithms In a lossless data compression, a transformation of the representation of the original data set is performed such that it is possible to reproduce exactly the original data set by performing a decompression transformation. This type of compression is usually used in compressing text files, executable codes, word processing files, database files, tabulation files, and whenever the original needs to be exactly restored from the compressed data. Many popular data compression applications have been developed utilizing lossless compression algorithms, for example, lossless compression algorithms are used in the popular ZIP file format and in the UNIX tool gzip. It is mainly used for text and executable files compression as in such file data must be exactly retrieved otherwise it is useless. It is also used as a component within lossy data compression technologies. It can usually achieve a 2:1 to 8:1 compression ratio range. (2) Lossy data compression algorithms In a lossy data compression a transformation of the representation of the original data set is performed such that an exact representation of the original data set can not be reproduced, but an approximate representation is reproduced by performing a decompression transformation. 15
  • 27. A lossy data compression is used in applications wherein exact representation of the original data is not necessary, such as in streaming multimedia on the Internet, telephony and voice applications, and some image file formats. Lossy compression can provide higher compression ratios of 100:1 to 200:1, depending on the type of information being compressed. In addition, higher compression ratio can be achieved if more errors are allowed to be introduced into the original data [Lel 87]. Length of data compression symbols Data compression algorithms can be classified, depending on the length of the symbols the algorithm can process, into fixed and variable length; regardless of whether the algorithm uses variable length symbols in the original data or in the compressed data, or both. For example, the run-length encoding (RLE) uses fixed length symbols in both the original and the compressed data. Huffman encoding uses variable length compressed symbols to represent fixed-length original symbols. Other methods compress variable- length original symbols into fixed-length or variable-length compressed data. Data compression symbol table Data compression algorithms can classified as either static, adaptive, or semi-adaptive data compression algorithms [Rue 06, Pla 06, Smi 97]. In static compression algorithms, the encoding process is fixed regardless of the data content; while in adaptive algorithms, the encoding process is data dependent. In semi-adaptive algorithms, the data to be compressed are first analyzed in their entirety, an appropriate model is then built, and afterwards the data is encoded. The model is stored as part of the compressed data, as it is required by the decompressor to reverse the compression. Data compression/decompression processing time Data compression algorithms can be classified according to the compression/ decompression processing time as symmetric or asymmetric algorithms. In symmetric 16
  • 28. algorithms the compression/decompression processing time are almost the same; while for asymmetric algorithms, normally, the compression time is much more than the decompression processing time [Pla 06]. 1.3.4 Performance evaluation parameters In order to be able to compare the efficiency of the different compression techniques reliably, and not allowing extreme cases to cloud or bias the technique unfairly, certain issues need to be considered. The most important issues need to be taken into account in evaluating the performance of various algorithms includes [Say 00]: (1) Measuring the amount of compression (2) Compression/decompression time (algorithm complexity) These issues need to be carefully considered in the context for which the compression algorithm is used. Practically, things like finite memory, error control, type of data, and compression style (adaptive/dynamic, semi-adaptive or static) are also factors that should be considered in comparing the different data compression algorithms. (1) Measuring the amount of compression Several parameters are used to measure the amount of compression that can be achieved by a particular data compression algorithm, such as: (i) Compression ratio (C) (ii) Reduction ratio (Rs) (iii) Coding rate (Cr) 17
  • 29. Definitions of these parameters are given below. (i) Compression ratio (C) The compression ratio (C) is defined as the ratio between the size of the data before compression and the size of the data after compression. It is expressed as: So C = (1.1) Sc Where So is the size of the original data (uncompressed data) Sc is the sizes of the compressed data (ii) Reduction ratio (Rs) The reduction ratio (R) represents the ratio between the difference between the size of the original data (So) and the size of the compressed data (Sc) to the size of the original data. It is usually given in percents and it is mathematically expressed as: (1.2) (1.3) (iii) Coding rate (Cr) The coding rate (Cr) expresses the same concept at the compression ratio, but it relates the ratio to a more tangible quantity. For example, for a text file, the coding rate may be expressed in “bits/character” (bpc), where in uncompressed text file a coding rate of 7 or 8 bpc is used. In addition, the coding rate of an audio stream may be expressed in “bits/analogue”. For still image compression, the coding rate is expressed in “bits/pixel”. In general, it can the coding rate can be expressed mathematically as: 18
  • 30. q ⋅ Sc Cr = (1.4) So Where q is the number of bit represents each symbol in the uncompressed file. The relationship between the coding rate (Cr) and the compression ratio (C), for example, for text file originally using 7 bpc, can be given by: 7 Cr = (1.5) C It can be clearly seen from Eqn. (1.5) that a lower coding rate indicates a higher compression ratio. (2) Compression/decompression time (algorithm complexity) The compression/decompression time (which is an indication of the algorithm complexity) is defined as the processing time required compressing or decompressing the data. These compression and decompression times have to be evaluated separately. As it has been discussed in Section 1.4.3, data compression algorithms are classified according to the compression/decompression time into either symmetric or asymmetric algorithms. In this context, data storage applications mainly concern with the amount of compression that can be achieved and the decompression processing time that is required to retrieve the data back (asymmetric algorithms). As in data compression applications, the compression is only performed once or non-frequently repeated. Data transmission applications focus predominately on reducing the amount of data to be transmitted over communication channels, and both compression and decompression processing times are the same at the respective junctions or nodes (symmetric algorithms) [Liu 05]. For a fair comparison between the different available algorithms, it is important to consider both the amount of compression and the processing time. Therefore, it would be useful to be able to parameterize the algorithm such that the compression ratio and processing time could be optimized for a particular application. 19
  • 31. There are extreme cases where data compression works very well or other conditions where it is inefficient, the type of data that the original data file contains and the upper limits of the processing time have an appreciable effect on the efficiency of the technique selected. Therefore, it is important to select the most appropriate technique for a particular data profile in terms of both data compression and processing time [Rue 06]. 1.4 Current Trends in Building High-Performance Web Search Engine There are several major trends that can be identified in the literature for building high- performance Web search engine. A list of these trends is given below and further discussion will be given in Chapter 2; these trends include: (1) Succinct data structure (2) Compressed full-text self-index (3) Query optimization (4) Efficient architectural design (5) Scalability (6) Semantic Search Engine (7) Using Social Network (8) Caching 1.5 Statement of the Problem Due to the rapid growth in the size of the Web, Web search engines are facing enormous performance challenges, in terms of storage capacity, data retrieval rate, query processing time, and communication overhead. Large search engines, in particular, have to be able to process tens of thousands of queries per second on tens of billions of documents, making query throughput a critical issue. To satisfy this heavy workload, search engines use a variety of performance optimizations techniques including index compression; and some 20
  • 32. obvious solutions to these issues are to develop more succinct data structure, compressed index, query optimization, and higher-speed processing and communication systems. We believe that current search engine model cannot meet users and applications needs and more retrieval performance and more compact and cost-effective systems are still required. The main contribution of this thesis is to develop a novel Web search engine model that is based on index-query compression; therefore, it is referred to as the CIQ Web search engine model or simply the CIQ model. The model incorporates two bit-level compression layers both implemented at the back-end processor side, one after the indexer acting as a second compression layer to generate a double compressed index, and the other one after the query parser for query compression to enable bit-level compressed index-query search. So that less disk space is required storing the index file, reducing disk I/O overheads, and consequently higher retrieval rate or performance. 1.6 Objectives of this Thesis The main objectives of this thesis can be summarized as follows: • Develop a new Web search engine model that is accurate as the current Web search engine model, requires less disk space for storing index files, performs search process faster than current models, reduces disk I/O overheads, and consequently provides higher retrieval rate or performance. • Modify the HCDC algorithm to meet the requirements of the new CIQ model. • Study and optimize the statistics of the inverted index files to achieve maximum possible performance (compression ratio and minimum searching time). • Validate the searching accuracy of the new CIQ Web search engine model. • Evaluate and compare the performance of the new Web search engine model in terms of disk space requirement and query processing time (searching time) for different search scenarios. 21
  • 33. 1.7 Organization of this Thesis This thesis is organized into five chapters. Chapter 1 provides an introduction to the general domain of this thesis. The rest of this thesis is organized as follows: Chapter 2 presents a literature work and also summarizes some of the previous work that is related to Web search engine, in particular, works that is related to enhancing the performance of the Web search engine through data compression at different levels. Chapter 3 describes the concept, methodology, and implementation of the novel CIQ Web search engine model. It also includes the detail description of the HCDC algorithm and the modifications implemented to meet the new application needs. Chapter 4 presents a description of a number of scenarios simulated to evaluate the performance of the new Web search engine model. The effect of index file size on the performance of the new model is investigated and discussed. Finally, in Chapter 5, based on the results obtained from the different simulations, conclusions are drawn and recommendations for future work are pointed-out. 22
  • 34. Chapter Two Literature Review This work is concern with the development of a novel high-performance Web search engine model that is based on compressing the index files and search queries using a bit- level data compression technique, namely, the novel Hamming codes based data compression (HCDC) algorithm [Bah 07b, Bah 08a]. In this model the search process is performed at a compressed index-query level. It produces a double compressed index file, which consequently requires less disk space to store the index files, reduces communication time, and on the other hand, compressing the search query, reduces I/O overheads and increases retrieval rate. This chapter presents a literature review, which is divided into three sections. Section 2.1 presents a brief definition of the current trends towards enhancing the performance of Web search engines. Then, in Section 2.2 and 2.3, we present a review on some of the most recent and related work on Web search engine and bit-level data compression algorithms, respectively. 2.1 Trends Towards High-Performance Web Search Engine Chapter 1 list several major trends that can be identified in the literature for building high-performance Web search engine. In what follows, we provide a brief definition for each of these trends. 2.1.1 Succinct data structure Recent years have witnessed an increasing interest on succinct data structures. Their aim is to represent the data using as little space as possible, yet efficiently answering queries on the represented data. Several results exist on the representation of sequences [Fer 07, Ram 02], trees [Far 05], graphs [Mun 97], permutations and functions [Mun 03], and texts [Far 05, Nav 04]. One of the most basic structures, which lie at the heart of the representation of more complex ones are binary sequences with rank and select queries. Given a binary sequence 23
  • 35. S=s1s2 … sn, which is denoted by Rankc(S; q) the number of times the bit c appears in S[1; q]=s1s2 … sq, and by Selectc(S; q) the position in S of the q-th occurrence of bit c. The best results answer those queries in constant time, retrieve any sq in constant time, and occupy nH0(S)+O(n) bits of storage, where H0(S) is the zero-order empirical entropy of S. This space bound includes that for representing S itself, so the binary sequence is being represented in compressed form yet allowing those queries to be answered optimally [Ram 02]. For the general case of sequences over an arbitrary alphabet of size r, the only known re- sult is the one in [Gro 03] which still achieves nH0(S)+O(n) space occupancy. The data structure in [Gro 03] is the elegant wavelet tree, it takes O(log r) time to answer Rankc(S; q) and Selectc(S; q) queries, and to retrieve any character sq. 2.1.2 Compressed full-text self-index A compressed full-text self-index [Nav 07] represents a text in a compressed form and still answers queries efficiently. This represents a significant advancement over the full- text indexing techniques of the previous decade, whose indexes required several times the size of the text. full-text indexing must be used. Although it is relatively new, this algorithmic technology has matured up to a point where theoretical research is giving way to practical developments. Nonetheless this requires significant programming skills, a deep engineering effort, and a strong algorithmic back- ground to dig into the research results. To date only isolated implementations and focused comparisons of compressed indexes have been reported, and they missed a common API, which prevented their re-use or deployment within other applications. 2.1.3 Query optimization Query optimization is an important skill for search engine developers and database ad- ministrators (DBAs). In order to improve the performance of the search queries, develop- ers and DBAs need to understand the query optimizer and the techniques it uses to select an access path and prepare a query execution plan. Query tuning involves knowledge of 24
  • 36. techniques such as cost-based and heuristic-based optimizers, plus the tools a search plat- form provides for explaining a query execution plan [Che 01]. 2.1.4 Efficient architectural design Answering large number of queries per second on a huge collection of data requires the equivalent of a small supercomputer, and all current major engines are based on large clusters of servers connected by high-speed local area networks (LANs) or storage area networks (SANs). There are two basic ways to partition an inverted index structure over the nodes: • A local index organization where each node builds a complete index on its own subset of documents (used by AltaVista and Inktomi) • A global index organization where each node contains complete inverted lists for a subset of the words. Each scheme has advantages and disadvantages that we do not have space to discuss here and further discussions can be found in [Bad 02, Mel 00]. 2.1.5 Scalability Search engine technology should scale in a dramatic way to keep up with the growth of the Web [Bri 98]. In 1994, one of the first Web search engines, the World Wide Web Worm (WWWW) had an index of 110,000 pages [Mcb 94]. At the end of 1997, the top search engines claim to index from 2 million (WebCrawler) to 100 million Web docu- ments [Bri 98]. In 2005 Google claim to index 1.2 billion pages (as they were showing in Google home page) in July 2008 Google announced to hit a new milestone: 1 trillion (as in 1,000,000,000,000) unique URLs on the Web at once [Web 2]. At the same time, the number of queries search engines handle has grown rabidly too. In March and April 1994, the WWWW received an average of about 1500 queries per day. In November 1997, Altavista claimed it handled roughly 20 million queries per day. With the increasing number of users on the web, and automated systems which query search engines, Google handled hundreds of millions of queries per day in 2000 and about 3 bil- 25
  • 37. lion queries per day in 2009 and twitter handled about 635 millions queries per day [web 1]. Creating a Web search engine which scales even to today’s Web presents many challeng- es. Fast crawling technology is needed to gather the Web documents and keep them up to date. Storage space must be used efficiently to store indexes and, optionally, the docu- ments themselves as cashed pages. The indexing system must process hundreds of giga- bytes of data efficiently. Queries must be handled quickly, at a rate of hundreds to thou- sands per second. 2.1.6 Semantic search engine The semantic Web is an extension of the current Web in which information is given well- defined meaning, better enabling computers and people to work together in cooperation [Guh 03]. It is the idea of having data on the Web defined and linked in a way that it can be used for more effective discovery, automation, integration, and reuse across various applications. In particular, the semantic Web will contain resources corresponding not just to media ob- jects (such as Web pages, images, audio clips, etc.) as the current Web does, but also to objects such as people, places, organizations and events. Further, the semantic Web will contain not just a single kind of relation (the hyperlink) between resources, but many dif- ferent kinds of relations between the different types of resources mentioned above [Guh 03]. Semantic search attempts to augment and improve traditional search results (based on in- formation retrieval technology) by using data from the semantic Web and to produce pre- cise answers to user queries. This can be done easily by taking advantage of the availabil- ity of explicit semantics of information in the context of the semantic Web search engine [Lei 06]. 2.1.7 Using Social Networks There is an increasing interest about social networks. In general, recent studies suggest that a social network of a person has a significant impact on his or her information acqui- 26
  • 38. sition [Kir 08]. It is an ongoing trend that people increasingly reveal very personal infor- mation on social network sites in particular and in the Web. As this information becomes more and more publicly available from these various social network sites and the Web in general, the social relationships between people can be identified. This in turn enables the automatic extraction of social networks. This trend is furthermore driven and enforced by recent initiatives such as Facebook’s connect. MyS- pace’s data availability and Google’s FriendConnect by making their social network data available to anyone [Kir 08]. So to combine the social network data with the search engine technology to improve the results relevancy to the users and to increase the sociality of the results is one of the trends currently used by the search engine like Google and Bing. Microsoft and Facebook have announced a new partnership that brings Facebook data and profile search to Bing. The deal marks a big leap forward in social search and also represents a new advantage for Bing [Web 3]. 2.1.8 Caching Popular Web search engines receive a round hundred millions of queries per day, and for each search query, return a result page(s) to the user who submitted the query. The user may request additional result pages for the same query, submit a new query, or quit searching process altogether. An efficient scheme for caching query result pages may en- able search engines to lower their response time and reduce their hardware requirements [Lem 04]. Studies have shown that a small set of popular queries accounts for a significant fraction of the query stream. These statistical properties of the query stream seem to call for the caching of search results [Sar 01]. 2.2 Recent Research on Web Search Engine E. Moura et al. [Mou 97] presented a technique to build an index based on suffix arrays for compressed texts. They developed a compression scheme for textual databases based on words that generates a compression code that preserves the lexicographical ordering of 27
  • 39. the text words. As a consequence, it permits the sorting of the compressed strings to gen- erate the suffix array without decompressing. Their results demonstrated that as the com- pressed text is under 30% of the size of the original text, they were able to build the suffix array twice as fast on the compressed text. The compressed text plus index is 55-60% of the size of the original text plus index and search times were reduced to approximately half the time. They presented analytical and experimental results for different variations of the word-oriented compression paradigm. S. Varadarajan and T. Chiueh [Var 97] described a text search engine called shrink and search engine (SASE), which operates in the compressed domain. It provides an exact search mechanism using an inverted index and an approximate search mechanism using a vantage point tree. SASE allows a flexible trade-off between search time and storage space required to maintain the search indexes. The experimental results showed that the compression efficiency is within 7-17% of GZIP, which is one of the best lossless compression utilities. The sum of the compressed file size and the inverted indexes is only between 55-76% of the original database, while the search performance is comparable to a fully inverted index. S. Brin and L. Page [Bri 98] presented the Google search engine, a prototype of a large- scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. They provided an in-depth description of the large- scale web search engine. Apart from the problems of scaling traditional search techniques to data of large magnitude, there are many other technical challenges, such as the use of the additional information present in hypertext to produce better search results. In their work they addressed the question of how to build a practical large-scale system that can exploit the additional information present in hypertext. E. Moura et al. [Mou 00] presented a fast compression and decompression technique for natural language texts. The novelties are that (i) decompression of arbitrary portions of the text can be done very efficiently, (ii) exact search for words and phrases can be done on the compressed text directly by using any known sequential pattern matching 28
  • 40. algorithm, and (iii) word-based approximate and extended search can be done efficiently without any decoding. The compression scheme uses a semi-static word-based model and a Huffman code where the coding alphabet is byte-oriented rather than bit-oriented. N. Fuhr and N. Govert [Fuh 02] investigated two different approaches for reducing index space of inverted files for XML documents. First, they considered methods for compressing index entries. Second, they developed the new XS tree data structure which contains the structural description of a document in a rather compact form, such that these descriptions can be kept in main memory. Experimental results on two large XML document collections show that very high compression rates for indexes can be achieved, but any compression increases retrieval time. A. Nagarajarao et al. [Nag 02] implemented an inverted index as a part of a mass collaboration system. It provides the facility to search for documents that satisfy a given query. It also supports incremental updates whereby documents can be added without re- indexing. The index can be queried even when updates are being done to it. Further, querying can be done in two modes. A normal mode that can be used when an immediate response is required and a batched mode that can provide better throughput at the cost of increased response time for some requests. The batched mode may be useful in an alert system where some of the queries can be scheduled. They implemented generators to generate large data sets that they used as benchmarks. They tested there inverted index with data sets of the order of gigabytes to ensure scalability. R. Grossi et al. [Gro 03] presented a novel implementation of compressed suffix arrays exhibiting new tradeoffs between search time and space occupancy for a given text (or sequence) of n symbols over an alphabet α, where each symbol is encoded by log | α, | bits. They showed that compressed suffix arrays use just nHh+O(n log log n/ log| α, | n) bits, while retaining full text indexing functionalities, such as searching any pattern sequence of length m in O(mlog | α, |+polylog(n)) time. The term Hh<log | α, | denotes the hth-order empirical entropy of the text, which means that the index is nearly optimal in space apart from lower-order terms, achieving asymptotically the empirical entropy of the text (with a multiplicative constant 1). If the text is highly compressible so that H h=O(1) 29
  • 41. and the alphabet size is small, they obtained a text index with O(m) search time that requires only O(n) bits. X. Long and T. Suel [Lon 03] studied pruning techniques that can significantly improve query throughput and response times for query execution in large engines in the case where there is a global ranking of pages, as provided by Page rank or any other method, in addition to the standard term-based approach. They described pruning schemes for this case and evaluated their efficiency on an experimental cluster based search engine with million Web pages. Their results showed that there is significant potential benefit in such techniques. V. N. Anh and A. Moffat [Anh 04] described a scheme for compressing lists of integers as sequences of fixed binary codewords that had the twin benefits of being both effective and efficient. Because Web search engines index large quantities of text the static costs associated with storing the index can be traded against dynamic costs associated with using it during query evaluation. Typically, index representations that are effective and obtain good compression tend not to be efficient, in that they require more operations during query processing. The approach described by Anh and Moffat results in a reduction in index storage costs compared to their previous word-aligned version, with no cost in terms of query throughput. Udayan Khurana and Anirudh Koul [Khu 05] presented a new compression scheme for text. The same is efficient in giving high compression ratios and enables super fast searching within the compressed text. Typical compression ratios of 70-80% and reduc- ing the search time by 80-85% are the features of this paper. Till now, a trade-off between high ratios and searchability within compressed text has been seen. In this paper, they showed that greater the compression, faster the search. Stefan Buttcher and Charles L. A. Clarke [But 06] examined index compression tech- niques for schema-independent inverted files used in text retrieval systems. Schema-inde- pendent inverted files contain full positional information for all index terms and allow the structural unit of retrieval to be specified dynamically at query time, rather than statically during index construction. Schema-independent indexes have different characteristics 30
  • 42. than document-oriented indexes, and this difference can affect the effectiveness of index compression algorithms greatly. There experimental results show that unaligned binary codes that take into account the special properties of schema-independent indexes achieve better compression rates than methods designed for compressing document in- dexes and that they can reduce the size of the index by around 15% compared to byte- aligned index compression. P. Farragina et al [Fer 07] proposed two new compressed representations for general se- quences, which produce an index that improves over the one in [Gro 03] by removing from the query times the dependence on the alphabet size and the polylogarithmic terms. R. Gonzalez and G. Navarro [Gon 07a] introduced a new compression scheme for suffix arrays which permits locating the occurrences extremely fast, while still being much smaller than classical indexes. In addition, their index permits a very efficient secondary memory implementation, where compression permits reducing the amount of I/O needed to answer queries. Compressed text self-indexes had matured up to a point where they can replace a text by a data structure that requires less space and, in addition to giving access to arbitrary text passages, support indexed text searches. At this point those indexes are competitive with traditional text indexes (which are very large) for counting the number of occurrences of a pattern in the text. Yet, they are still hundreds to thousands of times slower when it comes to locating those occurrences in the text. R. Gonzalez and G. Navarro [Gon 07b] introduced a disk-based compressed text index that, when the text is compressible, takes little more than the plain text size (and replaces it). It provides very good I/O times for searching, which in particular improve when the text is compressible. In this aspect the index is unique, as compressed indexes have been slower than their classical counterparts on secondary memory. They analyzed their index and showed experimentally that it is extremely competitive on compressible texts. A. Moffat and J. S. Culpepper [Mof 07] showed that a relatively simple combination of techniques allows fast calculation of Boolean conjunctions within a surprisingly small amount of data transferred. This approach exploits the observation that queries tend to contain common words, and that representing common words via a bitvector allows 31
  • 43. random access testing of candidates, and, if necessary, fast intersection operations prior to the list of candidates being developed. By using bitvectors for a very small number of terms that (in both documents and in queries) occur frequently, and byte coded inverted lists for the balance can reduce both querying time and query time data-transfer volumes. The techniques described in [Mof 07] are not applicable to other powerful forms of querying. For example, index structures that support phrase and proximity queries have a much more complex structure, and are not amenable to storage (in their full form) using bitvectors. Nevertheless, there may be scope for evaluation regimes that make use of preliminary conjunctive filtering before a more detailed index is consulted, in which case the structures described in [Mof 07] would still be relevant. Due to the rapid growth in the size of the web, web search engines are facing enormous performance challenges. The larger engines in particular have to be able to process tens of thousands of queries per second on tens of billions of documents, making query throughput a critical issue. To satisfy this heavy workload, search engines use a variety of performance optimizations including index compression, caching, and early termination. J. Zhang et al [Zha 08] focused on two techniques, inverted index compression and index caching, which play a crucial rule in web search engines as well as other high- performance information retrieval systems. We perform a comparison and evaluation of several inverted list compression algorithms, including new variants of existing algorithms that have not been studied before. We then evaluate different inverted list caching policies on large query traces, and finally study the possible performance benefits of combining compression and caching. The overall goal of this paper is to provide an updated discussion and evaluation of these two techniques, and to show how to select the best set of approaches and settings depending on parameter such as disk speed and main memory cache size. P. Ferragina et al [Fer 09] presented an article to fill the gap between implementations and focused comparisons of compressed indexes. They presented the existing implemen- tations of compressed indexes from a practitioner's point of view; introduced the Pizza&Chili site, which offers tuned implementations and a standardized API for the 32
  • 44. most successful compressed full-text self-indexes, together with effective test-beds and scripts for their automatic validation and test; and, finally, they showed the results of ex- tensive experiments on a number of codes with the aim of demonstrating the practical rel- evance of this novel algorithmic technology. Ferragina et al [Fer 09], first, presented the existing implementations of compressed in- dexes from a practitioner’s point of view. Second, they introduced the Pizza&Chili site, which offers tuned implementations and a standardized API for the most successful com- pressed full-text self-indexes, together with effective test beds and scripts for their auto- matic validation and test. Third, they showed the results of their extensive experiments on these codes with the aim of demonstrating the practical relevance of this novel and excit- ing technology. H. Yan et al [Yan 09] studied index compression and query processing techniques for such reordered indexes. Previous work has focused on determining the best possible or- dering of documents. In contrast, they assumed that such an ordering is already given, and focus on how to optimize compression methods and query processing for this case. They performed an extensive study of compression techniques for document IDs and pre- sented new optimizations of existing techniques which can achieve significant improve- ment in both compression and decompression performances. They also proposed and evaluated techniques for compressing frequency values for this case. Finally, they studied the effect of this approach on query processing performance. Their experiments showed very significant improvements in index size and query processing speed on the TREC GOV2 collection of 25.2 million Web pages. 2.3 Recent Research on Bit-Level Data Compression Algorithms This section presents a review of some of the most recent research on developing an efficient bit-level data compression algorithms, as the algorithm we use in thesis is a bit- level technique. A. Jardat and M. Irshid [Jar 01] proposed a very simple and efficient binary run-length compression technique. The technique is based on mapping the non-binary information 33
  • 45. source into an equivalent binary source using a new fixed-length code instead of the ASCII code. The codes are chosen such that the probability of one of the two binary symbols; say zero, at the output of the mapper is made as small as possible. Moreover, the "all ones" code is excluded from the code assignments table to ensure the presence of at least one "zero" in each of the output codewords. Compression is achieved by encoding the number of "ones" between two consecutive "zeros" using either a fixed-length code or a variable-length code. When applying this simple encoding technique to English text files, they achieve a compression of 5.44 bpc (bit per character) and 4.6 bpc for the fixed-length code and the variable length (Huffman) code, respectively. Caire et al [Cai 04] presented a new approach to universal noiseless compression based on error correcting codes. The scheme was based on the concatenation of the Burrows- Wheeler block sorting transform (BWT) with the syndrome former of a low-density parity-check (LDPC) code. Their scheme has linear encoding and decoding times and uses a new closed-loop iterative doping algorithm that works in conjunction with belief- propagation decoding. Unlike the leading data compression methods, their method is resilient against errors, and lends itself to joint source-channel encoding/decoding; furthermore their method offers very competitive data compression performance. A. A. Sharieh [Sha 04] introduced a fixed-length Hamming (FLH) algorithm as enhancement to Huffman coding (HU) to compress text and multimedia files. He investigated and tested these algorithms on different text and multimedia files. His results indicated that the HU-FLH and FLH-HU enhanced the compression ratio. K. Barr and K. Asanovi’c [Bar 06] presented a study of the energy savings possible by lossless compressing data prior to transmission. Because wireless transmission of a single bit can require over 1000 times more energy than a single 32-bit computation. It can therefore be beneficial to perform additional computation to reduce the number of bits transmitted. If the energy required to compress data is less than the energy required to send it, there is 34
  • 46. a net energy savings and an increase in battery life for portable computers. This work demonstrated that, with several typical compression algorithms, there was actually a net energy increase when compression was applied before transmission. Reasons for this increase were explained and suggestions were made to avoid it. One such energy-aware suggestion was asymmetric compression, the use of one compression algorithm on the transmit side and a different algorithm for the receive path. By choosing the lowest- energy compressor and decompressor on the test platform, overall energy to send and receive data can be reduced by 11% compared with a well-chosen symmetric pair, or up to 57% over the default symmetric scheme. The value of this research is not merely to show that one can optimize a given algorithm to achieve a certain reduction in energy, but to show that the choice of how and whether to compress is not obvious. It is dependent on hardware factors such as relative energy of the central processing unit (CPU), memory, and network, as well as software factors including compression ratio and memory access patterns. These factors can change, so techniques for lossless compression prior to transmission/reception of data must be re- evaluated with each new generation of hardware and software. A. Jaradat et al. [Jar 06] proposed a file splitting technique for the reduction of the nth- order entropy of text files. The technique is based on mapping the original text file into a non-ASCII binary file using a new codeword assignment method and then the resulting binary file is split into several sub files each contains one or more bits from each codeword of the mapped binary file. The statistical properties of the sub files are studied and it was found that they reflect the statistical properties of the original text file which was not the case when the ASCII code is used as a mapper. The nth-order entropy of these sub files was determined and it was found that the sum of their entropies was less than that of the original text file for the same values of extensions. These interesting statistical properties of the resulting subfiles can be used to achieve better compression ratios when conventional compression techniques were applied to these sub files individually and on a bit-wise basis rather than on character- wise basis. H. Al-Bahadili [Bah 07b, Bah 08a] developed a lossless binary data compression scheme 35
  • 47. that is based on the error correcting Hamming codes. It was referred to as the HCDC algorithm. In this algorithm, the binary sequence to be compressed is divided into blocks of n bits length. To utilize the Hamming codes, the block is considered as a Hamming codeword that consists of p parity bits and d data bits (n=d+p). Then each block is tested to find if it is a valid or a non-valid Hamming codeword. For a valid block, only the d data bits preceded by 1 are written to the compressed file, while for a non-valid block all n bits preceded by 0 are written to the compressed file. These additional 1 and 0 bits are used to distinguish the valid and the non-valid blocks during the decompression process. An analytical formula was derived for computing the compression ratio as a function of block size, and fraction of valid data blocks in the sequence. The performance of the HCDC algorithm was analyzed, and the results obtained were presented in tables and graphs. The author concluded that the maximum compression ratio that can be achieved by this algorithm is n/(d+1), if all blocks are valid Hamming codewords. S. Nofal [Nof 07] proposed a bit-level files compression algorithm. In this algorithm, the binary sequence is divided into a set of groups of bits, which are considered as minterms representing Boolean functions. Applying algebraic simplifications on these functions reduce in turn the number of minterms, and hence, the number of bits of the file is reduced as well. To make decompression possible one should solve the problem of dropped Boolean variables in the simplified functions. He investigated one possible solution and their evaluation shows that future work should find out other solutions to render this technique useful, as the maximum possible compression ratio they achieved was not more than 10%. H. Al-Bahadili and S. Hussain [Bah 08b] proposed and investigated the performance of a bit-level data compression algorithm, in which the binary sequence is divided into blocks each of n-bit length. This gives each block a possible decimal values between 0 to 2 n-1. If the number of the different decimal values (d) is equal to or less than 256, then the binary sequence can be compressed using the n-bit character wordlength. Thus, a compression ratio of approximately n/8 can be achieved. They referred to this algorithm as the 36
  • 48. adaptive character wordlength (ACW) algorithm, since the compression ratio of the algorithm is a function of n, it was referred to it as the ACW(n) algorithm. Implementation of the ACW(n) algorithm highlights a number of issues that may degrade its performance, and need to be carefully resolved, such as: (i) If d is greater than 256, then the binary sequence cannot be compressed using n-bit character wordlength, (ii) the probability of being able to compress a binary sequence using n-bit character wordlength is inversely proportional to n, and (iii) finding the optimum value of n that provides maximum compression ratio is a time consuming process, especially for large binary sequences. In addition, for text compression, converting text to binary using the equivalent ASCII code of the characters gives a high entropy binary sequence, thus only a small compression ratio or sometimes no compression can be achieved. To overcome all drawbacks that mentioned in the ACW(n) algorithm, Al-Bahadili and Hussain [Bah 10a] developed an efficient implementation scheme to enhance the performance of the ACW(n) algorithm. In this scheme the binary sequence was divided into a number of subsequences (s), each of them satisfies the condition that d is less than 256, therefore it is referred to as the ACW(n,s) scheme. The scheme achieved compression ratios of more than 2 on most text files from most widely used corpora. H. Al-Bahadili and A. Rababa’a [Bah 07a, Rab 08, Bah 10b] developed a new scheme consists of six steps some of which are applied repetitively to enhance the compression ratio of the HCDC algorithm [Bah 07b, Bah 08a], therefore, the new scheme was referred to as the HCDC(k) scheme, where k refers to the number of repetition loops. The repetition loops continue until inflation is detected. The overall (accumulated) compression ratio is the multiplication of the compression ratios of the individual loops. The results obtained for the HCDC(k) scheme demonstrated that the scheme has a higher compression ratio than most well-known text compression algorithms, and also exhibits a competitive performance with respect to many widely-used state-of-the-art software. The HCDC algorithm and the HCDC(k) scheme will be discussed in details in the next Chapter. 37
  • 49. S. Ogg and B. Al-Hashimi [Ogg 06] proposed a simple yet effective real-time compres- sion technique that reduces the amount of bits sent over serial links. The proposed tech- nique reduces the number of bits and the number of transitions when compared to the original uncompressed data. Results of compression on two MPEG1 coded picture data showed average bit reductions of approximately 17% to 47% and average transition re- ductions of approximately 15% to 24% over a serial link. The technique can be employed with such network-on-chip (NoC) technology to improve the bandwidth bottleneck issue. Fixed and dynamic block sizing was considered and general guidelines for determining a suitable fixed block length and an algorithm for dynamic block sizing were shown. The technique exploits the fact that unused significant bits do not need to be transmitted. Also, the authors outlined a possible implementation of the proposed compression technique, and the area overhead costs and potential power and bandwidth savings within a NoC en- vironment were presented. J. Zhang and X. Ni [Zha 10] presented a new implementation of bit-level arithmetic cod- ing using integer additions and shifts. The algorithm has less computational complexity and more flexibility, and thus is very suitable for hardware design. They showed that their implementation has the least complexity and the highest speed in Zhao’s algorithm [Zha 98], Rissanen and Mohiuddin. (RM) algorithm [Ris 89], Langdon and Rissanen (LR) al- gorithm [Lan 82] and the basic arithmetic coding algorithm. Sometimes it has higher compression rate than basic arithmetic encoding algorithm. Therefore, it provides an ex- cellent compromise between good performance and low complexity. 38
  • 50. Chapter Three The Novel CIQ Web Search Engine Model This chapter presents a description of the proposed Web search engine model. The model incorporates two bit-level data compression layers, both installed at the back-end processor, one for index compression (index compressor), and one for query compression (query or keyword compressor), so that the search process can be performed at the compressed index-query level and avoid any decompression activities during the searching process. Therefore, it is referred to as the compressed index-query (CIQ) model. In order to be able to perform the search process at the compressed index-query level, it is important to have a data compression technique that is capable of producing the same pattern for the same character from both the query and the index. The algorithm that meets the above main requirements is the novel Hamming code data compression (HCDC) algorithm [Bah 07b, Bah 08a]. The HCDC algorithm creates a compressed file header (compression header) to store some parameters that are relevant to compression process, which mainly include the character-binary coding pattern. This header should be stored separately to be accessed by the query compressor and the index decompressor. Introducing the new compression layers should reduce disk space for storing index files; increase query throughput and consequently retrieval rate. On the other hand, compressing the search query reduces I/O overheads and query processing time as well as the system response time. This section outlines the main theme of this chapter. The rest of this chapter is organized as follows: The detail description of the new CIQ Web search engine model is given in Section 3.2. Section 3.3 presents the implementation of the new model and its main procedures. The data compression algorithm, namely, the HCDC algorithm is described in Section 3.4. In addition in Section 3.4, derivation and analysis of the HCDC compression ratio is given. The performance measures that are use to evaluate and compare the performance of the new model is introduced in Section 3.5. 39
  • 51. 3.1 The CIQ Web Search Engine Model In this section, a description of the proposed Web search engine model is presented. The new model incorporates two bit-level data compression layers, both installed at the back- end processor, one for index compression (index compressor) and one for query compression (query compressor or keyword compressor), so that the search process can be performed at the compressed index-query level and avoid any decompression activities, therefore, we refer to it as the compressed index-query (CIQ) Web search engine model or simply the CIQ model. In order to be able to perform the search process at the CIQ level, it is important to have a data compression technique that is capable of producing the same pattern for the same character from both the index and the query. The HCDC algorithm [Bah 07b, Bah 08a] which will be described in the next section, satisfies this important feature, and it will be used at the compression layers in the new model. Figure (3.1) outlines the main components of the new CIQ model and where the compression layers are located. It is believed that introducing the new compression layers reduce disk space for storing index files, increases query throughput and consequently retrieval rate. On the other hand, compressing the search query reduces I/O overheads and query processing time as well as the system response time. The CIQ model works as follows: At the back-end processor, after the indexer generates the index, and before sending it to the index storage device it keeps it in a temporary memory to apply a lossless bit-level compression using the HCDC algorithm, and then sends the compressed index file to the storage device. So that it requires less-disk space enabling more documents to be indexed and accessed in comparatively less CPU time. The HCDC algorithm creates a compressed-file header (compression header) to store some parameters that are relevant to compression process, which mainly include the character-to-binary coding pattern. This header should be stored separately to be accessed by the query compression layer (query compressor). 40
  • 52. On the other hand, the query parser, instead of passing the query to the index file, it passes it to the query compressor before accessing the index file. In order to produce similar binary pattern for the similar compressed characters from the index and the query, the character-to-binary codes used in converting the index file are passed to be used at the query compressor. If a match is found the retrieved data is decompressed, using the index decompressor, and passed through the ranker and the search engine interface to the end- user. Figure (3.1). Architecture and main components of the CIQ Web search engine model. 41
  • 53. 3.2 Implementation of the CIQ Model: The CIQ-based Test Tool (CIQTT) This section describes the implementation of a CIQ-based test tool (CIQTT), which is developed to: (1) Validate the accuracy and integrity of the retrieved data. Ensuring that the same data sets can be retrieved using the new CIQ model. (2) Evaluate the performance of the CIQ model. Estimate the reduction in the index file storage requirement and processing or search time. The CIQTT consists from six main procedures; these are: (1) COLCOR: Collecting the testing corpus (documents). (2) PROCOR: Processing and analyzing the testing corpus (documents). (3) INVINX: Building the inverted index and start indexing. (4) COMINX: Compressing the inverted index. (5) SRHINX: Searching the index file (inverted or inverted/compressed index). (6) COMRES: Comparing the outcomes of different search processes performed by SRHINX procedure. In what follows, we shall provide a brief description for each of the above procedures. 3.2.1 COLCOR: Collects the testing corpus (documents) In this procedure, the Nutch crawler [Web 6] is used to collect the targeted corpus (documents). Nutch is an open-source search technology initially developed by Douglas Reed Cutting who is an advocate and creator of open-source search technology. He originated Lucene and, with Mike Cafarella, Nutch, both open-source search technology projects which are now managed through the Apache Software Foundation (ASF). Nutch builds on Lucene [Web 7] and Solr [Web 8]. 42