This lab report describes developing a program to perform string operations using suffix arrays. It includes 3 modules: 1) Finding the longest repeated substring, 2) Finding the longest common substring, and 3) Finding the longest palindrome in a string. The report provides code for building a suffix tree from a string and performing traversal to solve each problem. It also includes sample outputs and references.
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
String functions using suffix array
1. PROJECT BASED LAB REPORT
On
FUNCTIONS OF STRING USING SUFFIX ARRAY
Submitted in partial fulfilment of the
Requirements for the award of the Degree of
Bachelor of Technology
In
Computer Science & Engineering
By
S.V.Rohith
(150031000)
P.Iswarya
(150030684)
K.Sri sai krishna
(150030496)
Under the esteem guidance of
Sir, G.Swain
2. DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(DST-FIST Sponsored Department)
K L University
Green Fields, Vaddeswaram, Guntur District-522 502
2015-2016
This is to certify that this project based lab report entitled “String functions using Suffix
array” is a bonafide work done by the team.
S.V.Rohith (150031000)
P.Iswarya (150030684)
K.Sri sai krishna (150030496)
In partial fulfilment of the requirement for the award of degree in BACHELOR OF
TECHNOLOGY in Computer Science and Engineering during the academic year 2015-
2016.
Faculty in charge Head of the Department
CERTIFICATE
3. K L University
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(DST-FIST Sponsored Department)
We hereby declare that this project based lab report entitled “String functions using Suffix
arrays” has been prepared by us in partial fulfilment of the requirement for the award of degree
“BACHELOR OF TECHNOLOGY in COMPUTER SCIENCE AND ENGINEERING”
during the academic year 2015-2016.
I also declare that this project based lab report is of our own effort and it has not been
submitted to any other university for the award of any degree.
Date: 18/04/16 S.V.Rohith (150031000)
Place: Vaddeswaram P.Iswarya (150030684)
K.Sri sai Krishna (150030382)
DECLARATION
4. ACKNOWLEDGEMENTS
My sincere thanks to G.Swain in the Lab for their outstanding support throughout the project
for the successful completion of the work
We express our gratitude to Dr.V.Srikanth, Head of the Department for Computer Science
and Engineering for providing us with adequate facilities, ways and means by which we are
able to complete this term paper work.
We would like to place on record the deep sense of gratitude to the honourable Vice Chancellor,
K L University for providing the necessary facilities to carry the concluded term paper work.
Last but not the least, we thank all Teaching and Non-Teaching Staff of our department and
especially my classmates and my friends for their support in the completion of our term paper
work.
S.V.Rohith (150031000)
P.Iswarya (150030684)
K.Sri sai Krishna (150030382)
5. TABLE OF CONTENTS
Content Pg. No
1. Introduction and Description 1
1.1 Module 1: Longest repeated sub-string 3
1.2 Module 2: Longest common substring 4
1.3 Module 3: Longest palindrome of a string 5
2. Basic requirements and development of the program. 7
3. Longest repeated sub-string 8
3.1. Code of the module
3.2. outputs and reference frames
4. Longest common substring
14
13
4.1. Code of the module
4.2. outputs and reference frames
5. Longest palindrome in a string. 18
4.1. Code of the module
4.2. outputs and reference frames
6. References 20
6.
7. Page | 1
1.INTRODUCTION
Advanced data structures is a part of c-
language. C is a structured, high level machine independent language. C is converted to a lower
language which was understood by the compiler. It allows the software developers to develop
programs without worrying about the hardware plat forms where there will be implemented.
The c language comes from the ALGOL which gives the concept structured programming to
the computer science community. It was introduced early in 1960’s.
After, MARTIN RICHARDS DEVELOPED a language known as BCPL in 1967 for this in
1970’s ken Thompson created a language from BCPL and he called as “B” both BCPL and B
are types less system programming languages. After finding ALGOL BCPL, AND B then from
this c is evolved from that at BELL LABORATORIES in 1972 by “DENIS RITCHE”. C
language uses many concepts from these and added the concept of data type because it was
developed along with a UNIX operating system. UNIX is nothing but a most popular network
operating system is used today and the heart of the internet data super high way. C-language is
robust language because c-supports richest of operators and burden functions this consist of
many operators, operands, key words, special characters, many characters.
Features of c-programming:-
It is a structural programming language with fundamental flow control
construction.
It is highly portable. The program written on one computer can run on another
computer also without any modification or with a slight modification.
It contains 32 keywords.
It is simple and versatile programming language.
It is richest than all programs.
Dynamic memory allocation is possible in ‘c’.
Structures
We have seen that arrays can be used to represent a group of data items that
belongs to same data type. If we want to represent a collectionof data items of different
data types using a single name, then we cannot use an array. C supports a constructed
data type known as Structure, which is a method for packing data of different data types.
A structure is a convenient tool for handling a group of logically related data items.
Structures help to organize complex data in a more meaningful way. It is a powerful
concept that we may often need to Use in our program design.
8. Page | 2
Definition:
A group of data items that belongs to different data types is known as Structure.
‘Struct’ : It is a keyword and is used to declare a Structure.
Declaration of structure:
struct struct_name
{
Data item-1;
Data item-2;
…………
…………
Data item-n;
};
Declaration of structure variable:
struct struct_name identifier;
(or)
struct struct_name dentifier-1,identifier-2,.......,identifier-n;
(Access operator):
It is used to access the data items of a structure with the help of structure variable.
Syntax:
struct_variable. Data item;
This includes all the declaration of data variables.
Includes print statements.
Dynamic memory allocation
9. Page | 3
SUFFIX ARRAY
In computer science, a suffix array is a
sorted array of all suffixes of a string. It is a data structure used, among others, in full text
indices, data compression algorithms and within the field of bioinformatics.
Suffix arrays were introduced by Manber & Myers (1990) as a simple, space efficient
alternative to suffix trees. They have independently been discovered by Gaston Gonnet in 1987
under the name PAT array (Gonnet, Baeza-Yates & Snider 1992).
Task is to Build a Suffix Array and perform the following operations on the obtained Suffix
Array.
Name of the
Module
Function
Number
Functions to be discharged
SUFFIX
ARRAY
#1. Finding the longest repeated substring
#2. Finding the longest common substring
#3. Finding the longest palindrome in a string
The Title of the Program is to develop a program which deals with the combination of
structures, arrays, and other functions. This program could do some operations on arrays such
as insertion, deletion, sorting, searching, update, retrieve, merging, append, and exit.
By implementing this program we can execute the string related operations. To
do this analysis manually it takes a lot of time and patience but by implementing this program
using a high level language like C it becomes much easier. But before going to make final
solution for the problem, the problem must be analysed.
First of all the basic information regarding the program which consists of
complex numbers. This program is solved by using several methods like one can solve this
program using user defined functions concept, loops conditions, go to statements. In this
abstract we used the concept of functions, while loop, for loop, switch case and if condition’s
which helps to execute the problem much easier .The following steps are followed while
implementing the given program using if and while loop.
10. Page | 4
The input is entered i.e., the value of choice (the menu no) select the particular menu.
Next it goes to particular menu and then go to the particular function.
It prints the resultant value which came from the execution.
Longest Repeated Substring
In computer science, the longest repeated
substring problem is the problem of finding the longest substring of a string that occurs at least
twice. This problem can be solved in linear time and space by building a suffix tree for the
string, and finding the deepest internal node in the tree. Depth is measured by the number of
characters traversed from the root. The string spelled by the edges from the root to such a node
is a longest repeated substring. The problem of finding the longest substring with at least k
occurrences can be solved by first pre-processing the tree to count the number of leaf
descendants for each internal node, and then finding the deepest node with at least k leaf
descendants that have no children. In the figure with the string "ATCGATCGA$", the longest
repeated substring is "ATCGA", and repeats twice.
11. Page | 5
Longest Common Substring
In computer science, the longest common substring problem
is to find the longest string (or strings) that is a substring (or are substrings) of two or more
strings. The longest common substring of the strings "ABABC", "BABCA" and "ABCBA" is
string "ABC" of length 3. Other common substrings are "A", "AB", "B", "BA", "BC" and "C".
Longest Palindrome in a String
In computer science, the longest
palindromic substring or longest symmetric factor problem is the problem of finding a
maximum-length contiguous substring of a given string that is also a palindrome. For example,
the longest palindromic substring of "bananas" is "anana". The longest palindromic substring
is not guaranteed to be unique; for example, in the string "abracadabra", there is no palindromic
substring with length greater than three, but there are two palindromic substrings with length
three, namely, "aca" and "ada". In some applications it may be necessary to return all maximal
palindromic substrings (that is, all substrings that are themselves palindromes and cannot be
12. Page | 6
extended to larger palindromic substrings) rather than returning only one substring or returning
the maximum length of a palindromic substring.
Manacher (1975) found a linear time algorithm for listing all the palindromes that appear at the
start of a given string. However, as observed e.g., by Apostolico, Breslauer & Galil (1995), the
same algorithm can also be used to find all maximal palindromic substrings anywhere within
the input string, again in linear time. Therefore, it provides a linear time solution to the longest
palindromic substring problem. Alternative linear time solutions were provided by Jeuring
(1994), and by Gusfield (1997), who described a solution based on suffix trees. Efficient
parallel algorithms are also known for the problem.
13. Page | 7
2.Requirements and Development
SOFTWARE REQUIREMENTS:
• This application is developed in Microsoft windows Xp or later operating system.
• This Phonebook application is coded and made using the following compilers:
1. Code::blocks.
2. Turbo c.
3. Dos Box
HARDWARE REQUIREMENTS:
• The Application size is 38Kb and the size of the code is 5Kb required by the hard disk.
• RAM: minimum 256MB.
• Some basic components like mouse, keyboard, Display monitor…
14. Page | 8
3.Longest repeated sub-string
Algorithm:
1. Start the basic: including the header files
2. Declaring required number of structure variable and pointer variable.
3. *start,*end interval specifies the edge, by which the node is connected to its
parent node. Each edge will connect two nodes, one parent and one child, and
(start, end) interval of a given edge will be stored in the child node
4. leaf nodes, it stores the index of suffix for the path from root to leaf.
5. Take an input variable string and pointer to root node.
6. activeEdge is represented as input string character index.
7. For root node, suffixLink will be set to NULL For internal nodes, suffixLink
will be set to root by default in current extension and may change in next
extension
8. suffixIndex will be set to -1 by default and actual suffix index will be set later
for eaves at the end of all phases
9. activePoint change for walk down (APCFWD) using Skip/Count Trick (Trick
1). If activeLength is greater than current edge length, set next internal node as
activeNode and adjust activeEdge and activeLength accordingly to represent
same activePoint.
10.Now the module going to perform the required operation of finding the longest
repeated string is performed.
11.Displaying the appropriate output
12.Exit the program.
CODING AND EXECUTION
#include<stdio.h>
#include<string.h>
#include<stdlib.h>
#define MAX_CHAR 256
structSuffixTreeNode {
structSuffixTreeNode *children[MAX_CHAR];
18. Page | 12
}
int main(intargc,char *argv[])
{
strcpy(text,"ABCDEFG$");
buildSuffixTree();
getLongestRepeatedSubstring();
freeSuffixTreeByPostOrder(root);
strcpy(text,"ATCGATCGA$");
buildSuffixTree();
getLongestRepeatedSubstring();
freeSuffixTreeByPostOrder(root);
strcpy(text,"pqrpqpqabab$");
buildSuffixTree();
getLongestRepeatedSubstring();
freeSuffixTreeByPostOrder(root);
return 0;
}
Since the input is already made in the program the output is executed in the following way:
19. Page | 13
4.Longest common substring
Algorithm:
1. Start the basic: including the header files
2. Declaring required number of structure variable and pointer variable.
3. *start,*end interval specifies the edge, by which the node is connected to its
parent node. Each edge will connect two nodes, one parent and one child, and
(start, end) interval of a given edge will be stored in the child node
4. leaf nodes, it stores the index of suffix for the path from root to leaf.
5. Take an input variable string and pointer to root node.
6. activeEdge is represented as input string character index.
7. For root node, suffixLink will be set to NULL For internal nodes, suffixLink
will be set to root by default in current extension and may change in next
extension
8. suffixIndex will be set to -1 by default and actual suffix index will be set later
for eaves at the end of all phases
9. activePoint change for walk down (APCFWD) using Skip/Count Trick (Trick
1). If activeLength is greater than current edge length, set next internal node as
activeNode and adjust activeEdge and activeLength accordingly to represent
same activePoint.
10.Now the module going to perform the required operation of finding the longest
common string is performed.
11.Displaying the appropriate output
12.Exit the program.
CODING AND EXECUTION
#include<stdio.h>
#include<string.h>
#include<stdlib.h>
#define MAX_CHAR 256
structSuffixTreeNode {
20. Page | 14
structSuffixTreeNode *children[MAX_CHAR];
structSuffixTreeNode *suffixLink;
int start,suffixIndex;
int *end;
};
typedef structSuffixTreeNode Node;
char text[100];
Node *root=NULL;
Node *lastNewNode=NULL;
Node *activeNode=NULL;
int activeEdge=-1,activeLength=0,remainingSuffixCount=0,leafEnd=-1;
int *rootEnd=NULL;
int *splitEnd=NULL;
int size=-1,size1=0;
Node *newNode(int start,int*end) {
Node *node=(Node*)malloc(sizeof(Node));
int i;
for(i=0;i<MAX_CHAR;i++)
node->children[i]=NULL;
node->suffixLink=root;
node->start=start;
node->end=end;
node->suffixIndex=-1;
return node;
}
int edgeLength(Node *n){
if(n==root)
return 0;
return *(n->end)-(n->start)+1;
}
int walkDown(Node *currNode){
if(activeLength>=edgeLength(currNode)){
activeEdge+=edgeLength(currNode);
activeLength-=edgeLength(currNode);
activeNode=currNode;
return 1;
}
return 0;
}
void extendSuffixTree(int pos){
leafEnd=pos;
remainingSuffixCount++;
lastNewNode=NULL;
while(remainingSuffixCount>0){
if(activeLength==0)
activeEdge=pos;
if(activeNode->children[text[activeEdge]]==NULL){
activeNode->children[text[activeEdge]]=newNode(pos,&leafEnd);
if(lastNewNode!=NULL){
lastNewNode->suffixLink=activeNode;
lastNewNode=NULL;
}
}
23. Page | 17
int k,maxHeight=0,substringStartIndex=0;
doTraversal(root,0,&maxHeight,&substringStartIndex);
for(k=0;k<maxHeight;k++)
printf("%c",text[k+substringStartIndex]);
if(k==0)
printf("No common substring");
else
printf(", of length: %d",maxHeight);
printf("n");
}
int main(intargc,char *argv[]){
size1=6;
printf("Longest Common Substringin abcde and fghie is:");
strcpy(text,"abcde#fghie$"); buildSuffixTree();
getLongestCommonSubstring();
freeSuffixTreeByPostOrder(root);
size1=6;
printf("Longest Common Substringin pqrst and uvwxyz is:");
strcpy(text, "pqrst#uvwxyz$"); buildSuffixTree();
getLongestCommonSubstring();
freeSuffixTreeByPostOrder(root);
return 0;
}
Since the input is already made in the program the output is executed in the following way:
24. Page | 18
5.Longest palindrome in a string
Algorithm:
1. Start the basic: including the header files
2. Declaring required number of structure variable and pointer variable.
3. *start,*end interval specifies the edge, by which the node is connected to its
parent node. Each edge will connect two nodes, one parent and one child, and
(start, end) interval of a given edge will be stored in the child node
4. leaf nodes, it stores the index of suffix for the path from root to leaf.
5. Take an input variable string and pointer to root node.
6. activeEdge is represented as input string character index.
7. For root node, suffixLink will be set to NULL For internal nodes, suffixLink
will be set to root by default in current extension and may change in next
extension
8. suffixIndex will be set to -1 by default and actual suffix index will be set later
for eaves at the end of all phases
9. activePoint change for walk down (APCFWD) using Skip/Count Trick (Trick
1). If activeLength is greater than current edge length, set next internal node as
activeNode and adjust activeEdge and activeLength accordingly to represent
same activePoint.
10.Now the module going to perform the required operation of finding the longest
palindrome in a string is performed.
11.Displaying the appropriate output
12.Exit the program.
CODING AND EXECUTION
#include <stdio.h>
#include <string.h>
voidprintSubStr(char*str,intlow,inthigh){
inti;
for(i=low;i<=high;++i)
printf("%c",str[i]);
}
intlongestPalSubstr(char*str){
intmaxLength=1;
intstart=0,len=strlen(str),i,low,high;
for(i=1;i<len;++i){
26. Page | 20
6. REFERENCES
We checked out the most available content that we can find from the internet
and used in our project.
https://en.wikipedia.org/wiki/Suffix_array
https://en.wikipedia.org/wiki/Longest_common_substring_problem
https://en.wikipedia.org/wiki/Longest_palindromic_substring
https://en.wikipedia.org/wiki/Longest_palindromic_substring
http://www.geeksforgeeks.org/suffix-tree-application-1-substring-
check/
http://www.geeksforgeeks.org/suffix-tree-application-6-longest-
palindromic-substring/
http://www.geeksforgeeks.org/suffix-tree-application-3-longest-
repeated-substring/