Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Document clustering for forensic analysis

3.782 Aufrufe

Veröffentlicht am

project final review ppt report

Veröffentlicht in: Bildung, Technologie
  • Als Erste(r) kommentieren

Document clustering for forensic analysis

  1. 1. P R O J E C T I N T E R N A L G U I D E : J . J A G A D E E S W A R R E D D Y M . T E C H , A S S I S T A N T P R O F E S S O R , D E P A R T M E N T O F C O M P U T E R S C I E N C E A N D E N G I N E E R I N G , A . S . C . E . T . P R E S E N T E D B Y : T . S A N J A Y H A R S H A ( 1 0 G 2 1 A 0 5 A 5 ) , P . J A Y A K U M A R ( 1 0 G 2 1 A 0 5 7 9 ) , B . S R E E N I V A S A T E J A ( 1 1 G 2 5 A 0 5 0 1 ) , S K . Z A H I D ( 1 0 G 2 1 A 0 5 B 9 ) . DOCUMENT CLUSTERING FOR FORENSIC ANALYSIS:AN APPROACH FOR IMPROVING COMPUTER INSPECTION
  2. 2. AGENDA  1. Abstract  2. Introduction  3. Existing systems and Disadvantages  4. Proposed systems and Advantages  5. System Architecture  6. UML Diagrams  7. Modules and their explanation  8. System Configurations  9. I/O Screen Shots  10. Conclusion  11. Future enhancements
  3. 3. ABSTRACT  In computer forensic analysis, hundreds of thousands of files are usually examined. Much of the data in those files consists of unstructured text, whose analysis by computer examiners is difficult to be performed.  In particular, algorithms for clustering documents can facilitate the discovery of new and useful knowledge from the documents under analysis.  The present an approach that applies document clustering algorithms to forensic analysis of computers seized in police investigations.
  4. 4. FORENSIC COMPUTING - INTRODUCTION • Digital forensics is a branch of forensic science encompassing The recovery and investigation of material found in digital devices or often in relation to computer crime. • Computer forensics is the application of investigation and analysis techniques to gather and preserve them. • Volume of data in the digital world has increased from 161 hexabytes in 2006 to 988 hexabytes in 2010. • Has a direct impact in Computer Forensics
  5. 5. EXISTING SYSTEM  From a more technical viewpoint, our datasets consist of unlabeled objects—the classes or categories of documents that can be found are a priori unknown.  A new data sample would come from a different population.  In this context, use of cluster algorithms for finding latent patterns from text documents found in seized computers.
  6. 6. DISADVANTAGES OF EXISTING SYSTEM  The literature on Computer Forensics only reports the use of algorithms that assume that the number of clusters is known and fixed a priori by the user.  Aimed at relaxing this assumption, which is often unrealistic in practical applications, a common approach in other domains involves estimating the number of clusters from data.
  7. 7. Example of an unstructured data
  8. 8. Examples of structured and unstructured data
  9. 9. PROPOSED SYSTEM  Here, we decided to choose a set of (six) representative algorithm in order to show the potential of the proposed approach, namely:  Partitional K-means and K-medoids,  Classical hierarchical algorithms like Single/Complete/Average Link, and  the cluster ensemble algorithm known as CSPA.  These algorithms were run with different combinations of their parameters, resulting in sixteen different algorithmic instantiations.
  10. 10. ADVANTAGES OF PROPOSED SYSTEM  Most importantly, we observed that clustering algorithms indeed tend to induce clusters formed by either relevant or irrelevant documents, thus contributing to enhance the expert examiner’s job.  Furthermore, our evaluation of the proposed approach in applications show that it has the potential to speed up the computer inspection process.
  11. 11. System Architecture
  12. 12. UML Diagrams Documents Preprocessing Term Frequency Similarity Calculation Cluster Formation Evaluating Query Results • Use Case Diagram of the system
  13. 13. Document Preprocessing Valid Similarity Computation Term Frequency Unconsidered Cluster Formation Query Results NO YES • Activity Diagram of the system
  14. 14. Data Flow Diagram Documents Preprocessing Term Frequency Similarity Calculation Cluster Formation Query Processing Evaluating Results
  15. 15. MODULES  Preprocessing Module  Calculating the number of clusters  Clustering techniques  Removing Outliers
  16. 16. Preprocessing:  In this, stopwords (prepositions, pronouns, articles, and irrelevant document metadata) have been removed.  Also, the Snow balls Stemming algorithm for Portuguese words has been used.  Documents are represented in a vector space model.  Term Variance (TV) is used to increase the effectiveness and efficiency of the clustering algorithm.  Uses two measures namely :  Cosine-based distance and Levenshtein-based distance.
  17. 17. Calculating the number of Clusters:  A widely used approach consists of getting a set of data partitions with different numbers of clusters and then selecting that particular partition that provides the best result according to a specific quality criterion (e.g., a relative validity index called Silhouettes).
  18. 18. • Uses both Classical hierarchical clustering algorithms and partitional algorithms like K-means. • let us assume that a set of data partitions with different numbers of clusters is available, from which we want to choose the best one. • Average dissimilarity of an object i to all objects will be called as d(i, C). • The smallest one is selected i.e.; b(i) = min d(i, C). • Average dissimilarity to its neighboring cluster and the silhouette for a given object s(i) is given by: • Thus, the higher the better the assignment of object to a given cluster that has the maximum average silhouette.
  19. 19. Clustering techniques:  Uses partitional K-means and K-medoids, the hierarchical Single/Complete/Average Link, and the cluster ensemble based algorithm known as CSPA—are popular in the machine learning and data mining fields, and therefore they have been used in our study.
  20. 20. Removing Outliers:  We assess a simple approach to remove outliers. This approach makes recursive use of the silhouette.  Fundamentally, if the best partition chosen by the silhouette has singletons (i.e., clusters formed by a single object only), these are removed.
  21. 21. HARDWARE REQUIREMENTS  Processor - Pentium –IV  Speed - 1.1 GHz  RAM - 256 MB(min)  Hard Disk - 20 GB  Key Board - Standard Windows Keyboard  Mouse - Two or Three Button Mouse  Monitor - SVGA
  22. 22. SOFTWARE REQUIREMENTS  Operating System : Windows XP  Programming Language : JAVA  Java Version : JDK 1.6 & above.  Database : MySQL 4.5  IDE : Java Net Beans 7.4
  23. 23. An iterative semi-supervised machine learning approach
  24. 24. CONCLUSION • Using this proposed approach which can become an ideal application for document clustering to forensic analysis of computers, laptops and hard disks which are seized from criminals during investigation of police. • There are several practical results based on our work which are extremely useful for the experts working in forensic computing department.
  25. 25. FUTURE ENHANCEMENTS • Aimed at further leveraging the use of data clustering algorithms in similar applications, a promising venue for future work involves investigating automatic approaches for cluster labelling. • Inquiry on the possibility of endowing in the cluster calibration procedure semantic . • Another line of research could consist in using supervised learning tools to categorize data on already defined categories for investigative purposes.
  26. 26. REFERENCE  Louis Filipe da Cruz Nassif and Eduardo Raul Hruschka “Document Clustering for Forensic Analysis: An Approach for Improving Computer Inspection” - IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 8, NO. 1, JANUARY 2013.