Information about Document clustering for forensic analysis an approach for improving...

Document clustering for forensic analysis an approach for improving computer inspection

ABSTRACT: In computer forensic analysis, hundreds of thousands of files are usually examined. Much of the data in those files consists of unstructured text, whose analysis by computer examiners is difficult to be performed. In particular, algorithms for clustering documents can facilitate the discovery of new and useful knowledge from the documents under analysis. The present an approach that applies document clustering algorithms to forensic analysis of computers seized in police investigations.

Introction To DataMining •Data mining refers to as extracting knowledge from the data •It is the computational process of discovering patterns in large data sets involving methods at the intersection ofartificial intelligence, machine learning, statistics, and database systems. •The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use.

Clustering-Introduction What is Clustering? Clustering can be considered the most important unsupervised learning problem; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data.

Forensic Computing-Introduction •Digital forensics is a branch of forensic science encompassing The recovery and investigation of material found in digital devices Or often in relation to computer crime. •Computer forensics is the application of investigation and analysis techniques To gather and preserve them.

Text Mining-Definition •Text mining also refered to as text data mining,roughly equivalent to text analysis •Text mining is the analysis of data contained in natural language text •The application of text mining techniques to solve problems is called text analysis

EXISTING SYSTEM: •Clustering algorithms are typically used for exploratory data analysis •Where there is little or no prior knowledge about the data •More precisely, it is likely that the new data sample would come from a different population •By doing so ,one can avoid the hard task of examing all the documents(individually) but ,even if so desired, it still could be done

DISADVANTAGES OF EXISTING SYSTEM: •The literature on Computer Forensics only reports the use of algorithms that assume that the number of clusters is known and fixed a priori by the user. •A common approach in other domains involves estimating the number of clusters from data.

K-means Algorithm Definition •The K-Means algorithm is an method to cluster objects based on their attributes into k partitions. •It assumes that the k clusters exhibit Gaussian distributions. •It assumes that the object attributes form a vector space. •The objective it tries to achieve is to minimize total intra-cluster variance.

K-means example •Problem: Cluster the following eight points (with (x, y) representing locations) •into three clusters A1(2, 10) A2(2, 5) A3(8, 4) A4(5, 8) A5(7, 5) A6(6, 4) A7(1, 2) A8(4, 9). •Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2). •The distance function between two points a=(x1, y1) and b=(x2, y2) •is defined as: ρ(a, b) = |x2 – x1| + |y2 – y1| . •Use k-means algorithm to find the three cluster centers after the second iteration.

Solution: A1 Point (2, 10) A2 (5, 8) A5 (7, 5) A6 (6, 4) A7 (1, 2) A8 (1, 2) Dist Mean 3 (8, 4) A4 (5, 8) Dist Mean 2 (2, 5) A3 (2, 10) Dist Mean 1 Cluster (4, 9) First we list all points in the first column of the table above. The initial cluster centers – means, are (2, 10), (5, 8) and (1, 2) - chosen randomly. Next, we will calculate the distance from the first point (2, 10) to each of the three means, by using the distance function:

Solution:(Cont..) point mean1 x1, y1 x2, y2 (2, 10) (2, 10) ρ(a, b) = |x2 – x1| + |y2 – y1| ρ(point, mean1) = |x2 – x1| + |y2 – y1| = |2 – 2| + |10 – 10| =0+0 =0 point mean2 x1, y1 x2, y2 (2, 10) (5, 8) ρ(a, b) = |x2 – x1| + |y2 – y1| ρ(point, mean2) = |x2 – x1| + |y2 – y1| = |5 – 2| + |8 – 10| =3+2 =5 point mean3 x1, y1 x2, y2 (2, 10) (1, 2) ρ(a, b) = |x2 – x1| + |y2 – y1| ρ(point, mean2) = |x2 – x1| + |y2 – y1| = |1 – 2| + |2 – 10| =1+8 =9

Analogically, we fill in the rest of the table, and place each point in one of the clusters: (2, 10) (5, 8) (1, 2) Point Dist Mean 1 Dist Mean 2 Dist Mean 3 Cluster A1 (2, 10) 0 5 9 1 A2 (2, 5) 5 6 4 3 A3 (8, 4) 12 7 9 2 A4 (5, 8) 5 0 10 2 A5 (7, 5) 10 5 9 2 A6 (6, 4) 10 5 7 2 A7 (1, 2) 9 10 0 3 A8 (4, 9) 3 2 10 2

PROPOSED SYSTEM: •I present an approach that applies document clustering algorithms to forensic analysis of computers seized in police investigations. •Clustering algorithms have been studied for decades, and the literature on the subject is huge. •We decided to choose a set of (six) representative algorithm in order to show the potential of the proposed approach, namely: the partitional K-means and K-medoids, the hierarchical Single/Complete/Average Link, and the cluster ensemble algorithm known as CSPA. •In order to make the comparative analysis of the algorithms more realistic, two relative validity indexes have been used to estimate the number of clusters automatically ZCfrom data.

ADVANTAGES OF PROPOSED SYSTEM: •Evaluation of the proposed approach in applications show that it has the potential to speed up the computer inspection process.

ESTIMATING THE NUMBER OF CLUSTERS FROM DATA A widely used relative validity index is the so-called silhouette Let us consider an object belonging to cluster . The average dissimilarity of to all other objects of A is denoted by a(i) . Now let us take into account cluster C. The average dissimilarity of i to all objects of C will be called d(i,C) . After computing d(i,C) for all clusters C!=A, the smallest one is selected, i.e.,b(i)=min d(i,C) ,C!=A . This value represents the dissimilarity of to its neighbor cluster, and the silhouette for a give object, s(i), is given by: Thus, the higher the better the assignment of object to a given cluster

CONCLUSION •With the proposed concept we can estimate the number of clusters automatically to achive get good results.

THANKYOU

?

Document Clustering for Forensic Analysis: An Approach for Improving Computer Inspection

Read more

Document Clustering for Forensic Analysis: An Approach for Improving Computer Inspection

Read more

Document Clustering for Forensic Analysis An Approach for Improving Computer Inspection. ... We present an approach that applies document clustering ...

Read more

Document Clustering for Forensic Analysis ... Clustering for Forensic Analysis An Approach ... APPROACH FOR IMPROVING COMPUTER INSPECTION ...

Read more

Document Clustering for Forensic Computing: An Approach for Improving Computer Inspection ... clustering algorithms to forensic analysis of ...

Read more

Browse. Interests. Biography & Memoir; Business & Leadership; Fiction & Literature; Politics & Economy

Read more

Luís Filipe da Cruz Nassif, Eduardo Raul Hruschka, "Document Clustering for Forensic Computing: An Approach for Improving ...

Read more

46. IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 8, NO. 1, JANUARY 2013 Document Clustering for Forensic Analysis: An Approach for ...

Read more

## Add a comment