DHS COPLINK Data Mining 2003

33 %
67 %
Information about DHS COPLINK Data Mining 2003

Published on March 7, 2008

Author: Danielle

Source: authorstream.com

Hsinchun Chen, Ph.D. Director, COPLINK Center of Excellence Artificial Intelligence Lab University of Arizona :  Hsinchun Chen, Ph.D. Director, COPLINK Center of Excellence Artificial Intelligence Lab University of Arizona Crime Data Mining and Visualization for Intelligence and Security Informatics: The COPLINK Research Acknowledgement: NSF, CIA, NIJ, COPS, TPD, PPD, KCC Outline:  Outline COPLINK Data Mining and Visualization Framework COPLINK Testbed: Data Characteristics COPLINK Connect and Detect Systems: Using COPLINK Data for Information Sharing and Crime Relationship Identification COPLINK Visual Data Mining Research: Crime Visualization, Agent, Deception Detection, Criminal Network Analysis Outline:  Outline The COPLINK Crime Data Mining and Visualization Framework Introduction:  Introduction The concern about national security has increased significantly since the terrorist attack on September 11, 2001 Intelligence agencies such as the CIA and FBI are actively collecting and analyzing information to investigate terrorists’ activities Local law enforcement agencies have also become more alert to criminal activities in their own jurisdictions that may be relevant to national security Challenge:  Challenge The difficulty of analyzing the large volumes of data involved in criminal and terrorist activities Some criminal activities are highly organized and relevant data can be voluminous, yet diffuse in geography and time span Hard to see the overall picture until tragic events happen New crime types emerge as technology evolves, e.g., Cybercrimes can be difficult to detect because busy network traffic and frequent online transactions generate large amounts of data but only a tiny portion is related to criminal activities KDD:  KDD Knowledge discovery and dissemination (KDD) techniques hold the promise of making it easy, convenient, and practical to explore very large databases We present a general research framework and suggest high-impact challenge problems for KDD Two dimensions: (1) Crime types and security concerns; (2) Crime analysis approaches and techniques Crime Types at the Local Law Enforcement Level (1):  Crime Types at the Local Law Enforcement Level (1) Traffic violations Offenders are cited or arrested when traffic violations are discovered by police officers Sexual assault and other sexual offenses (e.g., child molesting) Theft: illegal seizure of properties (e.g., robbery, burglary, larceny, motor vehicle theft, etc.) Fraud: intentional perversion of truth in order to induce another to part with something of value or to surrender a legal right e.g., forgery and counterfeiting, embezzlement, and identity deception Crime Types at the Local Law Enforcement Level (2):  Crime Types at the Local Law Enforcement Level (2) Gang/drug offense: illegal sales or possession of drugs Organized criminal activities are frequently found (e.g., with gangs) and can be traced through various sources of evidence (e.g., persons involved, vehicles, locations) Violent crime: criminal activities that involve the use of force or armed weapons (e.g., guns, narcotics, bombs) Typically, behavior of the criminals can be traced and location and time of incident are critical in identifying the suspects Crime Types at the National Security Level (1):  Crime Types at the National Security Level (1) Sex crime: Prostitution can be an organized crime that involves more than one country. Examples include the illegal trading of prostitutes, organized pedophilia, etc. Theft: The theft of national secret or weapon information can cause severe damage on the national or international level. Fraud: It refers to deceptive behavior conducted in an illegal way. Specific crime types include transnational money laundering, identity fraud, and transnational financial fraud. Crime Types at the National Security Level (2):  Crime Types at the National Security Level (2) Gang/drug offenses: drug trafficking conducted by organized gangs across national borders is an important type of crime in this category Other types: setting up and running international criminal organization (e.g., the Mafia in Italy and the U.S., Yakuza in Japan, and The Chinese Triads in Hong Kong) Violent crime: the act of terrorism. Examples include bombing, hijacking, bioterrorism, etc. Terrorism – the unlawful use of force or violence against persons or property to intimidate or coerce a government, the civilian population, or any segment thereof, in furtherance of political or social objectives (FBI definition). Crime Types at the National Security Level (3):  Crime Types at the National Security Level (3) Cybercrime: computer-mediated activities which are illegal and which can be conducted through global electronic networks Owing to the pervasiveness of the Internet, cybercrime can occur on both local and national levels The intentions of cyber-criminals can be political, social, or financial Examples of cybercrime include Internet frauds, network intrusion, cyber-piracy, cyber-pornography, theft of confidential information, hate crime (race and religion), etc. Crime Analysis Approaches and Techniques (1):  Crime Analysis Approaches and Techniques (1) Association Rules Mining The process of discovering frequently occurring criminal elements in a database Intrusion detection: to identify patterns of program executions and user activities as association rules Classification The process of finding the common properties among different crime entities and classifying them into groups Clustering The process of grouping criminal items into classes of similar characteristics Crime Analysis Approaches and Techniques (2):  Crime Analysis Approaches and Techniques (2) Social Network Analysis Establish a network that illustrates the roles of criminals, the flow of tangible/intangible goods and information, and the associations among these entities Sequential Pattern Mining Find frequently occurring sequences of items over a set of transactions that occurred at different times (e.g., to detect temporal pattern of network attack) String Comparators Algorithms used in fraud and deception detection Entity Extraction The process of identifying patterns of particular types from unstructured data such as text, image, or audio materials A KDD Research Framework for Crime Data Mining:  A KDD Research Framework for Crime Data Mining Challenge Problems:  Challenge Problems How can criminal identities and events be detected and extracted automatically and correctly across different crime types and from different media sources? How can criminal and intelligence patterns be identified automatically and correctly as clusters and associations? How can criminal and intelligence patterns be classified automatically and correctly and used for future event prediction? How can criminal and intelligence analysis results be summarized and presented in an intuitive and effective visual format for analysts? Outline:  Outline COPLINK Testbed: Data Characteristics Tucson PD Data Sources:  Tucson PD Data Sources TPD Record Management System: Stores a wide range of information from incident reports to warrants to pawn tickets, from person descriptions to vehicles to weapons and property items. Incident data goes back as early as 1983. Database: Litton PRC RMS31 on Oracle 7.3, Compaq OpenVMS TPD Mug Shot Database: Stores about 90,000 mug shots taken by the ID Department. Database: ImageWare on SQL Server 7.0, Windows NT 4.0 Server TPD Gang Database: Stores comprehensive information about 3,200 gang members: their activities, aliases, physical descriptions, vehicles, etc. Database: In House Access 97, Windows NT 4.0 Server Tucson PD RMS Documents:  Tucson PD RMS Documents Incident Reports: Report number, crime type, precinct, MOs, date and time. Pawn Tickets: Ticket number, data and time. Warrants: Warrant number, docket number, type and issue date. Field Interviews: FI number, type, precinct, date and time. Tucson PD RMS Data Objects:  Tucson PD RMS Data Objects Person: True names, aliases, descriptions, addresses, IDs, marks and phone numbers. Organization: Name, address and phones. Vehicle: VIN, license plate, make, model, style, year and colors. Property: Serial number, type, make, model, size and colors. Weapon: Serial number, type, manufacturer, caliber and colors. COPLINK Database: Tucson PD:  COPLINK Database: Tucson PD Phoenix PD Data Sources:  Phoenix PD Data Sources Police Automated Computer Entry System, PACE: Stores a wide range of information including: incident reports, citations, field interrogations, person descriptions, vehicles, property items, and weapons. Seven years of data (1996 to 2002) are extracted into Coplink 2.5. Database: Unisys DMS II on Unisys Clearpath system Phoenix PD RMS Documents:  Phoenix PD RMS Documents Incident Reports: Report number, crime type, precinct, MOs, date and time. Citations: Citation number, type, charges, precinct, date and time. Arrests: Booking number, type, charges, date and time. Phoenix PD RMS Data Objects:  Phoenix PD RMS Data Objects Person: True name, aliases, descriptions, addresses, IDs, marks and phones. Organization: Name, address and phones. Vehicle: VIN, license plate, make, model, style, year and colors. Property: Serial number, type, make, model, size and colors. Weapon: Serial number, type, manufacturer, caliber and colors. COPLINK Database: Phoenix PD:  COPLINK Database: Phoenix PD TPD Data vs. PPD Data:  TPD Data vs. PPD Data Area coverage: TPD data represent criminal data in Tucson area only. PPD data comprise incident data from several adjacent agencies: Phoenix PD, Scottsdale PD and Glendale PD. Narrative data: PPD has a much more comprehensive collection of narrative data: 1,800,000 at PPD vs. 300,000 at TPD. Property data: PPD has a more thorough collection of property data: 2,000,000 at PPD vs. 400,000 at TPD. Mug shot data: While TPD data include mug shot information, PPD data do not. Data Scrubbing (KDD testbed):  Data Scrubbing (KDD testbed) Data scrubbed: person last names, person IDs, phones and extensions, street and apartment numbers, vehicle license plates. All scrubbed names remain meaningful, e.g., person names (Johnson, Martinez). To maintain data consistency, the uniqueness of each entity is preserved after scrubbing, i.e., the same original entity has the same scrubbed entity. Narratives are excluded from the testbed because there appears to be no reliable way to identify and scrub person names, person IDs, phones, addresses, and license plates in the narrative text. COPLINK Documentation:  COPLINK Documentation Sample COPLINK ERD, Entity Relationship Diagram COPLINK Documentation:  COPLINK Documentation COPLINK Data Dictionary: 217 Tables, 1000 attributes COPLINK Data Formats:  COPLINK Data Formats Delimited ASCII text files SQL Server 2000 backup file SQL Server 2000 detached database Oracle 8i/9i dump file Oracle 8i/9i transportable tablespace DB2 UDB 7 backup file TPD PPD data available: 4/1/2003 Outline:  Outline COPLINK Connect and Detect Systems: Using COPLINK Data for Information Sharing and Crime Relationship Identification Slide32:  1990-present NSF CISE funding (IIS, Digital Government, Digital Library, NSDL, ITR, IDM, CSS), NIH/NLM, DARPA 1997 NIJ COPLINK funding; Web-enabled data warehousing NIJ AGILE interoperability funding; information sharing NSF Digital Government funding; data/text mining, agents, and knowledge management; COPLINK Center for Excellence NSF/CIA KDD funding; national security research, COPLINK testbed; Border Safe research Goal: A model and testbed for law enforcement and national security research COPLINK Progression COPLINK Recognitions:  COPLINK Recognitions Time Magazine Global Business December 23, 2002 "Data Miners" Americans got a glimpse of how such a system might work this fall during the Washington-sniper investigation. Life Week Magazine  November 18, 2002 The Washington Post November 7, 2002   "A Missing Link Most Wanted" Linking facts in the sniper case will be a big test of what Coplink can do. The New York Times  November 2, 2002 "An Electronic Cop That Plays Hunches" It is an Internet-based system called Coplink, developed at an artificial intelligence laboratory… Tucson Citizen  October 23, 2002 "Tucson Cops, local software to help in D.C. sniper probe" A computer database system that Tucson police employ in crime investigations will be used in the hunt for the Washington, D.C.-area sniper or snipers. Arizona Daily Star  October 23, 2002 "Sniper probe to get help from Tucson" A program developed by the University of Arizona will be used to try to capture the Washington, D.C., area sniper. The Innovation Groups August 5, 2002 "Regional Information Sharing Project for Huntsville, Texas Law Enforcement Agencies" The city of Huntsville, TX recently granted a contract to implement COPLINK.  Los Angeles Times May 20, 2002 "Making a Digital Government" Lawrence Brandt's latest job is to get federal agencies to share technology and information KMWorld March 2002 Law enforcement is an information-intensive process, beginning with data collection at crime scenes and extending through records management and analysis of data to support crime-solving. Slide34:  DG Online December 2001 "Super Detective" When University of Arizona professor Hsinchun Chen combined police databases for a consortium of city police agencies, a super-detective was born. POLICE Magazine March 2001 Coplink Shifts and Shares Information – Fast. Law Enforcement Technology  Magazine March 2001 Software For Data Searchers. The POLICE CHIEF  March 2001 Information Sharing System "Coplink“. United Daily News (Taiwan)  February 2, 2001 AI Lab's Chinese semantic retrieval system is the engine behind UDN's (United Daily New) acclaimed intelligent news search service. Tucson Citizen January 17, 2001 "Use of COPLINK spreads, fuels company's growth“ Arizona Daily Star  January 7, 2001 "Technology developed in Tucson is helping police catch criminals faster. FCW.com  April 03, 2000 Changing the Rules of the Game. How Coplink is Helping Police Departments Match Evidence Across Boundaries of Time and Space. TechBeat  August 1999 It's called a "Web-based intuitive integrated interface." But in layman's terms it's called "Coplink." What if will do is help put an end to a serious problem faced by law enforcement every day... the inability to exchange information about criminal cases across jurisdictions NLECTC June 1999 No tool similar to Coplink has been available previously because the technology that would foster this kind of connectivity and interoperability did not exist. Government Computer News  January, 1998 "COPLINK intranet (designed by the AI Lab) will bring Arizona crime fighters to the data they need. Software Components:  Software Components (Source: Knowledge Computing Corporation, Tucson, AZ) COPLINK Connect :  COPLINK Connect Consolidating & Sharing Information promotes problem solving and collaboration Records Management Systems (RMS) Mugshots Database Gang Database COPLINK Connect Functionality:  COPLINK Connect Functionality Generic, common XML based criminal elements representation Data migration (batch and incremental) and mapping for all major databases and legacy systems Database independent: ODBC compliance data warehouse Multi-layered Web-based architecture: database server, Web server, browser Powerful and flexible search tools for various reports, e.g., incidents, warrants, pawns, etc. Graphical browser-based GUI interface for ease of use, training and maintenance H. Chen, J. Schroeder, R. V. Hauck, L. Ridgeway, H. Atabakhsh, H. Gupta, C. Boarman, K. Rasmussen, and A. W. Clements, “COPLINK Connect: Information and Knowledge Management for Law Enforcement,” Decision Support Systems, Special Issue on Digital Government, 2002, forthcoming. COPLINK Detect :  COPLINK Detect Consolidated information enables targeted problem solving via powerful investigative criminal association analysis COPLINK Detect Functionality:  COPLINK Detect Functionality Simple association rule mining applied to criminal elements relationships Generic, common XML based representation for criminal relationships Incremental data migration and association analysis on databases Support powerful, multi-attribute queries using partial crime information Graphical browser-based GUI interface for simple crime relationship analysis and case retrieval H. Chen, D. Zeng, H. Atabakhsh, W. Wyzga, J. Schroeder, “COPLINK: Managing Law Enforcement Data and Knowledge,” Communications of the ACM, 2002, forthcoming. COPLINK Detect 2.0/2.5:  COPLINK Detect 2.0/2.5 COPLINK Connect/Detect Status:  COPLINK Connect/Detect Status Systems widely adopted for law enforcement information sharing and analysis. Commercialized and supported by KCC Systems deployed at: Tucson, Phoenix, Salt River (AZ), Huntsville (TX), Polk County/Des Moines (IW), Ann Arbor (MI), Montgomery (DC), Henderson County (NC), Boston (MA), Redmond, Spokane (WA), Shawnee County (KA) Under deployment/development: San Diego (CA), Pima County (AZ), Philadelphia (PA), Hennepin County (MN), US Customs Border Patrol (AZ), Middlesex County (NJ), State of Alaska, State of Hawaii Outline:  Outline COPLINK Visual Data Mining Research: Crime Visualization, Criminal Network Analysis, Deception Detection, Agent COPLINK Visual Data Mining Research:  COPLINK Visual Data Mining Research COPLINK Criminal Network Analysis: Association Tree, Association Network Analysis, Temporal-Spatial Visualization P1000: A Picture is worth 1000 words. Use visual representations and effective HCI to assist in more efficient and effective crime analysis Leverage different representations and algorithms: hyperbolic trees, network placement algorithms, structural analysis, geo-spatial mapping, time visualization H. Chen, D. Zeng, H. Atabakhsh, W. Wyzga, J. Schroeder, “COPLINK: Managing Law Enforcement Data and Knowledge,” Communications of the ACM, 2002, forthcoming. Existing Network Analysis Tools:  Existing Network Analysis Tools First generation — manual approach Anacapa Chart (Harper & Harris, 1975) Second generation — graphics-based approach Analyst’s Notebook, Netmap, Watson COPLINK hyperbolic tree view, network view Third generation — structural analysis approach Anacapa Chart (1st generation):  Anacapa Chart (1st generation) Association Matrix Link chart Manually extract criminal associations from data files Construct an association matrix and draw a link chart based on the association matrix Analyst’s Notebook, Netmap, Watson (2nd generation):  Analyst’s Notebook, Netmap, Watson (2nd generation) Slide47:  COPLINK Association Tree and Network (2nd generation) COPLINK Criminal Structural Analysis (3rd generation):  COPLINK Criminal Structural Analysis (3rd generation) Criminal association identification Using shortest-path algorithms to find the strongest associations between two or more criminals in a network SNA (Social Network Analysis) Using blockmodel analysis to detect subgroups and patterns of interactions between groups Identifying leaders, gatekeepers, and outliers from a criminal network J. Xu & H. Chen, “Criminal Network Analysis: A Data Mining Perspective,” AI Lab Technical Report, 2002. A 9/11 Terrorist Network:  A 9/11 Terrorist Network The proposed framework:  The proposed framework Experiment:  Experiment Data Sets TPD incident summaries Time period—Narcotics: 2000-present; Gangs: 1995-present Size Two testing networks Narcotics (60 individuals) Gang (24 individuals) The COPLINK SNA Project (The narcotic network example):  The COPLINK SNA Project (The narcotic network example) The COPLINK SNA Project (The gang network example):  The COPLINK SNA Project (The gang network example) Patterns Found:  Patterns Found The chain structure of the narcotic network Implications: disrupt the network by breaking the chain The star structure of the gang network Implications: disrupt the network by removing the leader Slide55:  White gangs who involved in murders and shootings White gangs who sold crack cocaine A group of black gangs Expert Validation Slide56:  “Yes, these two groups are together very often” “(211) and (173) are best friends” Slide57:  “ He is very important. He has a lot of money and sells drugs. His girl friend brings a lot of dancers in the city and buy drugs” COPLINK Spatial-Temporal Visualization: Timeline Tool:  COPLINK Spatial-Temporal Visualization: Timeline Tool Visualizes the chronologically ordered set of events associated with user-selected database entities Events placed along horizontal axis Entities placed along vertical axis Entities can be grouped together Each row contains all events associated with the entities in a group Time-based Zooming User can zoom into a specific time interval for more detail, while hiding uninteresting portions of the timeline Slide59:  Plots location of incident events within a selected time interval Zooming/panning capabilities User-selectable GIS layers Overview map Provides context to the currently selected region Plot events over time Plot events as they occur, use different color shadings to indicate when it occurred relative to other events Plot events as they occur and remove them after they are over, using directed arrows to highlight movement from one event to the next in time COPLINK Spatial-Temporal Visualization: GeoMapping Tool Slide60:  Reveals periodic patterns of incident occurrence Incident events will be plotted continuously on a circular graph Time period represented along circle (day, week, month, etc.) Height from center indicates number of incidents that occurred at that specific time Customizable granularity (e.g. year, month, day, etc.) 3-sigma statistical significance line Indicates unusually large or small number of occurrences at a specific time COPLINK Spatial-Temporal Visualization: Periodic Pattern Tool COPLINK Visual Data Mining Research:  COPLINK Visual Data Mining Research Deception Detection, a data mining approach “An agent must spell a suspect’s name exactly right, or the FBI computer will not recognize it. That can be particularly frustrating in cases such as the Sept. 11 probe, in which suspects have used multiple names and sometimes created identities by switching a few letters in their names.” – FBI FBI’s problem with 9/11 suspect names, e.g., “Majed M.GH Moqed,” “Majed Moqed,” and “Majed Mashaan Moqed,” and DOB, e.g., “01-01-1976” and “03-03-1976.” A deception taxonomy was created based on criminal deceptions in law enforcement databases Patterns existed in criminal deceptions, e.g., SSN variations, name variations, etc. Phonetic and syntactic string comparators are adopted Promising initial testing result: 94% accuracy in deception detection G. Wang, H. Chen, H. Atabakhsh, “Automatically Detecting Deceptive Criminal Identities,” Communications of the ACM, forthcoming, 2002.   A Taxonomy for Deceptions in Criminal Identity:  A Taxonomy for Deceptions in Criminal Identity A Taxonomy of Deceptions in Criminal Identity: Name Deception:  A Taxonomy of Deceptions in Criminal Identity: Name Deception Name Deception: Either false first name or false last name (62.5%) Only the middle initial is changed (62.5%) Similar pronunciation but different spelling (42%) A Completely false name (29.2%) Using abbreviated names or adding extra letters (29.2%) Leaving out the first name or last name (29.2%) Exchanging last name and first name (8%) A Taxonomy of Deceptions in Criminal Identity: DOB, SSN, Residency:  A Taxonomy of Deceptions in Criminal Identity: DOB, SSN, Residency DOB and ID (SSN) deception: In most cases, criminals only make minor changes in DOB and SSN, e.g., 19700207  19700208 Residency deception: 42% criminals in the collection deceived on address information. In most cases, only one portion of the address is changed slightly, e.g., street number. String Comparators:  String Comparators Phonetic Russell SoundEx code: Newcombe [1959], encodes a name with a format having a prefix letter followed by a three-digit number, e.g., PEARCE and PIERCE both coded as: “P620”. However, phonetic matching is particularly poor at finding matches [Zobel and Dart 1996]; Spelling string comparator [Jaro 1976; Winkler 1990]. compares spelling variations between two strings instead of phonetic codes Limitation: common characters in both strings must be within half the length of the shorter string Other Approximate String Matching tool:  Other Approximate String Matching tool Agrep [Wu, Manber 1992]: A general string matching algorithm that can handle character variations of insertion, deletion, and substitution. The pattern is represented as a bit array. The computation only involves simple bit operations (RightShift) and logic operations (AND, OR) on bit arrays. Rdj+1=Rshift[Rdj] AND Sc OR Rshift[Rd-1j OR Rd-1j+1] OR Rd-1j Agrep has been integrated into Unix and been in wide use since June 1991 Algorithm Design:  Algorithm Design Compare corresponding fields of each pair of records (disagreement): Sname, SDOB, Saddr, and SID To capture different types of name deceptions, Calculate the Normalized Euclidean Distance for the overall dis-similarity between two records, i.e., Disagreement = Experimental Results (Training: 80 cases):  Experimental Results (Training: 80 cases) Table: Distance matrix, the distance value shows the degree of disagreement between each pair of records in the training data set. Experimental Results (Training: 80 cases):  Experimental Results (Training: 80 cases) Table: Determining best threshold value (0.48) Experimental Results (Testing: 40 cases):  Experimental Results (Testing: 40 cases) Table: Accuracy of deception detection when the best threshold value (0.48) is applied to the testing data set (40 records) COPLINK Agent Research:  COPLINK Agent Research COPLINK Agent: alert and collaboration in a wireless architecture Enhance police information timeliness, collaboration, mobility, and safety via a web-based wireless alerting system (under testing at TPD) Real-time alert of time-critical information from multiple databases, e.g., CAD (computer-aided dispatching) database, MVD Identify and inform officers/detectives who are working on similar cases Push time-critical information via wireless and personalized communications, i.e., web alert, email, cell phone, and pager COPLINK Agent: Wireless Alert and Collaboration:  COPLINK Agent: Wireless Alert and Collaboration Allows Patrol Officers to enhance their community expertise Further promotes Officer safety through curbside knowledge Secure wireless access and alert: laptop, PDA, pager, cell phone Alert: 24-7 monitoring of time-critical information from different databases Collaboration: Automatically informing detectives working on similar cases COPLINK Agent: Vehicle Search Form:  COPLINK Agent: Vehicle Search Form Multi-DB Search Alert Method Notification setting COPLINK Agent: Web and E-mail Collaboration Alerts:  COPLINK Agent: Web and E-mail Collaboration Alerts Web Alert Email Alert COPLINK Agent: Cell Phone and Pager Alert :  COPLINK Agent: Cell Phone and Pager Alert Pager alert with case number Cell phone alert Agent User Study and Result Summary:  Agent User Study and Result Summary Study Design: Case study method based on structured interviews, archival records analysis, and usability survey. Use QUIS (Questionnaire for User Interaction Satisfaction) survey instrument developed by the HCI Lab at the U. of Maryland. 10 participants: crime analysts and detectives in several TPD units. Positive feedback on system Effectiveness and Efficiency: Monitoring: “… the information I have received back was instrumental in making at least 2 felony cases that will be prosecuted on the federal level.” Collaboration from CAD Alert: “… allowing us to respond to incidents we know are important that the field units perhaps don’t realize in a timely manner.” Multi-database Search: “The Tucson City Court Search was helpful because I located one of my suspects on her court date.” High User Satisfaction from QUIS survey items: Averaged 5.5 for 49 items on a 7-point Likert scale (7: most useful). Strengths: Offers good Investigative power; Easy to read layout; Potential for Collaborative information sharing; CAD Integration; High intention to use. Weaknesses: Lack of help messages; Difficult for inexperienced users; Obscure user preference settings. Cyber Crime Analysis:  Cyber Crime Analysis Cyber crime refers to computer-mediated activities which are illegal and which can be conducted through global electronic networks [1]. Crime types Cyber attacks: network intrusion, email bombing Distribution of illegal materials in the cyberspace: pirate software/CD, child pornography, cyber fraud [1] Thomas, D. and Loader, B.D., “Introduction - Cyber Crime: law enforcement, security and surveillance in the information age,” in Cyber Crime: Law enforcement, security and surveillance in the information age, Taylor & Francis Group, New York, NY, 2000. Challenge and Solution:  Challenge and Solution Data collection – use web spiders and peer-to-peer software to download messages/files from the network. Rule forming – ask domain experts to form rules to determine the legality and severity of the messages/files. Identity tracing There are thousands of millions of Internet users; cyber criminals use different IDs to hide their identities on the Internet. Law enforcement agencies do not have enough resources to trace the activities of each suspicious ID; they must focus their efforts on the major suspects. Authorship Analysis technique can help identify the author of a message based on the person’s writing style. Authorship Analysis:  Authorship Analysis Authorship analysis attempts to determine the likelihood of a particular author having written a piece of work based on some characteristics of the author [2]. The essence of this technique, is the formation of a set of metrics, or forensics, that remain relatively constant for a large number of writings created by the same person [3]. In cyber crime research context, this technique can help determine whether a set of illegal Internet messages belong to the same user based on the person’s writing style. [2] A. Gray, P. Sallis, and S. MacDonell, “Software forensics: Extending authorship analysis techniques to computer programs,” in Proc. 3rd Biannual Conf. Int. Assoc. of Forensic Linguists (IAFL'97), pages 1--8, 1997. [3] O. de Vel, “Mining e-mail authorship,” in Proc. Workshop on Text Mining, ACM International Conference on Knowledge Discovery and Data Mining (KDD) 2000.  Experiment – Data Collection:  Experiment – Data Collection An experiment was conducted to test the prediction accuracy of authorship analysis algorithm. 2 types of data were used: 70 email messages 3 students provided 20-30 email messages each. Messages were randomly chosen by their authors and covered a variety of topics. 153 newsgroup messages 3 popular USENET newsgroups related to software trading were selected. misc.forsale.computers.other.software misc.forsale.computers.pc-specific.software misc.forsale.computers.mac-specific.software 9 users who frequently posted messages in the 3 newsgroups were chosen. Messages posted by these users were manually checked, with the help from domain experts, to determine whether they were illegal (i.e. involving sales of pirate software). 10-30 messages per user were manually downloaded that contained illegal content. Experiment – Feature Extraction:  Experiment – Feature Extraction Previous research suggested that style markers and structural features are good indicators of an author’s style [3]. Three types of message text features were used in this experiment to determine the authorship: Style Markers (205 features) average sentence length, total number of characters, total number of punctuations, etc. Structural Features (11 features) has a greeting, has a salutation, position of reply text, number of attachments, etc. Content-specific Features (9 features, for newsgroup messages only) has a list of products, position of price (in subject, in body, in list), etc. Style markers were extracted automatically using programs. Structural and content-specific features were extracted manually. [3] O. de Vel, “Mining e-mail authorship,” in Proc. Workshop on Text Mining, ACM International Conference on Knowledge Discovery and Data Mining (KDD) 2000.  Experiment – Classification Results:  Experiment – Classification Results A Support Vector Machine classifier [4] was used to predict the authorship of the messages based on the extracted features. 10-fold cross validation method was used. Improvement in accuracy was observed with different combinations of message features. [4] C.-W. Hsu and C.-J. Lin. “A comparison on methods for multi-class support vector machines,” IEEE Transactions on Neural Networks, 13, pages 415-425, 2002. Implication:  Implication A list of most active cyber criminals can be compiled based on the number and severity of illegal messages they post. Law enforcement agencies can assign resources accordingly to target those criminals on the top of the list. The remaining challenge is how to validate the results of such an experiment, or any cyber crime research, so they can be used as grounds for prosecution. A comprehensive validation method is necessary before research findings could be presented as evidence in court. For project information: http://ai.bpa.arizona.edu/COPLINK hchen@eller.arizona.edu:  For project information: http://ai.bpa.arizona.edu/COPLINK hchen@eller.arizona.edu

Add a comment

Related presentations

Related pages

Conference Presentations | Artificial Intelligence Laboratory

DHS-COPLINK-Data Mining-2003.ppt; DHS-Dark-Web-2003.ppt; UA MIS Program Overview; E-Library, E-Government, and E-Commerce: Common Threads and New ...
Read more

Data Warehousing - Coplink*/BorderSafe/RISC | Artificial ...

* The COPLINK system was initially developed by the University of Arizona Artificial Intelligence Lab ... "A Data Mining Framework for ... DHS / CNRI: Sept ...
Read more

Data Mining and Homeland Security: An Overview - fas.org

Data Mining and Homeland Security: An Overview Updated January 18, 2007 Jeffrey W. Seifert Specialist in Information Science and Technology Policy
Read more

Coplink Center: Social Network Analysis and Identity ...

Coplink Center: Social Network ... A Crime Data Mining Approach to Developing Border Safe Research ... The Department of Homeland Security’s (DHS) ...
Read more

Alaska adopts crime data mining - The Citizen Lab

Alaska adopts crime data mining. October 24, 2003. ... Coplink, created in 1998 at the Artificial Intelligence Lab at the University of Arizona at Tucson, ...
Read more

Designing, Implementing, and Evaluating Information ...

... law enforcement, COPLINK, knowledge management, data mining ... Designing, Implementing, and Evaluating Information Systems for Law ... 2003 ...
Read more

Alaska adopts criminal data mining -- FCW

Alaska adopts criminal data mining. By Dibya Sarkar; Oct 21, 2003; ... Coplink is being used in various jurisdictions across the country, ...
Read more

Crime Data Mining: An Overview and Case Studies - HKU

Crime Data Mining: An Overview and ... we have presented an overview of crime data mining and four COPLINK case studies. From ... June 2003. We would like ...
Read more