Published on March 10, 2014
Abaca: Technically Assisted Sensitivity Review of Digital Records 0
Agenda ● Transferring of Records to Archives ● The Digital Problem ● The Abaca Project ● Abaca Classifier Experiment ● The Test Collection ● The Abaca Project - Where Next? ● Break-Out Group Session ● Groups Discussion 1
Transferring of Records to Archives ● Department selects and appraises records for permanent preservation – In paper, about 5% of output selected - digital may rise to 20% ● Prior to transfer, department must complete sensitivity review – Paper review is well understood – Digital presents many new challenges and is not so well understood ● Hence our research ! 2
The Digital Problem ● The file has gone ● Volume will increase – The way business is done has changed – Largely unstructured despite EDRMs ● Big transfers of departmental records ● Appraisal – Separate issue not addressed today ● Precautionary closure – Need to research a solution ● Not unique to public records 3
Our Approach ● Provide a Framework of Utilities ... – to assist the Review Process ● Need Methods ... – that respect the reality of Digital Records in all their “Glory” – that can be tailored to specific circumstances ● Need tools ... – to help reviewers be more productive 4
The Abaca Project ● Research to show that utilities will help ● Two Phases – Proof of Concept (In Progress) – Full Project (Seeking external funding) ● Today we are describing our proof-of-concept work ● Abaca: Technically Assisted Sensitivity Review of Digital Records 6
Abaca Classifier Experiment ● Overview of the Task & Approach ● Predicting Exemptions using a Classifier – Features – Types of Features ● Example Sensitive Document ● Research Question ● Overview of Classification ● Evaluation Methodology ● Results 7
The Task Produce a classifier that can predict the presence of sensitive material within unstructured text. Initially focusing on two FOIA sensitivities Section 27: International Relations Section 40: Personal Information 8
Approach Manually review sensitive data to create a test collection. Split test collection into training and test sets. Train a classifier to predict the sensitivities in documents using the set of identified features. Test the classifier on previously “unseen” documents. Measure classification success. 9
External Resources External Resources Predict Exemptions Using a Classifier Feature Extraction Learn Classiﬁer Features represented as real numbers. Documents represented as feature vectors. Feature Extraction Run Classiﬁer Features represented as real numbers. Documents represented as feature vectors. Learned Model Predictions Using 10
Features Document features, such as the words it contains or the entities it references, convey information about a document. 11
Features Document features, such as the words it contains or the entities it references, convey information about a document. A document can be modelled by using a statistical representation of its features. 11
Features Document features, such as the words it contains or the entities it references, convey information about a document. A document can be modelled by using a statistical representation of its features. We use external knowledge bases, Natural Language Processing and semantic analysis to better understand the document features. 11
Features Document features, such as the words it contains or the entities it references, convey information about a document. A document can be modelled by using a statistical representation of its features. We use external knowledge bases, Natural Language Processing and semantic analysis to better understand the document features. The classifier recognises patterns in the documents’ feature sets and uses them for prediction. 11
The features we use can be divided into three main categories. Types of Features Feature Type Examples Comments Structure Lists of Words (tf/idf) Document Length Number of Recipients Ubiquitous throughout the collection. Can expose patterns in document types. High value information about the nature of the communication. Content Subjectivity Verbs “D.O.B” Negation By applying techniques such as Natural Language Processing and dictionary based term matching, we can identify the tone of the communication. Entities Countries People Organisations Tells us what the document “is about”. Context related to the entity, such as a “high-risk” country or a “significant” person or role can suggest sensitivity likelihood. 12
Research Question: Can we produce a classifier that can predict the presence of sensitive material within unstructured text? 13
Research Question: Measure: Can we produce a classifier that can predict the presence of sensitive material within unstructured text? Balanced Accuracy - Arithmetic mean of True Positive and True Negative predictions, with random = 0.5000 13
Research Question: Measure: Test Collection: Can we produce a classifier that can predict the presence of sensitive material within unstructured text? Balanced Accuracy - Arithmetic mean of True Positive and True Negative predictions, with random = 0.5000 Total Documents 1849 Total Section 27 208 Total Section 40 142 13
Overview of Classification Learn Classiﬁer on training data Run Classiﬁer on unseen data Learned Model Predictions Test Collection 14
Evaluation Methodology Test Collection Assessor Judgments ResultsStatistical analysis Classiﬁer Predictions 15
Results By adding features to a tf/idf text classification baseline, we see noticeable improvement in both Section 27 and Section 40 predictions. But there is still much work to be done ! Balanced AccuracyBalanced Accuracy Features s27 s40 Text Classification 0.6327 0.6344 + Source Count 0.6369 0.6303 + Country Count 0.6453 0.6406 + Country Risk Score 0.6417 0.6368 + DOB Score 0.6327 0.6391 + Negation Score 0.6378 0.6382 16
Test Collection - Aims ● To provide sensitivity judgements and training data to develop and measure tools 17
Test Collection - Aims ● To provide sensitivity judgements and training data to develop and measure tools ● To measure and understand assessors’ behavior 17
Test Collection - Measurments ● Time 18
Test Collection - Measurments ● Time ● Agreement of sensitivity – Not previously studied 18
Test Collection - Measurments ● Time ● Agreement of sensitivity – Not previously studied ● Hard Judgements ● Identify borderline cases ● Sensitivities sub-categories – Good indicator for features 18
The Abaca Project - Where Next? ● Understanding the real digital environment – Changes in working practice ● Testing our proof-of-concept system against real data ● More, wider and deeper – More exemptions, more data, more features – BIS, HO, MOJ, FCO, ... and more to come! – Funding 19
Questions and Feedback 20
Break-Out Groups Discuss sensitivity review in the Welsh Government and language context. Share your understanding and develop some ideas. Aims: 21
Break-Out Groups Questions: 1. What digital records does The Welsh Government create? 2. What sort of sensitivities are expected within these digital records? 3. What aspects of the sensitivity review process could be technically supported by a software tool or system? 4. What document features could be used to identify the expected sensitivities? 22
Contact http://projectabaca.wordpress.com/ email@example.com 23
Presentación que realice en el Evento Nacional de Gobierno Abierto, realizado los ...
In this presentation we will describe our experience developing with a highly dyna...
Presentation to the LITA Forum 7th November 2014 Albuquerque, NM
Un recorrido por los cambios que nos generará el wearabletech en el futuro
Um paralelo entre as novidades & mercado em Wearable Computing e Tecnologias Assis...
The Welsh Government and the UK Office for National Statistics held the 11th Workshop on Labour Force Survey Methodology where the focus was on Data ...
Welsh Government Smart Living Collaborative Workshops Post-Workshop Report AD Research and Analysis Ltd with the Centre for Sustainable Energy for the Welsh
Click here for full details on eProcurement workshop content and ... All content is available under the Open Government ... Welsh Government ...
Fun packed week of entertainment and activities with the Welsh Government at the Eisteddfod; ... Skills and training Last updated 21 July 2014. Related Links
View 3728 Welsh Government posts, presentations, experts, and more. Get the professional knowledge you need on LinkedIn.
Welsh Government – the data we use and how we use it Chris Williams Head of HE & Student Finance Statistics Welsh Government HEFCW Data Workshop
Posts about Welsh Government written by ideasuk ... Now we have introduced all the Keynote Speakers for the Conference, we are moving on to the Workshop ...
Starting a business can be rewarding and ... this service is provided by an external organisation and the Welsh Government accepts no responsibility for ...
Our work for the Welsh Government; ... workshop on the 4th February on not-for-dividend companies and rail service provision in Wales. The workshop, ...