Published on November 14, 2016
1. Challenges in Preparing and Sharing Open Data OpenCon 2016 Cape Town 14 December 2016 Michelle Willmers and Thomas King ROER4D Curation and Dissemination Manager CC BY
2. Research On Open Educational Resources (OER) for Development • Imperative to establish empirical baseline research on OER in Global South • 86 researchers in 26 countries across 3 continents • Project ‘Open’ ethos manifests in Open Research strategy, bridging ‘Open’ silos • Open content (typically used in a teaching and learning content) that can be reused, revised, remixed, redistributed and retained • Made possible by open licensing, although increasing focus on differentiating implicit vs. explicit open content • Focus on role OER can play in improving access to quality education • Focus on role project can play in building Global South Open Education research capacity • Strong advocacy and activism component (NGO, CBO sectors – not only career researchers) Focus on empirical baseline manifests in focus on curatorial and publishing capacity within the research project. The project acts as publisher, providing greater agency and control (but presenting some challenges in terms of accreditation/reward).
3. ROER4D Curation & Dissemination Strategy • Provide a content management and publishing service to SP researchers and the Network Hub team in order to advance research capacity development efforts and increase visibility of outputs. • Support Principal Investigators and SP researchers in editorial development of ROER4D outputs. • Address infrastructure deficits and provide content management solutions (including content hosting) in a research community with uneven institutional support and capacity challenges. • Ensure that the ROER4D legacy is freely accessible for reuse in line with international curatorial and publishing standards. • Complement Network Hub Communications efforts in an integrated communications/dissemination approach.
4. • Data sharing as component of open content focus. • Organising and profiling open content increases the potential for reuse and citation (impact). • Well-organised, strategic research management and content organisation promotes rigour in the research process. • Copyright vests with the author > data-sharing activity determined by their willingness and capacity to engage. • Format and platform/tool agnostic. • Share openly by default on condition that it is valuable, legal and ethical ROER4D data management principles
5. Project archive (external) Zenodo Network Hub (Google, Vula) ROER4D project data flow Internal sharing and collaboration External sharing and collaboration
6. Five pillars of ROER4D data publication approach
7. Step 1: Evaluate contractual framework, articulate strategy
8. Step 2: Get researchers on board
9. • Check ethics approval and consent • Ensure first-tier de-identification takes place prior to Network Hub transfer in order to ensure research subject confidentiality • ROER4D agnostic in its approach (in terms of scale, format and technical sophistication) • Challenges of varying researcher sophistication in terms of data collection and presentation • Challenges of varying researcher sophistication in terms of technology employed to capture, present, and analyse data Step 3: Obtain source sub-project micro-data
10. • Archive in Vula and UCT e-Research Centre secure institutional archive • Network Hub C&D team audits researchers’ submitted dataset > What is the dataset comprised of? > Are all the pieces there? > What were the data collection processes, and do we have all the instruments to share? > What languages are represented? > Does something else like it exist? > Who might it be of use to? • Address file naming and format issues • Articulate sub-project-specific data management plan Step 4: Network Hub curation and quality assurance
11. • Scope and conceptualise the dataset > Which components of the project-generated micro-data are you ethically and legally allowed to share? > Which components of the project-generated micro-data will you invest resources in curating and sharing? > Which instruments will you include? • Identify focus of data and points of sensitivity • Define appropriate second-tier de-identification approach Step 5: Prepare data for publication
12. • Generate metadata and dataset description (accompanying narrative) • Submit content to publisher (DataFirst) • Link to published outputs • Include description of process in research Methodology statements • Profile in project communications activity Step 6: Publish
13. Some lessons learned
14. 1. Openness increases rigour. Preparing data for publication promotes professional approach to research process. 2. Preparing data for publication exposes weaknesses in instrument design and research process. 3. Introducing C&D and data-sharing focus midway through a project poses many challenges, particularly in terms of ethical and consent components. 4. Data sharing drives focus on reproducibility, transforming traditional approach to crafting methodology statements. 5. The data preparation process takes time (approx. one week of researchers’ time in ROER4D context). 6. Obtaining balance between utility and adequate protection in de-identification of qualitative data is a challenge. 7. Openness is threatening to researchers in terms of exposing weakness in processes and perceived threat of losing publication advantage. 8. C&D and data sharing activity require support, capacity development and resourcing.
15. Qualitative de-identification Thomas King
16. Terms and definitions • De-identification – removing, eliding or replacing pieces of information that reveal research participants’ (possibly also referents’) identity. • Anonymity – personal details are not gathered. • Confidentiality – personal details are not shared. • E.g. an anonymous survey contains no questions about personal identifiers. A confidential survey does contain these questions, but will not share/publish them.
17. The two pillars of open data sharing Consensual ethical legal Comprehensible coherent valuable Research Data Management & Open Data sharing
18. The de-identification balancing act First, do no harm Remove as much as needed to ensure the confidentiality or anonymity of the research participants. Ensure that all ethical and consent processes have been adhered to. Don’t go overboard Remove as little as is ethical to ensure the richness of the data. Take the unit of analysis as the guide – de- identify up to the Unit of Analysis. E.g: If Study X compares two universities, you can safely remove all identifiers lower than the university affiliation. HOWEVER Your data may be useful to others. The purpose of de-identification is to preserve confidentiality – don’t de-identify for the sake of it
19. Qualitative de-identification • De-identification located in the same ecosystem as data cleaning and data validation – no clear line between data improvement and de- identification – Cleaning up typos – Standardising presentation and layout – Identifying unanswered questions (or additional questions), mislabelled responses, etc. • Much of these also apply to quantitative data • Articulation of principles in RDM and description of these processes included in metadata
20. READ DATA READ DATA Coherence Format & layout Editing Fix typos & identify anomalous data 1. 2. 3. 4. 5. De-identifying Remove identifiers Validation Identify and account for missing data ROER4D data interrogation process
21. NETWORK HUB Principal Investigator Curation and Dissemination team Communication and Evaluation consultants NETWORK HUB Principal Investigator Curation and Dissemination team Communication and Evaluation consultants SUB PROJECTSSUB PROJECTS ROER4D project structure Using largely mixed-methods data (both quantitative and qualitative)
22. ROER4D de-identification process 1. First-level de-identification by researcher – Removal of direct identifiers (names of people/institutions/companies, ID numbers, etc.) – Important to ensure that raw data is not shared 1. Second-level de-identification by C&D team to catch remaining direct identifiers 2. In-depth sweep of the text to identify indirect identifiers – Meticulous, thorough, repeated reading of the text • (which ties back to general data enhancement)
23. Tricky situations • Data collected in multiple languages – De-identification (particularly in qualitative data) far more difficult – greater reliance on the researcher • Post-hoc consent process – Departments merge or close, participants retire or disappear • Data collected by multiple researchers – Different collection strategies, adherence to interview schedules, use/non-use of clarifying questions, etc.
24. Open by design • Help researchers write consent forms! Particularly for open data sharing. • ‘Red flag’ clauses abound in template consent forms, including: – “will be used for research purposes only” – “data will be destroyed after use” – “only researchers will have access to the data” • More open consent forms allow for data sharing but do not mandate it.