Published on March 11, 2014
Data Analysis Module ILRI Graduate Fellows skills training Nairobi 4th December 2013
Session Objectives To be able to; • Answer the research questions: What results do you need to show and in what format (tables, graphs, charts etc.) Selection of data analysis methods (principles and concepts) • Identify, evaluate and apply data analysis packages – software available, open source • Plan analysis of own data and carry out exploratory analysis • Carry out formal data analysis using different tools and methods e.g. R
Research Process • Problem definition • Literature review • Objective & hypothesis • Study design • Sampling • Data collection • Data management • Formal analysis • Reporting • Publication • Data archiving or publication Project development implementation Communicating findings Definition of problem domain & how the specific problem fits in Identification of gaps, appropriate methods & theory Research will approve or disapprove the hypothesis Research strategy to be used, sample size, sampling frame Sample selection Data collection tools Database development and data cleaning Exploration, description, modelling & interpretation of statistical outputs Choice of reporting media & format Advise on presentation of results Data sharing media
Data Analysis – Guiding Principles Translating Research Questions / Objectives / Hypotheses into an ‘analysis plan’ • Used our research questions / objectives / hypotheses to design the study – experiment or survey, questions to ask & data to collect • We use them again to plan the analysis – what differences do I need to show, what are my response variables, what types of model may I need to use etc. • This is often a good time to draft tables and graphs which you think will help answer the questions
Data Analysis – Multi-level Data Structures • Many studies are designed at different levels and collect data at these multiple levels • E.g. in experiments this could be animals or plots → blocks and for survey this may be animal → household → village → district. • Another aspect of ‘levels’ is repeated measurements over time. • These levels must not be forgotten when we reach analysis stage – which level do we summarise each variable at? What is our ‘unit of analysis’? • For formal analysis then there are advanced methods that allow the data to be analysed in a way that allows for the multi-level data structure (including variation). • The analysis can often be simplified into fewer dimensions by summarising particular aspects • e.g. means between two points in time, the slope of the trend between two points, the value reached at the end.
Data Analysis – Response & Explanatory variables Terms are discipline specific: Response ≡ Dependent variable ≡ y Explanatory ≡ Independent variable ≡ x’s • Explanatory variables can be continuous or discrete • In epidemiology they can be ‘confounding’ or biological ‘interacting’* • In economics they can be exogenous or endogenous *Note that statistical confounding and interactions have different interpretations
Data Analysis – Variable Types • Variables can be: CONTINUOUS or DISCRETE • In analysis we may sometimes convert continuous data to discrete • Both Response and Explanatory variables can be continuous or discrete.
Data Analysis – Aim of Exploratory Analysis • Data exploration is the first stage to any analysis of the data – people often jump straight to formal analysis and models but it is at this stage where you will identify patterns and ‘odd’ data • In some cases this may be 90% of the analysis you do on the data (e.g. Case Study 11) • The more complicated the data set the more interesting and necessary the exploratory phase becomes. • With some expertise in data management we can highlight the important patterns within the data and list the types of statistical models and their forms that need to be fitted at the next stage.
Data Analysis – Activities of Exploratory Analysis
Data Analysis – Exploratory Analysis Methods (Continuous & Discrete Variables) Tools for exploratory analysis: • Means & ranges Case Study 2 • 1 & 2-way tables of means Case Study 3 • Frequency tables Case Study 3 & Case Study 8 – use of Excel pivot tables! • Histograms Case Study 11 • Scatterplots Case Study 1 & Case Study 3 • Boxplots Case Study 3 • Bar charts & pie charts Case Study 11 • Trend graphs, survival curves Case Study 3 Identify patterns and unusual variables – e.g. outliers, zeros Measures of variation – variance, standard deviation, confidence interval, standard error, CV Transformation of variables?
Data Analysis – Confirmatory / Formal Analysis Exploratory analysis was the first part of the analysis of our research data – to gain initial understanding of patterns that exist in the data and suggest further analysis needs By the time we reach Confirmatory / Formal analysis we have refined our objectives and these should clearly define exactly what type of further statistical analysis we need (and which models to use) In exploratory analysis it is difficult to look at many variables at the same time – formal analysis allows us to do this and be able to see which variables are more important and others.
Data Analysis – Confirmatory / Formal Analysis (Some options…) (Non-parametric equivalents for small samples) Response / Dependent Variable(s) Discrete Continuous Explanatory / Independent Variable(s) Discrete Chi-square/ Regression (Logistic – binary; Poisson - count) T-Test / Analysis of Variance (/ Regression) Continuous Regression (Logistic – binary; Poisson - count) Correlation / Linear Regression Both Regression (Logistic – binary; Poisson - count) ANOVA / Linear Regression
Data Analysis – Confirmatory / Formal Analysis (Some advanced options…) • Mixed-effects models (REML) – incorporates random effects including spatial & temporal repeated measurements; better at managing data with many levels of hierarchy; used a lot in epidemiology, animal/plant genetics. • Survival models. • Multivariate (> 1 y) methods– can be both parametric & non- parametric). • Proportional-odds models when categorical response with both than 2 categories.
Data Analysis – Confirmatory / Formal Analysis Concepts The underlying concept in formal statistical modeling is: Data = Pattern + Residual ‘Data’ are the raw data (responses) that you collected, or sometimes may be summaries derived from the raw data*. They could also be transformed values* ‘Pattern’ is all the variables (continuous or discrete) that are in the design of the study (e.g. treatment) or have been selected in your exploratory analysis as explaining some of the differences in the ‘Data’ ‘Residual’ is the variation we can’t explain by the ‘Pattern’. The aim of our formal analysis is to put as much of the variation as possible into the ‘Pattern’ while keeping the model as simple as possible…easy *See earlier slides
Data Analysis – Confirmatory / Formal Analysis: Correlation & simple Linear Regression 1 We will use correlation & fitting a straight line to data to explain the concepts of statistical modeling: • The simplest way of looking at the relationship between an x-variable and a y-variable is with the CORRELATION. • An extension of this is to use a LINEAR REGRESSION model to fit a straight line through the points (Case study 3) • To look at how well this model is fitting we use an ‘analysis of variance’ – the amount of variation in the Pattern vs. in the Residual
Data Analysis – Confirmatory / Formal Analysis: Correlation & simple Linear Regression 2 Linear regression and similar models present the analysis in an ‘analysis of variance’ table that looks like (Section 3.2): In this example the p-value will tell you if the slope of the line is significantly different from 0 (i.e. a flat line) (Section 5) Models such as Logistic for binary data and proportions or Poisson for counts give a similar table but it is now the ‘analysis of deviance’ with similar interpretations (Section 10.2 Source of variation d.f. s.s. m.s. v.r (F-value) p-value Slope of line 1 Residual (error) N-2 Total N-1
Data Analysis – Confirmatory / Formal Analysis: Correlation & simple Linear Regression 3 A key aspect of any model is the ‘model checking’ – part of this is done through examination of the residuals. For all regression models which assume that either the data or the residuals are ‘normal’ then we use the same assumptions of independence, randomness and normal distribution
Data Analysis – Confirmatory / Formal Analysis: Parameter Estimates & Least Square Means We also look at the parameter estimates and their standard errors – for the linear regression example the parameter estimate is the slope (and intercept). For more complex models and those with discrete explanatory variable we will use the parameter estimates to compare levels of the discrete variables (Case Study 3 for examples & discussion). For models containing discrete variables as explanatory / independent variables we will often want to present Means and Standard Errors and compare these (with t-test if comparing 2 or multiple comparison tests) (Section 6).
Data Analysis – Confirmatory / Formal Analysis: Exercise Identify what sort of model you may use in your research (check the Statistical Modeling Teaching Guide) – e.g. linear regression (section 3), designed experiment (section 4), response data which are proportions or binary (section 10), count response data (section 11), survival data (section 12). Which parameters may be included in your model as the Pattern Draw a pretend analysis of variance / deviance or parameter estimates table of what you may expect to see in the analysis
Features Excel Stata SPSS SAS R Learning curve Gradual/flat Steep/gradual Gradual/flat Pretty steep Pretty steep User interface Point-and-click Programming/point- and-click Mostly point-and- click Programming Programming Data manipulation Weak/moderate Very strong Moderate Very strong Very strong Data analysis Modest Powerful Powerful Powerful/versatile Powerful/versatile Graphics Very good Very good Very good Good Excellent Cost Part of MS office Affordable (perpetual licenses, renew only when upgrade) Expensive (with annual license renewal ) Expensive Annual Renewal Open source (free) Data Analysis – Application (Statistical Software Packages)
Data Analysis – Application (Using R) Outline • Installing R • R Environment Command prompt RStudio Setting your workspace • Loading and installing R packages • Importing data into R - *.csv/*.xls • Saving R data • Data exploration – summary statistics • Graphing in R - boxplot • Data analysis – T-test/linear & logistic regression
Introduction to R Installing R - Download R from http://cran.r-project.org CRAN – Comprehensive R Archive Network R version changes over time, the current one is E- 2.15.0 - Installing RStudio - Setting up the work environment
Introduction to R R Environment Command prompt R is primarily a command driven software where instructions are typed at the command prompt (> ) R is case sensitive Rstudio Rstudio has limited set of commands that can be selected and executed from the menu Setting your workspace It is important to set R preferences to suit your work environment, one such setting is the working directory. Working directory is set using the command setwd setwd("D:/My Documents/R course") or from File->change dir on the menu Take note of / R will not recognize when specifying subdirectories
Introduction to R Loading and installing R packages Modules or sets of functions are referred to as PACKAGES in R. Some packages are part of the base installation while others have to be installed separately. There are several user-contributed packages. Type library() to view installed packages To view functions within a package type library(help=“packagename”) e.g. library(help=stats) – no quotes Install packages using the menu Packages->Install package(s) …. Use find(“item”) command to identify the package containing an item of interest e.g. find(“plot”), if you are sure of the exact name otherwise use apropos(“item”)
Introduction to R Importing data into R - *.csv/*.xls Although it is possible to enter data directly into R, importing data in a spreadsheet format is more efficient. Use: i. Read.table – to import space separated data with column headings (*.txt) prod1 <- read.table("D://My Documents/R course/PROD2B.txt", header=T, sep=",") ii. Read.csv – to import comma separated data with column headings (*.csv) prod2 <- read.csv("D://My Documents/R course/PROD2B.csv", header=T) To save the file in R write.table(prod2, file="D://My Documents/R course/proddata2", quote=FALSE) i. odbcConnectExcel() - to import excel worksheet prod3<-"D://My Documents/R course/PROD2B.xls“ datachannel<-odbcConnectExcel(prod3) outprod3 <-sqlFetch(channel= datachannel, sqtable="prod3") write.table(outprod3, file="D://My Documents/R course/proddata3", quote=FALSE)
Introduction to R Data exploration – summary statistics One can get summary statistics on all numeric variables in the dataset using summary(datasetname) eg. summary(outprod3) It is also possible to get summary statistics on a particular variable, use $ to attach variable to the data table e.g. summary(outprod3$WEIGHT) Use aggregate to get summary statistics by group/category e.g. aggregate(data.frame(WEIGHT), by=list(herd=HERD,sex=SEX), mean) It is advisable to attach a data file to avoid having to specify the data file all the time particularly for long summaries such as aggregate attach(outprod3)
Introduction to R Graphing in R – boxplot - R has powerful graphing features that can be used in data exploration, such as histograms, boxplot, scatterplot, etc. histogram(~PCV, n=30, xlab="Packed Cell Volume") boxplot(PCV, ylab="Packed Cell Volume") boxplot(PCV~HERD, color="orange", ylab="Packed Cell Volume", xlab="Herd") xyplot(PCV~WEIGHT, color="orange", ylab="Packed Cell Volume", xlab="Weight")
Introduction to R Data analysis – T-test - t.test(prod2$WEIGHT~prod2$SEX) Data analysis – Linear Regression - output1<-lm(PCV~WEIGHT) - Remember to attach the dataset to make it active - attach(prod2) Data analysis – Logistic Regression -
References Research Methods & Biometrics Teaching Resource – Case Study 1 & 4 have R, Case Study 2 has ANOVA: many others used in this session as examples. The study guides are useful for reference material on Explanatory and Formal Analysis. Take home: Analysis the Data & Models Chapters in Green Book Good Statistical Practice for Natural Resources Research – Part IV R Intro Course Notes – Nicholas Ndiwa Reading University SSC – Approaches to Analysis of Survey Data; Confidence & Significance: Key Concepts of Inferential Statistics; Modern methods of analysis; Analysis of Experimental Data
Module 4: Data analysis and presentation . Six steps in the IR process . What are the differences between ... Quantitative data analysis Data management .
- 4 - Data Analysis – Basics I will illustrate the data analysis techniques using data from the shoe questionnaire (filename: shoedata.xls) and coding ...
Data Exploration 1) Overview of Data Exploration Calculating the descriptive statistics outlined in this module may be the extent of your analysis or the
PDF Module 4 Fundamental Big Data Analysis & Science. This course provides an in-depth overview of essential topic areas pertaining to data science and ...
Module 4 - Data Analysis me-module-4-data-analysis-may-2.ppt — PowerPoint presentation , 1,264 kB (1,294,848 bytes ...
Lesson 1 Big Idea 2: Analyze and summarize data sets. MA.8.S.3.1 Select, organize and construct appropriate data displays, including box and whisker plots ...
2 9) A factor of 12 is chosen at random. What is the probability that it is the factor 2 or the factor 4? a) 3 1 b) 5 1 c) 8 1 d) 15 1 10) How many license ...
Module 4: Systems analysis and design ... In Exhibit 4.3-1, data flows can connect processes to each other, to external entities, and to data stores.
Microsoft Virtual Academy: Breakthrough Insights using SQL Server 2012 : Analysis Services and Credible, Consistent data (Module 4) - Data Quality Services ...