Published on June 27, 2009
The Art and Science of Test Development—Part A Planning, development frameworks & domain/test specification blueprints Kevin S. McGrew, PhD. Educational Psychologist Research Director Woodcock-Muñoz Foundation The basic structure and content of this presentation is grounded extensively on the test development procedures developed by Dr. Richard Woodcock
The Art and Science of Test Development The above titled topic is presented in a series of sequential PowerPoint modules. It is strongly recommended that the modules (A-G) be viewed in sequence. Part A: Planning, development frameworks & domain/test specification blueprints Part B: Test and Item Development Part C: Use of Rasch Technology Part D: Develop norm (standardization) plan Part E: Calculate norms and derived scores Part F: Psychometric/technical and statistical analysis: Internal Part G: Psychometric/technical and statistical analysis: External The current module is designated by red bold font lettering
“In an ever-changing world, psychological testing remains the flagship of applied psychology” Embretson, S. E. (1996). The new rules of measurement. Psychological Assessment, 8 (4), 341-349.
Desirable Personality Traits of Test Developers • Obsessive-compulsive • Intellectually inquisitive • Masochistic • 1/99 % I/P ratio • Sadistic • Tough-skinned
Desirable Personality Traits of Test Developers • Willingness to take risks and giant leaps of faith
The approach to test development used: Item Response Theory (IRT) X = T + E observed score = true score + error Classical Test Theory (CTT), and IRT vs CTT comparisons, are not covered in this presentation
The bible of test development: The “Joint Standards”
Test development is a complex series of interconnected steps • The reality of the complexity of test development is not fully appreciated by most test users • The following complex flow-charts are intended to illustrate the magnitude of the overall project complexity • This presentation will focus on the more general, broad stroke test development framework • The process is much more non-linear than depicted by flow charts and presentations
“Generic” Woodcock test development flowchart
Test/Battery Development: Practical “Broad Stroke” Framework (Woodcock)
A detailed description for a test, often called a test blueprint, that specifies: • The number or proportion of items that assess each content and process/skill area • The format of items, response, and scoring rubrics and procedures, and • The desired psychometric properties of the items and test such as the distribution of item difficulty and discrimination abilities
Test/Battery Development: Common Conceptual Psychometric Validity Framework (Bensen, 1998 summary)
Substantive Stage of Test Development Purpose Define the theoretical and empirical/measurement domains of interest (e.g., intelligence or cognitive abilities –cognitive + achievement) Questions asked How should intelligence be defined and operationally measured? Method and concepts • Theory development & validation • Generate definitions • Item and scale development • Content validation • Evaluate construct underrepresentation and construct irrelevancy Characteristics of • A strong psychological theory plays a prominent role strong test validity • Theory provides a well-specified and bounded domain of program constructs • The empirical domain includes measures of all potential constructs (i.e., adequate construct representation) • The empirical domain includes measures that only contain reliable variance related to the theoretical constructs (i.e., construct relevance)
Structural (Internal) Stage of Test Development Purpose Examine the internal relations among the measures used to operationalize the theoretical construct domain (i.e., intelligence or cognitive abilities) Questions asked Do the observed measures “behave” in a manner consistent with the theoretical domain definition of intelligence? Method and concepts • Internal domain studies • Item/subscale intercorrelations • Exploratory/confirmatory factor analysis • Item response theory (IRT) • Multitrait-Multimethod matrix • Generalizability theory Characteristics of • Moderate item internal consistency strong test validity • Measures co-vary in a manner consistent with the intended program theoretical structure • Factors reflect trait rather than method variance • Items/measures are representative of the empirical domain • Items fit the theoretical structure • The theoretical/empirical model is deemed plausible (especially when compared against other competing models) based on substantive and statistical criteria
External Stage of Test Development Purpose Examine the external relations among the focal construct (i.e., intelligence or cognitive abilities) and other constructs and/or subject characteristics Questions asked Do the focal constructs and observed measures “fit” within a network of expected construct relations (i.e., the nomological network) Method and concepts • Group differentiation • Structural equation modeling • Correlation of observed measures with other measures • Multitrait-Multimethod matrix Characteristics of • Focal constructs vary in theorized ways with other constructs strong test validity • Measures of the constructs differentiate existing groups that program are known to differ on the constructs • Measures of focal constructs correlate with other validated measures of the same constructs • Theory-based hypotheses are supported, particularly when compared to rival hypotheses
Practical “Broad Stroke” Framework: Typical Questions to Ask (Woodcock) What is the intended purpose ? Who are the potential users ? Who are the intended examinees ? What domain (s) of behavior are to be measured and in what proportion ? • Content/substantive validity • Maximize construct representation • Minimize construct irrelevant variance What type, or types, of items are to be used ? How is the test to be scored ? • By hand, machine, computer • Scoring rubrics/guides • Correction for guessing What types of derived scores will be provided ?
Practical “Broad Stroke” Framework: Typical Questions to Ask (Woodcock) How are the scores to be interpreted ? • Types of profiles to provide What physical materials are needed and how should they appear? • Test books • Test records • Manipulatives • Audio tapes/CDs • Computer disks • Scoring keys • Manuals • Training materials • etc.
This presentation is an integration of the practical and psychometric test/battery frameworks Practical “Broad Stroke” Framework Common Conceptual Psychometric Validity Framework
Substantive Stage Structural (Internal) & External Stages
Examples used in this presentation come from the domain of intelligence or cognitive abilities (cognitive + achievement) Based on presenters experience as a coauthor of the Woodcock- Johnson Battery—Third Edition (WJ III; 2001)
Typically there are two types of test specification blueprints • Well defined a priori (typically theory-based) blueprints • Less well-defined (emerging) data-driven (empirical) blueprints
Possible theory-based intelligence model test design blueprints (select examples) Gardner MI theory Das-Naglieri PASS Theory Cattell-Horn-Carroll (CHC) theory
Possible emerging, empirical, or pragmatic intelligence model test design blueprints (select examples) Original Wechsler Verbal/Nonverbal model 1977 WJ Pragmatic Decision-Making model
Substantive Stage of Test Development Purpose Define the theoretical and empirical/measurement domains of interest (e.g., intelligence or cognitive abilities –cognitive + achievement) Questions asked How should intelligence be defined and operationally measured? Method and concepts • Theory development & validation Characteristics of • A strong psychological theory plays a prominent role strong test validity • Theory provides a well-specified and bounded domain of program constructs
• Psychometric approach: is • Several theorists argue that the dominant approach, has there are many different inspired the most research, “intelligences” (systems of is used most widely in abilities), only a few of which practical settings can be captured by standard (p. 77). psychometric tests (p. 78)
CHC Theory Defined • Combination of research by Raymond Cattell, John Horn, and John Carroll • The most empirically-supported, psychometric-based, contemporary description of the structure of human cognitive abilities • Based on the analyses of hundreds of data sets that were not restricted to a particular test battery • The theory describes cognitive abilities as a function of degree of breadth/generality – Broad and narrow cognitive abilities
Cattell-Horn Carroll Fluid Gf Fluid Gf Intelligence Intelligence g Quantitative Gq Knowledge Crystallized Crystallized Gc Intelligence Gc Intelligence Short-Term Gen. Memory Gy Memory & Learning Gsm Broad Visual Gv Visual Processing Gv Perception Broad Auditory Gu Auditory Perception Ga Processing Long-Term Broad Retrieval Gr Glr Retrieval Ability Comparison Processing Broad Cognitive Gs Speediness Gs Speed Dec/Reaction Gt Correct Time/Speed Decision Speed CDS Carroll and Cattell-Horn Model Reading/ Grw Writing
...most disciplines have a common set of terms and definitions (i.e., a standard nomenclature) that facilitates communication among professionals and guards against misinterpretations. In chemistry, this standard nomenclature is reflected in the ‘Table of Periodic Elements’. Carroll (1993a) has provided an analogous table for intelligence….. (Flanagan & McGrew, 1998)
The verdict is unanimous re: the importance of Carroll’s (1993) work Richard Snow (1993): “John Carroll has done a magnificent thing. He has reviewed and reanalyzed the world’s literature on individual differences in cognitive abilities…no one else could have done it… it defines the taxonomy of cognitive differential psychology for many years to come.” Burns (1994): Carroll’s book “is simply the finest work of research and scholarship I have read and is destined to be the classic study and reference work on human abilities for decades to come” (p. 35). John Horn (1998): A “tour de force summary and integration” that is the “definitive foundation for current theory” (p. 58). Horn compared Carroll’s summary to “Mendelyev’s first presentation of a periodic table of elements in chemistry” (p. 58). Arthur Jensen (2004): “…on my first reading this tome, in 1993, I was reminded of the conductor Hans von Bülow’s exclamation on first reading the full orchestral score of Wagner’s Die Meistersinger, ‘‘It’s impossible, but there it is!’’ “Carroll’s magnum opus thus distills and synthesizes the results of a century of factor analyses of mental tests. It is virtually the grand finale of the era of psychometric description and taxonomy of human cognitive abilities. It is unlikely that his monumental feat will ever be attempted again by anyone, or that it could be much improved on. It will long be the key reference point and a solid foundation for the explanatory era of differential psychology that we now see burgeoning in genetics and the brain sciences” (p. 5).
Carroll and Cattell-Horn Broad Ability Correspondence Stratum III g A. Carroll Three-Stratum Model (vertically-aligned ovals represent similar broad domains) (general) Notes. Broad ability factor codes based on Carroll (1993) and Horn and Blankson (2005). See Table 1 for additional explanation. 80+ Stratum I (narrow) abilities have been identified under the Stratum II broad abilities. They are not listed here due to space limitations (see Table 1). Gf Gc Gy Gv Gu Gr Gs Gt Placement of g to the left-side of the Carroll Three-Stratum Model (A) is consistent with Carroll's (1993) published figures, a placement reflecting his finding that the broad abilities towards the left (e.g,Gf, Gc) had the highest loadings on the g-factor. The placement of the Stratum II Grw and Gq factors in the Cattell-Horn Extended Gf-Gc Model (B) is not consistent with thisg-broad ability representation as Grw and Gq (broad) typically demonstrate highg-loadings. Grw and Gq are placed to the B. Cattell-Horn Extended Gf-Gc Model right in B to reflect their absence in model A. SAR TSR Gf Gc Gv Ga Gs CDS Grw Gq Gsm Glm C. Cattell-Horn-Carroll (CHC) Integrated Model D. Tentatively identified Stratum II g (broad) domains 1 Gf Gc Gsm Gv Ga Glr Gs Gt Grw Gq Gkn Gh Gk Go Gp Gps (Missing g-to-broad ability arrows acknowledges that Carroll and Cattell-Horn disagreed on the validity of the general factor) CHC Broad (Stratum II) Ability Domains Gf Fluid reasoning Gkn General (domain-specific) knowledge Gc Comprehension-knowledge Gh Tactile abilities Gsm Short-term memory Gk Kinesthetic abilities Gv Visual processing Go Olfactory abilities Ga Auditory processing Gp Psychomotor abilities Glr Long-term storage and retrieval Gps Psychomotor speed Gs Processing speed Gt Decision and reaction speed (see Table 1 for definitions) Grw Reading and writing 1 See McGrew (2004, 2005) for literature review supporting these domains Gq Quantitative knowledge © Institute for Applied Psychometrics, LLC Kevin S. McGrew 7-22-08
Substantive Stage of Test Development: Develop Test Design and Specification Blueprint Cylinders = broad CHC abilities g Theoretical Domain Circles = narrow CHC abilities Specification Gf Gv Glr Gs Gc Gsm Ga • What is the theoretical domain? • How should intelligence be defined? • What intelligence theory has the best validity evidence? Answer: Cattell-Horn-Carroll (CHC) theory of cognitive abilities
Substantive Stage of Test Development: Develop Test Design and Specification Blueprint Cylinders = broad CHC abilities g Theoretical Domain = Cattell- Circles = narrow CHC abilities Horn-Carroll (CHC) theory of cognitive abilities Gf Gv Glr Gs Gc Gsm Ga What broad and narrow ability domain(s) are to be measured and in what proportion ? • Answer relates to questions regarding intended purpose of battery, intended examinees, and intended users. • How do we assure adequate construct representation? How do we define the broad and narrow ability constructs? • Content validity important
Substantive Stage of Test Development: Develop Test Design and Specification Blueprint Cylinders = broad CHC abilities g Theoretical Domain = Cattell- Circles = narrow CHC abilities Horn-Carroll (CHC) theory of cognitive abilities Gf Gv Glr Gs Gc Gsm Ga Example domain to be used for illustration of process: Gv (Visual Processing)
What narrow Gv ability domain(s) are to be measured and in what proportion ? • Answer relates to questions regarding intended purpose of battery, intended examinees, and intended users. • How do we assure adequate construct representation?
Substantive Stage of Test Development Purpose Define the theoretical and empirical/measurement domains of interest (e.g., intelligence or cognitive abilities –cognitive + achievement) Questions asked How should intelligence be defined and operationally measured? Method and concepts • Generate definitions Characteristics of • The empirical domain includes measures of all potential strong test validity constructs (i.e., adequate construct representation) program
Definition of broad Gv (Visual Processing) • Ability to perceive, analyze, synthesize and think with visual patterns • Ability to store and recall visual representations • Fluent thinking with stimuli that are visual in the “mind’s eye” What narrow Gv ability domain(s) are to be measured and in what proportion ? • Answer relates to questions regarding intended purpose of battery, intended examinees, and intended users. • How do we assure adequate construct representation?
Narrow Gv ability definitions Spatial Relations (SR): Ability to rapidly perceive and manipulate relatively simple visual patterns or to maintain orientation with respect to objects in space. Visualization (Vz): The ability to apprehend a spatial form, object, or scene and match it with another spatial object, form, or scene with the requirement to rotate it (one or more times) in two or three dimensions. Requires the ability to mentally imagine, manipulate or transform objects or visual patterns (without regard to speed of responding). Visual Memory (MV): Ability to form and store a mental representation or image of a visual stimulus and then recognize or recall it later. We will focus on one: Visualization (Vz)
Substantive Stage of Test Development Purpose Define the theoretical and empirical/measurement domains of interest (e.g., intelligence or cognitive abilities –cognitive + achievement) Questions asked How should intelligence be defined and operationally measured? Method and concepts • Content validation Characteristics of strong test validity program
Content validity evidence Knowledge and skills covered (sampled) by the test items should be representative of the larger population domain of knowledge and skills. Refers to logical or empirical analyses of the adequacy with which the test content represents the content domain and of the relevance of the content domain to the proposed interpretation of test scores (Joint Test Standards) This is a non-statistical type of validity that involves “the systematic examination of the test content to determine whether it covers a representative sample of the behaviour domain to be measured” (Anastasi & Urbina, 1997)
Content validity evidence: One example
Content validity evidence: One example (cont. – for all tests in battery) Etc…….
Content validity evidence: Another example in the domain of reading: Logical—theoretical skill hierarchy task analysis model
End of Part A Additional steps in test development process will be presented in subsequent modules as they are developed
Development Of Frameworks Articles, experts, ... Top PHP Development Frameworks and Their Advantages Views 108 views...streams support for application ...
And Blueprints. Articles, experts, ... Business Development at Blueprints For Tomorrow LLC Past Digital Associate Media Strategist at Kelly Scott Madison, ...
Intelligent Insights on Intelligence Theories and Tests
A human resources management framework does ... management is the design, development ... elements of the Framework because they are inherently part of the ...
Special Reports and Publications ... Part A: Planning, development frameworks & domain/test specification blueprints; Part B: Test and tem development;
Your local planning authority is responsible for deciding whether a development ... framework which underpins the planning ... a planning decision or non ...
Curriculum Planning for All Learners Applying Universal Design for Learning (UDL) to a High School Reading Comprehension Program Grace Meo Originally ...