Published on June 29, 2009
The Art and Science of Test Development—Part C Test and item development: Use of Rasch scaling technology Kevin S. McGrew, PhD. Educational Psychologist Research Director Woodcock-Muñoz Foundation The basic structure and content of this presentation is grounded extensively on the test development procedures developed by Dr. Richard Woodcock
The Art and Science of Test Development The above titled topic is presented in a series of sequential PowerPoint modules. It is strongly recommended that the modules (A-G) be viewed in sequence. Part A: Planning, development frameworks & domain/test specification blueprints Part B: Test and Item Development Part C: Use of Rasch Technology Part D: Develop norm (standardization) plan Part E: Calculate norms and derived scores Part F: Psychometric/technical and statistical analysis: Internal Part G: Psychometric/technical and statistical analysis: External The current module is designated by red bold font lettering
Important note: For the on-line public versions of this PPT module certain items, information, etc. is obscured for test security or proprietary reasons…sorry
Use Rasch (IRT) scaling to evaluate the complete pool of items and to develop the Norming and Publication tests
Structural (Internal) Stage of Test Development Purpose Examine the internal relations among the measures used to operationalize the theoretical construct domain (i.e., intelligence or cognitive abilities) Questions asked Do the observed measures “behave” in a manner consistent with the theoretical domain definition of intelligence? Method and concepts • Internal domain studies • Item/subscale intercorrelations • Item response theory (IRT) Characteristics of • Moderate item internal consistency strong test validity • Items/measures are representative of the empirical domain program • Items fit the theoretical structure
Item Scale Development via Rasch technology Gv Theoretical Domain = Cattell- Horn-Carroll (CHC) theory of cognitive abilities – Gv domain & 3 selected narrow Gv abilities Measurement or empirical domain High ability/difficult items Rasch scale and evaluate the complete pool of items to develop Norming and Publication tests Low ability/easy items
Recall that Block Rotation items have 2 possible correct answers. Therefore there is a scoring question: • Should items be scaled as 0/1 (need both correct to receive 1)? • Should items be scales as 0/1/2 ? Item data can be Rasch-scaled with both scoring systems and then select one that provides best reliability, etc. We decided to go with 0/1/2 scoring sytem
Important understanding regarding 0/1 and multiple point (0/1/2) scoring systems when using Rasch/IRT Dichotomous (0/1) item scoring 1 0 1 “step” Multiple point (0/1/2) item scoring 0 1 2 Therefore – think of 2-step items as two 0/1 items 1 “step” 1 “step”
Rasch IRT “norms” (calibrates) the scale ! Think of the items as now having been placed in their proper position on an equal interval ruler or yardstick….each item is a “tick” mark along the latent trait scale
A major advantage/feature of a large Rasch IRT-scaled item pool…….. Once you have a large Rasch IRT-scaled item pool, you can develop different and customized scales that place people on the same underlying scale • CAT (computer adaptive testing) • Different and unique forms of the test
A major advantage/feature of a large IRT- scaled item pool…….. Hard All three tests have items on the same scale (W- scale) Although different number of items in each test, the obtained person ability W-score ‘s are equivalent, but differ in degree of precision (reliability) Average difference in “gaps” between items on respective scales is called “item density” W-scale is equal interval metric Easy Norming test Publication Possible special test Research Edition tests
2 Major Rasch results People are assigned Items are assigned W-ability scores W-difficulties Rasch puts person ability and item difficulty on the same scale (W scale)
2 Major Rasch results Item Person W-ability W-difficulties scores Select and order items for Publication test based on inspection of Rasch results Block Rotation Block Rotation Norming test Publication test (n=44 items; n = 4,722 (n = 37 items; n = 4,722 norm subjects) norm subjects)
Block Rotation: Final Rasch with norming test n = 37 norming items n = 4722 norm subjects Measure order and fit statistics table Used to select items with specified item density
Block Rotation: Final Rasch Majority of with norming Block test Rotation n = 37 norming norm sample Complete items obtained W- range n = 4722 norm scores from (including subjects 480-520 extremes) of Block Distribution of Rotation W- Block Rotation scores is W-ability scores 432-546 in norm sample
Recall Block Rotation scoring system is 0/1/2—Items have “steps” Multiple point (0/1/2) item scoring 0 1 2 1 “step” 1 “step”
Block Rotation: Final Rasch with norming test n = 37 norming items n = 4722 norm subjects Item map with “steps” displayed for items Blue area represents majority of norm sample subjects Block Rotation W-scores Item 1 (0/1/2) step structure 1 “step” 1 “step”
Adequate “top” or “ceiling” for test scale Block Rotation: Final Rasch with norming test n = 37 norming items Excellent “bottom” or n = 4722 norm “floor” for test scale subjects Item map with “steps” displayed for items Blue area represents majority of norm sample subjects Block Rotation W-scores Very good test scale coverage for majority of population
Block Rotation: Final Rasch with norming test n = 37 norming items n = 4722 norm subjects Item map with “steps” displayed for items Red area represents the complete range (including extremes) of sample Block Rotation W- scores Good test scale coverage for complete range of population
Block Rotation Rasch floor/ceiling results confirmed by formal +-3SD floor/ceiling analysis (24-300 months of age) BLKROT: Floor (rs=1) & ceiling (rs=max) plot 550 Ref W +/- 3 SD's 510 470 430 0 10 20 30 40 50 60 70 80 90 00 10 20 30 40 50 60 70 80 90 00 10 20 30 40 50 60 70 80 90 00 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 camos
Block Rotation Rasch floor/ceiling results confirmed by formal +-3SD floor/ceiling analysis (300 - 1200 months of age) BLKROT: Floor (rs=1) & ceiling (rs=max) plot 550 Ref W +/- 3 SD's 510 470 43000 30 60 90 20 50 80 10 40 70 00 30 60 90 20 50 80 10 40 70 00 30 60 90 20 50 80 10 40 70 00 3 3 3 3 4 4 4 5 5 5 6 6 6 6 7 7 7 8 8 8 9 9 9 9 10 10 10 11 11 11 12 camos
2 Major Rasch results Item Person W-ability W-difficulties scores Program generates final RS to W-ability scoring table Block Rotation Block Rotation Norming test Publication test (n=44 items; n = 4,722 (n = 37 items; n = 4,722 norm subjects) norm subjects)
Block Rotation: Final Rasch with norming test n = 37 norming items n = 4722 norm subjects Raw score to W- score “scoring table” Note: Total raw score points is 74 for 37 items. These are 2-step items. 37 items x 2 steps = 74 total possible points
Raw Score W-score 88 545.7 87 539.0 . . . . . . . . . . Block Rotation . . Norming Test . . . . n=44 items . . . . 44 items x 2 steps = . . raw scores from . . . . 0 to 88 on the Rasch- . . based scoring table . . (the equal interval . . Visualization-Vz . . measurement “ruler” . . or “yardstick”) . . . . 1 437.8 0 431.6 Block Rotation Norming test (n=44 items)
Raw Score W-score Raw Score W-score 88 545.7 74 545.7 87 539.0 73 539.0 . . . . . . . . . . . . Block Rotation . . . . Norming and . . . . Publication tests, . . . . although having . . . . different number of . . items (and total . . . . Raw Scores), are . . . . on the same . . . . underlying . . . . measurement scale . . (ruler) . . . . . . . . . . . . . . . . . . . . . . 1 437.8 1 437.8 0 431.6 0 431.6 Block Rotation Publication test n = 37 items) Block Rotation Norming test (n=44 items)
2 Major Rasch results Item Person W-ability W-difficulties scores Program generates final RS to W-ability scoring table Result: All norm subjects with Block Rotation scores (n = 4,722) now have scores on equal interval W-score Block Rotation Block Rotation Norming test Publication (n=44 items; n = 4,722 test (n = 37 items) norm subjects)
2 Major Rasch results These Block Rotation W-scores are then Item used for developing Person W-ability test “norms” and W-difficulties scores completing technical manual analysis and validity research Program generates final RS to W- ability scoring table Result: All norm subjects with Block Rotation scores (n = 4,722) now have scores on equal Block Rotation interval W-score Block Rotation Norming test Publication (n=44 items; n = 4,722 test (n = 37 items) norm subjects)
These Block 546 Rotation W-scores Block Rotation are then used for Summary: Final developing test Rasch for “norms” and Publication test – validity research graphic item map n = 37 norming items (0-74 RS points) n = 4,722 norm subjects Graphic display of distribution of Block Rotation person abilities Pub. Test W-score scale 432
Recall early warning to expect the unexpected and the non-linear “art and science” of test development Last minute question raised (prior to formal production) of Block Rotation test: Should the blocks be shaded/colored instead of being black and white? Would adding shading/color change the nature of the task? What to do? Answer: Do a study—gather some empirical data to help make decision. The question should be answered empirically – you should not assume that colorizing items will make no difference
Special Block Rotation no-color vs color group administration study completed
Special Block Rotation no-color vs color group administration study completed Sample size plan - approx 300+ subjects 3 groups spanning the complete range of Block Rotation ability • 2nd – 4th graders – approx. 100+ • 7th – 11th graders – approx 100+ • College students – approx 100+ •Final total sample was 380 subjects Group administration version of test Two forms of test constructed from complete set of ordered (scaled) items • White version – even items • Colored version – odd items Analyses – Rasch analysis and comparison of respective item difficulties and mean score comparison between versions Conclusion – adding color did NOT change the psychometric characteristics of the items/test – therefore print the final test with colored items
Final Block Rotation Publication Test Constructed n = 37 (0/1/2) items—Raw Scores from 0-74 Two sample items
Rasch (IRT) is a magnificent tool for evaluating and constructing tests with flexibilty during the entire process. Embrace IRT methods in applied test development (vs CTT methods) Important to remember you are calibrating the scale and not norming the test during this phase). Samples with rectangular distributions of ability are critical. Carefully inspect the Rasch results (esp., measure order table) and determine if you have enough easy and difficulty items or need more items at certain places along the scale. Then use “linking/anchor” technology to add in new items. Item fit is a relative matter involving “reasonably acceptable approximate fit”. Don’t blindly follow black and white item fit rules from text-books and articles. The “real world” of test development is not an ivory tower exercise. Follow 3- basic Rasch assumptions (unidimensionality; equal discrimination; local independence) “within reason” (Woodcock). Many tests claim to use the Rasch model (Rasch “name dropping”), but only use for item analyses and do not harness the advantages of the underlying Rasch ability scale (e.g., W-scale) for improved test construction and score interpretation procedures (e.g., RPI’s).
Maintaining a master item pool Norming-calibration tests Linking/equating (alternate forms) tests Adding new items to master item pool (use of anchor items from master item pool) Checking for possible item bias (DIF – differential item function) Creating and using shortened special purpose versions of tests (norming tests; research edition tests; tests for special populations) Flagging potentially poor examiners via empirical “person fit” statistics report Computer adaptive testing (CAT)
End of Part C Additional steps in test development process will be presented in subsequent modules as they are developed
Intelligent Insights on Intelligence Theories and Tests
The Art and Science of Applied Test ... blueprintsPart B: Test and Item DevelopmentPart C: Use of Rasch ... Applied Psych Test Design: Part A ...
SAMPLE OF APPLIED TECHNOLOGY TEST QUESTIONS ... statistical analysis should i use statistical pdf | ... applied psych test design part d develop norm, ...
The Applied Test Development Series is ... methods such as theory-driven test specification, IRT-Rasch scaling, ... Part C: Use of Rasch technology; ...
What makes problems difficult? ... C:? can be applied to different types of contents, ... Applied Psych Test Design: Part C - Use of Rasch ...
One part of the field is ... Psychometrics is applied ... The book also establishes standards related to testing operations including test design and ...
Latent Trait Standardization of the Benzodiazepine Dependence Self-Report Questionnaire Using the Rasch Scaling Model C.C ... ﬁeld of applied test ...
Rasch Research Papers, Explorations & Explanations ... (Best Test Design) www.rasch.org/spanish.htm: ... Rasch Applications, Part 1: ...