• This record comes from PubMed

Checking Equity: Why Differential Item Functioning Analysis Should Be a Routine Part of Developing Conceptual Assessments

. 2017 Summer ; 16 (2) : .

Language English Country United States Media print

Document type Journal Article

We provide a tutorial on differential item functioning (DIF) analysis, an analytic method useful for identifying potentially biased items in assessments. After explaining a number of methodological approaches, we test for gender bias in two scenarios that demonstrate why DIF analysis is crucial for developing assessments, particularly because simply comparing two groups' total scores can lead to incorrect conclusions about test fairness. First, a significant difference between groups on total scores can exist even when items are not biased, as we illustrate with data collected during the validation of the Homeostasis Concept Inventory. Second, item bias can exist even when the two groups have exactly the same distribution of total scores, as we illustrate with a simulated data set. We also present a brief overview of how DIF analysis has been used in the biology education literature to illustrate the way DIF items need to be reevaluated by content experts to determine whether they should be revised or removed from the assessment. Finally, we conclude by arguing that DIF analysis should be used routinely to evaluate items in developing conceptual assessments. These steps will ensure more equitable-and therefore more valid-scores from conceptual assessments.

See more in PubMed

Ackerman T. A. A didactic explanation of item bias, item impact, and item validity from a multidimensional perspective. Journal of ­Educational Measurement. 1992;29:67–91.

Adams W. K., Wieman C. E. Development and validation of instruments to measure learning of expert‐like thinking. International Journal of Science Education. 2011;33:1289–1312.

Agresti A. Categorical data analysis. Hoboken, NJ: Wiley-­Interscience; 2002.

Allen M. J., Yen W. M. Introduction to measurement theory. Monterey, CA: Brooks/Cole; 1979.

American Educational Research Association, American Psychological Association, National Council on Measurement in Education. Standards for educational and psychological testing. Washington, DC: 2014.

Benjamini Y., Hochberg Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the ­Royal Statistical Society: Series B, Statistical Methodology. 1995;57:289–300.

Berger M., Tutz G. Detection of uniform and nonuniform differential item functioning by item-focused trees. Journal of Educational and Behavioral Statistics. 2016;41:559–592.

Boone W. J. Rasch analysis for instrument development: Why, when and how? CBE—Life Sciences Education. 2016;15:rm4. PubMed PMC

Cai L., Thissen D., du Toit S. IRTPRO [Software manual]. Version 2.1. Skokie, IL: Scientific Software International; 2011. Retrieved January 24, 2016, from.

Camilli G. Test fairness. In: Brennan R., National Council on Measurement in Education, editor. Educational measurement. Westport, CT: Praeger; 2006. pp. 220–256.

Camilli G., Shepard L. A. Methods for identifying biased test items. Thousand Oaks, CA: Sage; 1994.

Clauser B. E., Mazor K. M. Using statistical procedures to identify differentially functioning test items. Educational Measurement: Issues and Practice. 1998;17:31–44.

Creech L. R., Sweeder R. D. Analysis of student performance in large-enrollment life science courses. CBE—Life Sciences Education. 2012;11:386–391. PubMed PMC

Deane T., Nomme K., Jeffery E., Pollock C., Birol G. Development of the Statistical Reasoning in Biology Concept Inventory (SRBCI) CBE—Life Sciences Education. 2016;15:ar5. doi: 10.1187/cbe.15-06-0131:10.1187/cbe.15-06-0131. PubMed DOI PMC

Dennis J. E., Gay D. M., Walsh R. E., Rice J. An adaptive nonlinear least-squares algorithm. ACM Transactions on Mathematical Software. 1981;7:348–368.

Doolittle A. E. 1985. Understanding differential item performance as a consequence of gender differences in academic background. Paper presented at the Annual Meeting of the American Educational Research Association, Chicago, IL.

Downing S. M., Haladyna T. M. Handbook of test development. Hillsdale, NJ: Lawrence Erlbaum; 2006.

Drabinová A., Martinková P. Detection of differential item functioning with non-linear regression: Non-IRT approach accounting for guessing. 2016 Retrieved May 11, 2017, from http://hdl.handle.net/11104/0259498.

Drabinová A., Martinková P., Zvára K. difNLR: Detection of Dichotomous Differential Item Functioning (DIF) and Differential Distractor Functioning (DDF) by Non-Linear Regression Models. 2017. R package version 1.0.0. Retrieved May 11, 2017, from https://CRAN.R-project.org/package=difNLR.

Ercikan K., Arim R., Law D., Domene J., Gagnon F., Lacroix S. Application of think aloud protocols for examining and confirming sources of differential item functioning identified by expert reviews. Educational Measurement: Issues and Practice. 2010;29:24–35.

Federer M. R., Nehm R. H., Pearl D. K. Examining gender differences in written assessment tasks in biology: a case study of evolutionary explanations. CBE—Life Sciences Education. 2016;15:ar2. PubMed PMC

Gelman A., Hill J. Data analysis using regression and multilevel/hierarchical models. New York: Cambridge University Press; 2007.

Güler N., Penfield R. D. A comparison of the logistic regression and contingency table methods for simultaneous detection of uniform and nonuniform DIF. Journal of Educational Measurement. 2009;46:314–329.

Hamilton L. S. Detecting gender-based differential item functioning on a constructed-response science test. Applied Measurement in Education. 1999;12:211–235.

Hills J. R. Screening for potentially biased items in testing programs. Educational Measurement: Issues and Practice. 1989;8:5–11.

Holland P. W. In Proceedings of the 17th Annual Conference of the Military Testing Association. 1985. pp. 282–287.

Holland P. W., Thayer D. T. Differential item performance and the Mantel-Haenszel procedure. In: Wainer H, Braun H. I., editors. Test validity. Hillsdale, NJ: Lawrence Erlbaum; 1988. pp. 129–145.

Holland P. W., Wainer H. Differential item functioning. Hillsdale, NJ: Lawrence Erlbaum; 1993.

Huang X., Wilson M., Wang L. Exploring plausible causes of differential item functioning in the PISA science assessment: language, curriculum or culture. Educational Psychology. 2016;36:378–390.

IBM. IBM SPSS statistics for Windows. Version 22.0. Armonk, NY: 2013.

Kendhammer L., Holme T., Murphy K. Identifying differential performance in general chemistry: Differential item functioning analysis of ACS general chemistry trial tests. Journal of Chemical Education. 2013;90:846–853.

Kim J., Oshima T. C. Effect of multiple testing adjustment in differential item functioning detection. Educational and Psychological Measurement. 2013;73:458–470.

Kingston N., Leary L., Wightman L. An exploratory study of the applicability of item response theory methods to the Graduate Management Admission Test. ETS Research Report Series. 1985;1985(2):i–56.

Legewie J., DiPrete T. A. The high school environment and the gender gap in science and engineering. Sociology of Education. 2014;87:259–280. PubMed PMC

Libarkin J. Paper presented at: National Research Council Promising Practices in Undergraduate STEM Education Workshop 2 (October 13–14, Washington, DC) 2008.

Linacre J. M. Rasch dichotomous model vs. one-parameter logistic model. Rasch Measurement Transactions. 2005;19(3):1032.

Liu O. L., Wilson M. Gender differences in large-scale math assessments: PISA trend 2000 and 2003. Applied Measurement in Education. 2009;22:164–184.

Lord F. Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum; 1980.

Magis D., Beland S., Raiche G. difR: Collection of methods to detect dichotomous differential item functioning (DIF) in psychometrics. 2016. R package Version 4.7. Retrieved May 11, 2017, from https://CRAN.R-project.org/package=difR.

Magis D., Beland S., Tuerlinckx F., De Boeck P. A general framework and an R package for the detection of dichotomous differential item functioning. Behavior Research Methods. 2010;42:847–862. PubMed

Magis D., Tuerlinckx F., De Boeck P. Detection of differential item functioning using the lasso approach. Journal of Educational and Behavioral Statistics. 2014;40:111–135.

Mantel M., Haenszel W. Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute: Monographs. 1959;22:719–748. PubMed

Martinello M., Wolf M. K. Exploring ELL’s understanding of word problems in mathematics assessments: the role of text complexity and student background knowledge. In: Celedón-Pattichis S., Ramirez N., editors. Beyond good teaching: Strategies that are imperative for English language learners in the mathematics classroom. Reston, VA: National Council of Teachers of Mathematics; 2012.

Martinková P., Drabinová A., Leder O., Houdek J. ShinyItemAnalysis: test and item analysis via Shiny. R package Version 1.1.0. 2017. Retrieved May 11, 2017, from https://CRAN.R-project.org/package=ShinyItemAnalysis.

McFarland J. L., Price R. M., Wenderoth M. P., Martinková P., Cliff W., Michael J., Modell H., Wright A. Development and validation of the homeostasis concept inventory. CBE—Life Sciences Education. 2017;16:ar35. PubMed PMC

Millsap R. E., Everson H. T. Methodology review: statistical approaches for assessing measurement bias. Applied Measurement in Education. 1993;17:297–334.

Moore D., Notz W., Fligner M. A. The basic practice of statistics. New York: Freeman; 2015.

Narayanan P., Swaminathan H. Identification of items that show nonuniform DIF. Applied Psychological Measurement. 1996;20:257–274.

Neumann I., Neumann K., Nehm R. Evaluating instrument quality in science education: Rasch-based analyses of a nature of science test. International Journal of Science Education. 2011;33:1373–1405. doi: 10.1080/09500693.2010.511297. DOI

Noble T., Suarez C., Rosebery A., Oçonnor M. C., Warren B., Hudicourt-Barnes J. “I never thought of it as freezing”: How students answer questions on large-scale science tests and what they know about science. Journal of Research in Science Teaching. 2012;49:778–803.

Penfield R. D., Lee O. Test-based accountability: potential benefits and pitfalls of science assessment with student diversity. Journal of Research in Science Teaching. 2010;47:6–24.

Raju N. S. The area between two item characteristic curves. Psychometrika. 1988;53:495–502.

Raju N. S. Determining the significance of estimated signed and unsigned areas between two item response functions. Applied Psychological Measurement. 1990;14:197–207.

R Core Team. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2016. Retrieved May 11, 2017, from www.R-project.org.

Rasch G. Probabilistic models for some intelligence and attainment tests. Chicago: University of Chicago Press; 1960.

Rauschenberger M. M., Sweeder R. D. Gender performance differences in biochemistry. Biochemistry and Molecular Biology Education. 2010;38:380–384. PubMed

Reeves T. D., Marbach-Ad G. Contemporary test validity in theory and practice: a primer for discipline-based education researchers. CBE—Life Sciences Education. 2016;15:rm1. PubMed PMC

Romine W. L., Miller M. E., Knese S. A., Folk W. R. Multilevel assessment of middle school students’ interest in the health sciences: Development and validation of a new measurement tool. CBE—Life Sciences Education. 2016;15:ar21. PubMed PMC

Roussos L., Stout W. A multidimensionality-based DIF analysis paradigm. Applied Psychological Measurement. 1996;20:355–371.

Sabatini J., Bruce K., Steinberg J., Weeks J. SARA reading components tests, rise forms: technical adequacy and test design. ETS Research Report Series. 2015;2015(2): 1–20.

SABER. Biology concept inventories and assessments. n. d. Retrieved January 24, 2016, from http://saber-biologyeducationresearch.wikispaces.com/DBER-Concept+Inventories.

SAS Institute. SAS 9.4 language reference concepts. Cary, NC: 2013.

Shealy R., Stout W. A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DTF as well as item bias/DIF. Psychometrika. 1993;58:159–194.

Siegel M. A. Striving for equitable classroom assessments for linguistic minorities: Strategies for and effects of revising life science items. Journal of Research in Science Teaching. 2007;44:864–881.

Smith M. U., Snyder S. W., Devereaux R. S. The GAENE—Generalized Acceptance of EvolutioN Evaluation: development of a new measure of evolution acceptance. Journal of Research in Science Teaching. 2016;53:1289–1315.

StataCorp. Stata statistical software. Release 14. College Station, TX: 2015.

Steif P. S., Dantzler J. A. A statics concept inventory: development and psychometric analysis. Journal of Engineering Education. 2005;94:363–371.

Štuka, Č., Martinková P., Zvára K., Zvárová J. The prediction and probability for successful completion in medical study based on tests and pre-admission grades. New Educational Review. 2012;28:138–152.

Sudweeks R. R., Tolman R. R. Empirical versus subjective procedures for identifying gender differences in science test items. Journal of Research in Science Teaching. 1993;30:3–19.

Swaminathan H., Rogers J. H. Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement. 1990;27:361–370.

Thissen D., Wainer H., Wang X. B. Are tests comprising both multiple-choice and free-response items necessarily less unidimensional than multiple-choice tests? An analysis of two tests. Journal of Educational Measurement. 1994;31:113–123.

Walker C. M. What’s the DIF? Why differential item functioning analyses are an important part of instrument development and validation. Journal of Psychoeducational Assessment. 2011;29:364–376.

Walker C. M., Beretvas S. N. An empirical investigation demonstrating the multidimensional DIF paradigm: A cognitive explanation for DIF. Journal of Educational Measurement. 2001;38:147–163.

Wright C. D., Eddy S. L., Wenderoth M. P., Abshire E., Blankenbiller M., Brownell S. E. Cognitive difficulty and format of exams predicts gender and socioeconomic gaps in exam performance of students in introductory biology courses. CBE—Life Sciences Education. 2016;15:ar23. PubMed PMC

Wu M. L., Adams R. J., Wilson M. R. ConQuest [computer software] Camberwell, Victoria: Australian Council for Educational Research; 1998.

Zenisky A. L., Hambleton R. K., Robin F. DIF detection and interpretation in large-scale science assessments: informing item writing practices. Educational Measurement. 2004;9:61–68.

Zieky M. Practical questions in the use of DIF statistics in test development. In: Holland P. W., Wainer H., editors. Differential item functioning. Hillsdale, NJ: Erlbaum; 1993. pp. 337–347.

Zieky M. A DIF primer. Princeton, NJ: Educational Testing Service; 2003. Retrieved January 24, 2016, from www.ets.org/s/praxis/pdf/dif_primer.pdf.

Zumbo B. D. A handbook on the theory and methods of differential item functioning (DIF): Logistic regression modeling as a unitary framework for binary and Likert-type (ordinal) item scores. Ottawa ON: Directorate of Human Resources Research and Evaluation, Department of National Defense; 1999. http://faculty.educ.ubc.ca/zumbo/DIF/handbook.pdf.

Zumbo B. D. Three generations of differential item functioning (DIF) analyses: Considering where it has been, where it is now, and where it is going. Language Assessment Quarterly. 2007;4:223–233.

Find record

Citation metrics

Loading data ...

Archiving options

Loading data ...