Assessing quality of selection procedures: Lower bound of false positive rate as a function of inter-rater reliability

. 2024 Nov ; 77 (3) : 651-671. [epub] 20240415

Jazyk angličtina Země Velká Británie, Anglie Médium print-electronic

Typ dokumentu časopisecké články

Perzistentní odkaz   https://www.medvik.cz/link/pmid38623032

Grantová podpora
Grantová Agentura České Republiky
European Union

Inter-rater reliability (IRR) is one of the commonly used tools for assessing the quality of ratings from multiple raters. However, applicant selection procedures based on ratings from multiple raters usually result in a binary outcome; the applicant is either selected or not. This final outcome is not considered in IRR, which instead focuses on the ratings of the individual subjects or objects. We outline the connection between the ratings' measurement model (used for IRR) and a binary classification framework. We develop a simple way of approximating the probability of correctly selecting the best applicants which allows us to compute error probabilities of the selection procedure (i.e., false positive and false negative rate) or their lower bounds. We draw connections between the IRR and the binary classification metrics, showing that binary classification metrics depend solely on the IRR coefficient and proportion of selected applicants. We assess the performance of the approximation in a simulation study and apply it in an example comparing the reliability of multiple grant peer review selection procedures. We also discuss other possible uses of the explored connections in other contexts, such as educational testing, psychological assessment, and health-related measurement, and implement the computations in the R package IRR2FPR.

Zobrazit více v PubMed

Bailar, J. C. (1991). Reliability, fairness, objectivity and other inappropriate goals in peer review. Behavioral and Brain Sciences, 14(1), 137–138. https://doi.org/10.1017/s0140525x00065705

Bartholomew, D. J., Knott, M., & Moustaki, I. (2011). Latent variable models and factor analysis: A unified approach. John Wiley & Sons.

Bartoš, F., Martinková, P., & Brabec, M. (2019). Testing heterogeneity in inter‐rater reliability. The Annual Meeting of the Psychometric Society, 322, 347–364. https://doi.org/10.1007/978‐3‐030‐43469‐4_26

Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed‐effects models using lme4. Journal of Statistical Software, 67(1), 1–48. https://doi.org/10.18637/jss.v067.i01

Bonett, D. G., & Price, R. M. (2005). Inferential methods for the tetrachoric correlation coefficient. Journal of Educational and Behavioral Statistics, 30(2), 213–225.

Bornstein, R. F. (1991). The predictive validity of peer review: A neglected issue. Behavioral and Brain Sciences, 14(1), 138–139. https://doi.org/10.1017/S0140525X00065717

Brennan, R. L. (2001). Generalizability theory. Springer. https://doi.org/10.1007/978‐1‐4757‐3456‐0

Brogden, H. E. (1949). When testing pays off. Personnel Psychology, 2(2), 171–183. https://doi.org/10.1111/j.1744‐6570.1949.tb01397.x

Carpenter, A. S., Sullivan, J. H., Deshmukh, A., Glisson, S. R., & Gallo, S. A. (2015). A retrospective analysis of the effect of discussion in teleconference and face‐to‐face scientific peer‐review panels. BMJ Open, 5(9), e009138. https://doi.org/10.1136/bmjopen‐2015‐009138

Casabianca, J. M., McCaffrey, D. F., Gitomer, D. H., Bell, C. A., Hamre, B. K., & Pianta, R. C. (2013). Effect of observation mode on measures of secondary mathematics teaching. Educational and Psychological Measurement, 73(5), 757–783.

Cicchetti, D. V. (1991). The reliability of peer review for manuscript and grant submissions: A cross‐disciplinary investigation. Behavioral and Brain Sciences, 14(1), 119–135. https://doi.org/10.1017/S0140525X00065675

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46. https://doi.org/10.1177/001316446002000104

Cole, S., Rubin, L., & Cole, J. R. (1978). Peer review in the National Science Foundation: Phase one of a study. National Academy of Sciences. https://doi.org/10.17226/20041

Cronbach, L. J., & Gleser, G. C. (1957). Psychological tests and personnel decisions. University of Illinois Press.

de Leeuw, J., & Meijer, E. (2008). Handbook of multilevel analysis. Springer.

Duda, R. O., Hart, P. E., & Stork, D. G. (2006). Pattern classification. John Wiley & Sons.

Erosheva, E. A., Grant, S., Chen, M.‐C., Lindner, M. D., Nakamura, R. K., & Lee, C. J. (2020). NIH peer review: Criterion scores completely account for racial disparities in overall impact scores. Science Advances, 6(23), eaaz4868. https://doi.org/10.1126/sciadv.aaz4868

Erosheva, E. A., Martinková, P., & Lee, C. J. (2021). When zero may not be zero: A cautionary note on the use of inter‐rater reliability in evaluating grant peer review. Journal of the Royal Statistical Society: Series A (Statistics in Society), 184(3), 904–919. https://doi.org/10.1111/rssa.12681

Fang, F. C., Bowen, A., & Casadevall, A. (2016). NIH peer review percentile scores are poorly predictive of grant productivity. Elife, 5, e13323. https://doi.org/10.7554/elife.13323

Goldhaber, D., Grout, C., Wolff, M., & Martinková, P. (2021). Evidence on the dimensionality and reliability of professional references’ ratings of teacher applicants. Economics of Education Review, 83, 102130. https://doi.org/10.1016/j.econedurev.2021.102130

Grant, S., Meilaă, M., Erosheva, E., & Lee, C. (2022). Refinement: Measuring informativeness of ratings in the absence of a gold standard. British Journal of Mathematical and Statistical Psychology, 75(3), 593–615. https://doi.org/10.1111/bmsp.12268

Guo, F. (2006). Expected classification accuracy using the latent distribution. Practical Assessment, Research, and Evaluation, 11(6), 1–9. https://doi.org/10.7275/bxba‐7466

Hanson, B. A., & Brennan, R. L. (1990). An investigation of classification consistency indexes estimated under alternative strong true score models. Journal of Educational Measurement, 27(4), 345–359. https://doi.org/10.1111/j.1745‐3984.1990.tb00753.x

Hargens, L. L. (1991). Referee agreement in context. Behavioral and Brain Sciences, 14(1), 150–151. https://doi.org/10.1017/S0140525X00065857

Hill, H. C., Charalambous, C. Y., & Kraft, M. A. (2012). When rater reliability is not enough: Teacher observation systems and a case for the generalizability study. Educational Researcher, 41(2), 56–64.

Kiesler, C. A. (1991). Confusion between reviewer reliability and wise editorial and funding decisions. Behavioral and Brain Sciences, 14(1), 151–152. https://doi.org/10.1017/S0140525X00065869

Kraemer, H. C. (1991). Do we really want more “reliable” reviewers? Behavioral and Brain Sciences, 14(1), 152–154. https://doi.org/10.1017/S0140525X00065870

Lauer, M. S., Danthi, N. S., Kaltman, J., & Wu, C. (2015). Predicting productivity returns on investment: Thirty years of peer review, grant funding, and publication of highly cited papers at the National Heart, Lung, and Blood Institute. Circulation Research, 117(3), 239–243. https://doi.org/10.1161/circresaha.115.306830

Lauer, M. S., & Nakamura, R. (2015). Reviewing peer review at the NIH. New England Journal of Medicine, 373(20), 1893–1895. https://doi.org/10.1056/NEJMp1507427

Lee, C. J. (2012). A Kuhnian critique of psychometric research on peer review. Philosophy of Science, 79(5), 859–870.

Lee, W.‐C. (2010). Classification consistency and accuracy for complex assessments using item response theory. Journal of Educational Measurement, 47(1), 1–17.

Li, D., & Agha, L. (2015). Big names or big ideas: Do peer‐review panels select the best science proposals? Science, 348(6233), 434–438. https://doi.org/10.1126/science.aaa0185

Lindner, M. D., & Nakamura, R. K. (2015). Examining the predictive validity of NIH peer review scores. PLoS One, 10(6), e0126938. https://doi.org/10.1371/journal.pone.0126938

Livingston, S. A., & Lewis, C. (1995). Estimating the consistency and accuracy of classifications based on test scores. Journal of Educational Measurement, 32(2), 179–197. https://doi.org/10.1111/j.1745‐3984.1995.tb00462.x

Marsh, H. W., & Ball, S. (1989). The peer review process used to evaluate manuscripts submitted to academic journals: Interjudgmental reliability. The Journal of Experimental Education, 57(2), 151–169. https://doi.org/10.1080/00220973.1989.10806503

Martinková, P., Bartoš, F., & Brabec, M. (2023). Assessing inter‐rater reliability with heterogeneous variance components models: Flexible approach accounting for contextual variables. Journal of Educational and Behavioral Statistics, 48(3), 349–383. [Preprint at https://doi.org/10.48550/ARXIV.2207.02071]

Martinková, P., & Drabinová, A. (2018). ShinyItemAnalysis for teaching psychometrics and to enforce routine analysis of educational tests. The R Journal, 10(2), 503–515. https://doi.org/10.32614/rj‐2018‐074

Martinková, P., Goldhaber, D., & Erosheva, E. A. (2018). Disparities in ratings of internal and external applicants: A case for model‐based inter‐rater reliability. PLoS One, 13(10), e0203002. https://doi.org/10.1371/journal.pone.0203002

Martinková, P., & Hladká, A. (2023). Computational aspects of psychometric methods: With R. CRC Press. 10.1201/9781003054313

Mayo, N. E., Brophy, J., Goldberg, M. S., Klein, M. B., Miller, S., Platt, R. W., & Ritchie, J. (2006). Peering at peer review revealed high degree of chance associated with funding of grant applications. Journal of Clinical Epidemiology, 59(8), 842–848. https://doi.org/10.1016/j.jclinepi.2005.12.007

McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1(1), 30. https://doi.org/10.1037/1082‐989X.1.1.30

Metz, C. E. (1978). Basic principles of ROC analysis. Seminars in Nuclear Medicine, 8(4), 283–298. https://doi.org/10.1016/s0001‐2998(78)80014‐2

Moher, D., & Ravaud, P. (2016). Increasing the evidence base in journalology: Creating an international best practice journal research network. BMC Medicine, 14, 154. https://doi.org/10.1186/s12916‐016‐0707‐2

Mutz, R., Bornmann, L., & Daniel, H.‐D. (2012). Heterogeneity of inter‐rater reliabilities of grant peer reviews and its determinants: A general estimating equations approach. PLoS One, 7(10), e48509. https://doi.org/10.1371/journal.pone.0048509

Nelson, L. D. (1991). The process of peer review: Unanswered questions. Behavioral and Brain Sciences, 14(1), 158–159. https://doi.org/10.1017/S0140525X00065924

Pearce, M., & Erosheva, E. A. (2022). A unified statistical learning model for rankings and scores with application to grant panel review. Journal of Machine Learning Research, 23(210), 1–33. [Preprint at https://arxiv.org/abs/2112.02539]

Pearson, K. (1900). Mathematical contributions to the theory of evolution. VIII. On the correlation of characters not quantitatively measurable. Philosophical Transactions of the Royal Society of London, 66(424–433), 1–47. https://doi.org/10.1098/rsta.1900.0022

Pepe, M. S. (2003). The statistical evaluation of medical tests for classification and prediction. Oxford University Press.

Rasova, K., Martinkova, P., Vyskotova, J., & Sedova, M. (2012). Assessment set for evaluation of clinical outcomes in multiple sclerosis: Psychometric properties. Patient Related Outcome Measures, 3, 59–70. https://doi.org/10.2147/prom.s32241

Rudner, L. M. (2000). Computing the expected proportions of misclassified examinees. Practical Assessment, Research, and Evaluation, 7(14), 1–5. https://doi.org/10.7275/an9m‐2035

Rudner, L. M. (2005). Expected classification accuracy. Practical Assessment, Research, and Evaluation, 10(13), 1–4. https://doi.org/10.7275/56a5‐6b14

Searle, S. R., Casella, G., & McCulloch, C. E. (2006). Variance components. John Wiley & Sons.

Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420. https://doi.org/10.1037/0033‐2909.86.2.420

Superchi, C., González, J. A., Solà, I., Cobo, E., Hren, D., & Boutron, I. (2019). Tools used to assess the quality of peer review reports: A methodological systematic review. BMC Medical Research Methodology, 19, 1–14. https://doi.org/10.1186/s12874‐019‐0688‐x

Taylor, H. C., & Russell, J. T. (1939). The relationship of validity coefficients to the practical effectiveness of tests in selection: Discussion and tables. Journal of Applied Psychology, 23(5), 565. https://doi.org/10.1037/h0057079

Thurstone, L. L. (1931). The reliability and validity of tests: Derivation and interpretation of fundamental formulae concerned with reliability and validity of tests and illustrative problems. Edwards Brothers.

Ziv, A., Rubin, O., Moshinsky, A., Gafni, N., Kotler, M., Dagan, Y., Lichtenberg, D., Mekori, Y. A., & Mittelman, M. (2008). Mor: A simulation‐based assessment centre for evaluating the personal and interpersonal qualities of medical school candidates. Medical Education, 42(10), 991–998. https://doi.org/10.1111/j.1365‐2923.2008.03161.x

Najít záznam

Citační ukazatele

Nahrávání dat ...

Možnosti archivace

Nahrávání dat ...