Assessing quality of selection procedures: Lower bound of false positive rate as a function of inter-rater reliability
Jazyk angličtina Země Velká Británie, Anglie Médium print-electronic
Typ dokumentu časopisecké články
Grantová podpora
Grantová Agentura České Republiky
European Union
PubMed
38623032
DOI
10.1111/bmsp.12343
Knihovny.cz E-zdroje
- Klíčová slova
- Type I error, Type II error, error rate, mixed‐effect models, rating,
- MeSH
- falešně pozitivní reakce MeSH
- lidé MeSH
- odchylka pozorovatele MeSH
- počítačová simulace MeSH
- posudkové řízení metody MeSH
- pravděpodobnost MeSH
- reprodukovatelnost výsledků MeSH
- statistické modely * MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- časopisecké články MeSH
Inter-rater reliability (IRR) is one of the commonly used tools for assessing the quality of ratings from multiple raters. However, applicant selection procedures based on ratings from multiple raters usually result in a binary outcome; the applicant is either selected or not. This final outcome is not considered in IRR, which instead focuses on the ratings of the individual subjects or objects. We outline the connection between the ratings' measurement model (used for IRR) and a binary classification framework. We develop a simple way of approximating the probability of correctly selecting the best applicants which allows us to compute error probabilities of the selection procedure (i.e., false positive and false negative rate) or their lower bounds. We draw connections between the IRR and the binary classification metrics, showing that binary classification metrics depend solely on the IRR coefficient and proportion of selected applicants. We assess the performance of the approximation in a simulation study and apply it in an example comparing the reliability of multiple grant peer review selection procedures. We also discuss other possible uses of the explored connections in other contexts, such as educational testing, psychological assessment, and health-related measurement, and implement the computations in the R package IRR2FPR.
Department of Psychological Methods University of Amsterdam Amsterdam The Netherlands
Faculty of Education Charles University Prague Czech Republic
Zobrazit více v PubMed
Bailar, J. C. (1991). Reliability, fairness, objectivity and other inappropriate goals in peer review. Behavioral and Brain Sciences, 14(1), 137–138. https://doi.org/10.1017/s0140525x00065705
Bartholomew, D. J., Knott, M., & Moustaki, I. (2011). Latent variable models and factor analysis: A unified approach. John Wiley & Sons.
Bartoš, F., Martinková, P., & Brabec, M. (2019). Testing heterogeneity in inter‐rater reliability. The Annual Meeting of the Psychometric Society, 322, 347–364. https://doi.org/10.1007/978‐3‐030‐43469‐4_26
Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed‐effects models using lme4. Journal of Statistical Software, 67(1), 1–48. https://doi.org/10.18637/jss.v067.i01
Bonett, D. G., & Price, R. M. (2005). Inferential methods for the tetrachoric correlation coefficient. Journal of Educational and Behavioral Statistics, 30(2), 213–225.
Bornstein, R. F. (1991). The predictive validity of peer review: A neglected issue. Behavioral and Brain Sciences, 14(1), 138–139. https://doi.org/10.1017/S0140525X00065717
Brennan, R. L. (2001). Generalizability theory. Springer. https://doi.org/10.1007/978‐1‐4757‐3456‐0
Brogden, H. E. (1949). When testing pays off. Personnel Psychology, 2(2), 171–183. https://doi.org/10.1111/j.1744‐6570.1949.tb01397.x
Carpenter, A. S., Sullivan, J. H., Deshmukh, A., Glisson, S. R., & Gallo, S. A. (2015). A retrospective analysis of the effect of discussion in teleconference and face‐to‐face scientific peer‐review panels. BMJ Open, 5(9), e009138. https://doi.org/10.1136/bmjopen‐2015‐009138
Casabianca, J. M., McCaffrey, D. F., Gitomer, D. H., Bell, C. A., Hamre, B. K., & Pianta, R. C. (2013). Effect of observation mode on measures of secondary mathematics teaching. Educational and Psychological Measurement, 73(5), 757–783.
Cicchetti, D. V. (1991). The reliability of peer review for manuscript and grant submissions: A cross‐disciplinary investigation. Behavioral and Brain Sciences, 14(1), 119–135. https://doi.org/10.1017/S0140525X00065675
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46. https://doi.org/10.1177/001316446002000104
Cole, S., Rubin, L., & Cole, J. R. (1978). Peer review in the National Science Foundation: Phase one of a study. National Academy of Sciences. https://doi.org/10.17226/20041
Cronbach, L. J., & Gleser, G. C. (1957). Psychological tests and personnel decisions. University of Illinois Press.
de Leeuw, J., & Meijer, E. (2008). Handbook of multilevel analysis. Springer.
Duda, R. O., Hart, P. E., & Stork, D. G. (2006). Pattern classification. John Wiley & Sons.
Erosheva, E. A., Grant, S., Chen, M.‐C., Lindner, M. D., Nakamura, R. K., & Lee, C. J. (2020). NIH peer review: Criterion scores completely account for racial disparities in overall impact scores. Science Advances, 6(23), eaaz4868. https://doi.org/10.1126/sciadv.aaz4868
Erosheva, E. A., Martinková, P., & Lee, C. J. (2021). When zero may not be zero: A cautionary note on the use of inter‐rater reliability in evaluating grant peer review. Journal of the Royal Statistical Society: Series A (Statistics in Society), 184(3), 904–919. https://doi.org/10.1111/rssa.12681
Fang, F. C., Bowen, A., & Casadevall, A. (2016). NIH peer review percentile scores are poorly predictive of grant productivity. Elife, 5, e13323. https://doi.org/10.7554/elife.13323
Goldhaber, D., Grout, C., Wolff, M., & Martinková, P. (2021). Evidence on the dimensionality and reliability of professional references’ ratings of teacher applicants. Economics of Education Review, 83, 102130. https://doi.org/10.1016/j.econedurev.2021.102130
Grant, S., Meilaă, M., Erosheva, E., & Lee, C. (2022). Refinement: Measuring informativeness of ratings in the absence of a gold standard. British Journal of Mathematical and Statistical Psychology, 75(3), 593–615. https://doi.org/10.1111/bmsp.12268
Guo, F. (2006). Expected classification accuracy using the latent distribution. Practical Assessment, Research, and Evaluation, 11(6), 1–9. https://doi.org/10.7275/bxba‐7466
Hanson, B. A., & Brennan, R. L. (1990). An investigation of classification consistency indexes estimated under alternative strong true score models. Journal of Educational Measurement, 27(4), 345–359. https://doi.org/10.1111/j.1745‐3984.1990.tb00753.x
Hargens, L. L. (1991). Referee agreement in context. Behavioral and Brain Sciences, 14(1), 150–151. https://doi.org/10.1017/S0140525X00065857
Hill, H. C., Charalambous, C. Y., & Kraft, M. A. (2012). When rater reliability is not enough: Teacher observation systems and a case for the generalizability study. Educational Researcher, 41(2), 56–64.
Kiesler, C. A. (1991). Confusion between reviewer reliability and wise editorial and funding decisions. Behavioral and Brain Sciences, 14(1), 151–152. https://doi.org/10.1017/S0140525X00065869
Kraemer, H. C. (1991). Do we really want more “reliable” reviewers? Behavioral and Brain Sciences, 14(1), 152–154. https://doi.org/10.1017/S0140525X00065870
Lauer, M. S., Danthi, N. S., Kaltman, J., & Wu, C. (2015). Predicting productivity returns on investment: Thirty years of peer review, grant funding, and publication of highly cited papers at the National Heart, Lung, and Blood Institute. Circulation Research, 117(3), 239–243. https://doi.org/10.1161/circresaha.115.306830
Lauer, M. S., & Nakamura, R. (2015). Reviewing peer review at the NIH. New England Journal of Medicine, 373(20), 1893–1895. https://doi.org/10.1056/NEJMp1507427
Lee, C. J. (2012). A Kuhnian critique of psychometric research on peer review. Philosophy of Science, 79(5), 859–870.
Lee, W.‐C. (2010). Classification consistency and accuracy for complex assessments using item response theory. Journal of Educational Measurement, 47(1), 1–17.
Li, D., & Agha, L. (2015). Big names or big ideas: Do peer‐review panels select the best science proposals? Science, 348(6233), 434–438. https://doi.org/10.1126/science.aaa0185
Lindner, M. D., & Nakamura, R. K. (2015). Examining the predictive validity of NIH peer review scores. PLoS One, 10(6), e0126938. https://doi.org/10.1371/journal.pone.0126938
Livingston, S. A., & Lewis, C. (1995). Estimating the consistency and accuracy of classifications based on test scores. Journal of Educational Measurement, 32(2), 179–197. https://doi.org/10.1111/j.1745‐3984.1995.tb00462.x
Marsh, H. W., & Ball, S. (1989). The peer review process used to evaluate manuscripts submitted to academic journals: Interjudgmental reliability. The Journal of Experimental Education, 57(2), 151–169. https://doi.org/10.1080/00220973.1989.10806503
Martinková, P., Bartoš, F., & Brabec, M. (2023). Assessing inter‐rater reliability with heterogeneous variance components models: Flexible approach accounting for contextual variables. Journal of Educational and Behavioral Statistics, 48(3), 349–383. [Preprint at https://doi.org/10.48550/ARXIV.2207.02071]
Martinková, P., & Drabinová, A. (2018). ShinyItemAnalysis for teaching psychometrics and to enforce routine analysis of educational tests. The R Journal, 10(2), 503–515. https://doi.org/10.32614/rj‐2018‐074
Martinková, P., Goldhaber, D., & Erosheva, E. A. (2018). Disparities in ratings of internal and external applicants: A case for model‐based inter‐rater reliability. PLoS One, 13(10), e0203002. https://doi.org/10.1371/journal.pone.0203002
Martinková, P., & Hladká, A. (2023). Computational aspects of psychometric methods: With R. CRC Press. 10.1201/9781003054313
Mayo, N. E., Brophy, J., Goldberg, M. S., Klein, M. B., Miller, S., Platt, R. W., & Ritchie, J. (2006). Peering at peer review revealed high degree of chance associated with funding of grant applications. Journal of Clinical Epidemiology, 59(8), 842–848. https://doi.org/10.1016/j.jclinepi.2005.12.007
McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1(1), 30. https://doi.org/10.1037/1082‐989X.1.1.30
Metz, C. E. (1978). Basic principles of ROC analysis. Seminars in Nuclear Medicine, 8(4), 283–298. https://doi.org/10.1016/s0001‐2998(78)80014‐2
Moher, D., & Ravaud, P. (2016). Increasing the evidence base in journalology: Creating an international best practice journal research network. BMC Medicine, 14, 154. https://doi.org/10.1186/s12916‐016‐0707‐2
Mutz, R., Bornmann, L., & Daniel, H.‐D. (2012). Heterogeneity of inter‐rater reliabilities of grant peer reviews and its determinants: A general estimating equations approach. PLoS One, 7(10), e48509. https://doi.org/10.1371/journal.pone.0048509
Nelson, L. D. (1991). The process of peer review: Unanswered questions. Behavioral and Brain Sciences, 14(1), 158–159. https://doi.org/10.1017/S0140525X00065924
Pearce, M., & Erosheva, E. A. (2022). A unified statistical learning model for rankings and scores with application to grant panel review. Journal of Machine Learning Research, 23(210), 1–33. [Preprint at https://arxiv.org/abs/2112.02539]
Pearson, K. (1900). Mathematical contributions to the theory of evolution. VIII. On the correlation of characters not quantitatively measurable. Philosophical Transactions of the Royal Society of London, 66(424–433), 1–47. https://doi.org/10.1098/rsta.1900.0022
Pepe, M. S. (2003). The statistical evaluation of medical tests for classification and prediction. Oxford University Press.
Rasova, K., Martinkova, P., Vyskotova, J., & Sedova, M. (2012). Assessment set for evaluation of clinical outcomes in multiple sclerosis: Psychometric properties. Patient Related Outcome Measures, 3, 59–70. https://doi.org/10.2147/prom.s32241
Rudner, L. M. (2000). Computing the expected proportions of misclassified examinees. Practical Assessment, Research, and Evaluation, 7(14), 1–5. https://doi.org/10.7275/an9m‐2035
Rudner, L. M. (2005). Expected classification accuracy. Practical Assessment, Research, and Evaluation, 10(13), 1–4. https://doi.org/10.7275/56a5‐6b14
Searle, S. R., Casella, G., & McCulloch, C. E. (2006). Variance components. John Wiley & Sons.
Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420. https://doi.org/10.1037/0033‐2909.86.2.420
Superchi, C., González, J. A., Solà, I., Cobo, E., Hren, D., & Boutron, I. (2019). Tools used to assess the quality of peer review reports: A methodological systematic review. BMC Medical Research Methodology, 19, 1–14. https://doi.org/10.1186/s12874‐019‐0688‐x
Taylor, H. C., & Russell, J. T. (1939). The relationship of validity coefficients to the practical effectiveness of tests in selection: Discussion and tables. Journal of Applied Psychology, 23(5), 565. https://doi.org/10.1037/h0057079
Thurstone, L. L. (1931). The reliability and validity of tests: Derivation and interpretation of fundamental formulae concerned with reliability and validity of tests and illustrative problems. Edwards Brothers.
Ziv, A., Rubin, O., Moshinsky, A., Gafni, N., Kotler, M., Dagan, Y., Lichtenberg, D., Mekori, Y. A., & Mittelman, M. (2008). Mor: A simulation‐based assessment centre for evaluating the personal and interpersonal qualities of medical school candidates. Medical Education, 42(10), 991–998. https://doi.org/10.1111/j.1365‐2923.2008.03161.x