Bioacoustic fundamental frequency estimation: a cross-species dataset and deep learning baseline

. 2025 ; 34 (4) : 419-446. [epub] 20250602

Status PubMed-not-MEDLINE Jazyk angličtina Země Velká Británie, Anglie Médium print-electronic

Typ dokumentu časopisecké články

Perzistentní odkaz   https://www.medvik.cz/link/pmid40881464

Grantová podpora
P50 MH100023 NIMH NIH HHS - United States
R01 DC008343 NIDCD NIH HHS - United States
R01 MH115831 NIMH NIH HHS - United States

The fundamental frequency (F0) is a key parameter for characterising structures in vertebrate vocalisations, for instance defining vocal repertoires and their variations at different biological scales ( e.g population dialects, individual signatures). However, the task is too laborious to perform manually, and its automation is complex. Despite significant advancements in the fields of speech and music for automatic F0 estimation, similar progress in bioacoustics has been limited. To address this gap, we compile and publish a benchmark dataset of over 250,000 calls from 14 taxa, each paired with ground truth F0 values. These vocalisations range from infra-sounds to ultra-sounds, from high to low harmonicity, and some include non-linear phenomena. Testing different algorithms on these signals, we demonstrate the potential of neural networks for F0 estimation, even for taxa not seen in training, or when trained without labels. Also, to inform on the applicability of algorithms to analyse signals, we propose spectral measurements of F0 quality which correlate well with performance. While current performance results are not satisfying for all studied taxa, they suggest that deep learning could bring a more generic and reliable bioacoustic F0 tracker, helping the community to analyse vocalisations via their F0 contours.

Biology Department University of Konstanz

Centre de Recherche sur la Biodiversité et l'Environnement Université Paul Sabatier 31062 Toulouse Cedex 9 France

Czech Academy of Sciences Institute of Vertebrate Biology Brno Czech Republic

Department for the Ecology of Animal Societies Max Planck Institute of Animal Behavior Konstanz Germany

Department of Biology and Emory National Primate Research Center Emory University

Department of Computational Mathematics Science and Engineering Michigan State University East Lansing MI USA

Department of Computer Science Oxford University

Department of Computer Science San Diego State University

Department of Ecoscience Aarhus University Frederiksborgvej 399 4000 Roskilde Denmark

Department of Integrative Biology Michigan State University East Lansing MI USA

Department of Psychology University of Warwick

Department of Zoology Faculty of Science University of South Bohemia České Budějovice Czech Republic

Ecology Evolution and Behavior Program Michigan State University East Lansing MI USA

Escuela de Biologıía and Centro de Investigación en Neurociencias Universidad de Costa Rica

Facultad de Ciencias Universidad Autónoma de Madrid 28049 Madrid Spain

Faculty of Environmental Sciences Czech University of Life Sciences Prague Prague Czech Republic

Forestry and Game Management Research Institute v v i Jíloviště› Czech Republic

Human Biology Program Michigan State University

Institute of Biology University of Neuchtel Neuchtel Switzerland

National Museum of Natural Sciences Spanish National Research Council Madrid Spain

School of Life and Environmental Sciences University of Lincoln Lincoln United Kingdom

Speech Music and Hearing KTH Royal Institute of Technology

Université de Toulon Aix Marseille Univ CNRS LIS Toulon France

Wildlife Conservation Research Unit Recanati Kaplan Centre Department of Biology University of Oxford Oxford UK

Zoology Department Cambridge University

Zobrazit více v PubMed

Huang X, Acero A, Hon HW, Reddy R. Spoken language processing: A guide to theory, algorithm, and system development. Prentice hall PTR; 2001.

Herbst CT. Biophysics of vocal production in mammals. Vertebrate sound production and acoustic communication. 2016; p. 159–189.

Hirst DJ, de Looze C. Measuring Speech. Fundamental frequency and pitch.; 2021.

Honorof DN, Whalen D. Identification of speaker sex from one vowel across a range of fundamental frequencies. The Journal of the Acoustical Society of America. 2010;128(5):3095–3104. PubMed PMC

Skantze G Turn-taking in conversational systems and human-robot interaction: a review. Computer Speech & Language. 2021;67:101178.

Lindblom B Accuracy and limitations of sona-graph measurements. In: Proceedings of the fourth international congress of phonetic sciences. vol. 1. Mouton The Hague; 1962. p. 188–202.

Sundberg J, Titze I, Scherer R. Phonatory control in male singing: A study of the effects of subglottal pressure, fundamental frequency, and mode of phonation on the voice source. Journal of Voice. 1993;7(1):15–29. PubMed

Ekström AG. Ape vowel-like sounds remain elusive: a comment on Grawunder et al.(2022). International Journal of Primatology. 2023;44:237–239.

Orio N, et al. Music retrieval: A tutorial and review. Foundations and Trends in Information Retrieval. 2006;1(1):1–90.

Salamon J, Gómez E, Ellis DP, Richard G. Melody extraction from polyphonic music signals: Approaches, applications, and challenges. IEEE Signal Processing Magazine. 2014;31(2):118–134.

Bosch JJ, Marxer R, Gómez E. Evaluation and combination of pitch estimation methods for melody extraction in symphonic classical music. Journal of New Music Research. 2016;45(2):101–117.

Bowling DL, Garcia M, Dunn JC, Ruprecht R, Stewart A, Frommolt KH, et al. Body size and vocalization in primates and carnivores. Scientific reports. 2017;7(1):41070. PubMed PMC

Fitch WT, Hauser MD. Unpacking “honesty”: vertebrate vocal production and the evolution of acoustic signals. In: Acoustic communication. Springer; 2003. p. 65–137.

Stoeger AS, Zeppelzauer M, Baotic A. Age-group estimation in free-ranging African elephants based on acoustic cues of low-frequency rumbles. Bioacoustics. 2014;23(3):231–246. PubMed PMC

Kershenbaum A, Blumstein DT, Roch MA, Akçay C¸ , Backus G, Bee MA, et al. Acoustic sequences in non-human animals: a tutorial review and prospectus. Biological Reviews. 2016;91(1):13–52. PubMed PMC

Garland EC, Castellote M, Berchok CL. Beluga whale (Delphinapterus leucas) vocalizations and call classification from the eastern Beaufort Sea population. The Journal of the Acoustical Society of America. 2015;137(6):3054–3067. PubMed

HHenry L, Barbu S, Lemasson A, Hausberger M. Dialects in animals: Evidence, development and potential functions. Animal Behavior and Cognition. 2015;2(2):132–155.

LLehmann KD, Jensen FH, Gersick AS, Strandburg-Peshkin A, Holekamp KE. Long-distance vocalizations of spotted hyenas contain individual, but not group, signatures. Proceedings of the Royal Society B. 2022;289 (1979) : 20220548. PubMed PMC

WWijers M, Trethowan P, Du Preez B, Chamaillé -Jammes S, Loveridge AJ, Macdonald DW, et al. Vocal discrimination of African lions and its potential for collar-free tracking. Bioacoustics. 2021;30(5):575–593.

Deecke VB, Janik VM. Automated categorization of bioacoustic signals: avoiding perceptual pitfalls. The Journal of the Acoustical Society of America. 2006;119(1):645–653. PubMed

Sayigh LS, Janik VM, Jensen FH, Scott MD, Tyack PL, Wells RS. The Sarasota Dolphin Whistle Database: A unique long-term resource for understanding dolphin communication. Frontiers in Marine Science. 2022;9:923046

LLinhart P, Šálek M. The assessment of biases in the acoustic discrimination of individuals. Plos One. 2017;12(5):e0177206. PubMed PMC

LLameira AR, Hardus ME, Bartlett AM, Shumaker RW, Wich SA, Menken SB. Speech-like rhythm in a voiced and voiceless orangutan call. PloS one. 2015;10(1):e116136. PubMed PMC

TTyack PL, Miller EH. Vocal anatomy, acoustic communication and echolocation. Marine mammal biology: An evolutionary approach. 2002;59:142–84.

WWilden I, Herzel H, Peters G, Tembrock G. Subharmonics, biphonation, and deterministic chaos in mammal vocalization. Bioacoustics. 1998;9(3):171–196.

FFitch WT, Neubauer J, Herzel H. Calls out of chaos: the adaptive significance of nonlinear phenomena in mammalian vocal production. Animal behaviour. 2002;63(3):407–418.

TTitze IR, for Voice NC, Speech. Workshop on Acoustic Voice Analysis: Summary Statement. National Center for Voice and Speech; 1995. Available from: https://books.google.es/books?id=POk2HQAACAAJ.

RRiede T, Tokuda IT, Munger JB, Thomson SL. Mammalian laryngseal air sacs add variability to the vocal tract impedance: Physical and computational modeling. The Journal of the Acoustical Society of America. 2008;124(1):634–647. PubMed PMC

BBoersma P, Van Heuven V. Speak and unSpeak with PRAAT. Glot International. 2001;5(9/10):341–347.

MMauch M, Dixon S. pYIN: A fundamental frequency estimator using probabilistic threshold distributions. In: International conference on acoustics, speech and signal processing (ICASSP). IEEE; 2014. p. 659–663.

DDe Cheveigné A, Kawahara H. YIN, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America. 2002;111(4):1917–1930. PubMed

KKim JW, Salamon J, Li P, Bello JP. CREPE: A Convolutional Representation for Pitch Estimation. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2018. p. 161–165.

RRiou A, Lattner S, Hadjeres G, Peeters G. PESTO: Pitch Estimation withSelf-supervised Transposition-equivariant Objective. In: 24th International Society for Music Information Retrieval Conference (ISMIR); 2023.

BBittner RM, Bosch JJ, Rubinstein D, Meseguer-Brocal G, Ewert S. A lightweight instrument-agnostic model for polyphonic note transcription and multipitch estimation. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2022. p. 781–785.

Lisa KK Yang Center for Conservation Bioacoustics at tK Lisa Yang Center for Conservation Bioacoustics at the Cornell Lab of Ornithology. Raven Pro: Interactive Sound Analysis Software; 2023. Available from: https://www.ravensoundsoftware.com/.

LLachlan R Luscinia; 2022. Available from: http://rflachlan.github.io/Luscinia/.

SSueur J, Aubin T, Simonis C. Seewave: a free modular tool for sound analysis and synthesis. Bioacoustics. 2008;18:213–226.

AAraya-Salas M, Smith-Vidaurre G. warbleR: an R package to streamline analysis of animal acoustic signals. Methods in Ecology and Evolution. 2017;8(2):184–191. doi: 10.1111/2041-210X.12624. DOI

JJadoul Y, De Boer B, Ravignani A. Parselmouth for bioacoustics: automated acoustic analysis in Python. Bioacoustics. 2024; p. 1–19.

RRöper K, Scheumann M, Wiechert A, Nathan S, Goossens B, Owren M, et al. Vocal acoustics in the endangered proboscis monkey (Nasalis larvatus). American Journal of Primatology. 2014;76(2):192–201. PubMed

HHagiwara M, Miron M, Liu JY. ISPA: Inter-Species Phonetic Alphabet for Transcribing Animal Sounds. International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2024;.

GGamba M, Favaro L, Torti V, Sorrentino V, Giacoma C. Vocal tract flexibility and variation in the vocal output in wild indris. Bioacoustics. 2011;20(3):251–265.

PPoupard M, Best P, Schlüter J, Symonds H, Spong P, Lengagne T, et al. Large-scale unsupervised clustering of orca vocalizations: a model for describing orca communication systems. In: 2nd International Workshop on Vocal Interactivity in-and-between Humans, Animals and Robots (VIHAR); 2019. p. 82–87.

GGarcia N, Macias-Toro E, Vargas-Bonilla J, Daza J, López J. Segmentation of bio-signals in field recordings using fundamental frequency detection. In: 3

TTorti V, Bonadonna G, De Gregorio C, Valente D, Randrianarison RM, Friard O, et al. An intra-population analysis of the indris’ song dissimilarity in the light of genetic distance. Scientific Reports. 2017;7(1):10140. PubMed PMC

LLi P, Liu X, Palmer K, Fleishman E, Gillespie D, Nosal EM, et al. Learning deep models from synthetic data for extracting dolphin whistle contours. In: 2020 International Joint Conference on Neural Networks (IJCNN). IEEE; 2020. p. 1–10.

LLi P, Liu X, Klinck H, Gruden P, Roch MA. Using deep learning to track time× frequency whistle contours of toothed whales without human-annotated training data. The Journal of the Acoustical Society of America. 2023;154(1):502–517. PubMed

OO’Reilly C, Harte N. Pitch tracking of bird vocalizations and an automated process using YIN-bird. Cogent Biology. 2017;3(1):1322025.

Herbst CT, Dunn JC. Fundamental frequency estimation of low-quality electroglottographic signals. Journal of Voice. 2019;33(4):401–411. PubMed

SStowell D Computational bioacoustics with deep learning: a review and roadmap. PeerJ. 2022;10:e13152. PubMed PMC

BBest P, Paris S, Glotin H, Marxer R. Deep audio embeddings for vocalization clustering. Plos one. 2023;18(7):e0283396. PubMed PMC

GGfeller B, Frank C, Roblek D, Sharifi M, Tagliasacchi M, Velimirović M. SPICE: Self-supervised pitch estimation. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2020;28:1118–1128.

HHan K, Wang D. Neural network based pitch tracking in very noisy speech. IEEE/ACM transactions on audio, speech, and language processing. 014;22(12):2158–2168.

PPapale E, Buffa G, Filiciotto F, Maccarrone V, Mazzola S, Ceraulo M, et al. Biphonic calls as signature whistles in a free-ranging bottlenose dolphin. Bioacoustics. 2015;24(3):223–231.

FFilatova O, Fedutin I, Nagaylik M, Burdin A, Hoyt E. Usage of monophonic and biphonic calls by free-ranging resident killer whales (Orcinus orca) in Kamchatka, Russian Far East . Acta ethologica. 2009;12:37–44.

BBrown CH, Alipour F, Berry DA, Montequin D. Laryngeal biomechanics and vocal communication in the squirrel monkey (Saimiri boliviensis). The Journal of the Acoustical Society of America. 2003;113(4):2114–2126. PubMed

ZZollinger SA, Riede T, Suthers RA. Two-voice complexity from a single side of the syrinx in northern mockingbird Mimus polyglottos vocalizations. Journal of Experimental Biology. 2008;211(12):1978–1991. PubMed PMC

SSuthers RA, Zollinger SA. Producing song: the vocal apparatus. Annals of the new York Academy of Sciences. 2004;1016(1):109–129. PubMed

RRoch MA, Scott Brandes T, Patel B, Barkley Y, Baumann-Pickering S, Soldevilla MS. Automated extraction of odontocete whistle contours. The Journal of the Acoustical Society of America. 2011;130(4):2212–2223. PubMed

KKershenbaum A, Root-Gutteridge H, Habib B, Koler-Matznick J, Mitchell B, Palacios V, et al. Disentangling canid howls across multiple species and subspecies: structure in a complex communication channel. Behavioural processes. 2016;124:149–157. PubMed

WWarren MR, Campbell D, Borie AM, Ford IV CL, Dharani AM, Young LJ, et al. Maturation of Social-Vocal Communication in Prairie Vole (Microtus ochrogaster ) Pups. Frontiers in Behavioral Neuroscience. 2022;15:814200. PubMed PMC

LLiu RC, Miller KD, Merzenich MM, Schreiner CE. Acoustic variability anddistinguishability among mouse ultrasound vocalizations. The Journal of theAcoustical Society of America. 2003;114(6):3412–3422. PubMed

BBeltrán DF, Araya-Salas M, Parra JL, Stiles FG, Rico-Guevara A. The evolution of sexually dimorphic traits in ecological gradients: an interplay between natural and sexual selection in hummingbirds. Proceedings of the Royal Society B. 2022;289(1989):20221783. PubMed PMC

AAraya-Salas M, Hernández-Pinsón HA, Rojas N, Chaverri G. Ontogeny of an interactive call-and-response system in Spix’s disc-winged bats. Animal Behaviour. 2020;166:233–245.

SSmith-Vidaurre G, Perez-Marrufo V, Wright TF. Individual vocal signatures show reduced complexity following invasion. Animal Behaviour. 2021;179:15–39.

Smith-Vidaurre G, Pérez-Marrufo V, Hobson EA, Salinas-Melgoza A, Wright TF. Individual identity information persists in learned calls of introduced parrot populations. PLoS Computational Biology. 2023;19(7):e1011231. PubMed PMC

Smith-Vidaurre G, Perez-Marrufo V, Wright TF. Simpler signatures post-invasion; 2021. Available from: https://figshare.com/articles/dataset/Simpler_signatures_post-invasion/14811636.

Smith-Vidaurre G, Perez-Marrufo V, Hobson EA, Salinas-Melgoza A, Wright TF. Smith-Vidaurre et al 2023 IdentityInformationEncoding; 2023. Available from: https://figshare.com/articles/dataset/Smith-Vidaurre_et_al_2023_IdentityInformationEncoding/22582099.

Lameira AR, Wich SA. Orangutan long call degradation and individuality over distance: a playback approach. International Journal of Primatology. 2008;29:615–625.

Ekström AG, Moran S, Sundberg J, Lameira A. PREQUEL: Supervised phonetic approaches to analyses of great ape quasi-vowels; 2023. Available from: osf.io/preprints/psyarxiv/8aeh4.

Araya-Salas M, Wright T. Open-ended song learning in a hummingbird. Biology letters. 2013;9(5):20130625. PubMed PMC

Chen CP, Bilmes JA. MVA processing of speech features. IEEE Transactions on Audio, Speech, and Language Processing. 2006;15(1):257–270.

Xie J, Colonna JG, Zhang J. Bioacoustic signal denoising: a review. Artificial Intelligence Review. 2021;54:3575–3597.

Salamon J, Gómez E, Bonada J. Sinusoid extraction and salience function design for predominant melody estimation. In: Proc. 14th Int. Conf. on Digital Audio Effects (DAFx-11), Paris, France; 2011. p. 73–80.

Sun X A pitch determination algorithm based on subharmonic-to-harmonic ratio. In: Proc. 6th International Conference on Spoken Language Processing (ICSLP2000); 2000. p. vol. 4, 676–679.

Herbst CT. Performance evaluation of subharmonic-to-harmonic ratio (SHR) computation. Journal of Voice. 2021;35(3):365–375. PubMed

Raffel C, McFee B, Humphrey EJ, Salamon J, Nieto O, Liang D, et al. MIR EVAL: A Transparent Implementation of Common MIR Metrics. In: ISMIR. vol. 10; 2014.

McFee B resampy: efficient sample rate conversion in Python. Journal of Open Source Software. 2016;1(8):125.

Morrison M torchcrepe; 2023. Available from: https://github.com/maxrmorrison/torchcrepe.

Fujihara H, Kitahara T, Goto M, Komatani K, Ogata T, Okuno HG. F0 estimation method for singing voice in polyphonic audio signal based on statistical vocal model and Viterbi search. In: International Conference on Acoustics Speech and Signal Processing (ICASSP). vol. 5. IEEE; 2006. p. V–V.

Fitch WT. The evolution of speech: a comparative review. Trends in cognitive sciences. 2000;4(7):258–267. PubMed

Najít záznam

Citační ukazatele

Pouze přihlášení uživatelé

Možnosti archivace

Nahrávání dat ...