GigaSOM.jl: High-performance clustering and visualization of huge cytometry datasets

. 2020 Nov 18 ; 9 (11) : .

Jazyk angličtina Země Spojené státy americké Médium print

Typ dokumentu časopisecké články, práce podpořená grantem

Perzistentní odkaz   https://www.medvik.cz/link/pmid33205814

BACKGROUND: The amount of data generated in large clinical and phenotyping studies that use single-cell cytometry is constantly growing. Recent technological advances allow the easy generation of data with hundreds of millions of single-cell data points with >40 parameters, originating from thousands of individual samples. The analysis of that amount of high-dimensional data becomes demanding in both hardware and software of high-performance computational resources. Current software tools often do not scale to the datasets of such size; users are thus forced to downsample the data to bearable sizes, in turn losing accuracy and ability to detect many underlying complex phenomena. RESULTS: We present GigaSOM.jl, a fast and scalable implementation of clustering and dimensionality reduction for flow and mass cytometry data. The implementation of GigaSOM.jl in the high-level and high-performance programming language Julia makes it accessible to the scientific community and allows for efficient handling and processing of datasets with billions of data points using distributed computing infrastructures. We describe the design of GigaSOM.jl, measure its performance and horizontal scaling capability, and showcase the functionality on a large dataset from a recent study. CONCLUSIONS: GigaSOM.jl facilitates the use of commonly available high-performance computing resources to process the largest available datasets within minutes, while producing results of the same quality as the current state-of-art software. Measurements indicate that the performance scales to much larger datasets. The example use on the data from a massive mouse phenotyping effort confirms the applicability of GigaSOM.jl to huge-scale studies.

Zobrazit více v PubMed

Bandura  DR, Baranov  VI, Ornatsky  OI, et al.  Mass cytometry: technique for real time single cell multitarget immunoassay based on inductively coupled plasma time-of-flight mass spectrometry. Anal Chem. 2009;81(16):6813–22. PubMed

Jaitin  DA, Kenigsberg  E, Keren-Shaul  H, et al.  Massively parallel single-cell RNA-Seq for marker-free decomposition of tissues into cell types. Science. 2014;343(6172):776–79. PubMed PMC

Schmutz  S, Valente  M, Cumano  A, et al.  Spectral cytometry has unique properties allowing multicolor analysis of cell suspensions isolated from solid tissues. PLoS One. 2016;11(8):e0159961. PubMed PMC

Mair  F, Hartmann  FJ, Mrdjen  D, et al.  The end of gating? An introduction to automated analysis of high dimensional cytometry data. Eur J Immunol. 2016;46(1):34–43. PubMed

Arvaniti  E, Claassen  M. Sensitive detection of rare disease-associated cell subsets via representation learning. Nat Commun. 2017;8(1):1–10. PubMed PMC

Bruggner  RV, Bodenmiller  B, Dill  DL, et al.  Automated identification of stratifying signatures in cellular subpopulations. Proc Natl Acad Sci U S A. 2014;111(26):E2770–7. PubMed PMC

Qiu  P, Simonds  EF, Bendall  SC, et al.  Extracting a Cellular Hierarchy from High-dimensional Cytometry Data with SPADE. Nat Biotechnol. 2011;29(10):886–91. PubMed PMC

Lun  ATL, Richard  AC, Marioni  JC. Testing for differential abundance in mass cytometry data. Nat Methods. 2017;14(7):707–9. PubMed PMC

van Gassen  S, Callebaut  B, Helden  MJV, et al.  FlowSOM: Using self-organizing maps for visualization and interpretation of cytometry data. Cytometry Part A. 2015;87(7):636–45. PubMed

Kohonen  T. Essentials of the self-organizing map. Neural Netw. 2013;37:52–65. PubMed

Caruana  R, Elhawary  M, Nguyen  N, et al.  Meta Clustering. In: Sixth International Conference on Data Mining (ICDM’06); 2006:107–18.

Weber  LM, Robinson  MD. Comparison of clustering methods for high-dimensional single-cell flow and mass cytometry data. Cytometry Part A. 2016;89(12):1084–96.. 10.1002/cyto.a.23030. PubMed DOI

Chen  TJ, Kotecha  N. Cytobank: Providing an analytics platform for community cytometry data analysis and collaboration, Fienberg  HG, Nolan  P. In: High-Dimensional Single Cell Analysis. Berlin, Heidelberg: Springer; 2014:127–57. PubMed

Bezanson  J, Edelman  A, Karpinski  S, Shah  VB, Julia: A fresh approach to numerical computing, SIAM review. 2017;59(1):65–98., SIAM.

Kratochvíl  M, Koladiya  A, Vondrášek  J. Generalized EmbedSOM on quadtree-structured self-organizing maps. F1000Res. 2019;8:2120. PubMed PMC

Kohonen  T. Self-organized formation of topologically correct feature maps. Biological Cybernetics. 1982;43(1):59–69.. http://link.springer.com/10.1007/BF00337288. DOI

Cheng  Y. Convergence and Ordering of Kohonen’s Batch Map. Neural Comput. 1997;9(8):1667–76.

Sul  SJ, Tovchigrechko  A. Parallelizing BLAST and SOM Algorithms with MapReduce-MPI Library. In: 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum Anchorage, AK, USA: IEEE; 2011:481–9.. http://ieeexplore.ieee.org/document/6008868/.

Liu  Y, Sun  J, Yao  Q, et al.  A Scalable Heterogeneous Parallel SOM Based on MPI/CUDA. In: Asian Conference on Machine Learning; 2018. p. 264–279.. http://proceedings.mlr.press/v95/liu18b.html.

Sarazin  T, Azzag  H, Lebbah  M. SOM Clustering Using Spark-MapReduce. In: 2014 IEEE International Parallel and Distributed Processing Symposium Workshops Phoenix, AZ, USA: IEEE; 2014. p. 1727–1734.. http://ieeexplore.ieee.org/document/6969583/.

Dean  J, Ghemawat  S. MapReduce: simplified data processing on large clusters. Commun ACM. 2008;51(1):107–13.

Collange  S, Defour  D, Graillat  S, et al.  Numerical reproducibility for the parallel reduction on multi- and many-core architectures. Parallel Comput. 2015;49:83–97.

Gropp  W, Lusk  E, Doss  N, et al.  A high-performance, portable implementation of the MPI message passing interface standard. Parallel Comput. 1996;22(6):789–828.

Ihaka  R, Gentleman  R. R: A language for data analysis and graphics. J Comput Graph Stat. 1996;5(3):299–314.

Wegener  D, Sengstag  T, Sfakianakis  S, et al.  GridR: An R-based tool for scientific data analysis in grid environments. Future Generation Comput Syst. 2009;25(4):481–8.

Zaharia  M, Xin  RS, Wendell  P, et al.  Apache Spark: a unified engine for big data processing. Commun ACM. 2016;59(11):56–65.

Rocklin  M. Dask: Parallel Computation with Blocked algorithms and Task Scheduling. Austin, Texas; 2015:126–32.. https://conference.scipy.org/proceedings/scipy2015/matthew_rocklin.html.

Harris  CR, Millman  KJ, van der Walt  SJ, et al.  Array programming with NumPy. Nature. 2020;585(7825):357–62. PubMed PMC

Bentley  JL. Multidimensional binary search trees used for associative searching. Commun ACM. 1975;18(9):509–17.

Omohundro  SM. Five Balltree Construction Algorithms. Int Comput Sci Inst. 1989; 22.

Maaten  Lvd, Hinton  G. Visualizing Data using t-SNE. J Mach Learn Res. 2008;9(Nov):2579–605.

McInnes  L, Healy  J, Saul  N, Grossberger  L, UMAP: Uniform Manifold Approximation and Projection, Journal of Open Source Software. 2018;3(29):861.

Brown  SDM, Moore  MW. The International Mouse Phenotyping Consortium: past and future perspectives on mouse phenotyping. Mammalian Genome. 2012;23(9-10):632–40.. http://link.springer.com/10.1007/s00335-012-9427-x. PubMed DOI PMC

Kratochvíl  M, Hunewald  O, Heirendt  L, et al.  Supporting data for “GigaSOM.jl: High-performance clustering and visualization of huge cytometry datasets”. GigaScience Database. 2020. 10.5524/100810. PubMed DOI PMC

Varrette  S, Bouvry  P, Cartiaux  H, et al.  Management of an academic HPC cluster: The UL experience. In: 2014 International Conference on High Performance Computing and Simulation (HPCS) Bologna, Italy: IEEE; 2014. p. 959–967.. http://ieeexplore.ieee.org/document/6903792/.

Nejnovějších 20 citací...

Zobrazit více v
Medvik | PubMed

GigaSOM.jl: High-performance clustering and visualization of huge cytometry datasets

. 2020 Nov 18 ; 9 (11) : .

Najít záznam

Citační ukazatele

Pouze přihlášení uživatelé

Možnosti archivace

Nahrávání dat ...