• This record comes from PubMed

MuTr: Multi-Stage Transformer for Hand Pose Estimation from Full-Scene Depth Image

. 2023 Jun 12 ; 23 (12) : . [epub] 20230612

Status PubMed-not-MEDLINE Language English Country Switzerland Media electronic

Document type Journal Article

Grant support
SGS-2022-017 University of West Bohemia
CZ.02.1.01/0.0/0.0/15 003/0000466 European Regional Development Fund

This work presents a novel transformer-based method for hand pose estimation-DePOTR. We test the DePOTR method on four benchmark datasets, where DePOTR outperforms other transformer-based methods while achieving results on par with other state-of-the-art methods. To further demonstrate the strength of DePOTR, we propose a novel multi-stage approach from full-scene depth image-MuTr. MuTr removes the necessity of having two different models in the hand pose estimation pipeline-one for hand localization and one for pose estimation-while maintaining promising results. To the best of our knowledge, this is the first successful attempt to use the same model architecture in standard and simultaneously in full-scene image setup while achieving competitive results in both of them. On the NYU dataset, DePOTR and MuTr reach precision equal to 7.85 mm and 8.71 mm, respectively.

See more in PubMed

Romero J., Kjellstrom H., Kragic D. Monocular real-time 3d articulated hand pose estimation; Proceedings of the 9th IEEE RAS International Conference on Humanoid Robots; Paris, France. 7–10 December 2009; pp. 87–92.

Feix T., Romero J., Ek C.H., Schmiedmayer H.B., Kragic D. A Metric for Comparing the Anthropomorphic Motion Capability of Artificial Hands. IEEE Trans. Robot. 2013;29:82–93. doi: 10.1109/TRO.2012.2217675. DOI

Zimmermann C., Brox T. Learning to Estimate 3D Hand Pose From Single RGB Images; Proceedings of the IEEE International Conference on Computer Vision (ICCV); Venice, Italy. 22–29 October 2017.

Garcia-Hernando G., Yuan S., Baek S., Kim T.K. First-Person Hand Action Benchmark With RGB-D Videos and 3D Hand Pose Annotations; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Salt Lake City, UT, USA. 18–22 June 2018.

Tekin B., Bogo F., Pollefeys M. H+O: Unified Egocentric Recognition of 3D Hand-Object Poses and Interactions; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Long Beach, CA, USA. 16–20 June 2019.

He K., Zhang X., Ren S., Sun J. Deep Residual Learning for Image Recognition; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Las Vegas, NV, USA. 27–30 June 2016; pp. 770–778. DOI

Huang G., Liu Z., Van Der Maaten L., Weinberger K.Q. Densely connected convolutional networks; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Honolulu, HI, USA. 21–26 July 2017; pp. 4700–4708.

Tan M., Le Q. Efficientnet: Rethinking model scaling for convolutional neural networks; Proceedings of the International Conference on Machine Learning, PMLR; Long Beach, CA, USA. 9–15 June 2019; pp. 6105–6114.

Oberweger M., Lepetit V. DeepPrior++: Improving Fast and Accurate 3D Hand Pose Estimation; Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCVW); Venice, Italy. 22–29 October 2017; DOI

Kolesnikov A., Dosovitskiy A., Weissenborn D., Heigold G., Uszkoreit J., Beyer L., Minderer M., Dehghani M., Houlsby N., Gelly S., et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. 2021. [(accessed on 11 June 2023)]. Available online: https://openreview.net/forum?id=YicbFdNTTy.

Touvron H., Cord M., Douze M., Massa F., Sablayrolles A., Jégou H. Training data-efficient image transformers & distillation through attention; Proceedings of the International Conference on Machine Learning, PMLR; Virtual Event. 18–24 July 2021; pp. 10347–10357.

Wu H., Xiao B., Codella N., Liu M., Dai X., Yuan L., Zhang L. Cvt: Introducing convolutions to vision transformers. arXiv. 20212103.15808

Wang W., Xie E., Li X., Fan D.P., Song K., Liang D., Lu T., Luo P., Shao L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. arXiv. 20212102.12122

Liu Z., Lin Y., Cao Y., Hu H., Wei Y., Zhang Z., Lin S., Guo B. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv. 20212103.14030

Yang J., Li C., Zhang P., Dai X., Xiao B., Yuan L., Gao J. Focal self-attention for local-global interactions in vision transformers. arXiv. 20212107.00641

Carion N., Massa F., Synnaeve G., Usunier N., Kirillov A., Zagoruyko S. End-to-end object detection with transformers; Proceedings of the European Conference on Computer Vision; Glasgow, UK. 23–28 August 2020; Berlin/Heidelberg, Germany: Springer; 2020. pp. 213–229.

Zhu X., Su W., Lu L., Li B., Wang X., Dai J. Deformable DETR: Deformable Transformers for End-to-End Object Detection; Proceedings of the International Conference on Learning Representations; Addis Ababa, Ethiopia. 26–30 April 2020.

Zheng M., Gao P., Wang X., Li H., Dong H. End-to-end object detection with adaptive clustering transformer. arXiv. 20202011.09315

Dai Z., Cai B., Lin Y., Chen J. Up-detr: Unsupervised pre-training for object detection with transformers; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Nashville, TN, USA. 20–25 June 2021; pp. 1601–1610.

Wang H., Zhu Y., Adam H., Yuille A., Chen L.C. Max-deeplab: End-to-end panoptic segmentation with mask transformers; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Nashville, TN, USA. 20–25 June 2021; pp. 5463–5474.

Wang Y., Xu Z., Wang X., Shen C., Cheng B., Shen H., Xia H. End-to-end video instance segmentation with transformers; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Nashville, TN, USA. 20–25 June 2021; pp. 8741–8750.

Ge L., Liang H., Yuan J., Thalmann D. Real-Time 3D Hand Pose Estimation with 3D Convolutional Neural Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2019;41:956–970. doi: 10.1109/TPAMI.2018.2827052. PubMed DOI

Oberweger M., Wohlhart P., Lepetit V. Generalized Feedback Loop for Joint Hand-Object Pose Estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2019;42:1898–1912. doi: 10.1109/TPAMI.2019.2907951. PubMed DOI

Moon G., Yong Chang J., Mu Lee K. V2V-PoseNet: Voxel-to-Voxel Prediction Network for Accurate 3D Hand and Human Pose Estimation From a Single Depth Map; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Salt Lake City, UT, USA. 18–22 June 2018.

Huang F., Zeng A., Liu M., Qin J., Xu Q. Structure-Aware 3D Hourglass Network for Hand Pose Estimation from Single Depth Image; Proceedings of the British Machine Vision Conference, BMVC; Newcastle, UK. 3–6 September 2018; Durham, UK: BMVA Press; 2018. p. 289.

Ting P.W., Chou E.T., Tang Y.H., Fu L.C. Hand Pose Estimation Based on 3D Residual Network with Data Padding and Skeleton Steadying. In: Jawahar C., Li H., Mori G., Schindler K., editors. Proceedings of the Asian Conference on Computer Vision (ACCV), Perth, Australia, 2–6 December 2018. Springer; Cham, Switzerland: 2019. pp. 293–307.

Guo F., He Z., Zhang S., Zhao X., Tan J. Attention-Based Pose Sequence Machine for 3D Hand Pose Estimation. IEEE Access. 2020;8:18258–18269. doi: 10.1109/ACCESS.2020.2968361. DOI

Xiong F., Zhang B., Xiao Y., Cao Z., Yu T., Zhou J.T., Yuan J. A2J: Anchor-to-Joint Regression Network for 3D Articulated Pose Estimation From a Single Depth Image; Proceedings of the IEEE International Conference on Computer Vision (ICCV); Seoul, Republic of Korea. 27 October–2 November 2019.

Ren P., Sun H., Qi Q., Wang J., Huang W. SRN: Stacked Regression Network for Real-time 3D Hand Pose Estimation; Proceedings of the British Machine Vision Conference BMVC; Cardiff, UK. 9–12 September 2019.

Ren P., Sun H., Huang W., Hao J., Cheng D., Qi Q., Wang J., Liao J. Spatial-aware stacked regression network for real-time 3D hand pose estimation. Neurocomputing. 2021;437:42–57. doi: 10.1016/j.neucom.2021.01.045. DOI

Ge L., Ren Z., Yuan J. Point-to-Point Regression PointNet for 3D Hand Pose Estimation; Proceedings of the European Conference on Computer Vision, ECCV; Munich, Germany. 8–14 September 2018.

Li S., Lee D. Point-To-Pose Voting Based Hand Pose Estimation Using Residual Permutation Equivariant Layer; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Long Beach, CA, USA. 16–20 June 2019.

Chen X., Wang G., Zhang C., Kim T., Ji X. SHPR-Net: Deep Semantic Hand Pose Regression From Point Clouds. IEEE Access. 2018;6:43425–43439. doi: 10.1109/ACCESS.2018.2863540. DOI

Huang L., Tan J., Liu J., Yuan J. Hand-Transformer: Non-Autoregressive Structured Modeling for 3D Hand Pose Estimation. In: Vedaldi A., Bischof H., Brox T., Frahm J.M., editors. Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020. Springer; Cham, Switzerland: 2020. pp. 17–33.

Li K., Wang S., Zhang X., Xu Y., Xu W., Tu Z. Pose Recognition With Cascade Transformers; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); Nashville, TN, USA. 20–25 June 2021; pp. 1944–1953.

Hampali S., Sarkar S.D., Rad M., Lepetit V. Keypoint Transformer: Solving Joint Identification in Challenging Hands and Object Interactions for Accurate 3D Pose Estimation; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); New Orleans, LA, USA. 18–24 June 2022; pp. 11090–11100.

Chen T., Wu M., Hsieh Y., Fu L. Deep learning for integrated hand detection and pose estimation; Proceedings of the International Conference on Pattern Recognition (ICPR); Cancun, Mexico. 4–8 December 2016; pp. 615–620.

Choi C., Kim S., Ramani K. Learning Hand Articulations by Hallucinating Heat Distribution; Proceedings of the IEEE International Conference on Computer Vision (ICCV); Venice, Italy. 22–29 October 2017.

Che Y., Song Y., Qi Y. A Novel Framework of Hand Localization and Hand Pose Estimation; Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); Brighton, UK. 12–17 May 2019; pp. 2222–2226. DOI

Tompson J., Stein M., Lecun Y., Perlin K. Real-Time Continuous Pose Recovery of Human Hands Using Convolutional Networks. ACM Trans. Graph. 2014;33:1–10. doi: 10.1145/2629500. DOI

Oberweger M., Wohlhart P., Lepetit V. Hands Deep in Deep Learning for Hand Pose Estimation; Proceedings of the Computer Vision Winter Workshop; Waikoloa, HI, USA. 6–9 January 2015.

Ge L., Liang H., Yuan J., Thalmann D. Robust 3D Hand Pose Estimation in Single Depth Images: From Single-View CNN to Multi-View CNNs; Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Las Vegas, NV, USA. 27–30 June 2016; pp. 3593–3601. DOI

Tang D., Jin Chang H., Tejani A., Kim T.K. Latent Regression Forest: Structured Estimation of 3D Articulated Hand Posture; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Columbus, OH, USA. 23–28 June 2014.

Yuan S., Garcia-Hernando G., Stenger B., Moon G., Chang J.Y., Lee K.M., Molchanov P., Kautz J., Honari S., Ge L., et al. Depth-Based 3D Hand Pose Estimation: From Current Achievements to Future Goals; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Salt Lake City, UT, USA. 18–23 June 2018.

Armagan A., Garcia-Hernando G., Baek S., Hampali S., Rad M., Zhang Z., Xie S., Chen M., Zhang B., Xiong F., et al. Measuring Generalisation to Unseen Viewpoints, Articulations, Shapes and Objects for 3D Hand Pose Estimation under Hand-Object Interaction; Proceedings of the European Conference on Computer Vision (ECCV); Glasgow, UK. 23–28 August 2020.

Yuan S., Ye Q., Stenger B., Jain S., Kim T. BigHand2.2M Benchmark: Hand Pose Dataset and State of the Art Analysis; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Honolulu, HI, USA. 21–26 July 2017; pp. 2605–2613. DOI

Paszke A., Gross S., Massa F., Lerer A., Bradbury J., Chanan G., Killeen T., Lin Z., Gimelshein N., Antiga L., et al. Advances in Neural Information Processing Systems 32. Curran Associates, Inc.; Red Hook, NY, USA: 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library; pp. 8024–8035.

Xie S., Girshick R., Dollár P., Tu Z., He K. Aggregated residual transformations for deep neural networks; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Honolulu, HI, USA. 21–26 July 2017; pp. 1492–1500.

Tan M., Le Q. Efficientnetv2: Smaller models and faster training; Proceedings of the International Conference on Machine Learning, PMLR; Virtual. 18–24 July 2021; pp. 10096–10106.

Supancic J.S., III, Rogez G., Yang Y., Shotton J., Ramanan D. Depth-Based Hand Pose Estimation: Data, Methods and Challenges; Proceedings of the IEEE International Conference on Computer Vision (ICCV); Santiago, Chile. 7–13 December 2015.

Find record

Citation metrics

Loading data ...

Archiving options

Loading data ...