External validation of a commercially available deep learning algorithm for fracture detection in children

Published in

March 2020

Authors

Michel Dupuis, Léo Delbos, Raphael Veil, Catherine Adamsbaum

Objective

The purpose of this study was to conduct an external validation of a fracture assessment deep learning algorithm (Rayvolve®) using digital radiographs from a real-life cohort of children presenting routinely to the emergency room

Methods

This retrospective study was conducted on 2634 radiography sets (5865 images) from 2549 children (1459 boys, 1090 girls; mean age, 8.5 ± 4.5 [SD] years; age range: 0–17 years) referred by the pediatric emergency room for trauma. For each set was recorded whether one or more fractures were found, the number of fractures, and their location found by the senior radiologists and the algorithm. Using the senior radiologist diagnosis as the standard of reference, the diagnostic performance of deep learning algorithm (Rayvolve®) was calculated via three different approaches: a detection approach (presence/absence of a fracture as a binary variable), an enumeration approach (exact number of fractures detected) and a localization approach (focusing on whether the detected fractures were correctly localized). Subgroup analyses were performed according to the presence of a cast or not, age category (0–4 vs. 5–18 years) and anatomical region.

Results

Regarding detection approach, the deep learning algorithm yielded 95.7% sensitivity (95% CI: 94.0–96.9), 91.2% specificity (95% CI: 89.8–92.5) and 92.6% accuracy (95% CI: 91.5–93.6). Regarding enumeration and localization approaches, the deep learning algorithm yielded 94.1% sensitivity (95% CI: 92.1–95.6), 88.8% specificity (95% CI: 87.3–90.2) and 90.4% accuracy (95% CI: 89.2–91.5) for both approaches. Regarding age-related subgroup analyses, the deep learning algorithm yielded greater sensitivity and negative predictive value in the 5–18-years age group than in the 0–4-years age group for the detection approach (P < 0.001 and P = 0.002) and for the enumeration and localization approaches (P = 0.012 and P = 0.028). The high negative predictive value was robust, persisting in all of the subgroup analyses, except for patients with casts (P = 0.001 for the detection approach and P < 0.001 for the enumeration and localization approaches).

Conclusion

The Rayvolve® deep learning algorithm is very reliable for detecting fractures in children, especially in those older than 4 years and without cast.