INTRODUCTION
                
                In the 
previous blog, pattern recognition in aviation safety reports has been demonstrated using clustering (unsupervised machine learning method)
                and Pearson’s correlation coefficient calculation between selected report metadata attributes.
                
                This case study exhibits how to detect recurring incidents using Natural Language Processing (NLP) methods, and Machine learning (ML) algorithms to
                calculate the similarity score of occurrence narratives between any safety report (or free entered text)  and all other safety reports from the database.
                
                The goal was to provide a production-ready, highly responsive, fully automatic solution that can be easily used by safety professionals without any manual intervention.
                
                
METHOD
                
                Database
                
                The solution developed is compatible with any ECCAIRS format and ICAO ADREP Taxonomy-based safety report.
                For this demonstration, the existing airline GALIOT SMS report database is used to showcase the solution in action.
                
                
Text preprocessing
                
                Linguistic processing of the safety narratives is divided into two parts; aviation domain-specific and standard NLP methods.
                In the first stage, commonly used acronyms in the aviation domain are grouped and replaced with a single term to prevent semantic confusion by using a custom-developed aviation dictionary.  In the second stage, the standard text processing methods,
                like stemming, stop-words removal, lemmatization,  punctuation removal, …, are performed to prepare the text for vectorization.
                
                
Fetures extraction and similarity calculation
                
                Reports vectorization (or feature extraction) is actually a transformation of the occurrence narratives into a two-dimensional numerical array
                where rows are reports and columns are text features. In this case, vectorization is performed by TF-IDF (term frequency-inverse document frequency)
                statistical measure that evaluates how relevant a word is to a report in the collection of reports in the database.
                
                After the transformation, each report is represented as an n-dimensional vector, and the similarity between the reports is calculated using
                a cosine similarity algorithm.
                
                
IN ACTION
                
                Similarity and filter criteria
                
                The only task required by the safety officer is to specify:
                
                a)  Similarity criteria
                (similar text to find either by selecting one of the reports from the database or manually entering free form text)
                
                b) Filter criteria
                (aircraft from the fleet, aircraft make/model, last departure point, planned destination, and minimum similarity)
                
                through the easy-to-use interface below.
                
                
                Similarity result
                
                Based on the specified criteria, the safety report similarity is calculated and the results are presented in the Time Plot
                with the chronological distribution of similar safety reports on the x-axis and calculated similarity on the y-axis.
                
                Built-in tooltip enables a quick overview of each report shown in the scatter plot.
                
                
                
                In addition,  the top 10 overall similar safety reports are listed in the separate table for following drill-down analysis.
                
                
                
                CONCLUSION
                
                This case study demonstrates how simple and effective GALIOT AI report similarity score calculation can be used to
                detect recurring safety incidents from the aviation safety report database.
                
                
                    
                    
                
                Marino Tudor
                Founder & CEO
                
                Galiot Aero Ltd
                
                April, 2021