JAMIA: SAS-based natural language processing tools show potential for cancer research
Due to a substantial delay between the time cancer diagnoses are made and their capture by cancer registries, cancer researchers are currently forced to rely on chart review and medical claims data to identify primary and recurrent cancers. The proliferation of EHRs, however, creates the potential for timely and complete identification of cancers for clinical research.
To put the potential to the test, KPSC researchers built an SAS-based coding, extraction and nomenclature tool (SCENT) to identify cancer diagnoses in the electronic pathology reports of 400 breast and 400 prostate cancer patients treated by the integrated health network between 2000 and 2007. A total of 915 pathology reports were included in the study and also manually examined by trained abstractors. In the reports, SCENT recognized 51 of 54 new primary and 60 of 61 recurrent cancer cases, and only produced three false positives in 792 true benign cases.
Following a set of hierarchical classification rules, SCENT examined processed electronic text using a dictionary of approximately 1,000 terms to identify clinical concepts associated with cancer and report them in SNOMED format.
Based on their findings, researchers led by Justin A. Strauss, MA, research associate III at KPSC, believe that SAS-based NLP tools could be easily implemented in most clinical settings for the purpose of analyzing electronic text.
“The widespread adoption of SAS in clinical analysis and research settings ensures that SCENT is highly accessible,” Strauss et al wrote. “Integration of SAS with relevant data systems has already been established in these settings, allowing electronic text to be readily extracted for analysis.”
“This functionality has the potential to provide significant value to clinical and epidemiological researchers, particularly when statistical NLP is infeasible due to resource or other constraints,” they concluded. “SCENT is proof of concept for SAS-based NLP applications that can be easily shared between institutions to support clinical and epidemiologic research.”
Strauss conceptualized and developed the NLP system described in this paper, led the validation study and drafted the manuscript, while his colleague, Virginia P. Quinn, PhD, a KPSC research scientist, provided breast cancer research expertise, assisted with data acquisition, contributed to validation study design, interpreted results and had input into the manuscript.