Los Angeles,
09:58 AM

New AI Tool Mines Cancer Patients’ Pathology Data

Cedars-Sinai Investigators Facilitate Computer Access to Pathologists’ Notes in Patient Records, Paving the Way for Their Use in New Studies, Clinical Trials

Cedars-Sinai investigators have used artificial intelligence (AI) to help computers access some of the most important and difficult-to-mine information in cancer patients’ medical records: pathology reports. Their method, described in the peer-reviewed data science journal Patterns, could help physician-scientists who obtain patient consent to extract information from these patients’ pathology reports for research and clinical trial recruitment.Nicholas Tatonetti, PhD

“Cancer is a complex disease, and rich information is contained in the notes that a pathologist makes when they review a patient’s cancer underneath the microscope,” said Nicholas Tatonetti, PhD, vice chair of Operations in the Department of Computational Biomedicine at Cedars-Sinai, associate director of Computational Oncology at Cedars-Sinai Cancer and senior author of the study. “But because these notes are in the form of scanned PDFs, the text they contain has been inaccessible to computers—until now.”

To create a machine-readable pathology dataset, Tatonetti and his team worked with The Cancer Genome Atlas, a publicly available collection of information from thousands of U.S. cancer patients who have given permission for investigators to examine their personal health records.

“The pathology reports in the atlas are scanned in at all angles and in different formats from each of the institutions that provided them,” Tatonetti said. “They’re messy and their scan quality is relatively poor—not unlike pathology forms you would find in patient records.”

Investigators used AI to clean up the scans so that optical character recognition software could turn them into machine-readable notes. When investigators compared these notes against the original reports, they found this method was highly accurate.

“By making the reports machine readable, we can train algorithms to extract information from them in response to investigators’ questions,” Tatonetti said. “This will help investigators identify and validate new disease markers, conduct research, and recruit patients for clinical trials.”Dan Theodorescu, MD, PhD

The resulting collection of text from the reports, now publicly available, includes data on almost 10,000 cancer patients. The format is commonly used in machine learning to allow computational biologists and computer scientists to use the data, Tatonetti said. The method could also be used to extract pathology report data from other datasets.

“The true story of a patient’s condition, such as detailed information about their cancer and the effects of various therapies, is found in clinicians’ notes,” said Cedars-Sinai Cancer Director Dan Theodorescu, MD, PhD, the PHASE ONE Foundation Distinguished Chair and Director at the Samuel Oschin Comprehensive Cancer Institute. “Tools that help us mine this information further our efforts to conduct translational studies that bring the promise of precision medicine to each of our patients.”

Tatonetti and his team are now focused on training models to extract specific information—such as cancer staging—from the data.

“Our model can extract that information when it is present in the notes, but it can also accurately infer the stage when it is not explicitly stated,” Tatonetti said. “For instance, the pathologist might make a note about a secondary lesion or about or evaluating a sample of a breast cancer from the liver. These notes don’t include the word metastatic, but they do imply it.”Jason Moore, PhD

The team is also working to apply its method to the Molecular Twin Precision Oncology Platform, a unique precision medicine and AI tool created at Cedars-Sinai that includes pathology reports and other data on the majority of Cedars-Sinai’s cancer patients. Team members are also developing tools to make other clinician notes from patient records machine readable, Tatonetti said.

“AI enhancements to optical character recognition are the key to extracting a wealth of data from some of the most clinically relevant portions of patient records,” said Jason Moore, PhD, chair of the Department of Computational Biomedicine at Cedars-Sinai. “This data will fuel new studies by researchers across specialties, including research clinicians, clinical trial investigators and investigators working to improve tools that allow computers to interpret clinical language.”

Funding: This work was supported by National Institute of General Medical Sciences of the National Institutes of Health grant number R35GM131905.

Follow Cedars-Sinai Academic Medicine on Twitter for more on the latest basic science and clinical research from Cedars-Sinai.