Computer Science Master of Science Thesis Defense

May 3, 2023 3:30pm — 4:30pm

Location:
In Person - Tepper Building 1403

Speaker:
JUN (JOHN) LUO , Master's Student, Computer Science Department , Carnegie Mellon University

Using Computer Vision and Machine Learning to Unlock Historical Data

Historical data, especially those recorded in tables and forms, have significant value for contemporary research and industry applications. However, such data is rarely digitized or available in readily usable formats such as Excel sheets and database tables. Using historical property appraisals as a case study, we demonstrate how machine learning and computer vision methods can help address this data gap in a cost-effective way. The earliest standardized property appraisal records in the United States were typically handwritten on physical cards. Using scanned cards from Ohio in the 1930s, we test approaches to digitize a property's earliest appraised value. We find that image processing and Optical Character Recognition (OCR) deep learning models can retrieve this value accurately with a Mean Absolute Percentage Error (MAPE) of 14.72\%. For cases where OCR cannot be applied, such as when scanned documents are not available, our machine learning model can use contemporary data to estimate this value with a reduced accuracy of 17.48\% MAPE. Both methods present a substantial saving over manually digitizing the same data, with OCR achieving a cost reduction of 81\% and the machine learning model achieving a cost reduction of 89%.

Historical data, especially those recorded in tables and forms, have significant value for contemporary research and industry applications. However, such data is rarely digitized or available in readily usable formats such as Excel sheets and database tables. Using historical property appraisals as a case study, we demonstrate how machine learning and computer vision methods can help address this data gap in a cost-effective way. The earliest standardized property appraisal records in the United States were typically handwritten on physical cards. Using scanned cards from Ohio in the 1930s, we test approaches to digitize a property's earliest appraised value. We find that image processing and Optical Character Recognition (OCR) deep learning models can retrieve this value accurately with a Mean Absolute Percentage Error (MAPE) of 14.72%. For cases where OCR cannot be applied, such as when scanned documents are not available, our machine learning model can use contemporary data to estimate this value with a reduced accuracy of 17.48% MAPE. Both methods present a substantial saving over manually digitizing the same data, with OCR achieving a cost reduction of 81% and the machine learning model achieving a cost reduction of 89%.

Additional Information

Thesis Committee:

Matt Gormley (Chair)
Rayid Ghani

Add event to Google
Add event to iCal

About Main page

Admissions Main page

Academics Main page

People Main page

Research Main page

Computer Science Master of Science Thesis Defense

May 3, 2023 3:30pm — 4:30pm