Aligning Keywords from Long Form Prose to Controlled Vocabulary
HIVE-4-MAT is a linked-data, automatic indexing application for vocabularies related to material science. In the past few months, work has been done to improve the performance of the keyword alignment algorithm so that it is faster, more accurate, and more flexible at the expense of precision. This presentation reports on the lessons learned in the process of refactoring this keyword alignment algorithm. Since HIVE-4-MAT has a somewhat broad scope, it provides a good use case for analyzing a keyword alignment pipeline from raw article text scraping to keyword extraction to keyword matching and alignment. The presentation will touch topics such as common pitfalls of web scraping, different strategies for preparing raw text for keyword extraction, the differences in goals between keyword extraction and keyword alignment, and the potential benefits and drawbacks of utilizing the concept of string distance in keyword alignment algorithms.