by William Underwood, Sandra Laib, and Sheila Isbell, Georgia Tech Research Laboratory, Georgia Institution of Technology
A method for automatic document type recognition and metadata extraction has been implemented and successfully tested. The method is based on the method for automatically annotating semantic categories such as person’s names, dates, and postal addresses. It extends this method by: (1) identifying about 100 types of intellectual elements of documents, (2) parsing these elements using context-free grammars defining the documentary form of document types, (3) interpreting the pragmatics of the form of the document to identify some or all of the following metadata: the chronological date, author(s), addressee(s), and topic. This metadata can be used for indexing and searching collections of records by person, organization and location names, topics, dates, author’s and addressee’s names and document types. It can also be used for automatically describing items, file units and record series.
Available from this URL: http://perpos.gtri.gatech.edu/publications/TR%2009-06.pdf