4946

Unsupervised Font Reconstruction Based on Token Co-occurrence

Michael P. Cutter, Joost van Beusekom, Faisal Shafait, Thomas Breuel

ACM Symposium on Document Engineering ACM Symposium on Document Engineering (DocEng-10), September 21-24, Manchester, United Kingdom , ACM , 2010
High quality conversions of scanned documents into PDF usually either rely on full OCR or token compression. This paper describes an approach intermediate between those two: it is based on token clustering, but additionally groups tokens into candidate fonts. Our approach has the potential of yielding OCR-like PDFs when the inputs are high quality and degrading to token based compression when the font analysis fails, while preserving full visual fidelity. Our approach is based on an unsupervised algorithm for grouping tokens into candidate fonts. The algorithm constructs a graph based on token proximity and derives token groups by partitioning this graph. In initial experiments on scanned 300 dpi pages containing multiple fonts, this technique reconstructs candidate fonts with 100% accuracy.

Show BibTex:

@inproceedings {
       abstract = {High quality conversions of scanned documents into PDF
usually either rely on full OCR or token compression. This
paper describes an approach intermediate between those
two: it is based on token clustering, but additionally groups
tokens into candidate fonts. Our approach has the potential of yielding OCR-like PDFs when the inputs are high
quality and degrading to token based compression when the
font analysis fails, while preserving full visual fidelity. Our
approach is based on an unsupervised algorithm for grouping tokens into candidate fonts. The algorithm constructs a
graph based on token proximity and derives token groups by
partitioning this graph. In initial experiments on scanned
300 dpi pages containing multiple fonts, this technique reconstructs candidate fonts with 100% accuracy.},
       number = {}, 
       month = {9}, 
       year = {2010}, 
       title = {Unsupervised Font Reconstruction Based on Token Co-occurrence}, 
       journal = {}, 
       volume = {}, 
       pages = {}, 
       publisher = {ACM}, 
       author = {Michael P. Cutter, Joost van Beusekom, Faisal Shafait, Thomas Breuel}, 
       keywords = {},
       url = {http://www.dfki.de/web/forschung/publikationen/renameFileForDownload?filename=Cutter-Font-Reconstruction-DocEng10.pdf&file_id=uploads_779}
}