January 25, 2018
How does one make high volume of legacy data, in hard copy or PDF, accessible to the modern digital media and that too quickly and sans errors? It’s all about data conversion done right. A good data conversion process can convert any form of Word document, PDF, hard copy to XML and HTML5 formats that are widely accessible across eBook readers and other digital and mobile devices.
Yet, a fully automated data conversion process is seldom the right way, for content needs the right context to give it the right direction. A semi-automated conversion project is the best bet—it weeds out the inconsistencies, accelerates turnaround, mitigates risks and errors, and yet retains the human context.
Here are a few best practices related to semi-automated data conversion for high-volume legacy materials.
- Inconsistencies of the content
In general, source material and legacy material span for years and decades, and in the case of research or conference papers, it spans for even centuries, all stuck in their nonstructured format.
- Undefined tagging
There are multiple ways to mark-up the same content, and you’ll need to decide which way is best. Document Type Definitions, or DTDs, which define the structure of documents in a general way, contain many tags that are optional, and some of the tags can be used in multiple structures, with room for interpretation.
- Text extraction process
Primarily, the text from varied inputs will need to be extracted using various tools. Tools in common use include Adobe, Jade, and Gemini.
- Conversion process
Conversion tools that support various generic DTDs can be used to handle the XML conversion, and further enhancements to XMLs are done by subject matter experts who can improve the structuring. The goal is always to deliver consistent results with high quality.
- XML parsing
The converted XML is parsed against the DTD for its structuring and verified for proper display of text using relevant XSLT or available platform (if any).