The benefits of document digitization are priceless. Advanced document capture solutions help reduce risk, save time, gain mobility, and improve productivity in office environments. However, preserving fragile and historical documents is another reason for an investment in digital imaging technology.
The Bar-Ilan University (BIU) in Israel recently utilized document capture procedures, including optical character recognition (OCR) for the digitization of its historical documents. Titled the Responsa Project, the university needed a capture process that translated a variety of Middle Eastern languages from documents that were difficult to read due to wear.
The overall objective of the project consisted of digitizing documents and applying intelligent document capture and management to develop a comprehensive and searchable online library. Documents needed to be scanned, digitized, and translated with OCR to support search.
A Phased Approach
The Responsa Project is a comprehensive, searchable digital Jewish library. It began at the Weizmann Institute in 1963 and is now managed by the BIU. In the early 1970’s, the U.S. National Endowment for Humanities awarded a research grant to support the project, leading to its expansion. The library contains great works of Jewish wisdom, including the Bible and its principal commentaries, the Babylonian Talmud, the Jerusalem Talmud, Midrashim, Zohar, Rambam, Shulchan Aruch, as well as a collection of questions and answers on matters of Jewish law.
To complete the digitalization process, The Responsa Project was broken into phases for the purposes of manageability. Approximately 28 sub directories were created and organized by categories. The first phase of the project focused on the Responsa literature database, which took months to develop and is continually updated. It consists of an account of questions and advice on various subjects within Jewish law; many of which were collected in books that needed to be available electronically.
The second stage of the project targeted numerous halachic, historical, sociological, and economic data, which spanned 1,000 years. A special committee was established to set priorities to determine which pieces would be included in the database. Some of the material and ancient Hebrew scripture was difficult to read and digitize.
BIU decided to invest in advanced data extraction technology. Through research, they found VERUS by NovoDynamics, which was used for similar applications at other universities, including Project AMEEL, a joint effort by Yale and Stanford Universities to create a scholarly Web-based portal for studying the history, culture, and development in the Middle East. The St. Kliment Ohridsky University of Sofia, Bulgaria also used the software to support a research project in Arabic language and linguistics.
The university choice the NovoDynamics OCR solution for its ability to meet its primary needs for The Responsa Project. VERUS is designed to provide superior accuracy, including recognition for Hebrew and Middle Eastern languages. It delivers high-quality text to integrated applications such as text retrieval and machine translation. Additionally, it is able to process and handle up to 10,000 pages per month.
In an initial pilot to test the OCR capabilities, the university used VERUS to automatically clean test documents. They were pleased with the improved image quality as well as the software’s accuracy and ease of use. During the trial, VERUS was quickly installed and running. It eliminated the need to manually presort pages by automatically detecting a page’s primary language. Additionally, the language detection capability demonstrated to be more accurate than a visual inspection. The OCR software was able to recognize all the major font families in all supported languages. In terms of .NET and C++ programming interfaces, the university achieved rapid system integration into its text-oriented applications. By early fall 2010; BIU fully implemented VERUS OCR technology.
BIU’s IT department, the scientific team, and the Responsa committee were trained first on the software and still use it today. The team runs VERUS on Dev-OS:Windows 7 and a production machine running Windows 7.
Search & Retrieve
BIU utilizes a retrieval software engine to enable classic free-text searches for Boolean word combinations using an inverse index. With the help of advanced capture and OCR technology, there are currently more than 300 digitized books within the database portal. The data is available to students, researchers, and the public. Information is retrieved with simple queries by keyword searches as well as a sophisticated search and retrieval mechanism. Users can also request and print every Responsum in the system.
In 1991, Online Responsa launched and was made into a CD for accessibility. The project contains useful content for the Torah scholar, but also provides a user-friendly method for the public to learn about the culture and language in different eras. For instance, a historian may use the search functions of The Responsa Project to enlighten themselves on various aspects of a certain era. There are indexes to many halakhic works and keyword combinations that now exist in more modern and standardized languages. There is also a simulated thesaurus for users unfamiliar with the technical terminology.
A Technology Win
For BIU, NovoDynamics’ OCR technology delivers high levels of accuracy when processing real-world documents, such as yellowed pages, poor copies, and stained documents. It uses advanced, proprietary image processing technology that automatically cleans and orients pages before recognizing text to provide precision for recognizing Hebrew and Middle Eastern languages. OCR supports The Responsa Project in its ability to provide valuable texts and thousands of years of writing.
The Responsa Project is a living database archive and continues to grow as more books and texts are uncovered. The project is scheduled to expand under a ten-year plan in which the team will integrate approximately 40,000 more identified Jewish books. The team expects to expand its content tenfold—from one billion to ten billion bits of data