Digitalised Text

Before and after – Bulgarian in Abbyy FineReader

The miracles of 21st century! Not only do we have a .pdf format for scanned documents, but we can also turn them into text documents which one can edit. This can be done in a number of ways, including using Google’s application, but we use Abbyy FineReader. The pros of this software is that if has many different languages and many of them with included dictionaries, to help with word recognition. You can also teach it to recognise words (while using it in Windows) and patterns in order to clean out mistakes. Good thing the software is Romanian and the creators included a dictionary in Bulgarian language! (Thanks neighbours!)

My project will very definitely be a continuation of my work in Sofia Central Library (Столична библиотека). There I used to index and often type on a computer – digitalise by hand – the works and books of old Bulgarian collections. Most of the books were quite old, whatever is left from the bombarding of the library in 1945. I hope I will be able to get scanned versions of such documents and digitalise them. So far, I am working collaboratively with the library in order to find the most unique and intriguing works for me to put into my personal corpus. If not, I will get the most practical texts that would help them expand their collection.

On the left you can see the result of turning a .pdf in Bulgarian into a raw format to work with. Originally this wasn’t a paper source so it was quite easy for the software to recognise. However, as this is a test in Bulgarian I noticed that it even underlined the places where spelling mistakes were made. Also there are a couple of lines from old Bulgarian poetry and the software noted that there is something a little bit off with the language. Amazing!