Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

The Scan and Share Tutorial Version 1.07

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

The Scan and Share tutorial version 1.

07
Written by V.; translated into English by A. 2008

Contents
1 Introduction 2 Scanning a book 2.1 Setting up IrfanView for scanning . . . . . . . . . . . . . . . . . . . 2.2 Handwork while scanning . . . . . . . . . . . . . . . . . . . . . . . 3 Processing scans with ScanKromsator 3.1 Draft run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 3 4 6 8 9

3.2 Set options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.3 Main run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4 Processing color gures and photos 5 Encoding scans into DJVU 6 Creating text layer with OCR 7 Adding book covers and color plates 8 Adding hyperlinks and bookmarks A Where to download software 14 15 17 19 20 22

Translators note: This document was originally written in Russian. Some English-language screenshots for IrfanView were inserted; some minor details were added by the translator. Screenshots for Djvu Hyperlinks Editor remain Russian because that program has no other localization.

1 Introduction
This is a mini-tutorial about scanning books and making high-quality les. This tutorial is intended for newbies who would like to make good-quality electronic books but do not know where to start. There are many ways to get good results by scanning; this text shows you one reasonably easy way. The tutorial has step-by-step screenshots and assumes some familiarity with Windows. You may need to download and install a few programs (see Appendix A). We will be mostly targeting the digitization of old books on science, mathematics, or technical books. For these books, OCR is pointless because these books contain too many equations, diagrams, graphs etc. The only solution is to scan and make images of all pages. Such books are almost always printed purely in black/white, with perhaps very few pages having color illustrations. For that kind of books, the highest quality of scanned e-books is achieved if one uses 600dpi black/white images for most pages.1 So you need to scan either directly in 600dpi black/white, or at 300dpi greyscale and then process the scans to make them into 600dpi black/white.2 If the book has a few pages with color illustrations, you will need to scan them separately in 300dpi 24-bit color mode. The same applies to colorful book covers that you also may want to scan. Please note: Never scan at 300dpi black/white! The quality of the results is never as good as what you can get by scanning in 300dpi greyscale and following this tutorial or equivalent methods. Scanning in 300dpi greyscale is on most scanners exactly as quick as scanning in 300dpi black/white or in any lower resolution! You will not save time if you scan in 300dpi black/white or in 200dpi instead of 300dpi greyscale, but you do lose a lot of quality. Scanning in 300dpi greyscale produces large intermediate scanned les, which will be processed into very small DJVU les. Scanning in 600dpi black/white produces smaller intermediate scanned les, but the process of scanning at 600dpi is much slower for most scanners. Also, its easier to process 300dpi greyscale scans because they have less "digital dirt" than 600dpi black/white scans. It is nearly impossible to improve the quality of a poorly scanned and/or incorrectly processed image of a book. For example, some e-books are made by inexperienced people in 150dpi, or in color instead of black/white. These e-book les are huge in size. The visual and print quality of such e-books is bad and cannot be improved! It is important (and not difcult) to make the scanned image correctly and ensure great quality of the resulting e-books. Read on!
If you dont know what 600dpi means: its called the resolution of the image and means the number of image points per inch (dpi=dots per inch). 2 This kind of processing when the resolution of an image is increased is called upsampling.
1

A high-quality scanned e-book is small in size, has great visual appearance on the screen and also when printed, and has searchable text. There are many ways to achieve high quality of scanned e-books; all methods involve the resolution of 600dpi. Output les are in the DJVU3 format and take typically about 5KB/page to 10KB/page. You may of course experiment on your own with other programs. For example, some people use Photoshop with special plugins, Book Restorer, Corel PhotoPaint, RasterID, even Matlab and IDL for picture processing. This tutorial presents a particular method that practically guarantees good results. If you are a beginner, please make a few books by closely following the instructions in this tutorial. You will then see that you can achieve quite a high a level of quality. If you develop your own methods, for example by using different ScanKromsator options or different programs, you will be able to decide which method is best because you can then compare the quality of the results with the reference quality obtained by the methods in this tutorial. One word of warning concerns using FineReader for scanning. Please do not use FineReader for scanning and processing e-books! The FineReader is a good program for making OCR only but is not optimal for scanning and for processing the scans with the goal of making a digital scanned e-book. FineReader attempts to give you a kind of all-in-one solution for scanning and processing e-books; resist the temptation to use just one program for everything. You will not get good results with FineReader; in any case, nowhere as good as when you follow this tutorial. FineReader has the following technical drawbacks: 1) It sometimes uses JPEG for image compression. This is not appropriate for black/white texts! 2) It stores images internally as black/white 300dpi TIFFs and auto-rotates them. Black/white 300dpi is adequate for OCR but not optimal for digital scanned e-books. The auto-rotate algorithm is faulty and produces defects in the image (broken lines). The auto-rotation is hard-coded into FineReader 7.x, 8.x and cannot be disabled.4 3) If you scan in 300dpi greyscale, which is the procedure recommended here, FineReader will perform all operations at 300dpi, rather than resample to 600dpi. ScanKromsator will rst resample to 600dpi and then perform processing. The results of FineReader processing are always going to be inferior for these reasons.

2 Scanning a book
You pick up a thick volume. Maybe you think that only a maniac could scan it, page after page. Yes, you are right! But you can become that kind of maniac and scan books of any size without much discomfort if you organize your work well.
If you dont know what DJVU is, please use Google or Wikipedia to read about it. The DJVU format was specially developed for high-compression storage of scanned images. The PDF format was intended for documents created in a word processor, i.e. for vector documents rather than scanned documents. Scanned e-books in PDF format occupy much more space and/or display slower than in the DJVU format. 4 Only in FineReader version 9 there was added an option to disable this auto-rotation. However, FineReader version 9 cannot be used (yet) to produce OCR layer in DJVU les.
3

Figure 1: Two images of the same page, one made by a digital camera, another by a cheap atbed scanner. The image made by a atbed scanner was scanned at 300dpi greyscale and upsampled to 600dpi black/white. You can guess which image that is! We recommend that you always use a atbed scanner and scan at 300dpi greyscale or higher resolution. First note: Please do not use a digital camera for scanning books! You will never get good results even with expensive 10 Megapixel or whatever cameras. Use an ordinary atbed scanner; even a cheap one is adequate. Look at gure 1 below and guess which of the two images of the same page is made by a digital camera. For scanning, you need any program that can work with the TWAIN scanner driver.5 It is convenient to have a program that can save scanned images for every page to the hard disk, numbering the les like p0001.tif, p0002.tif, etc. For example, image le viewers ACDsee, IrfanView, XnView can also scan images. There is also a convenient scanning program VueScan if it works with your scanner.

2.1 Setting up IrfanView for scanning


As an example, we describe how to scan using IrfanView. (This program can be downloaded for free.) Scanning in other programs is quite similar.
5 Most scanners are supported by TWAIN drivers; for other scanners you may need special drivers.

Start IrfanView. In the File menu, press "Choose TWAIN Source". Choose the scanner that you need to use.

Then in the same menu choose "Acquire/Batch scan".

Here you can choose how to number the scanned les, where to store them, and in which format to save them. As shown, the les will be named page0001.tif, page0002.tif, etc. You should select TIFF as the image format. (Do not use JPEG as the output format!) Click on Options to the right of Save as eld. This will set the options for the TIFF format.

You should select LZW compression; this will cut the TIFF le size in two, compared with no compression (None).6 If you later nd that you have compatibility problems with these TIFF les (i.e. you later use a program that
Note that a typical page scanned in greyscale will occupy between 2 and 4 megabytes on the hard disk with LZW compression.
6

Figure 2: Digital artifacts appearing due to JPEG compression of black/white text. (In this example, the quality setting for the JPEG encoding was very low, so these artifacts are apparent to the eye.) At left: greyscale image with unnatural wavy-looking shadows around the letters. These digital shadows are typical for JPEG compression of black/white images. At right: the same image converted back to black/white, resulting in digital noise. cannot open them) then you need to change the compression method. Do not use the JPEG compression method for black/white text! JPEG compression introduces digital artifacts, that is funny-looking shades around each letter (see gure 2). It is pointless to use JPEG for black/white images.7 Now press OK and go to the TWAIN driver window for your scanner. In the TWAIN window (or other conguration window if you are not using TWAIN drivers), set the resolution to 300dpi and the color mode to greyscale. These are the most important settings.

2.2 Handwork while scanning


The actual work is not complicated: First you need to try scanning some place in the book and check that everything works well. Take a book, open somewhere where the pages are full of text, put the book (both pages down) on the scanner glass. If necessary press with your hand so that the crease is as close to the glass as possible. (You can also use a weight, e.g. another heavy book on top, but its slower than pressing by hand. Do a preview scan. Then you can see what has been scanned in the preview window. If needed, you can turn the page 90 degrees so that the text is straight up. You can also adjust contrast, brightness, gamma correction if necessary. Your goal is that the text must be clearly visible.
The JPEG format actually cannot handle black/white images; when one converts black/white images to JPEG, the software must convert those images into greyscale images. The JPEG compression then introduces a certain quality loss, as shown in the gure. The quality loss in JPEG compression is acceptable for photographs but may degrade black/white text quite signicantly, unless a high quality JPEG mode is selected. (The quality of JPEG compression is usually selectable as a number from 1% to 100%. No visible artifacts would appear at 90% quality or higher. But some programs, especially for making PDF les or for optimizing images, may not allow you to set the JPEG quality manually.)
7

Select the scanning region by using the mouse. You should select the scanning region such that some white space is left around the text. Press the Scan button with the mouse and wait until the scanner nishes scanning the page. This will get the scan of one page (or two pages at once, if you can t the book onto the scanner). The scanned le will be saved to the disk. Now that the scanning program is set up, you can scan all the pages with the same settings. While the scanner lamp is moving back, turn the next page and put the book back to the same place on the scanner. Then press the mouse button to scan again. (The mouse can be left pointing at the Scan button, so you dont need to look. Alternatively, some scanners have buttons on them that make the next scan.) This technique allows you to scan the entire book, one page after another, without looking at the computer screen or at the keyboard. You can watch TV or whatever while you are scanning. Depending on the scanner speed, you can get between 100 and 200 scans per hour. Some scanners are particularly fast (e.g. Plustek OpticBook). It is not necessary to set the book onto the scaner absolutely straight (edge of the book parallel to the edge of the scanner). You should try to put it reasonably straight, but it is unavoidable that pages will not all be scanned completely straight; many pages will be slightly skewed. This small skew is okay and will be corrected later (after scanning) by software. Correcting this skew is called deskewing. When scanning you just need to avoid very large skews and cut pages, i.e. when some of the text gets out of the scanning region. The region of the text around the book crease is often difcult to scan. You can try scanning one page at a time (rather than two pages) or pressing slightly harder onto the book binding. It is important that the text is directly next to the scanner glass. Even 1 mm distance between the glass and the paper will make a very fuzzy scanned image in almost all scanners! It is faster to scan a book two pages per scan rather than one page at a time. But not all books can be scanned that way; some books are too large or dont open sufciently to be scanned two pages per scan. You need to try and decide how to proceed. Regardless of how you scan, the processing software will be able to cut the images into single pages. The result at this stage is a directory full of TIFF les. These les are the raw material that you will start processing after you nish scanning. Note that you need sufcient disk space to store all those scans (at least 4MB per scanned image!). After you nish scanning, use a slideshow mode of some picture viewer to quickly preview the scanned images to make sure that you didnt miss any pages and that every page is adequately scanned. It will be too late when you discover that some pages are upside-down or missing at the nal processing stage, especially when the book has already left your hands!

Note: When you scan the book, please do not omit title pages, front matter, including any information about the publisher, the table of contents, the index, the bibliography, empty pages, page numbers, or anything else!!! You will not save much time if you decide not to scan some 20 pages or so. However, a science book is almost unusable without bibliography and index and without exact information about its publication. Also, do not think that you will make your life easier from the legal point of view if you dont scan the publication information. However, try to avoid scanning the library stamps (just cover them with paper, or remove them with digital image editor after scanning). Nobody wants to see those library stamps in the e-book.

3 Processing scans with ScanKromsator


The main piece of processing software is the wonderful ScanKromsator written by Bolega.8 ScanKromsator is a very powerful tool for processing scanned material. ScanKromsator has a very large number of useful functions, but some of them are not intuitive or difcult to understand if you just look at the user interface.9 In this tutorial you will be walked through a particular simplied workow with ScanKromsator, assuming that you scanned a book at 300dpi greyscale. Start ScanKromsator and load the raw TIFF les into it (menu File). The list of les will appear on the top left column. The toolbar with several tabs (Book, etc.) will appear below the list of les.
Please do not write email to Bolega asking for help, for documentation, for source code of ScanKromsator, or for adding extra features! Instead, just learn to use it and make some good quality e-books! 9 We will talk only about the bare minimum of ScanKromsator functions here. Unfortunately the ScanKromsator program does not yet have a comprehensive users manual describing all the functions.
8

In the example shown, a book was scanned with two pages per scan, and apparently there was some skewing. Our task now is to split, to deskew, and to cut the page images so that every page has the same size and margins. If your scan is single-page, you will not need to split, but you will still need to deskew and cut. This operation is called kromsating in the program.10

3.1 Draft run


The rst step is a draft processing run, i.e. preparation for the nal processing of the raw les. Click the tab Files in the toolbar. You get a dialog where you can set the output resolution (very important!) to 600dpi, the folder for storing the output les (the output folder is by default the subdirectory out in the current directory), and the way of numbering the output les (prex, number of digits, starting number, step). Note the format for compressing the output les: its TIFF G4 encoding, which is optimal for black/white TIFF images. This will be the output format after processing.
The pseudoword kromsate is a mangled Russian word meaning to cut in pieces. Within the ScanKromsator, the meaning of kromsate is the operation of splitting a two-page scanned image into individual page images, and also the operation of cutting page images so that the margins become even and equal on all pages.
10

To start the draft processing run, click the button Draft kromsate bearing the pictogram of scissors, which is located to the left of the Process button in the toolbar. When you press the Draft kromsate button, and you get the dialog shown at right. In this dialog you need to set tick marks on Split pages and Safe top/bottom. The eld Kromsate=All means that the options are applied to all the pages. If some pages do not need to be split, you can select Kromsate=Current and unset Split pages for these pages. Press OK and wait 10-15 minutes until the Draft kromsate operation is nished. You will get the following screen.

Note that there are now green tick marks in the page list (top left column), meaning that these pages have been draft kromsated successfully. For each page you will see the blue lines across the page. These lines are the cutters that determine how the page image will be cut and split. Note that the program attempts to determine automatically where to cut the margins and where to split a two-page image into single pages. In some cases the program may make a mistake and cut too much or too little; in that case you will later be able to adjust the position of the cutters by hand.

10

3.2 Set options


The next important step is to go through the processing options and prepare for the main (not draft) run of ScanKromsator. The processing options are set in the many different tabs in the toolbar (left middle column). Please note: Each option can be set either to apply to all pages at once, or only to the currently shown page. To apply an option to all pages, hold the Ctrl key while clicking the option box with the mouse. In this way, you can set some common options quickly for the entire task and then go to some problematic page and select other options just for that page. First click the Page tab. Here you can set processing options for cutting the pages. The option Split means to split the two-page image into single pages. Deskew will deskew each single page image separately. Despeckle removes small dots. Sometimes Deskew makes pages signicantly skewed; this is usually due to some complicated illustrations. In that case, check Art for these pages. You can set Ortho if the page needs to be rotated by 90 degrees. You can set these options separately for left and right (L and R) pages. Now click on the Book tab. Here you set options related to the size and layout of the pages in the nal book. H.Gap is the size of the margins. The value of 200 is good for 600dpi (meaning 1/3 inch). Page width and height can be set to Auto. You can also center the pages differently (align to center/align to top/align to bottom).

We already visited the Files tab at the draft stage. It is very important to have 600dpi as the output resolution in the Files tab! Now click on the Options tab. Set Deskew method = Auto (shear), Resample lter = Lanczos3. The setting Despeckle=Fine+Normal or Safe switches on an intelligent despeckle method that avoids removing the dots over i or j, for example. Text sensitivity controls the logic of the autocutting. Low sensitivity might cut off the page numbers if they are too far away from the text. You may need to adjust the sensitivity settings a little bit; but in most cases they do not need to be adjusted. You can skip the Options 2 tab for now. Click on the Convert tab. Here you set the threshold for converting greyscale images to black/white. Do not forget to hold the Ctrl key (to set this for all pages) as you select Threshold=MiddleDark. Experiment with other settings if you dont like the results. 11

Click the Quality tab; there you can further control the conversion to black/white. This is a very important function! Set Enhance image, Blur=1, and Sharpen=1. What is important is that the image will become smoother with this setting. The values of Blur and Sharpen could be 2 instead of 1, although the value 1 is usually good. A larger value will make the letters more black. You may need to experiment depending on the quality of printing in a particular book. Another important option is Gray enhance. Click on it since you have greyscale scans (which is what you should have!).

You will get a dialog with many options for greyscale images. Go to the Background cleaner tab and check Enable. Skip several tabs and click the Illumination tab; click Correct illumination. This will normalize the illumination of the page, which is important since usually some parts of the page are darker than others. This is a very useful feature that removes black shadows that would otherwise appear in darker places on the page!

Skip several tabs and click the Denoise tab. Set the parameters as shown at right. These parameters clean up the image. This is the last set of options that we are going to bother with right now. You can use the FileOptions... menu to write the options to a le. This will save you all this work for the next time. The last step before the main processing is a visual checking of the position of the cutters. You need to go through every page and check that the cutters are correctly positioned. Yes, this is a bit boring... but you can make it quick. Put two ngers of the left hand onto the keys q and w; pressing these keys will go to the previous/next page. With the right hand, you hold the mouse 12

and adjust the position of the cutters wherever needed. Sometimes there is a skewed shadow, or it is necessary for some reason to set the cutter line at an angle rather than vertically or horizontally. Hold the Shift key and drag the cutter by its end to achieve this. You can copy the cutter position from one page to another. Right-click on the cutter, and you will see the menu as shown. For instance, if the current cutter position needs to be applied to all subsequent pages, click Copy current position toall down. If some page contains a photograph or a color gure, you need to protect it from converting to black/white. This can be done when checking the position of the cutters. Basically, you can select some arbitrary part of the page and mark it as a picture zone. See Section 4 for more details. You can save the settings for this task by using the File/Save Task command in the menu. This command is useful if you want to stop the task and to continue it later.

3.3 Main run


Now that everything is ready, you can begin the main run of ScanKromsator. Press the large button that says Process and bears the icon of a book, in the main toolbar at top:

The program will ask you to conrm that you really are sure you want to change the resolution of the images. Conrm! The process will then start. Now you need to wait a while. The upsampling operation can be quite slow; in recent versions of ScanKromsator (5.8 and up) this operation was made faster. You may expect to process 5 pages per minute or so. When everything is nished, you should view the output les in the output folder. You should check that all pages are cut and deskewed correctly. If some pages are not processed correctly, you can repeat processing of just those pages with some other options. The main processing run may take some hours on a slow computer. It is not necessary to process the entire book in one run. One can process only some portion of the pages; then one needs to set BookPage widthFixed to the size determined in the previous portion of the pages (so that all pages have equal size at the end of processing). It is usually sufcient to take 10 to 15 pages for determining page size. 13

If you like, you can use the powerful cleaning features of ScanKromsator to remove the digital dirt from some pages. Typically, the digital dirt is any extraneous spots on the paper, pencil or pen marks, and library stamps. Of course, you can also use any graphics editor to clean the images by hand. Hopefully, there will not be many pages to clean.

4 Processing color gures and photos


We discuss color gures separately because they are not frequently needed. However, their place in the workow is at the point where you check and adjust the position of the cutters. The latest version of Kromsator (5.9) includes a feature for color gure processing, the so-called picture zones. One some pages there may be a picture, i.e. a non-black-white illustration such as a photograph or a colorful diagram. You need to protect these illustrations from converting into black/white. To mark a picture zone, select a rectangle containing the illustration and click on the button Mark as picture zone bearing the icon of a blue frame in this toolbar: There is also a possibility to have polygon-shaped picture zones. This is useful, for example, if the page was scanned with a large skewing. Use the starshaped tool button to mark such zones: To set the options for a picture zone, double-click on the selected region. You will see the dialog Picture zone properties.

You need to set the color of the illustration. For example, if the page contains a greyscale photograph (rather than a color photograph or color diagram), set Color=Gray. We cannot discuss other zone options here; as you see, there are many options intended for advanced users. But note that after kromsating the picture zones will be saved to separate les. So after the main processing run you 14

will have to merge them with the page les. This is done by using the menu command ZonesPicture zoneMerge zones. The resulting page les will be TIFF les in which the text is black/white but the picture zones have color.

5 Encoding scans into DJVU


Once the processing of raw scans is nished, you have in the output folder a bunch of TIFF les which are (almost all) black/white at 600dpi. These TIFF les will take typically between 50 and 100 KB per page instead of 4 MB that greyscale les took. By now you should have checked these TIFF les and made sure that the quality of the black/white images is good: the letters are sharp, have smooth shapes, there is little or no dirt etc. To check all that, you can view the TIFF les in a picture viewer (such as IrfanView) at high zoom. Still, 50 to 100 KB per page is far too much. The next step is to encode these images to DJVU format; this will reduce their size dramatically, typically to 5-10 KB per page. To make a good, well-optimized DJVU le, you need one of the two programs: either DjvuSolo version 3.1 or Djvu Document Express (DDE) 4.x, 5.x, 6.x or Djvu Document Express Enterprise (DEE) version 5.1 4.x, 5.x, 6.x.11 The DDE and DEE programs are much faster than DjvuSolo, and DEE 5.1 can be congured to run in batch mode. On the other hand, DjvuSolo is a small and freely downloadable program. The results in terms of DJVU le quality from DjvuSolo and from DDE/DEE are pretty much the same if you set the options correctly. There are two ways of making DJVU les: one is by hand, another by batch. To make a DJVU le by hand, run DjvuSolo or DDE and click FileOpen to open the rst TIFF le. Then click EditInsert pages... and select all the other TIFF les. Please note: a selection box may have a bug in that you select many les by holding the Shift key and the mouse but they will be selected in the inverse order in the box. Check that you are selecting the les in the correct order. Then you need to Save as... and select the Bundled format for DJVU and Bitonal option at 600dpi. You can also edit the le documenttodjvu.conf in the proles directory and set pages-per-dict=100 or 200. The more pages per dictionary, the slower is the compression process, but the smaller the resulting le size. Note that the Bitonal option (or prole) in the DJVU encoders is intended for purely black/white scans, while Scanned option is intended for scans that have some (not many) colors but no photographs. Use the Photo option for photographs. To make a DJVU le by batch, you need DEE 5.1.12 First you need to create
There is also a free software package called djvulibre, but it cannot produce sufciently well compressed DJVU les. 12 This is a rather large package; there exists a stripped-down version that takes only about 20MB on the hard disk.
11

15

a special set of options (or custom prole) for the DJVU encoding job. Run the Document Express Conguration Manager, choose the prole Bitonal (600dpi) from the list of proles, click Advanced settings, and you will see the following dialog.

Now choose the Text tab as shown above. In that tab, set Pages per dictionary = 1000 (if this consumes too much RAM on your computer, or if this is too slow, set to 200 or 300 instead of 1000). Save the custom prole under a new name, say Bitonal-1. Do the same for the Scanned (600dpi) prole if you need to encode books with color drawings. Now run the Document Express Workow Manager. Load all the TIFF pages into it. In the Job name eld, write the name of the book if you want. Choose the previously created custom prole in the list Raster prole.

16

Then click to the Output tab (the tabs are at the bottom of the window). In the list Separate document(s) choose One document only. Tick the box under Enable at far left. Wait until the encoding is nished. You can also look at the Log tab to watch the progress. Thats all; the DJVU le is created. Do not delete the TIFF les yet! You may need to encode again if the DJVU le has some error. Also, the TIFF les are useful for OCR purposes (see section 6). The result of DJVU encoding is a multipage DJVU le containing the entire ebook. You should rename that le to something sensible; not just math1.djvu. At the very least, the le name should contain the authors name, the title of the book, the publication year, and/or the ISBN number if available. This is just a little work, but it will be so much easier to share that le on the Internet if its name is sensibly chosen.

6 Creating text layer with OCR


Compared with the trouble needed to scan and process the book into a DJVU le, it is really peanuts to add OCR for it. An e-book with search is a lot easier to use. The search in DJVU les works only if the DJVU le has the so-called OCR layer. This layer is basically just a list of words stored inside the DJVU le in compressed form. You can create the OCR layer using two programs: FineReader and DjvuOCR. You need FineReader version 7 or 8.13 It is okay to use even a trial or unregistered or evaluation version that you can download for free. The result of running FineReader will be a set of FineReader batch les. The wonderful program DjvuOCR created by Gencho will read these les directly, extract the OCR information, and insert it into DJVU les.
FineReader 9 is now available but it cannot add OCR to DJVU les, and there is no DjvuOCR support for FR 9.
13

17

Suppose you have already created the DJVU le out of some TIFF les. Hopefully, you didnt delete the TIFF les. Load the TIFF les into a new batch in FineReader (keep in mind the problem with selecting many les at once!). Set the recognition language and press Read all. When the OCR process is nished, click Save batch. It is not recommended to edit the OCR text. Previous versions of DjvuOCR could not process FineReader batches if the OCR text was edited. The most recent version DjvuOCR 2.2, can deal with small edits. You should not rewrite large blocks of text; i.e. you should keep many original symbols in their original positions if you edit. Also you should not delete the end-of-line symbols, so that the number of lines in a paragraph remains the same. But we recommend that you do not edit the OCR text at all. After saving the FineReader batch, you can quit FineReader and run the program DjvuOCR.

This program has several functions; for example, DjVu Decoder will produce TIFF les out of DJVU in case you deleted your TIFF les, or if you are working with somebody elses DJVU le. For now, you will use only the Manual mode OCR manager. Click that, and you get the following window.

18

Select the directory where the FineReader batch is located in the FineReader Project directory eld. Output OCR text le will be the name of the new le; it doesnt matter what that name is. Tick the Burn DJVU le box and select the DJVU le below; it means that the OCR data will be inserted (burned) into the DJVU le. Click Process, wait a few minutes, and thats all. Now the DJVU le is full-text searchable!

7 Adding book covers and color plates


It is reasonably easy to add a simple book cover. Just scan the book cover in 300dpi color, or even in 200dpi. Slightly blur the image in a graphics editor. Encode into DJVU using the prole Photo(300) or Scanned. The resulting 1-page DJVU le needs to be inserted at the beginning of the DJVU e-book after all the other processing is nished. Usually the book cover should not be larger than 20-30 KB. It is probably not necessary to spend a lot of effort on making a great-looking book cover. Consider that the people who will read your e-book will spend most of the time reading the text rather than looking at the cover. In the same way one can add color plates, that is, special pages that contain only color illustrations. Scan them separately and insert into the nished DJVU le after all other processing is done. To insert or rearrange pages in a DJVU le, use DjvuSolo or DDE. Open the DJVU le, and you will see the thumbnails of the pages in the left column. You can simply drag the thumbnails to rearrange the pages; you can also Cut, Copy, and Paste pages or groups of selected pages, or delete pages. Use the menu EditInsert pages... to add more DJVU pages to an existing DJVU le. You can insert single-page or multipage DJVU les anywhere (before or after any page), as you need. 19

8 Adding hyperlinks and bookmarks


After nishing all the preceding work with the DJVU le (including OCR), you can add some hyperlink navigation to it. There are two ways of adding hyperlinks. The rst is to use the DjvuSolo or Djvu Editor programs and add hyperlinks by hand. Usually, one adds hyperlinks to pages in the table of contents for easier navigation. In DjvuSolo or Djvu Editor you can select any rectangular area on any page and then insert a hyperlink to a different page of the DJVU le. The user will go to this page when clicking anywhere in the area. Note that the hyperlink will point to a page number, so adding hyperlinks has to be done after any changes to the page order or after inserting any additional pages into the DJVU le. So if you want you can sit and make some rectangular areas into hyperlinks until you are blue in the face.

The second way to add hyperlinks is semi-automatic, using the program DJVU Hyperlinks Editor.14 Run the program and you will see the following window.
14

This program has only the Russian-language interface.

20

First you need to specify options for the hyperlinks Then you need to specify ) in which the table of contents is located in the the page range ( DJVU le. These are DJVU page numbers, which may be different from the page numbers printed in the book and in the table of contents (e.g. because there are some pages taken by the cover and by the front matter). To compensate for this, usually one needs to add a certain offset to the page number; for instance, page 10 in the printed book may be actually page 11 in the DJVU le because one page is taken by the cover.15 Then you need to enter the corresponding offset into the box (offset). Now that all options are

enterd, press the button (which means Add). This will add a new DJVU le to the list in the left panel; the current options will apply to that le. You can now set different options and add a different le. Finally, press the button the DJVU les. (create). This will insert the hyperlink information into all

Similarly, one can create hyperlinks in the subject index. One needs to select a different entry in the drop box . The default entry as shown means Table of contents. Other entries mean that you want to process the subject index. The same settings apply. After nishing the processing, one should view the DJVU le and check that the hyperlinks were added correctly. The program relies on the OCR text for determining the page numbers for hyperlinks. So any errors in OCR may lead to errors in the position or targeting of the hyperlinks.
This is the Russian convention where the page numbering starts right away from the rst page of the book. In the Western typography the front matter usually has separate roman numbering, so typical offsets will be not 1 but between 10 and 20.
15

21

Where to download software


Name of program IrfanView 4.1 ScanKromsator 5.9 DjvuSolo 3.1 Djvu Editor 4.x, 5.x, 6.x (DDE/DEE) FineReader 7.x, 8.x DjvuOCR 2.2 beta Djvu Hyperlinks Editor Download site Status www.irfanview.com free www.djvu-soft.narod.ru free www.djvu-soft.narod.ru free www.djvu-soft.narod.ru nonfree www.abbyy.com trial djvuocr.ucoz.ru free www.djvu-soft.narod.ru free

Big thanks to monday2000 for creating the website djvu-soft.narod.ru! Note for Linux users: All the programs in this table work reasonably well under the standard Windows emulator (wine). However, some programs (IrfanView, DDE/DEE, FineReader) may fail to install if you run setup.exe for those programs. You need to get portable or installed versions of these programs that do not require running an installer.

22

Index
color plates, 19 deskewing, 7 DJVU, 3, 15 dictionary, 15 OCR layer, 17 rearrange pages, 19 FineReader problems, 3 illustrations, 2 IrfanView, 4 JPEG, 5 digital artifacts, 6 problems, 6 kromsating, 9 quality, 2 Russian screenshots, 1 ScanKromsator, 3, 8 cutters, 10 draft run, 9 main run, 13 picture zones, 14 scanning, 7, 8 disk space, 7 greyscale, 2 with digital camera, 4 TIFF, 5 upsampling, 2, 13 using Linux, 22

23

You might also like