Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resegment: running for 155 minutes(?)... #73

Open
jbarth-ubhd opened this issue Oct 12, 2020 · 6 comments
Open

resegment: running for 155 minutes(?)... #73

jbarth-ubhd opened this issue Oct 12, 2020 · 6 comments

Comments

@jbarth-ubhd
Copy link

jbarth-ubhd commented Oct 12, 2020

and still running.

Workflow:

. /usr/local/ocrd_all/venv/bin/activate
export TMPDIR=/dwork/tmp
export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
ocrd-create-mets.xml
( /usr/bin/time ocrd process \
"olena-binarize -I OCR-D-IMG -O OCR-D-N1 -P impl wolf" \
"anybaseocr-crop -I OCR-D-N1 -O OCR-D-N2" \
"olena-binarize -I OCR-D-N2 -O OCR-D-N3 -P impl wolf" \
"cis-ocropy-denoise -I OCR-D-N3 -O OCR-D-N4 -P level-of-operation page" \
"cis-ocropy-deskew -I OCR-D-N4 -O OCR-D-N5 -P level-of-operation page" \
"pc-segmentation -I OCR-D-N5 -O OCR-D-N6" \
"cis-ocropy-deskew -I OCR-D-N6 -O OCR-D-N7 -P level-of-operation region" \
"tesserocr-segment-line -I OCR-D-N7 -O OCR-D-N8" \
"cis-ocropy-resegment -I OCR-D-N8 -O OCR-D-N9" \
"cis-ocropy-dewarp -I OCR-D-N9 -O OCR-D-N10" \
"calamari-recognize -I OCR-D-N10 -O OCR-D-OCR -P checkpoint /usr/local/ocrd_models/calamari/calamari_models-0.3/fraktur_19th_century/*.ckpt.json"

) >cmd.log 2>&1
ps axf
ls       66073  0.0  0.0   4384   744 pts/0    S    14:40   0:00                                  |   \_ /usr/bin/time ocrd process olena-binarize -I O[44/1843]
-O OCR-D-N1 -P impl wolf anybaseocr-crop -I OCR-D-N1 -O OCR-D-N2 olena-binarize -I OCR-D-N2 -O OCR-D-N3 -P impl wolf cis-ocropy-denoise -I OCR-D-N3 -O OCR-D-N4
-P level-of-operation page cis-ocropy-deskew -I OCR-D-N4 -O OCR-D-N5 -P level-of-operation page pc-segmentation -I OCR-D-N5 -O OCR-D-N6 cis-ocropy-deskew -I OCR
-D-N6 -O OCR-D-N7 -P level-of-operation region tesserocr-segment-line -I OCR-D-N7 -O OCR-D-N8 cis-ocropy-resegment -I OCR-D-N8 -O OCR-D-N9 cis-ocropy-dewarp -I
OCR-D-N9 -O OCR-D-N10 calamari-recognize -I OCR-D-N10 -O OCR-D-OCR -P checkpoint /usr/local/ocrd_models/calamari/calamari_models-0.3/fraktur_19th_century/*.ckpt
.json
ls       66074  0.0  0.0 2423620 68968 pts/0   S    14:40   0:05                                  |       \_ /dwork/ocrd-schroot-ubuntu-eoan/usr/local/ocrd_all/
venv/bin/python3.7 /dwork/ocrd-schroot-ubuntu-eoan/usr/local/ocrd_all/venv/bin/ocrd process olena-binarize -I OCR-D-IMG -O OCR-D-N1 -P impl wolf anybaseocr-crop
 -I OCR-D-N1 -O OCR-D-N2 olena-binarize -I OCR-D-N2 -O OCR-D-N3 -P impl wolf cis-ocropy-denoise -I OCR-D-N3 -O OCR-D-N4 -P level-of-operation page cis-ocropy-de
skew -I OCR-D-N4 -O OCR-D-N5 -P level-of-operation page pc-segmentation -I OCR-D-N5 -O OCR-D-N6 cis-ocropy-deskew -I OCR-D-N6 -O OCR-D-N7 -P level-of-operation
region tesserocr-segment-line -I OCR-D-N7 -O OCR-D-N8 cis-ocropy-resegment -I OCR-D-N8 -O OCR-D-N9 cis-ocropy-dewarp -I OCR-D-N9 -O OCR-D-N10 calamari-recognize
 -I OCR-D-N10 -O OCR-D-OCR -P checkpoint /usr/local/ocrd_models/calamari/calamari_models-0.3/fraktur_19th_century/*.ckpt.json
ls        2747  116  0.3 11505348 519324 pts/0 Rl   16:44 160:53                                  |           \_ /dwork/ocrd-schroot-ubuntu-eoan/usr/local/ocrd_
all/venv/bin/python3.7 /dwork/ocrd-schroot-ubuntu-eoan/usr/local/ocrd_all/venv/bin/ocrd-cis-ocropy-resegment --working-dir /_digi8+9/digitalisate8/ocr-d/testset
/x,pc-segmentation,tesserocr-segment-line,calamari-frak19th --mets mets.xml --input-file-grp OCR-D-N8 --output-file-grp OCR-D-N9 --parameter {"dpi": 0, "min_fra
ction": 0.8, "extend_margins": 3}

@bertsky: same image set as in last email.

PS: no cis-ocropy-clip for obvious reasons :-)

@jbarth-ubhd jbarth-ubhd changed the title resegment: running for 155 minutes... resegment: running for 155 minutes(?)... Oct 12, 2020
@jbarth-ubhd
Copy link
Author

jbarth-ubhd commented Oct 12, 2020

@jbarth-ubhd
Copy link
Author

jbarth-ubhd commented Oct 14, 2020

Finally went through; took hours.

Since this only occurs in combination with pc-segmentation and pc-segmentation seems to be currently the weakest segmentation method, I'll close this case.

@bertsky
Copy link
Collaborator

bertsky commented Oct 14, 2020

Finally went through; took hours.

Since this only occurs in combination with pc-segmentation and pc-segmentation seems currently the weakest segmentation method, I'll close this case.

I would really like to debug this, but unfortunately I have not been able to run ocrd-pc-segmentation in the past. So could you please provide me with the last input file? I.e. fileGrp OCR-D-N8 file with pageId OCR-D-N8_00062 – only the PAGE-XML (since you gave me the images already)...

Before we close, we should make sure this is not a bug on ocrd_cis side. Could you please ocrd workspace validate, esp. OCR-D-N8?

@bertsky bertsky reopened this Oct 14, 2020
@jbarth-ubhd
Copy link
Author

jbarth-ubhd commented Oct 15, 2020

I've let it run again...

Note: complete workflow took longer than sbb_textline, resegment alone 3:30 wallclock time.

I don't know which page exactly affects resegment execution time. Perhaps a consequence of too bad input to resegment. Let's wait if someone else complaines in combination with sbb_textline or similar.

21:19:33.979 INFO ocrd.task_sequence.run_tasks - Start processing task 'olena-binarize -I OCR-D-IMG -O OCR-D-N1 -p 
'{"impl": "wolf", "k": 0.34, "win-size": 0, "dpi": 0}''
21:24:12.068 INFO ocrd.task_sequence.run_tasks - Finished processing task 'olena-binarize -I OCR-D-IMG -O OCR-D-N1 -p 
'{"impl": "wolf", "k": 0.34, "win-size": 0, "dpi": 0}''
21:24:12.073 INFO ocrd.task_sequence.run_tasks - Start processing task 'anybaseocr-crop -I OCR-D-N1 -O OCR-D-N2 -p 
'{"force": true, "colSeparator": 0.04, "maxRularArea": 0.3, "minArea": 0.05, "minRularArea": 0.01, "positionBelow": 
0.75, "positionLeft": 0.4, "positionRight": 0.6, "rularRatioMax": 10.0, "rularRatioMin": 3.0, "rularWidth": 0.95, 
"operation_level": "page"}''
21:28:15.482 INFO ocrd.task_sequence.run_tasks - Finished processing task 'anybaseocr-crop -I OCR-D-N1 -O OCR-D-N2 -p 
'{"force": true, "colSeparator": 0.04, "maxRularArea": 0.3, "minArea": 0.05, "minRularArea": 0.01, "positionBelow": 
0.75, "positionLeft": 0.4, "positionRight": 0.6, "rularRatioMax": 10.0, "rularRatioMin": 3.0, "rularWidth": 0.95, 
"operation_level": "page"}''
21:28:15.497 INFO ocrd.task_sequence.run_tasks - Start processing task 'olena-binarize -I OCR-D-N2 -O OCR-D-N3 -p 
'{"impl": "wolf", "k": 0.34, "win-size": 0, "dpi": 0}''
21:33:18.981 INFO ocrd.task_sequence.run_tasks - Finished processing task 'olena-binarize -I OCR-D-N2 -O OCR-D-N3 -p 
'{"impl": "wolf", "k": 0.34, "win-size": 0, "dpi": 0}''
21:33:18.989 INFO ocrd.task_sequence.run_tasks - Start processing task 'cis-ocropy-denoise -I OCR-D-N3 -O OCR-D-N4 -p 
'{"level-of-operation": "page", "noise_maxsize": 3.0, "dpi": 0}''
21:34:40.901 INFO ocrd.task_sequence.run_tasks - Finished processing task 'cis-ocropy-denoise -I OCR-D-N3 -O OCR-D-N4 
-p '{"level-of-operation": "page", "noise_maxsize": 3.0, "dpi": 0}''
21:34:40.910 INFO ocrd.task_sequence.run_tasks - Start processing task 'cis-ocropy-deskew -I OCR-D-N4 -O OCR-D-N5 -p 
'{"level-of-operation": "page", "maxskew": 5.0}''
21:48:11.411 INFO ocrd.task_sequence.run_tasks - Finished processing task 'cis-ocropy-deskew -I OCR-D-N4 -O OCR-D-N5 -p 
'{"level-of-operation": "page", "maxskew": 5.0}''
21:48:11.421 INFO ocrd.task_sequence.run_tasks - Start processing task 'pc-segmentation -I OCR-D-N5 -O OCR-D-N6 -p 
'{"overwrite_regions": true, "xheight": 8, "model": "__DEFAULT__", "gpu_allow_growth": false, "resize_height": 300}''
21:55:00.789 INFO ocrd.task_sequence.run_tasks - Finished processing task 'pc-segmentation -I OCR-D-N5 -O OCR-D-N6 -p 
'{"overwrite_regions": true, "xheight": 8, "model": "__DEFAULT__", "gpu_allow_growth": false, "resize_height": 300}''
21:55:00.816 INFO ocrd.task_sequence.run_tasks - Start processing task 'cis-ocropy-deskew -I OCR-D-N6 -O OCR-D-N7 -p 
'{"level-of-operation": "region", "maxskew": 5.0}''
22:08:02.059 INFO ocrd.task_sequence.run_tasks - Finished processing task 'cis-ocropy-deskew -I OCR-D-N6 -O OCR-D-N7 -p 
'{"level-of-operation": "region", "maxskew": 5.0}''
22:08:02.073 INFO ocrd.task_sequence.run_tasks - Start processing task 'tesserocr-segment-line -I OCR-D-N7 -O OCR-D-N8 
-p '{"dpi": -1, "overwrite_lines": true}''
22:09:49.340 INFO ocrd.task_sequence.run_tasks - Finished processing task 'tesserocr-segment-line -I OCR-D-N7 -O 
OCR-D-N8 -p '{"dpi": -1, "overwrite_lines": true}''
22:09:49.356 INFO ocrd.task_sequence.run_tasks - Start processing task 'cis-ocropy-resegment -I OCR-D-N8 -O OCR-D-N9 -p 
'{"dpi": 0, "min_fraction": 0.8, "extend_margins": 3}''
01:39:31.500 INFO ocrd.task_sequence.run_tasks - Finished processing task 'cis-ocropy-resegment -I OCR-D-N8 -O OCR-D-N9 
-p '{"dpi": 0, "min_fraction": 0.8, "extend_margins": 3}''
01:39:31.533 INFO ocrd.task_sequence.run_tasks - Start processing task 'cis-ocropy-dewarp -I OCR-D-N9 -O OCR-D-N10 -p 
'{"dpi": 0, "range": 4.0, "max_neighbour": 0.05}''
01:58:02.010 INFO ocrd.task_sequence.run_tasks - Finished processing task 'cis-ocropy-dewarp -I OCR-D-N9 -O OCR-D-N10 
-p '{"dpi": 0, "range": 4.0, "max_neighbour": 0.05}''
01:58:02.061 INFO ocrd.task_sequence.run_tasks - Start processing task 'calamari-recognize -I OCR-D-N10 -O OCR-D-OCR -p 
'{"checkpoint": "/usr/local/ocrd_models/calamari/calamari_models-0.3/fraktur_19th_century/*.ckpt.json", "voter": 
"confidence_voter_default_ctc", "textequiv_level": "line", "glyph_conf_cutoff": 0.001}''
03:01:22.186 INFO ocrd.task_sequence.run_tasks - Finished processing task 'calamari-recognize -I OCR-D-N10 -O OCR-D-OCR 
-p '{"checkpoint": "/usr/local/ocrd_models/calamari/calamari_models-0.3/fraktur_19th_century/*.ckpt.json", "voter": 
"confidence_voter_default_ctc", "textequiv_level": "line", "glyph_conf_cutoff": 0.001}''
03:01:22.268 INFO ocrd.cli.process - Finished
32186.10user 9665.10system 5:42:03elapsed 203%CPU (0avgtext+0avgdata 12763172maxresident)k
6303240inputs+45550456outputs (12729major+450089605minor)pagefaults 0swaps

@bertsky
Copy link
Collaborator

bertsky commented Oct 15, 2020

Finally went through; took hours.
Since this only occurs in combination with pc-segmentation and pc-segmentation seems currently the weakest segmentation method, I'll close this case.

I would really like to debug this, but unfortunately I have not been able to run ocrd-pc-segmentation in the past. So could you please provide me with the last input file? I.e. fileGrp OCR-D-N8 file with pageId OCR-D-N8_00062 – only the PAGE-XML (since you gave me the images already)...

Before we close, we should make sure this is not a bug on ocrd_cis side. Could you please ocrd workspace validate, esp. OCR-D-N8?

I was able to run ocrd-pc-segmentation now. I can reproduce the extremely long runtime of resegment afterwards.

From what I see, this is somewhat related to bad segmentation quality (undetected multi-column layouts). ocrd-pc-segmentation does produce invalid PAGE (negative coordinates etc).

But this also exposes a weakness in the resegmentation algorithm: if input regions are quite large, then the new line segmentation plus pair-wise comparison with existing lines and majority vote is inefficient.

I'll have to think about his.

@bertsky
Copy link
Collaborator

bertsky commented Mar 12, 2022

Could you please revisit with the current master version @jbarth-ubhd ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants