Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
100% found this document useful (2 votes)
1K views

Text Processing and Pattern Searching: Chapter - 6

The document discusses algorithms for text processing tasks like text formatting, justification, and keyword searching. It describes an algorithm to format text into lines of a specified maximum length without splitting words across lines. It also presents a Pascal implementation of this algorithm. It then discusses an algorithm for left and right justification of text to a fixed line length by inserting additional spaces between words in a way that avoids splitting words. Finally, it outlines an algorithm and Pascal implementation to count the number of occurrences of a given word in a text by scanning the text and matching substrings to the search word.

Uploaded by

prasannakompalli
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
1K views

Text Processing and Pattern Searching: Chapter - 6

The document discusses algorithms for text processing tasks like text formatting, justification, and keyword searching. It describes an algorithm to format text into lines of a specified maximum length without splitting words across lines. It also presents a Pascal implementation of this algorithm. It then discusses an algorithm for left and right justification of text to a fixed line length by inserting additional spaces between words in a way that avoids splitting words. Finally, it outlines an algorithm and Pascal implementation to count the number of occurrences of a given word in a text by scanning the text and matching substrings to the search word.

Uploaded by

prasannakompalli
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 34

Text Processing and Pattern

Searching

Chapter -6

1
Text line length adjustment
Given a set of lines of text of arbitrary length, reformat the text
so that no lines of more than n characters are printed. In each
output line the maximum number of words that occupy less
than or n characters, should be printed and no word should
extend across 2 lines. Paragraphs should also remain
indented.
• tabs are not be considered(take them as single space)
• Every end of the line returns a space.

2
Algorithm
1. Establish the line length limit limit and add one to it to allow for a
space.
2. Initialize word and line character counts to zero and end-of-line flag to
false
3. While not end-of-file do
a) read and store next character
b) if character is a space then
b.1) if a new paragraph then
1.a) move to next line and reset character
count for the new line.
b.2) add current word length to current line length

3
b.3) if current word causes line length limit to be exceeded
then
3.a) move to next line and set line length to current
word length.
b.4) write out current word and its trialing space and reinitialize
character count.
b.5) turn off end-of-input-line flag
b.6) if at end-of-input-line then
6.a) set end-of-input-line flag and move to next input line.

4
Pascal Implementation
procedure textformat(limit:integer);
var i,linecnt,wordcnt:integer;
var chr, space:char;
var eol:boolean;
var word:array[1..30]of char;
begin
wordcnt:=0;linecnt:=0,eol:=false; space:=‘’;
limit:=limit+1;
while not eof(input) do
begin
read(chr);
wordcnt:=wordcnt+1;
word[wordcnt]:=chr;

5
if chr=space then
begin
for i:=1 to wordcnt do
if eol and (wordcnt=1) then
begin
write(word[i]);
writeln;
wordcnt:=0;
linecnt:=0; eol:=false;
end; if eoln(input) then
linecnt:=linecnt+wordcnt; begin
if linecnt>limit then eol:=true;
begin readln;
writeln; end
linecnt:=wordcnt; end
end; end;
writeln
end

6
Left and Right Justification of Text

Design and implement a procedure that will left and


right justify text in a way that avoids splitting words
and leaves paragraphs indented. An attempt should
also be made to distribute the additional blanks as
evenly as possible in the justified line.

7
Left and Right Justification of Text
• Fixed line length is achieved by inserting additional spaces
between words.
• For any particular line the following holds
– The line is already in correct length so no processing is
needed.
– The number of extra spaces needed to expand the current
line to the required length is equal to number of spaces
already present in the line.
• It is simply adding 1 space to each existing space.
• number of spaces to be added > existing spaces
• number of spaces to be added < existing spaces.

8
Ex:
10 extra spaces to a line which has already 7 spaces.
• first add 7 spaces evenly to existing spaces
• 3 are left
– if 1 space then add it to middle space.
– if 2 spaces then first would be positioned one-third of the
way and second, two-thirds of the way across.
– if 3 spaces then add after 2, 4 and 6 word

9
Algorithm
1. Establish line to be justified, its current length and justification length.
2. Include test to see if it can be justified.
3. Initialize space count and alphabetic start of line.
4. while current character a space do
a) shift to next character;
b) increment alphabetic start to end to line.
c) write out a space
5. For from alphabetic start to end of line do
a) If current character is space then
a.1) increment space count
a.2) set current position in space count table to 1
6. Remove any spaces from end of line

10
7. Determine the extra spaces to be added from new and old line lengths.
8. While still extra spaces to add and possible to do so do
a) compute current template increment from space count and extra spaces count
b) if increment 0 the set to 1 since more extras than spaces
c) if extra spaces > space count then
c.1) determine space block using extra spaces and space
count else
c.1`) set space block size to 1
d) determine starting position for template
e) while not end-of-line and still spaces to add do
e.1) add space block to current template position.
e.2) move to next position in template
e.3) decrement extra space count by space block size

11
9. For from start to end-of-line do
a) if next character a space then
a.1) move to next position in space count table
a.2) write out number of spaces as per space count table
else
a.1`) write out current character
10. Finish off with an end-of-line.

12
Pascal Implementation
procedure justify( line: nchars; oldlen, newlen: integer);
const tsize:=40;
var delta, exspace, j, ispace, nspaces, next, pos, st, spaceblock: integer;
var space: char; var template: array[1..tsize] of integer;
begin
if oldlen>newlen then writeln(‘line too long’);
else begin
space:=‘’; st:=1;
while (line[st]=space) and (st<=oldlen) do
begin
st:=st+1; write(space);
end
nspaces:=0;
if st<=oldlen then
while line[oldlen]=space do oldlen:=oldlen-1;
for pos:=st to oldlen do

13
if line[pos]=space then
begin template[next]:=template[next]+spaceblock;
nspaces:=nspaces+1; next:=next+delta;
template[nspaces]:=1; exspace:=exspace-spaceblock;
end; end
exspace:=newlen-oldlen; end
while (exspace>0)and(nspaces>0) do ispace:=0
begin for pos:=st to oldlen do
delta:=round(nspaces/exspace); if line[pos]=space then
if delta=0 then delta:=1; begin
if exspace>nspaces then ispace:=ispace+1;
spaceblock:=exspace div nspaces; for j:=1 to template[ispace] do
else write(space);
spaceblock:=1;
end
next:=(delta +1) div 2;
else
while (next<=nspaces)and (exspace>0) do
write (line[pos]);
begin
writeln;
end end

14
Keyword Searching in Text
count the number of times a particular word occurs in a given text
Algorithm
1. Establish the word and word length wlength of the search-word
2. Initialize the match-count nmatches, set preceding character and set pointer
for word array i to 1
3. while not at end-of-file do
a) while not end-of-line do
a.1) read next character
a.2) if current text character chr matches ith character in word then
2.a) extend partial match i by 1,
2.b) if a word-pattern match then
b.1) read next character post,

15
b.2) if preceding and following character not alphabetic then
2.a) update match count nmatches
b.3) reinitialize pointer to word array i
b.4) save following character post as preceding character
else
2.a`) save current text character as preceding character for
match
2.b`) reset word array pointer i to first position
b) read past end-of-line.
4. return word-match count nmatches

16
Pascal Implementation
procedure wordsearch ( word:nchars; wlength:integer; var nmatches:integer);
type letters=‘a’..’z’;
var i, :integer;
var chr, pre, post: char;
alphabet: set of letters;
begin
alphabet:=[‘a’..’z’];
pre:=‘’;i:=1;
while not eof(input) do
begin
while eoln(input)do
begin
read(chr);

17
if chr=word[i] then
end
begin
i:=i+1; else
if i>wlength then begin
begin pre:=chr;
read(post); i:=1;
if(not (pre in alphabet)) and
(not(post in alphabet)) then end
begin end;
nmatches:=nmatches+1; readln;
end end
i:=1;
end
pre:=post;
end

18
Text Line Editing

Design and implement an algorithm that will search a


line of text for a particular pattern or substring.
Should the pattern be found it is to be replaced by
another given pattern.
the two wrongs in this line are wrong --original line
the two rights in this line are right -- edited line

19
Algorithm
1. Establish the textline, the search pattern and replacement pattern and
their associated lengths.
2. Set initial values for the position in the old text, the new text and the
search pattern
3. While all pattern positions in the text have not been examined do
a) if current text and pattern characters match then
a.1) extend indices to next pattern/text character pair
a.2) if a complete match then
2.a) copy new pattern into current position in edited line
2.b) move past old pattern in text
2.c) reset pointer for search pointer.
else

20
a.1) copy current text character to next position in edited text
a.2) reset search pattern pointer
a.3) move pattern to next text position
4. Copy the leftover characters in the original text line
5. Return the edited line of text

21
Pascal Implementation
procedure textedit(var text, newtext: nchars; var pattern, newpattern: nchars; var newlen: integer;
textlen, patlen, newpatlen :integer);
var i, j, k, l: integer;
begin i:=i+patlen; j:=1; end end
i:=1; j:=1;k:=0; else
while i<=textlen-patlen+1 do
begin
begin
k:=k+1;
if text[i+j-1]=pattern[j] then
begin newtext[k]:=text[i];
j:=j+1; i:=i+1;
if j>patlen then j:=1;
begin end
for l:=1 to newpatlen do end
begin
while i<=textlen do
k:=k+1;
newtext[k]:=newpattern[l];
begin
end k:=k+1;
newtext[k]:=text[i];
i:=i+1;
end
newtextlen:=k;
end 22
Linear Pattern Search

Design and implement a pattern searching algorithm


with a performance that is linearly dependant on the
length of the string or text being searched. A count
should be made of the number of times the search
pattern occurs in the string

23
Algorithm Description

• Partial-match table setup algorithm


• Linear pattern searching algorithm
• Procedure for recovering from mismatches and
complete matches.

24
Partial-match table setup algorithm
1. Establish the search pattern
2. Set initial displacement between the pattern and itself to one.
3. Initialize the zero and first position in the partial match array to zero
4. while all positions of pattern relative to itself not considered do
a) if current pattern and displaced pattern character pairs match then
a.1) save current degree of partial match
a.2) move to next position in pattern and displaced pattern
else
a`.1) a mismatch so set partial match to zero
a`.2) reset pointer to start of displaced pattern
a`.3) move the start of the displaced pattern to the next available position
5. return the partial match table.

25
Linear pattern searching algorithm
1. Establish the pattern to be searched for and the string in which it is to be
sought together with lengths of the pattern and the string.
2. Set initial values for start of pattern and string and zero the match count.
3. while all appropriate pattern positions in the string have not been
examined do
a) if current string and pattern characters match then
a.1) extend indices to next pattern/string pair.
a.2) if a complete match then
2.a) update complete match count
2.b) reset recovery position from the partial match table
else
a`.1) reset recovery position from the partial match table
4. return count of the number of complete matches of the pattern in the string.

26
Procedure for recovering from
mismatches and complete matches.
1. Establish the partial match table, the current position in the string and
position in pattern
2. if no smaller partial match then
a) move to next position in string
b) return to start of pattern
else
a`) recover from mismatch or complete match by using table to set new
smaller partial match for current position in string
3. return smaller partial match and current pattern position.

27
Pascal Implementation
procedure kmpsearch (pattern :nchars;string:nchars; var recover : ntchars; var nmatches:
integer; patlength,slength :integer);
var position, match:integer;

procedure restart (recover :nchars; var match, position :integer);


begin
match :=recover[match-1]+1;
if match =1 then
position:=position +1;
end

procedure partialmatch(pattern:nchars; var recover :nchars; patlength: integer);


var position, match :integer;
begin
position:=2; match:=1;

28
recover[0]:=0;recover[1]:=0;
while position <=patlength do
begin
if pattern[position]=pattern[match] then
begin
recover[position]:=match;
match:=match+1;
position:=position+1;
end
else
begin
recover[position]:=0;
match:=1;
position:=position+1;
end
end
end

29
begin
position:=1; match:=1;
while position<=slength do
begin
if pattern[match]=string[position] then
begin
match:=match+1;
position:=position+1;
if match > patlength then
begin
nmatches:=nmatches+1;
restart(recover, match, position);
end
end
else
restart(recover, match, position);
end
end

30
SubLinear Pattern Search

Design and implement an algorithm that will efficiently


search given text for a particular keyword or pattern
and record the number of times the keyword or
pattern is found.

31
Algorithm
1. Establish the word and text to be searched
2. set up the skip table
3. set keyword match count to zero
4. set character position I to keyword length
5. while current character position < textlength do
a) get numeric value nxt of current character at position I
b) index into skip table at position nxt.
c) if skip value for current character > 0 then
c.1) increase current position by skip value
else
c`.1) backwards-match text and word
c`.2) if match made update match count
c`.3) recover from mismatch
6. return match count.

32
Pascal Implementation
procedure quicksearch (text, word:tc;tlength, wlength:integer;var nmatches:integer);
const asize=127;
begin
type ascii=array[0..127] of integer;
j:=i-1;
vat i, j, k, nxt:intger;
var match: boolean;
k:=wlength-1;
skip:ascii;
match:=true;
begin
while (k>0) nd (match=true) do
setskips(word, skip wlength, asize); begin
nmatches:=0; i:=wlength; if text[j]=word[k] then
while i<=tlength do begin
begin j:=j-1;
nxt:=ord(text[i]); k:=k-1
if skip[nxt]>0 then end
i:=i+skip[nxt]; else
else match:=false;
end;
i:=i-skip[nxt];
end
end
end 33
procedure setskips(word:tc; var skip:ascii;wlength,asize:integer);
var i, j, p:integer;
begin
for i:=0to asize do
skip[i]:=wlength;
for j:=1 to wlength-1 do
begin
p:=ord(word[j]);
skip[p]:=wlength –j;
end;
p:=ord(word[wlength]);
skip[p]:=-skip[p];
end

34

You might also like