Greenstone3: A Modular Digital Library
Greenstone3: A Modular Digital Library
Greenstone3: A Modular Digital Library
Katherine Don
Department of Computer Science
University of Waikato
Hamilton, New Zealand
1
fashion, using SOAP as an example communications protocol. Finally, there are
several appendices, including how to install Greenstone from CVS, some notes
on Tomcat and SOAP, and a comparison of Greenstone2 and Greenstone3 format
statements.
2
Contents
1 Greenstone installation and administration 5
1.1 Get and install Greenstone . . . . . . . . . . . . . . . . . . . . . 5
1.2 How the library works . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Restarting the library . . . . . . . . . . . . . . . . . . . . 6
1.3 Directory structure . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Sites and interfaces . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Configuring Tomcat . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6 Configuring a Greenstone library . . . . . . . . . . . . . . . . . . 8
1.6.1 Site configuration file . . . . . . . . . . . . . . . . . . . . 9
1.6.2 Interface configuration file . . . . . . . . . . . . . . . . . 11
1.7 Run-time re-initialization . . . . . . . . . . . . . . . . . . . . . . 11
2 Using Greenstone3 14
2.1 Using a collection . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Building a collection . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Using the Librarian Interface . . . . . . . . . . . . . . . . 15
2.2.2 Importing from Greenstone2 . . . . . . . . . . . . . . . . 16
2.2.3 Using command line building . . . . . . . . . . . . . . . 16
2.3 Collection configuration files . . . . . . . . . . . . . . . . . . . . 18
2.3.1 collectionInit.xml . . . . . . . . . . . . . . . . . . . . . . 18
2.3.2 collectionConfig.xml . . . . . . . . . . . . . . . . . . . . 20
2.3.3 buildConfig.xml . . . . . . . . . . . . . . . . . . . . . . 22
2.4 Formatting the collection . . . . . . . . . . . . . . . . . . . . . . 22
2.4.1 Modifying the “format” xslt templates . . . . . . . . . . . 22
2.4.2 Changing the service text strings . . . . . . . . . . . . . . 29
2.5 Customizing the interface . . . . . . . . . . . . . . . . . . . . . . 29
2.5.1 Modifying an existing interface . . . . . . . . . . . . . . 29
2.5.2 Defining a new interface . . . . . . . . . . . . . . . . . . 30
2.5.3 Changing the interface language . . . . . . . . . . . . . . 30
3
3.8.4 ’process’-type services . . . . . . . . . . . . . . . . . . . 49
3.8.5 ’applet’-type services . . . . . . . . . . . . . . . . . . . . 50
3.8.6 ’enrich’-type services . . . . . . . . . . . . . . . . . . . . 51
3.9 Page generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.9.1 ’page’-type requests and their arguments . . . . . . . . . 52
3.9.2 page format . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.9.3 Receptionists . . . . . . . . . . . . . . . . . . . . . . . . 55
3.9.4 Collection specific formatting . . . . . . . . . . . . . . . 55
3.9.5 CGI arguments . . . . . . . . . . . . . . . . . . . . . . . 55
3.9.6 Page action . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.9.7 Query action . . . . . . . . . . . . . . . . . . . . . . . . 56
3.9.8 Applet action . . . . . . . . . . . . . . . . . . . . . . . . 56
3.9.9 Document action . . . . . . . . . . . . . . . . . . . . . . 57
3.9.10 XML Document action . . . . . . . . . . . . . . . . . . . 57
3.9.11 GS2Browse action . . . . . . . . . . . . . . . . . . . . . 58
3.9.12 System action . . . . . . . . . . . . . . . . . . . . . . . . 58
3.10 Other code information . . . . . . . . . . . . . . . . . . . . . . . 58
5 Distributed Greenstone 65
5.1 Serving a site using soap . . . . . . . . . . . . . . . . . . . . . . 65
5.2 Connecting to a site web service . . . . . . . . . . . . . . . . . . 66
B Tomcat 68
B.1 Proxying Tomcat with apache . . . . . . . . . . . . . . . . . . . . 69
B.2 Running Tomcat behind a proxy . . . . . . . . . . . . . . . . . . 69
C SOAP 70
C.1 Debugging SOAP . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4
1 Greenstone installation and administration
This section covers where to get Greenstone3 from, how to install it and how to run
it. The standard method of running Greenstone3 is as a Java servlet. We provide
the Tomcat servlet container to run the servlet. Standard web servers may be able
to be configured to provide servlet support, and thereby remove the need to use
Tomcat. Please see your web server documentation for this. This documentation
assumes that you are using Tomcat. To access Greenstone3, Tomcat must be started
up, and then it can be accessed via a web browser.
Ant (Java’s XML based build tool) is used for compilation, installation and
running Greenstone. The build.xml file is the configuration file for the Greenstone
project, and build.properties contains parameters that can be altered by the user.
5
arguments to the servlet, which processes these and returns a page of HTML. As
far as an end-user is concerned, a servlet is a Java version of a CGI program. The
interaction is similar: access is via a web browser, using arguments in a URL.
Other types of interfaces can be used, such as Java GUI programs. See Sec-
tion 4.3 for details about how to make these.
• $GSDL3HOME/WEB-INF/web.xml
• $GSDL3SRCHOME/packages/tomcat/conf/server.xml
• any classes or jar files used by the servlets
6
Table 1: The Greenstone directory structure
directory description
greenstone3 The main installation directory—$GSDL3SRCHOME is set to
this directory
greenstone3/src Source code lives here
greenstone3/src/java/ main Greenstone3 java source code
greenstone3/src/packages Imported source packages from other systems e.g. indexing
packages may go here
greenstone3/lib Shared library files
greenstone3/lib/java Java jar files not needed in the Greenstone3 runtime
greenstone3/lib/jni Jar files and shared library files (.so, .jnilib, .dll) needed for JNI
components
greenstone3/resources any resources that may be needed
greenstone3/resources/soap soap service description files
greenstone3/bin executable stuff lives here
greenstone3/bin/script some Perl and/or shell scripts
greenstone3/packages External packages that may be installed as part of greenstone,
e.g. Tomcat
greenstone3/docs Documentation
greenstone3/gli Greenstone Librarian Interface code
greenstone3/gs2build collection building code
greenstone3/web This is where the web site is defined. Any static HTML files
can go here. This directory is the root directory used by Tom-
cat when serving Greenstone3. $GSDL3HOME is set to this
directory.
greenstone3/web/WEB-INF The web.xml file lives here (servlet configuration information
for Tomcat)
greenstone3/web/WEB-INF/classes Individual class files needed by the servlet go in here, also prop-
erties files for java resource bundles - used to handle all the lan-
guage specific text. This directory is on the servlet classpath
greenstone3/web/WEB-INF/lib jar files needed by the servlets go here
greenstone3/web/sites Contains directories for different sites—a site is a set of collec-
tions and services served by a single MessageRouter (MR). The
MR may have connections (e.g. soap) to other sites
greenstone3/web/sites/localsite An example site - the site configuration file lives here
greenstone3/web/sites/localsite/collect The collections directory
greenstone3/web/sites/localsite/images Site specific images
greenstone3/web/sites/localsite/transforms Site specific transforms
greenstone3/web/interfaces Contains directories for different interfaces - an interface is de-
fined by its images and XSLT files
greenstone3/web/interfaces/default The default interface
greenstone3/web/interfaces/default/images The images for the default interface
greenstone3/web/interfaces/default/js The javascript libraries for the default interface
greenstone3/web/interfaces/default/style The CSS stylesheets for the default interface
greenstone3/web/interfaces/default/transforms The XSLT files for the default interface
greenstone3/web/applet jar files needed by applets can go here
7
providing a completely different look and feel for different audiences. Alterna-
tively, a standard interface may be used with many different sites—providing a
consistent mode of access to a lot of different content.
Collections live in the collect directory of a site. Any collections that are
found in this directory when the servlet is initialized will be loaded up. Public
collections will appear on the library home page, while private collections will be
hidden. These can still be accessed by typing in cgi arguments. Collections require
valid configuration files, but apart from this, nothing needs to be done to the site
to use new collections. Collections added while Tomcat is running will not be
noticed automatically. Either the server needs to be restarted, or a configuration
request may be sent to the library, triggering a (re)load of the collection (this is
described in Section 1.7).
There are two sites that come with the distribution: localsite, and gateway.
localsite has several demo collections, while gateway has none. gateway spec-
ifies that a SOAP connection should be made to localsite. Getting this to work
involves setting up a soap server for localsite: see Section 5 for details. There are
also two interfaces provided in the distribution: default and gs2. The default in-
terface is a generic Greenstone3 interface, while the gs2 interface aims to look like
the old Greenstone2 interface.
Each site and interface has a configuration file which specifies parameters for
the site or interface—these are described in Section 1.6.
8
Table 2: Greenstone servlet initialization parameters
name sample value description
library name library the web name of the servlet
interface name default the name of the interface to use
site name localsite the name of the local site to use (use either
site name or the three remote site parameters)
remote site name org.greenstone.site1 the name of a remote site (can be anything??)
remote site type soap the type of server running on the site
remote site address http://www.greenstone.org/ The address of the server
greenstone3/services/
localsite
default lang en the default language for the interface
receptionist class MyReceptionist (optional) specifies an alternative Receptionist
to use (default is DefaultReceptionist)
messagerouter class NewMessageRouter (optional) specifies an alternative Message-
Router to use (default is MessageRouter)
params class GS2Params (optional) specifies an alternative GSParams
class to use
uration files; these are discussed in Section 2.3. The configuration files are read
in when the system is initialized, and their contents are cached in memory. This
means that changes made to these files once the system is running will not take
immediate effect. Tomcat needs to be restarted for changes to the interface con-
figuration file to take effect. However, changes to the site configuration file can be
incorporated sending a system command to the library. There are a series of sys-
tem commands that can be sent to the library to induce reconfiguration of different
modules, including reloading the whole site. This removes the need to restart the
system to reflect these changes. These commands are described in Section 1.7.
9
<siteConfig>
<localSiteName value="org.greenstone.localsite"/>
<httpAddress value="http://localhost:8080/greenstone3/sites/localsite"/>
<serviceClusterList/>
<serviceRackList/>
<siteList/>
</siteConfig>
<siteConfig>
<localSiteName value="org.greenstone.gsdl1"/>
<httpAddress value="http://localhost:8080/greenstone3/sites/gsdl1"/>
<serviceClusterList>
<serviceCluster name="build">
<metadataList>
<metadata name="Title">Collection builder</metadata>
<metadata name="Description">Builds collections in a
gsdl2-style manner</metadata>
</metadataList>
<serviceRackList>
<serviceRack name="GS2Construct"/>
</serviceRackList>
</serviceCluster>
</serviceClusterList>
<siteList>
<site name="org.greenstone.localsite"
address="http://localhost:8080/greenstone3/services/localsite"
type="soap"/>
</siteList>
</siteConfig>
Figure 1 shows two example site configuration files. The first example is for
a rudimentary site with no site-wide services, which does not connect to any ex-
ternal sites. The second example is for a site with one site-wide service clus-
ter - a collection building cluster. It also connects to the first site using SOAP.
These two sites happen to be running on the same machine, which is why they can
use localhost in the address. For site gsdl1 to talk to site localsite, a SOAP
server must be run for localsite. The address of the SOAP server, in this case, is
http://localhost:8080/greenstone3/services/localsite.
Another element that can appear in a site configuration file is replaceList.
This must have an id attribute, and may contain one or more replace elements.
Replace elements are discussed in Section 2.3. The list found in a siteConfig.xml
file can be applied to any collection by adding a replaceListRef element (with
the appropriate id attribute) to its collectionConfig.xml file.
http://.
10
1.6.2 Interface configuration file
The interface configuration file interfaceConfig.xml lists all the actions that the
interface knows about at the start (other ones can be loaded dynamically). Actions
create the web pages for the library: there is generally one Action per type of page.
For example, a query action produces the pages for searching, while a document
action displays the documents. The configuration file specifies what short name
each action maps to (this is used in library URLs for the a (action) parameter) e.g.
QueryAction should use a=q. If the interface uses XSLT, it specifies what XSLT
file should be used for each action and possibly each subaction. This makes it easy
for developers to implement and use different actions and/or XSLT files without
recompilation. The server must be restarted, however.
It also lists all the languages that the interface text files have been translated
into. These have a name attribute, which is the ISO code for the language, and a
displayElement which gives the language name in that language (note that this
file should be encoded in UTF-8). This language list is used on the Preferences
page to allow the user to change the interface language. Details on how to add a
new language to a Greenstone3 library are shown in Section 2.5.3.
An optionList element can be used to disable or enable some optional func-
tionality for the interface. Currently there are three options that can be enabled:
highlightQueryTerms Whether search term highlighting is available
or not
berryBaskets Whether berry basket functionality is avail-
able or not
displayAnnotationService Whether any annotation services (specified in
the site config file) should be displayed with a
document or not.
An interface may be based on an existing one, for example, the gs2 interface
is based on the default interface. This means that it will use any images or tem-
plates from the base one unless overridden in the current one. The baseInterface
attribute of the <interfaceConfig> element is used to specify the base interface.
11
<interfaceConfig>
<actionList>
<action name=’p’ class=’PageAction’>
<subaction name=’home’ xslt=’home.xsl’/>
<subaction name=’about’ xslt=’about.xsl’/>
<subaction name=’help’ xslt=’help.xsl’/>
<subaction name=’pref’ xslt=’pref.xsl’/>
<subaction name=’nav’ xslt=’nav.xsl’/><!-- used for the
collection header frame -->
<subaction name="html" xslt="html.xsl"/> <!-- used to put an
external page into a frame with a collection header-->
</action>
<action name=’q’ class=’QueryAction’ xslt=’basicquery.xsl’/>
<action name=’b’ class=’GS2BrowseAction’ xslt=’classifier.xsl’/>
<action name=’a’ class=’AppletAction’ xslt=’applet.xsl’/>
<action name=’d’ class=’DocumentAction’ xslt=’document.xsl’/>
<action name=’xd’ class=’XMLDocumentAction’>
<subaction name=’toc’ xslt=’document-toc.xsl’/>
<subaction name=’text’ xslt=’document-content.xsl’/>
</action>
<action name=’pr’ class=’ProcessAction’ xslt=’process.xsl’/>
<action name=’s’ class=’SystemAction’ xslt=’system.xsl’/>
<action name=’g’ class=’GeneralAction’>
<subaction name="berry" xslt=’berry.xsl’/>
</action>
</actionList>
<languageList>
<language name="en">
<displayItem name=’name’>English</displayItem>
</language>
<language name="fr">
<displayItem name=’name’>Franais</displayItem>
</language>
<language name=’es’>
<displayItem name=’name’>Espaol</displayItem>
</language>
</languageList>
<optionList>
<option name="highlightQueryTerms" value="true"/>
<option name="berryBaskets" value="true"/>
</optionList>
</interfaceConfig>
12
Table 3: Example run-time configuration arguments.
a=s&sa=c reconfigures the whole site. Reads in siteConfig.xml, reloads all the
collections. Just part of this can be specified with another argu-
ment ss (system subset). The valid values are collectionList,
siteList, serviceList, clusterList.
a=s&sa=c&sc=XXX reconfigures the XXX collection or cluster. ss can also be used here,
valid values are metadataList and serviceList.
a=s&sa=a (re)activate a specific module. Modules are specified using two argu-
ments, st (system module type) and sn (system module name). Valid
types are collection, cluster site.
a=s&sa=d deactivate a module. st and sn can be used here too. Valid types are
collection, cluster, site, service. Modules are removed
from the current configuration, but will reappear if Tomcat is restarted.
a=s&sa=d&sc=XXX deactivate a module belonging to the XXX collection or cluster. st and
sn can be used here too. Valid types are service.
arguments into the URL; there is no nice web form yet to do this.
The arguments are entered after the library? part of the URL. There are
three types of commands: configure, activate, deactivate. These are specified by
a=s&sa=c, a=s&sa=a, and a=s&sa=d, respectively (a is action, sa is subaction). By
default, the requests are sent to the MessageRouter, but they can be sent to a col-
lection/cluster by the addition of sc=xxx, where xxx is the name of the collection
or cluster. Table 3 describes the commands and arguments in a bit more detail.
13
2 Using Greenstone3
Once Greenstone3 is installed, the sample collections can be accessed. The in-
stallation comes with several example collections, and Section 2.1 describes these
collections and how to use them. Section 2.2 describes how to build new collec-
tions.
The image at the top left is a link to the collection’s home page. The top
right has buttons to link to the library home page, help and preferences pages. All
the available services are arrayed along a navigation bar, along the bottom of the
banner. Clicking on a name accesses that service.
Search type services generally provide a form to fill in, with parameters includ-
ing what field or granularity to search, and the query itself. Clicking the search
button carries out the search, and a list of matching documents will be displayed.
Clicking on the icons in the result list takes you to the document itself.
2
of course, this is all customizable
14
Once you are looking at a document, clicking the open book icon at the top
of the document, underneath the navigation bar, will take you back to the service
page that you accessed the document from.
15
edit collections from another site. However, you can base a collection on one
from another site. To change the working site and/or servlet, go to Preferences-
>Connection in the File menu. By default, the GLI will use site localsite, and
servlet library.
Collection building using the GLI will use the Greenstone2 Perl scripts and plu-
gins. At the conclusion of the Greenstone2 build process, a conversion script will
be run to create the Greenstone3 configuration files. This means that format state-
ments are no longer ’live’—changing these will require changes to the Greenstone3
configuration files. Clicking the Preview Collection button will re-run the configu-
ration file conversion script. If you change anything on the Format panel, you will
need to click Preview Collection. Just reloading the collection via a browser will
not be enough.
Detailed instructions about using the GLI can be found in Sections 3.1 and 3.2
of the Greenstone2 User’s Guide (GS2-User-en.pdf). This can be found in your
Greenstone2 installation, or in the $GSDL3SRCHOME/docs/manual directory if you
have installed Greenstone3 from a distribution.
The script attempts to create Greenstone3 format statements from the old Green-
stone2 ones. The conversion may not always work properly, so if the collection
looks a bit strange under Greenstone3, you should check the format statements.
Format statements are described in Section 2.4.
Once again, to have the collection recognized by the library servlet, you can
either restart Tomcat, or load it dynamically.
16
steps are (for a new collection called testcol):
Linux:
cd greenstone3
source gs3-setup.sh
cd gs2build
source setup.bash
cd ../
mkcol.pl -collectdir $GSDL3HOME/sites/localsite/collect testcol
put source documents and metadata into
$GSDL3HOME/sites/localsite/collect/testcol/import
edit $GSDL3HOME/sites/localsite/collect/testcol/etc/collect.cfg as
appropriate
import.pl -collectdir $GSDL3HOME/sites/localsite/collect testcol
buildcol.pl -collectdir $GSDL3HOME/sites/localsite/collect testcol
rename the $GSDL3HOME/sites/localsite/collect/testcol/building
directory to index
convert_coll_from_gs2.pl -collectdir $GSDL3HOME/sites/localsite/collect
testcol
%$
Windows:
cd greenstone3
gs3-setup
cd gs2build
setup
cd ..
perl -S mkcol.pl -collectdir %GSDL3HOME%\sites\localsite\collect testcol
put source documents and metadata into
%GSDL3HOME%\sites\localsite\collect\testcol\import
edit %GSDL3HOME%\sites\localsite\collect\testcol\etc\collect.cfg as
appropriate
perl -S import.pl -collectdir %GSDL3HOME%\sites\localsite\collect testcol
perl -S buildcol.pl -collectdir %GSDL3HOME%\sites\localsite\collect testcol
rename the %GSDL3HOME%\sites\localsite\collect\testcol\building directory
to index
perl -S convert_coll_from_gs2.pl -collectdir
%GSDL3HOME%\sites\localsite\collect testcol
Once the build process is complete, Tomcat should be prompted to reload the
collection—either by restarting the server, or by sending an activate collection
command to the library servlet.
Metadata for documents can be added using metadata.xml files. A metadata.xml
file has a root element of <DirectoryMetadata>. This encloses a series of <FileSet>
items. Neither of these tags has any attributes. Each <FileSet> item includes two
parts: firstly, one or more <FileName> tags, each of which encloses a regular ex-
pression to identify the files which are to be assigned the metadata. Only files in
the same directory as the metadata.xml file, or in one of its child directories, will
be selected. The filename tag encloses the regular expression as text, e.g.:
<FileName>example</FileName>
17
This would match any file containing the text ’example’ in its name. The sec-
ond part of the <FileSet> item is a <Description> item. The <Description> tag
has no attributes, but encloses one or more <Metadata> tags. Each <Metadata>
tag contains one metadata item, i.e. a label to describe the metadata and a corre-
sponding value. The <Metadata> tag has one compulsory attribute: ’name’. This
attribute gives the metadata label to add to the document. Each <Metadata> tag
also has an optional attribute: ’mode’. If this attribute is set to ’accumulate’ then
the value is added to the document, and any existing values for that metadata item
are retained. If the attribute is set to ’set’ or is omitted, then any existing value of
the metadata item will be deleted.
Figure 4 shows an example metadata.xml file. Here, only one file pattern is
found in each file set. However, the Description tag contains a number of separate
metadata items. Note that the Title metadata does not have the mode=accumulate
attribute. This means that when this title is assigned to a document, any existing
Title information will be lost.
2.3.1 collectionInit.xml
This optional file is only used for non-standard, customized collections. It specifies
the class name of the non-standard collection class. The only syntax so far is the
class name:
<collectionInit class="XMLCollection"/>
Section 4.4 describes an example collection where this file is used. Depending
on the type of collection that this is used for, one or both of the other configuration
files may not be needed.
18
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE DirectoryMetadata SYSTEM "http://greenstone.org/dtd/DirectoryMetadata
/1.0/DirectoryMetadata.dtd">
<DirectoryMetadata>
<FileSet>
<FileName>ec160e</FileName>
<Description>
<Metadata name="Title">The Courier - No.160 - Nov - Dec 1996 -
Dossier Habitat - Country reports: Fiji , Tonga (ec160e)</Metadata>
<Metadata mode="accumulate" name="Language">English</Metadata>
<Metadata mode="accumulate" name="Subject">Settlements and housing:
general works incl. low- cost housing, planning techniques, surveying,
etc.</Metadata>
<Metadata mode="accumulate" name="Subject">The Courier ACP 1990 - 1996
Africa-Caribbean-Pacific - European Union</Metadata>
<Metadata mode="accumulate" name="Organization">EC Courier</Metadata>
<Metadata mode="accumulate" name="AZList">T.1</Metadata>
</Description>
</FileSet>
<FileSet>
<FileName>b22bue</FileName>
<Description>
<Metadata name="Title">Butterfly Farming in Papua New Guinea
(b22bue)</Metadata>
<Metadata mode="accumulate" name="Language">English</Metadata>
<Metadata mode="accumulate" name="Subject">Other animals (micro-
livestock, little known animals, silkworms, reptiles, frogs,
snails, game, etc.)</Metadata>
<Metadata mode="accumulate" name="Organization">BOSTID</Metadata>
<Metadata mode="accumulate" name="AZList">T.1</Metadata>
<Metadata mode="accumulate" name="Keyword">start a butterfly farm
</Metadata>
</Description>
</FileSet>
</DirectoryMetadata>
19
2.3.2 collectionConfig.xml
The collection configuration file is where the collection designer (e.g. a librarian)
decides what form the collection should take. So far this file only includes the
presentation aspects needed by the run-time system. Instructions for collection
building have yet to be defined. Presentation aspects include collection metadata
such as title and description, display text for indexes, and format statements for
search results, classifiers etc. The format of collectionConfig.xml is still under
consideration. However, Figure 5 shows the parts of it that have been defined so
far.
Display elements for a collection can be entered in any language—use lang=’en’
attributes to specify which language they are in.
The <metadataList> element specifies some collection metadata, such as cre-
ator. The <displayItemList> specifies some language dependent information that
is used for collection display, such as collection name and short description. These
displayItem elements can be specified in different languages.
The <search> element provides some display and formatting information for
the search indexes, while the <browse> element concerns classifiers, and the <display>
element looks at document display.
Inside the <search> and <browse> elements, <displayItem> elements are
used to provide titles for the indexes or classifiers, while <format> elements pro-
vide formatting instructions, typically for a document or classifier node in a list of
results. Placing the <format> instructions at the top level in the search or browse
element will apply the format to all the indexes or classifiers, while placing it inside
an individual index or classifier element will restrict that formatting instruction
to that item.
The <display> element contains optional formatting information for the dis-
play of documents. Templates that can be specified here include documentHeading
and DocumentContent. Other formatting options may also be specified here, such
as whether to display a table of contents and/or cover image for the documents.
Format elements are described in Section 2.4.
An optional <replaceList> element can be included at the top level. This
contains a list of strings and their replacements. This is particularly useful for
Greenstone2 collections that use macros.
The format is like the following:
<replaceList>
<replace scope=’text’ macro="xxx" text="yyy"/>
<replace scope=’metadata’ macro="xxx" bundle="yyy" key="zzz"/>
<replace scope=’all’ macro=’xxx’ metadata=’yyy’/>
</replaceList>
Scope determines on what text the replacements are carried out: text, metadata,
and all (both text and metadata). An empty scope attribute is equivalent to scope=all.
Each replace type can be used with all scope values. Replacing uses Java’s ’String.replaceAll’
functionality, so macro and replacement text are actually regular expressions. The
20
<collectionConfig xmlns:gsf="http://www.greenstone.org/greenstone3/
schema/ConfigFormat" xmlns:xslt="http://www.w3.org/1999/XSL/Transform">
<metadataList>
<metadata name="creator">greenstone@cs.waikato.ac.nz</metadata>
<metadata name="public">true</metadata>
</metadataList>
<displayItemList>
<displayItem name=’name’ lang=’en’>Greenstone3 MG demo collection</displayItem>
<displayItem name=’description’ lang=’en’>This is a demonstration
collection for the Greenstone3 digital library software.</displayItem>
<displayItem name=’icon’ lang=’en’>gs3mgdemo.gif</displayItem>
<displayItem name=’smallicon’ lang=’en’>gs3mgdemo_sm.gif</displayItem>
</displayItemList>
<search>
<index name="ste">
<displayItem name=’name’ lang="en">chapters</displayItem>
<displayItem name=’name’ lang="fr">chapitres</displayItem>
<displayItem name=’name’ lang="es">captulos</displayItem>
</index>
[ ... more indexes ...]
<format>
<gsf:template match="documentNode"><td valign=’top’>
<gsf:link><gsf:icon/></gsf:link></td><td><gsf:metadata name=’Title’/>
</td></gsf:template>
</format>
</search>
<browse>
<classifier name="CL1" horizontalAtTop=’true’>
<displayItem name=’name’ lang=’en’>Titles</displayItem>
</classifier>
[... more classifiers ...]
<classifier name="CL4">
<displayItem name=’name’ lang=’en’>HowTo</displayItem>
<format>
<gsf:template match="documentNode">
<br /><gsf:link><gsf:metadata name=’Keyword’ />
</gsf:link></gsf:template>
</format>
</classifier>
</browse>
<display>
<format>
<gsf:option name="coverImages" value="false"/>
<gsf:option name="documentTOC" value="false"/>
</format>
</display>
</collectionConfig>
21
first example is a straight textual replacement. The second example uses dictionary
lookups. xxx will be replaced with the (language-dependent) value for key zzz in
resource bundle yyy. The third example uses metadata: xxx will be replaced by the
value of the yyy metadata for that document.
Appendix D.2 gives some examples that have been used for Greenstone2 col-
lections.
2.3.3 buildConfig.xml
The file buildConfig.xml is produced by the collection building process. Gener-
ally it is not necessary to look at this file, but it can be useful in determining what
went wrong if the collection doesn’t appear quite the way it was planned.
It contains metadata and other information about the collection that can be de-
termined automatically, such as the number of documents in the collection. It also
includes a list of ServiceRack classes that are required to provide the services
that have been built into the collection. The serviceRack names are Java classes
that are loaded dynamically at runtime. Any information inside the serviceRack
element is specific to that service—there is no set format. Figure 6 shows an ex-
ample. This configuration file specifies that the collection should load up 3 Ser-
viceRacks: GS2Browse, GS2MGPPRetrieve and GS2MGPPSearch. The contents of
each <serviceRack> element are passed to the appropriate ServiceRack objects
for configuration. The collectionConfig.xml file content is also passed to the
ServiceRack objects at configure time—the format and displayItem informa-
tion is used directly from the collectionConfig.xml file rather than added into
buildConfig.xml during building. This enables formatting and metadata changes
in collectionConfig.xml to take effect in the collection without rebuilding being
necessary. However, as these files are cached, the collection needs to be reloaded
for the changes to appear in the library.
22
<buildConfig>
<metadataList>
<metadata name="numDocs">11</metadata>
<metadata name="buildType">mgpp</metadata>
</metadataList>
<serviceRackList>
<serviceRack name="GS2Browse">
<indexStem name="gs2mgppdemo"/>
<classifierList>
<classifier name="CL1" content="Title"/>
<classifier name="CL2" content="Subject" />
<classifier name="CL3" content="Organization" />
<classifier name="CL4" content="Howto" />
</classifierList>
</serviceRack>
<serviceRack name="GS2MGPPRetrieve">
<indexStem name="gs2mgppdemo"/>
<defaultLevel name="Sec" />
</serviceRack>
<serviceRack name="GS2MGPPSearch">
<indexStem name="gs2mgppdemo"/>
<defaultLevel name="Sec" />
<levelList>
<level name="Sec" />
<level name="Doc" />
</levelList>
<fieldList>
<field shortname="ZZ" name="allfields" />
<field shortname="TX" name="text" />
<field shortname="DL" name="dls.Title" />
<field shortname="DS" name="dls.Subject" />
<field shortname="DO" name="dls.Organization" />
</fieldList>
<searchTypeList>
<searchType name="form" />
<searchType name="plain" />
</searchTypeList>
<indexOptionList>
<indexOption name="stemIndexes" value="3"/>
<indexOption name="maxnumeric" value="4"/>
</indexOptionList>
<defaultIndex name="idx" />
<indexList>
<index name="idx" />
</indexList>
</serviceRack>
</serviceRackList>
</buildConfig>
23
is a sample template for formatting a classifier list, to show Keyword metadata as
a link to the document.
<xsl:template match="documentNode" priority="2"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:param name="collName"/>
<td><a href="{$library_name}?a=d&c={$collName}&
d={@nodeID}&dt={@docType}"><xsl:value-of
select="metadataList/metadata[@name=’Keyword’]"/></a>
</td>
</xsl:template>
We can use XSLT to transform XML into XSLT. Greenstone3 provides a sim-
plified set of formatting commands, written in XML, which will be transformed
into proper XSLT. The user specifies a <gsf:template> for what they want to
format—these typically match documentNode or classifierNode (for a node in a
classification hierarchy).
The template above can be represented as:
<gsf:template match=’documentNode’>
<td><gsf:link><gsf:metadata name=’Keyword’/></gsf:link></td>
</gsf:template>
Table 4 shows the set of ’gsf’ (Greenstone Format) elements. If you have
come from a Greenstone2 background, Appendix D.1 shows Greenstone2 format
elements and their equivalents in Greenstone3 .
The <gsf:metadata> elements are used to output metadata values. The sim-
plest case is <gsf:metadata name=’Title’/>—this outputs all the Title metadata
values for the current document or section. Namespaces are important here: if the
Title metadata is in the Dublin Core (dc) namespace, then the element should look
like <gsf:metadata name=’dc.Title’/>. There are three other attributes for this
element. By default, more than one value for the selected metadata is returned,
where multiple exist. The attribute pos is used when a particular value for the
selected metadata is requested (which can be the first, last or nth value). For in-
stance, one document may fall into several classification categories, and therefore
may have multiple Subject metadata values. When all are returned, the multiple
values are separated by commas by default. The separator attribute is used to
change the separating string. For example, adding separator=’: ’ to the element
24
Table 4: Format elements for GSF format language
Element Description
<gsf:text/> The document’s text
<gsf:link>...</gsf:link> The HTML link to the document itself
<gsf:link type=’document’>... Same as above
</gsf:link>
<gsf:link type=’classifier’>... A link to a classification node (use in classifierNode
</gsf:link> templates)
<gsf:link type=’source’>... The HTML link to the original file—set for doc-
</gsf:link> uments that have been converted from e.g. Word,
PDF, PS
<gsf:icon/> An appropriate icon
<gsf:icon type=’document’/> same as above
<gsf:icon type=’classifier’/> bookshelf icon for classification nodes
<gsf:icon type=’source’/> An appropriate icon for the original file e.g. Word,
PDF icon
<gsf:metadata name=’Title’/> All the values of a metadata element for the current
document or section, in this case, Title
<gsf:metadata name=’Title’ A more extended selection of metadata values. The
select=’select-type’ select field can be one of those shown in Table 5.
[separator=’y’ prefix="p" There are some optional attributes: separator gives a
suffix="s" pos=’first|last|n’]/> String that will be used to separate the fields (default
is “, “), prefix and suffix give strings that will be
output before and after each field, and pos can be
set to return either the first, last or nth value for that
metadata at each section.
<gsf:metadata name=’Title’> separator, prefix and suffix can all be specified as el-
<separator>y</separator> ements instead of attributes. This format allows for
<prefix>p</prefix> HTML tags to be used. e.g. <br/> as a separator.
<suffix>s</suffix> </gsf:metadata>
<gsf:metadata name=’Date’ The value of a metadata element for the current doc-
format=’formatDate’/> ument, formatted in some way. Current formatting
options available are listed in Table 6.
<gsf:choose-metadata> A choice of metadata. Will select the first existing
<gsf:metadata name=’metaA’/> one. the metadata elements can have the select, sep-
<gsf:metadata name=’metaB’/> arator and pos attributes like normal.
<gsf:metadata name=’metaC’/>
</gsf:choose-metadata>
<gsf:switch preprocess= switch on the value of a particular metadata - the
’preprocess-type’> metadata is specified in gsf:metadata, has the same
<gsf:metadata name=’Title’/> attributes as normal.
<gsf:when test=’test-type’
test-value=’xxx’>...</gsf:when>
<gsf:when test=’test-type’
test-value=’yyy’>...</gsf:when>
<gsf:otherwise>...</gsf:otherwise>
</gsf:switch>
25
Table 5: Select types for metadata format elements
Select Type Description
parent The immediate parent section
ancestors All the parents back to the root (topmost) section
root The root or topmost section
will separate all values by a colon and a space. Instead of retrieving all values
for a piece of metadata, adding pos=’first’ to the <gsf:metadata> element will
retrieve the first value.
Sometimes you may want to display metadata values for sections other than the
current one. For example, in the mgppdemo collection, in a search list we display
the Titles of all the enclosing sections, followed by the Title of the current section,
all separated by semi-colons. The display ends up looking something like: Farming
snails 2; Starting out; Selecting your snails where Selecting your snails is the Title
of the section in the results list, and Farming snails 2 and Starting out are the Titles
of the enclosing sections. The select attribute is used to display metadata for
sections other than the current one. Table 5 shows the options available for this
attribute. The separator attribute is used here also, to specify the separating text.
To get the previous metadata, the format statement would have the following
in it:
<gsf:metadata name=’Title’ select=’ancestors’ separator=’; ’/>;
<gsf:metadata name=’Title’/>
This will display dc.Title if available, otherwise it will use dls.Title if available,
otherwise it will use the Title metadata. If there are no values for any of these
metadata elements, then nothing will be displayed.
26
Table 7: Formatting options
option name values description
coverImages true, false whether or not to display cover images
for documents
documentTOC true, false whether or not to display the table of
contents for the document
<gsf:switch preprocess=’toLower;stripSpace’>
<gsf:metadata name=’Organization’/>
<gsf:when test=’equals’ test-value=’bostid’>
<!-- output BOSTID image --></gsf:when>
<gsf:when test=’equals’ test-value=’worldbank’>
<!-- output world bank image --></gsf:when>
<gsf:otherwise><!-- output default image--></gsf:otherwise>
</gsf:switch>
Preprocessing of the metadata value is optional. The preprocess types are listed
in Table 6. These operations are carried out on the value of the selected metadata
before the test is carried out. Multiple processing types can be specified, separated
by ; and they will be applied in the order specified (from left to right).
Each option specifies a test and a test value. Test values are just text. Tests
include startsWith, contains, exists, equals, endsWith. Exists doesn’t need
a test value. Having an otherwise option ensures that something will be displayed
even when none of the tests match.
If none of the gsf elements meets your needs for formatting, XSLT can be en-
tered directly into the format element, giving the collection designer full flexibility
over how the collection appears.
The collection specific templates are added into the configuration file collectionConfig.xml.
Any templates found in the XSLT files can be overridden. The important part to
adding templates into the configuration file is determining where to put them. For-
matting templates cannot go just anywhere—there are standard places for them.
Figure 7 shows the positions that templates can occur.
There are also formatting instructions that are not templates but are options.
These are described in Table 7. They are entered into the configuration file like
<gsf:option name=’coverImages’ value=’false’/>
Note, format templates are added into the XSLT files before transforming,
while the options are added into the page source, and used in tests in the XSLT.
27
<collectionConfig>
<metadataList/>
<displayItemList/>
<search>
<format> <!--Put here templates related to searching and
the query page. The common one is the documentNode
template -->
<gsf:template match=’documentNode’>...</gsf:template>
</format>
</search>
<browse>
<classifier name=’xx’>
<format><!-- put here templates related to formating a
particular classifier page. Common ones are documentNode
and classifierNode templates-->
<gsf:template match=’documentNode’>...</gsf:template>
<gsf:template match=’classifierNode’>...</gsf:template>
<gsf:template match=’classifierNode’ mode=’horizontal’>...
</gsf:template>
</format>
</classifier>
<classifier>...</classifier>
<format><!-- formatting for all the classifiers. these will
be overridden by any classifier specific formatting
instructions --></format>
</browse>
<display>
<format><!-- here goes any formatting relating to the display
of the documents. These are generally named templates,
and format options -->
<gsf:template name=’documentContent’>...</gsf:template>
<gsf:option name=’TOC’ value=’true’/>
</format>
</display>
</collectionConfig>
28
2.4.2 Changing the service text strings
Each collection has a set of services which are the access points for the information
in the collection. Each service has a set of text strings which are used to display
it. These include name, description, the text on the submit button, and names and
descriptions of all the parameters to the service.
These text strings are found in .properties files, in $GSDL3HOME/WEB-INF/classes.
The names of the files are based on class names. Subclasses can define their own
properties, or can use their parent class ones. For example, AbstractSearch de-
fines strings for the TextQuery service, in AbstractSearch.properties. GS2MGSearch
just uses these default ones, so doesn’t need its own properties file.
A particular collection can override the properties for any service. For example,
if a collection uses the GS2MGSearch service rack (look in the buildConfig.xml
file for a list of service racks used), and the collection builder wants to change the
text associated with this service, they can put a GS2MGSearch.properties file in
the resources directory of the collection. After a reconfigure of the collection, this
will be used in preference to the one in the default resources directory.
29
about.xsl file into the collection’s transform directory, and this will be used in-
stead. This is what we do for the Gutenberg sample collection.
This also applies to files that are included from other XSLT files. For example
the query.xsl for the query pages includes a file called querytools.xsl. To have
a particular site show a different query interface either of these files may need to
be modified. Creating a new version of either of these and putting it in the site
transform directory will work. Either the new query.xsl will include the default
querytools.xsl, or the default query.xsl will include the new querytools.xsl.
The xsl:include directives are preprocessed by the Java code and full paths added
based on availability of the files, so that the correct one is used.
Note that you cannot include a file with the same name as the including file.
For example query.xsl cannot include query.xsl (it is tempting to want to do
this if you just want to change one template for a particular file, and then include
the default. but you cant).
You can add the argument o=xml to any URL and you wil be returned the XML
before transformation by a stylesheet. This shows you the XML page source. It
can be useful when you are trying to write some new XSLT statements.
30
$GSDL3HOME/WEB-INF/classes. Each interface has one named interface name.properties
(where ’name’ is the interface name, for example, interface default.properties,
or interface gs2.properties). Each service class has one with the same name
as the class (e.g. GS2Search.properties). To add another language all of the
base .properties files must be translated. The translated files keep the same
names, but with a language extension added. For example, a French version of
interface default.properties would be named interface default fr.properties.
Keys will be looked up in the properties file closest to the specified language.
For example, if language fr CA was specified (French language, country Canada),
and the default locale was en GB, Java would look at properties files in the fol-
lowing order, until it found the key: XXX fr CA.properties, XXX fr.properties,
XXX en GB.properties, then XXX en.properties, and finally the default XXX.properties.
These new files are available straight away—to use the new language, add e.g.
l=fr to the arguments in the URL. To get Greenstone to add it in to the list of
languages on the preferences page, an entry needs to be added into the languages
list in the interfaceConfig.xml file (see Section 1.6.2). Modification of this file
requires a restart of the Tomcat server for the changes to be recognized.
31
Library
Servlet
Process
Action ImportCollection
Service GS2Construct
BuildCollection
Receptionist Browse Service
Action ActivateCollection
Service
TextQuery ClassifierBrowse
GS2MGPPSearch demo Service
Service Collection
fao GS2Browse
Collection
MetadataRetrieve MetadataRetrieve
Service Service
ResourceRetrieve
Service ClassifierBrowse ResourceRetrieve
Service Service
GS2MGPPSearch
page generation
]
32
MessageRouters, one for each site.
Collection and ServiceCluster: these are very similar, and group a set of ser-
vices into a conceptual group.. They both provide some metadata about the col-
lection/cluster, and a list of services. The services are provided by ServiceRack
objects that the collection/cluster loads up. A Collection is a specific type of Ser-
viceCluster. A ServiceCluster groups services that are related conceptually, e.g. all
the building services may be part of a cluster. What is part of a cluster is specified
by the site configuration file. A Collection’s services are grouped by the fact that
they all operate on some common data—the documents in the collection. Func-
tionally Collection and ServiceCluster are very similar, but conceptually, and to
the user, they are quite different.
Service: these provide the core functionality of the system e.g. searching, re-
trieving documents, building collections etc. One or more may be grouped into a
single Java class (ServiceRack) for code reuse, or to avoid instantiating the same
objects several times. For example, MGPP searching services all need to have the
index loaded into memory.
Communicator/Server: these facilitate communication between remote mod-
ules. For example, if you want MR1 to talk to MR2, you need a Communicator-
Server pair. The Server sits on top of MR2, and MR1 talks to the Communicator.
Each communication type needs a new pair. So far we have only been using SOAP,
so we have a SOAPCommunicator and a SOAPServer.
Receptionist: this is the point of contact for the ’front end’. Its core function-
ality involves routing requests to the Actions, but it may do more than that. For
example, a Receptionist may: modify the request in some way before sending it to
the appropriate Action; add some data to the page responses that is common to all
pages; transform the response into another form using XSLT. There is a hierarchy
of different Receptionist types, which is described in Section 3.9.3.
Actions: these do the job of creating the ’pages’. There is a different action for
each type of page, for example PageAction handles semi-static pages, QueryAc-
tion handles queries, DocumentAction displays documents. They know a little bit
about specific service types. Based on the ’CGI’ arguments passed in to them, they
construct requests for the system, and put together the responses into data for the
page. This data is returned to the Receptionist, which may transform it to HTML.
The various actions are described in more detail in Section 3.9.
33
been specified in the servlet initiation parameters (see Section 1.4). The appropri-
ate system variables are set for each object (interface name, site name, etc.) and
then configure() is called on both. The MessageRouter handle is passed to the
Receptionist. The servlet then communicates only with the Receptionist, not with
the MessageRouter.
The Receptionist reads in the interfaceConfig.xml file (see Section 1.6.2),
and loads up all the different Action classes. Other Actions may be loaded on
the fly as needed. Actions are added to a map, with shortnames for keys. Eg the
QueryAction is added with key ’q’. The Actions are passed the MessageRouter ref-
erence too. If the Receptionist is a TransformingReceptionist, a mapping between
shortnames and XSLT file names is also created.
The MessageRouter reads in its site configuration file siteConfig.xml (see
Section 1.6.1). It creates a module map that maps names to objects. This is used
for routing the messages. It also keeps small chunks of XML—serviceList, collec-
tionList, clusterList and siteList. These are part of what get returned in response to
a describe request (see Section 3.4.).
Each ServiceRack specified in the configuration file is created, then queried
for its list of services. Each service name is added to the map, pointing to the
ServiceRack object. Each service is also added to the serviceList. After this stage,
ServiceRacks are transparent to the system, and each service is treated as a separate
module.
ServiceClusters are created and passed the <serviceCluster> element for
configuration. They are added to the map as is, with the cluster name as a key.
A serviceCluster is also added to the serviceClusterList.
For each site specified, the MessageRouter creates an appropriate type of Com-
municator object. Then it tries to get the site description. If the server for the re-
mote site is up and running, this should be successful. The site will be added to the
mapping with its site name as a key. The site’s collections, services and clusters
will also be added into the static xml lists. If the server for the remote site is not
running, the site will not be included in the siteList or module map. To try again
to access the site, either Tomcat must be restarted, or a run-time reconfigure-site
command must be sent (see Section 1.7).
The MessageRouter also looks inside the site’s collect directory, and loads
up a Collection object for each valid collection found. If a collectionInit.xml
file is present, a subclass of Collection may be used. The Collection object reads
its buildConfig.xml and collectionConfig.xml files, determines the metadata,
and loads ServiceRack classes based on the names specified in buildConfig.xml.
The <serviceRack> XML element is passed to the object to be used in configura-
tion. The collectionConfig.xml contents are also passed in to the ServiceRacks.
Any format or display information that the services need must be extracted from
the collection configuration file. Collection objects are added to the module map
with their name as a key, and also a collection element is added into the collection-
List XML.
34
3.3 Message passing
There are two types of messages used by the system: external and internal mes-
sages. All messages have an enclosing <message> element, which contains either
one or more requests, or one or more responses. In the following descriptions, the
message element is not shown, but is assumed to be present. Action in Greenstone3
is originated by a request coming in from the outside. In the standard web-based
Greenstone, this comes from a servlet and is passed into the Receptionist. This
“external” type request is a request for a page of data, and contains a represen-
tation of the CGI style arguments. A page of XML is returned, which can be in
HTML format or other depending on the output parameter of the request.
Messages inside the system (“internal” messages) all follow the same basic
format: message elements contain multiple request elements, or multiple response
elements. Messaging is all synchronous. The same number of responses as re-
quests will be returned. Currently all requests are independent, so any requests can
be combined into the same message, and they will be answered separately, with
their responses being sent back in a single message.
When a page request (external request) comes in to the Receptionist, it looks
at the action attribute and passes the request to the appropriate Action module.
The Action will fire one or more internal requests to the MessageRouter, based
on the arguments. The data is gathered into a response, which is returned to the
Receptionist. The page that the receptionist returns contains the original request,
the response from the action and other info as needed (depends on the type of
Receptionist). The data may be transformed in some way — for the Greenstone
servlet we transform using XSLT to generate HTML pages.
Actions send internal style messages to the MessageRouter. Some can be an-
swered by it, others are passed on to collections, and maybe on to services. Internal
requests are for simple actions, such as search, retrieve metadata, retrieve document
text There are different internal request types: describe, process, system, format,
status. Process requests do the actual work of the system, while the other types
get auxiliary information. The format of the requests and responses for each in-
ternal request type are described in the following sections. External style requests,
and their page responses are described in the Section about page generation (Sec-
tion 3.9).
35
If the to field is empty, a request is answered by the MessageRouter. An example
response from a MessageRouter might look like this:
<response lang=’en’ type=’describe’>
<serviceList/>
<siteList>
<site name=’org.greenstone.gsdl1’
address=’http://localhost:8080/greenstone3/services/localsite’
type=’soap’ />
</siteList>
<serviceClusterList>
<serviceCluster name="build" />
</serviceClusterList>
<collectionList>
<collection name=’org.greenstone.gsdl1/
org.greenstone.gsdl2/fao’ />
<collection name=’org.greenstone.gsdl1/demo’ />
<collection name=’org.greenstone.gsdl1/fao’ />
<collection name=’myfiles’ />
</collectionList>
</response>
36
<response from="mgppdemo" type="describe">
<collection name="mgppdemo">
<displayItem lang="en" name="name">greenstone mgpp demo
</displayItem>
<displayItem lang="en" name="description">This is a
demonstration collection for the Greenstone digital
library software. It contains a small subset (11 books)
of the Humanity Development Library. It is built with
mgpp.</displayItem>
<displayItem lang="en" name="icon">mgppdemo.gif</displayItem>
<serviceList>
<service name="DocumentStructureRetrieve" type="retrieve" />
<service name="DocumentMetadataRetrieve" type="retrieve" />
<service name="DocumentContentRetrieve" type="retrieve" />
<service name="ClassifierBrowse" type="browse" />
<service name="ClassifierBrowseMetadataRetrieve"
type="retrieve" />
<service name="TextQuery" type="query" />
<service name="FieldQuery" type="query" />
<service name="AdvancedFieldQuery" type="query" />
<service name="PhindApplet" type="applet" />
</serviceList>
<metadataList>
<metadata name="creator">greenstone@cs.waikato.ac.nz</metadata>
<metadata name="numDocs">11</metadata>
<metadata name="buildType">mgpp</metadata>
<metadata name="httpPath">http://kanuka:8090/greenstone3/sites/
localsite/collect/mgppdemo</metadata>
</metadataList>
</collection>
</response>
37
<param name=’xxx’ type=’integer|boolean|string|invisible’ default=’yyy’/>
<param name=’xxx’ type=’enum_single|enum_multi’ default=’aa’/>
<option name=’aa’/><option name=’bb’/>...
</param>
<param name=’xxx’ type=’multi’ occurs=’4’>
<param .../>
<param .../>
</param>
<!-- this one is for the text box and field list for the
simple field query-->
<param name=’simpleField’ type=’multi’ occurs=’4’>
<param name=’fqv’ type=’string’/>
<param name=’fqf’ type=’enum_single’>
<option name=’TI’/><option name=’AU’/><option name=’OR’/>
</param>
</param>
The type attribute is used to determine how to display the parameters on a web
page or interface. For example, a string parameter may result in a text entry box,
a boolean an on/off button, enum single/enum multi a drop-down menu, where
one or many items, respectively, can be selected. A multi-type parameter indicates
that two or more parameters are associated, and should be displayed appropriately.
For example, in a field query, the text box and field list should be associated. The
occurs attribute specifies how many times the parameter should be displayed on the
page. Parameters also come with display information: all the text strings needed to
present them to the user. These include the name of the parameter and the display
values for any options. These are included in the above parameter descriptions in
the form of <displayItem> elements.
A service description also contains some display information—this includes
the name of the service, and the text for the submit button.
Here is a sample describe request to the FieldQuery service of collection mgp-
pdemo, along with its response. The parameters in this example include their dis-
play information. Figure 9 shows an example HTML search form that may be
generated from this describe response.
<request lang="en" to="mgppdemo/FieldQuery" type="describe" />
38
<response from="mgppdemo/FieldQuery" type="describe">
<service name="FieldQuery" type="query">
<displayItem name="name">Form Query</displayItem>
<displayItem name="submit">Search</displayItem>
<paramList>
<param default="Doc" name="level" type="enum_single">
<displayItem name="name">Granularity to search at</displayItem>
<option name="Doc">
<displayItem name="name">Document</displayItem>
</option>
<option name="Sec">
<displayItem name="name">Section</displayItem>
</option>
<option name="Para">
<displayItem name="name">Paragraph</displayItem>
</option>
</param>
<param default="1" name="case" type="boolean">
<displayItem name="name">Turn casefolding </displayItem>
<option name="0">
<displayItem name="name">off</displayItem>
</option>
<option name="1">
<displayItem name="name">on</displayItem>
</option>
</param>
<param default="1" name="stem" type="boolean">
<displayItem name="name">Turn stemming </displayItem>
<option name="0">
<displayItem name="name">off</displayItem>
</option>
<option name="1">
<displayItem name="name">on</displayItem>
</option>
</param>
<param default="10" name="maxDocs" type="integer">
<displayItem name="name">Maximum documents to return
</displayItem>
</param>
<param name="simpleField" occurs="4" type="multi">
<displayItem name="name"></displayItem>
<param name="fqv" type="string">
<displayItem name="name">Word or phrase </displayItem>
</param>
<param default="ZZ" name="fqf" type="enum_single">
<displayItem name="name">in field</displayItem>
<option name="ZZ">
<displayItem name="name">allfields</displayItem>
</option>
<option name="TX">
<displayItem name="name">text</displayItem>
</option>
<option name="TI">
39
Figure 9: The previous query service describe response as displayed on the search
page.
<displayItem name="name">Title</displayItem>
</option>
<option name="SU">
<displayItem name="name">Subject</displayItem>
</option>
<option name="ORG">
<displayItem name="name">Organization</displayItem>
</option>
<option name="SO">
<displayItem name="name">Source</displayItem>
</option>
</param>
</param>
</paramList>
</service>
</response>
A describe request to an applet type service returns the applet HTML element:
this will be embedded into a web page to run the applet.
<request type=’describe’ to=’mgppdemo/PhindApplet’/>
<response type=’describe’>
<service name=’PhindApplet’ type=’query’>
<applet ARCHIVE=’phind.jar, xercesImpl.jar, gsdl3.jar,
jaxp.jar, xml-apis.jar’
CODE=’org.greenstone.applet.phind.Phind.class’
CODEBASE=’lib/java’
HEIGHT=’400’ WIDTH=’500’>
<PARAM NAME=’library’ VALUE=’’/>
<PARAM NAME=’phindcgi’ VALUE=’?a=a&sa=r&sn=Phind’/>
40
<PARAM NAME=’collection’ VALUE=’mgppdemo’ />
<PARAM NAME=’classifier’ VALUE=’1’ />
<PARAM NAME=’orientation’ VALUE=’vertical’ />
<PARAM NAME=’depth’ VALUE=’2’ />
<PARAM NAME=’resultorder’ VALUE=’L,l,E,e,D,d’ />
<PARAM NAME=’backdrop’ VALUE=’interfaces/default/>
images/phindbg1.jpg’/>
<PARAM NAME=’fontsize’ VALUE=’10’ />
<PARAM NAME=’blocksize’ VALUE=’10’ />
The Phind java applet.
</applet>
<displayItem name="name">Browse phrase hierarchies</displayItem>
</service>
</response>
Note that the library parameter has been left blank. This is because library
refers to the current servlet that is running and the name is not necessarily known
in advance. So either the applet action or the Receptionist must fill in this parameter
before displaying the HTML.
One or more actual requests are specified in system elements. The following
are examples:
<system type=’configure’ subset=’’/>
<system type=’configure’ subset=’collectionList’/>
<system type=’activate’ moduleType=’collection’ moduleName=’demo’/>
<system type=’deactivate’ moduleType=’site’ moduleName=’site1’/>
The first request reconfigures the whole site—the MessageRouter goes through
its whole configure process again. The second request just reconfigures the collectionList—
the MessageRouter will delete all its collection information, and re-look through
the collect directory and reload all the collections again. The third request is to
activate collection demo. This could be a new collection, or a reactivation of an old
41
one. If a collection module already exists, it will be deleted, and a new one loaded.
The final request deactivates the site site1—this removes the site from the siteList
and module map, and also removes any of that sites collections/services from the
static lists.
A response just contains a status message5 , for example:
<status>MessageRouter reconfigured successfully</status>
<status>Error on reconfiguring collectionList</status>
<status>collection:demo activated</status>
<status>site:site1 deactivated</status>
The actual format statements are described in Section 2.4. They are templates
written directly in XSLT, or in GSF (GreenStone Format) which is a simple XML
representation of the more complicated XSLT templates. GSF-style format state-
ments need to be converted to proper XSLT. This is currently done by the Recep-
tionist (but may be moved to an ActionHelper): the format XML is transformed to
XSLT using XSLT with the config format.xsl stylesheet.
42
Table 8: Status codes currently used in Greenstone3
code name code meaning
value
SUCCESS 1 the request was accepted, and the process was completed
ACCEPTED 2 the request was accepted, and the process has been started, but
it is not completed yet
ERROR 3 there was an error and the process was stopped
CONTINUING 10 the process is still continuing
COMPLETED 11 the process has finished
HALTED 12 the process has stopped
INFO 20 just an info message that doesn’t imply anything
status requests can be sent repeatedly to the service to poll the status, using the pid
to identify the process. Status codes are used to identify the state of a process. The
values used at the moment are listed in Table 86 .
The following shows an example status request, along with two responses, the
first a ’OK but continuing’ response, and the second a ’successfully completed’
response. The content of the status elements in the two responses is the output
from the process since the last status update was sent back.
<response from="build/ImportCollection">
<status code="2" pid="2">Collection construction: import collection.
command = import.pl -collectdir /research/kjdon/home/greenstone3/web/sites/
localsite/collect test1
starting
</status>
</response>
<response from="build/ImportCollection">
<status code="11" pid="2">RecPlug: getting directory
/research/kjdon/home/greenstone3/web/sites/localsite/collect/test1/import
WARNING - no plugin could process /.keepme
*********************************************
Import Complete
*********************************************
* 1 document was considered for processing
* 0 were processed and included in the collection
* 1 was rejected. See /research/kjdon/home/greenstone3/web/sites/
localsite/collect/test1/etc/fail.log for a list of rejected documents
Success
</status>
</response>
6
A more standard set of codes should probably be used, for example, the HTTP codes
43
3.8 ’process’-type messages
Process requests and responses provide the major functionality of the system—
these are the ones that do the actual work. The format depends on the service they
are for, so I’ll describe these by service.
Query type services TextQuery, FieldQuery, AdvancedFieldQuery (GS2MGSearch,
GS2MGPPSearch), TextQuery (LuceneSearch) The main type of requests in the
system are for services. There are different types of services, currently: query,
browse, retrieve, process, applet, enrich. Query services do some kind of
search and return a list of document identifiers. Retrieve services can return the
content of those documents, metadata about the documents, or other resources.
Browse is for browsing lists or hierarchies of documents. Process type services are
those where the request is for a command to be run. A status code will be returned
immediately, and then if the command has not finished, an update of the status can
be requested. Applet services are those that run an applet. Enrich services take a
document and return the document with some extra markup added.
Other possibilities include transform, extract, accrete. These types of service
generally enhance the functionality of the first set. They may be used during col-
lection formation: ’accrete’ documents by adding them to a collection, ’transform’
the documents into a different format, ’extract’ information or acronyms from the
documents, ’enrich’ those documents with the information extracted or by adding
new information. They may also be used during querying: ’transform’ a query be-
fore using it to query a collection, or ’transform’ the documents you get back into
an appropriate form.
The basic structure of a service ’process’ request is as follows:
Some requests have other content—for document retrieval, this would be a list
of document identifiers to retrieve. For metadata retrieval, the content is the list of
documents to retrieve metadata for.
Responses vary depending on the type of request. The following sections look
at the process type requests and responses for each type of service.
44
3.8.1 ’query’-type services
Responses to query requests contain a list of document identifiers, along with some
other information, dependent on the query type. For a text query, this includes term
frequency information, and some metadata about the result. For instance, a text
query on ’snail farming’, with the parameter ’maxDocs=10’ might return the first
10 documents, and one of the query metadata items would be the total number of
documents that matched the query.7
The following shows an example query request and its response.
Find at most 10 Sections in the mgppdemo collection, containing the word
snail (stemmed), returning the results in ranked order:
<request lang=’en’ to="mgppdemo/TextQuery" type="process">
<paramList>
<param name="maxDocs" value="10"/>
<param name="queryLevel" value="Section"/>
<param name="stem" value="1"/>
<param name="matchMode" value="some"/>
<param name="sortBy" value="1"/>
<param name="index" value="t0"/>
<param name="case" value="0"/>
<param name="query" value="snail"/>
</paramList>
</request>
45
The list of document identifiers includes some information about document
type and node type. Currently, document types include simple, paged and hierarchy.
simple is for single section documents, i.e. ones with no sub-structure. paged is
documents that have a single list of sections, while hierarchy type documents
have a hierarchy of nested sections. For paged and hierarchy type documents,
the node type identifies whether a section is the root of the document, an internal
section, or a leaf.
The term list identifies, for each term in the query, what its frequency in the
collection is, how many documents contained that term, and a list of its equivalent
terms (if stemming or casefolding was used).
46
Structural info can also be requested in the paramList, and will be returned in
a <nodeStructureInfo> element. (See the section on DocumentStructureRetrieve
messages.) Possible values for info parameters are numSiblings, siblingPosition,
numChildren.
47
</documentNode>
</documentNodeList>
</response>
48
<response from="mgppdemo/DocumentStructureRetrieve" type="process">
<documentNodeList>
<documentNode nodeID="HASHac0a04dd14571c60d7fbfd.4.2">
<nodeStructureInfo>
<info name="numSiblings" value="2" />
</nodeStructureInfo>
<nodeStructure>
<documentNode nodeID="HASHac0a04dd14571c60d7fbfd"
docType=’hierarchy’ nodeType="root">
<documentNode nodeID="HASHac0a04dd14571c60d7fbfd.4"
docType=’hierarchy’ nodeType="interior">
<documentNode nodeID="HASHac0a04dd14571c60d7fbfd.4.2"
docType=’hierarchy’ nodeType="leaf" />
</documentNode>
</documentNode>
</nodeStructure>
</documentNode>
</documentNodeList>
</response>
49
<request lang=’en’ type=’process’ to=’build/ImportCollection’>
<paramList>
<param name=’collection’ value=’demo’/>
</paramlist>
</request>
<response from="build/ImportCollection">
<status code="2" pid="2">Starting process...</status>
</response>
The code attribute in the response specifies whether the command has been
successfully stated, whether its still going, etc (see Table 8 for a list of currently
used codes). The pid attribute specifies a process id number that can be used when
querying the status of this process. The content of the status element is (currently)
just the output from the process so far. Status messages, which were described
in Section 3.7, are used to find out how the process is going, and whether it has
finished or not.
50
in People Country Reports: Mali ; Western Samoa
</title>
</document>
...
</documentList>
<thesaurusList end=’10’ length=’15’ start=’0’>
<thesaurus df=’7’ id=’12387’ tf=’15’ type=’RT’>
<phrase>PUBLIC HEALTH</phrase>
</thesaurus>...
</thesaurusList>
</phindData>
</appletData>
</response>
51
</documentNodeList>
</response>
52
<param name=’rt’ value=’r’/>
<!-- the rest are the service specific params -->
<param name=’ca’ value=’0’/> <!-- casefold -->
<param name=’st’ value=’1’/> <!-- stem -->
<param name=’m’ value=’10’/> <!-- maxdocs -->
<param name=’q’ value=’snail’/> <!-- query string -->
</paramList>
</request>
There are some standard arguments used in Greenstone, and they are described
in Table 9. These are used by Receptionists and Actions. The GSParams class
specifies all the general basic arguments, and whether they should be saved or not
(Some arguments need to be saved during a session, and this needs to be imple-
mented outside Greenstone proper — currently we do this in the servlet, using
servlet session handling). The servlet has an init parameter params class which
specifies which params class to use: GSParams can be subclassed if necessary. The
Receptionist and Actions must not have conflicting argument names.
Other arguments are used dynamically and come from the Services. Service
arguments must always be saved during a session. Services may be created by dif-
ferent people, and may reside on a different site. There is no guarantee that there
is no conflict with argument names between services and actions. Therefore ser-
vice parameters are namespaced when they are put on the page, whereas interface
(receptionist and action) parameters have no namespace. The default namespace
is s1 (service1) — any parameters that are for the service will be prefixed by this.
For example, the case parameter for a search will be put in the page as s1.case,
and the resulting argument in a search URL will be s1.case. When actions are de-
ciding which parameters need to be sent in a request to a service, they can use the
namespace information.
If there are two or more services combined on a page with a single submit
button, they will use namespaces s1, s2, s3 etc as needed. The s (service) parameter
will end up with a list of services. For example, s=TextQuery,MusicQuery, and
the order of these determines the mapping order of the namespaces, i.e. s1 will
map to TextQuery, s2 to MusicQuery.
53
Argument Meaning Typical values
a action a (applet), q (query), b (browse), p (page), pr (process)
s (system)
sa subaction home, about (page action)
c collection or demo, build
service cluster
s service name TextQuery, ImportCollection
rt request type d (display), r (request), s (status)
ro response only 0 or 1 - if set to one, the request is carried out
but no processing of the results is done
currently only used in process actions
o output type XML, HTML, WML
l language en, fr, zh ...
d document id HASHxxx
r resource id ???
pid process handle an integer identifying a particular process request
the query options on the query form. The pageResponse contains all the data that
has been gathered from the system by the action. The other two elements con-
tain extra information needed by XSLT. Config contains run-time variables such
as the location of the gsdl home directory, the current site name, the name of the
executable that is running (e.g. library)—these are needed to allow the XSLT to
generate correct HTML URLs. Display contains some of the text strings needed in
the interface—these are separate from the XSLT to allow for internationalization.
The following subsections outline, for each action, what data is needed and
what requests are generated to send to the system.
Once the XML page has been put together, the page to return to the user is
created by transforming the XML using XSLT. The output is HTML at this stage,
but it will be possible to generate alternative outputs, such as XML, WML etc.
A set of XSLT files defines an ’interface’. Different users can change the look
of their web pages by creating new XSLT files for a new ’interface’. Just as we
have a sites directory where different sites ’live’ (ie where their configuration file
and collections are located), we have an interfaces directory where the different
interfaces ’live’ (ie their transforms and images are located there). The default
XSLT files are located in interfaces/default/transforms. Collections, sites and other
interfaces can override these files by having their own copy of the appropriate files.
New interfaces have their own directory inside interfaces/. Sites and collections can
have a transform directory containing XSLT files. The order in which the XSLT
files are looked for is collection, site, current interface, default interface.9 [TODO:
describe a bit more?? currently only can get this locally]
9
this currently breaks down for remote sites - need to rethink it a bit.
54
3.9.3 Receptionists
The receptionist is the controlling module for the page generation part of Green-
stone . It has the job of loading up all the actions, and it knows about the message
router it and the actions are supposed to talk to. It routes messages received to the
appropriate action (page-type messages) or directly to the message router (all other
types). Receptionists also do other things, for example, adding to the page received
back from the action any information that is common to all pages.
There are different ways of providing an interface to Greenstone , from web
based CGI style (using servlets) to Java GUI applications. These different inter-
faces require slightly different responses from a receptionist, so we provide several
standard types of receptionist.
Receptionist: This is the most basic receptionist. The page it returns consists
of the original request, and the response from the action it was sent to. Meth-
ods preProcessRequest, and postProcessPage are called on the request and page,
respectively, but in this basic receptionist, they don’t do anything.
TransformingReceptionist: This extends Receptionist, and overwrites postPro-
cessPage to transform the page using XSLT. An XSLT is listed for each action in
the receptionists configuration file, and this is used to transform the page. First,
some display information, and configuration information is added to the page.
Then it is transformed using the specified XSLT for the action, and returned.
WebReceptionist: The WebReceptionist extends TransformingReceptionist. It
doesn’t do much else except some argument conversion. To keep the URLs short,
parameters from the services are given shortnames, and these are used in the web
pages.
DefaultReceptionist: This extends WebReceptionist, and is the default one for
Greenstone3 servlets. Due to the page design, some extra information is needed
for each page: some metadata about the current collection. The receptionist sends
a describe request to the collection to get this, and appends it to the page before
transformation using XSLT.
By default, the LibraryServlet uses DefaultReceptionist. However, there is a
servlet init-param called receptionist which can be set to make the servlet use a
different one.
55
erences pages. These pages are not associated with specific services like the other
page types. In general, the data comes from describe requests to various modules.
The different pages are requested using the subaction argument. For the ’home’
page, a ’describe’ request is sent to the MessageRouter—this returns a list of all
the collections, services, serviceClusters and sites known about. For each collec-
tion, its metadata is retrieved via a ’describe’ request. This metadata is added into
the previous result, which is then added into the page. For the ’about’ page, a
describe request is sent to the module that the about page is about: this may be a
collection or a service cluster. This returns a list of metadata and a list of services.
To get an external html page embedded into a greenstone collection, i.e. a two
frame page, with the top frame containing the collection header and navigation bar,
and the second frame containg the external page, use subaction html. A url would
look like a=p&sa=html&c=collname&url=externalurl
56
pair.
Note that the applet HTML may need to know the name of the library pro-
gram. However, that name is chosen by the person who installed the software and
will not necessarily be “library”. To get around this, the applet can put a parameter
called “library” into the applet data with a null value:
<PARAM NAME=’library’ VALUE=’’/>
When the AppletAction encounters this parameter it inserts the name of the current
library servlet as its value.
57
Table 10: Configure CGI arguments
arg description
a=s system action
sa=c|a|d type of system request: c (configure), a (add/activate),
d (delete/deactivate)
c=demo the request will go to this collection/servicecluster
instead of the message router
ss=collectionList subset for configure: only reconfigure this part.
For the MessageRouter, can be serviceClusterList, serviceList,
collectionList, siteList.
For a collection/cluster, can be metadataList or serviceList.
sn=demo
st=collection
58
Table 11: The utility classes in org.greenstone.gsdl3.util
Utility class Description
CollectionClassLoader ClassLoader that knows about a collection’s resource directory
DBInfo Class to hold info from GDBM database entry
Dictionary wrapper around a Resource Bundle, providing strings with parameters
GDBMWrapper Wrapper for GDBM database. Uses JavaGDBM
GSConstants holds some constants used for servlet arguments and configuration variables
GSEntityResolver an EntityResolver which can be used to find resources such as DTDs
GSFile class to create all Greenstone file paths e.g. used to locate configuration
files, XSLT files and collection data.
GSHTML provides convenience methods for dealing with HTML, e.g. making strings
HTML safe
GSParams contains names and default values for interface parameters
GS2Params a subclass of GSParams which holds default service parameters too, neces-
sary for the gs2 style interface.
GSPath used to create, examine and modify message address paths
GSStatus some static codes for status messages
GSXML lots of methods for extracting information out of Greenstone XML, and cre-
ating some common types of elements. Also has static Strings for element
and attribute names used by Greenstone .
GSXSLT some manipulation functions for Greenstone XSLT
GlobalProperties Holds the global properties (from global.properties)
MacroResolver Used with replace elements in collection configuration files, replaces a
macro or string with another string, metadata or text from a dictionary
GS2MacroResolver MacroResolver for GS2 collections, that uses the GDBM database
Misc miscellaneous functions
MyNodeList A simple implementation of an XML NodeList
OID class to handle Greenstone (2) OIDs
Processing Runs an external process and prints the output from the process
SQLQuery contains a connection to a SQL database, along with some methods for ac-
cessing the data, such as converting MG numbers to and from Greenstone
OIDs.
XMLConverter provides methods to create new Documents, parse Strings or Files into Doc-
uments, and convert Nodes to Strings
XMLTransformer methods to transform XML using XSLT
XSLTUtil contains static methods to be called from within XSLT
59
same way as the standard query services). However, if it is a new type of service
that the interface and actions don’t know about, you willl need to add a new action
or modify an existing one so that your service is actually used.
60
different in the request/response format, then you may need to modify an existing
action to supply or use the new information. If the service is of a completely new
type, then you will probably need a new action to talk to the service and display
the results.
61
There are four levels of customization that may be needed with new collections:
service, collection, interface XSLT, and action levels. We will use the example
collections that come with Greenstone to describe these different levels.
Firstly, new service classes need to be written to provide the functionality to
search/browse/whatever the collection. If the services have similar interfaces and
functionality to the standard services, this may be all that is needed. For example,
MGPP collections were the first to be served in Greenstone3 . When we came
to do MG collections, all we had to do was write some new service classes that
interacted with MG instead of MGPP. Because these collections used the same
type of services, this was all we had to do. The format of the configuration files
was similar, they just specified MG serviceRack classes rather than MGPP ones.
The XML Sample Texts (gberg) collection, however, was done quite differently
to the standard collections. New services were provided to search the database
(built with Lucene) and to provide the documents and parts of documents (using
XSLT to transform the raw XML files). The collectionConfig file had some extra
information in it: a list of the documents in the collection along with their Titles.
Because the standard collection class has no notion of document lists, a new class
was created (org.greenstone.gsdl3.collection.XMLCollection). This class is basi-
cally the same as a standard collection class except that it looks for and stores in
memory the documentList from the collectionConfig file.
To tell Greenstone to load up a different type of collection class, we use an-
other configuration file: etc/collectionInit.xml. This specifies the name of the
collection class to use. Currently, this is all that is specified in that file, but you
may want to add parameters for the class etc.
<collectionInit class="XMLCollection"/>
The display for the collection is also quite different. The home page for the
collection displays the list of documents. To achieve this, the describe response
from the collection had to include the list, and a new XSLT was written for the
collection that displayed this. Collection XSLT should be put in the transform
directory of the collection10 .
Document display is significantly different to standard Greenstone . There are
two modes of display: table of contents mode, and content mode. Clicking on a
document link from the collection home page takes the user to the table of contents
for the collection. Clicking on one of the sections in the table of contents takes
them to a display of that section. To facilitate this, not only do we need new XSLT
files , we also needed a new action. XMLDocumentAction was created, that used
two subactions, toc and text, for the different modes of display.
The Receptionist was told about this new action by the addition of the following
element to the interfaceConfig.xml file:
62
<subaction name=’text’ xslt=’document-content.xsl’/>
</action>
XSLT files are linked to subactions rather than the action as a whole. The
collection supplies the two XSLT files written appropriately for the data it contains.
All links that link to the documents have to be changed to use the xd action
rather than the standard d action. These include the links from the home page, and
the links from query results.
Querying of the collection is almost the same as usual. The query service pro-
vides a list of parameters, does the query and then sends back a list of document
identifiers. The standard query action was fine for this collection. The change oc-
curs in the way that the results are displayed—this is accomplished using a format
statement supplied in the collectionConfig file inside the search node.
<search>
<format>
<gsf:template match="documentNode">
<xsl:param name="collName"/>
<xsl:param name="serviceName"/>
<td>
<b><a href="{$library_name}?a=xd&sa=text&c={$collName}&
amp;d={@nodeID}&p.a=q&p.s={$serviceName}">
<xsl:choose>
<xsl:when test="metadataList/metadata[@name=’Title’]">
<gsf:metadata name="Title"/>
</xsl:when>
<xsl:otherwise>(section)</xsl:otherwise>
</xsl:choose>
</a>
</b> from <b><a href="{$library_name}?a=xd&sa=toc&
c={$collName}&d={@nodeID}.rt&p.a=q&p.s={$serviceName}">
<gsf:metadata name="Title" select="root"/></a></b>
</td>
</gsf:template>
</format>
</search>
Instead of displaying an icon and the Title, it displays the Title of the section
and the title of the document. Both of these are linked to the document: the section
title to the content of that section, the document title to the table of contents for
the document. Because these require non-standard arguments to the library, these
parts of the template are written in XSLT not Greenstone format language. As is
shown here it is perfectly feasible to write a format statement that includes XSLT
mixed in with Greenstone format elements.
The document display uses CSS to format the output—these are kept in the
collection and specified in the collections XSLT files. The documents also specify
DTD files. Due to the way we read in the XML files, Tomcat sometimes has
trouble locating the DTDs. One option is to make all the links absolute links to
files in the collection folder, the other option is to put them in Greenstone ’s DTD
folder $GSDL3SRCHOME/resources/dtd.
63
4.5 The gs2 Interface
The library seen at http://www.greenstone.org/greenstone3/nzdl is like a
mirror to http://www.nzdl.org—it aims to present the same collections, in the
same way but using Greenstone3 instead of Greenstone2 . It uses a new site (nzdl)
with a new interface (nzdl) which is based on the gs2 interface. The web.xml file
had a new servlet entry in it to specify the combination of nzdl site and nzdl inter-
face.
The site was created by making a directory called nzdl in the sites folder. A
siteConfig file was created. Because it is running on Linux, we were able to link to
all the collections in the old Greenstone installation. The convert coll from gs2.pl
script was run over all the collections to produce the new XML configuration files.
The gs2 interface was created to be used by this site (and is now a standard
part of Greenstone). In many cases, creating a new interface just requires the new
images and XSLT to be added to the new directory(see Sections 1.4 and 2.5). This
gs2 interface required a bit more customization.
The standard Greenstone3 navigation bar lists all the services available for the
collection. In Greenstone2 , the navigation bar provides the search option, and
the different classifiers. This is not service specific, but hard coded to the search
and classifiers. The XSLT that produces the navigation bar needed to be altered to
produce this. The standard receptionist (DefaultReceptionist) gathers a little bit of
extra information for each page of XML before transforming it: this is the list of
services for the collection and their display information, allowing the services to
be listed along the navigation bar. This is information that is needed by every page
(except for the library home page) and therefore is obtained by the receptionist
instead of by each action. The nzdl interface uses the classifier list that comes in
the ClassifierBrowse service description to display teh list of classifiers.
The nzdl interface extends the gs2 interface to provide a different looking home
page and an extra static ’gsdl’ page.
64
5 Distributed Greenstone
Greenstone is designed to run in a distributed fashion. One Greenstone installation
can talk to several sites on different computers. This requires some sort of com-
munication protocol. Any protocol can be used, currently we have a simple SOAP
protocol.
more explanation..
Library
Servlet
Service
Service
Service
Receptionist Action
Service
ServiceCluster
Action Action
Service
SOAP MessageRouter
Communicator 3 Collection
MessageRouter
1 Service
SOAP Service
Server Collection
SOAP
Communicator
Service
Service
Service
Service
SOAP
Server
MessageRouter
2
Collection
Collection Collection
Figure 10: A distributed digital library configuration running over several servers
65
5.2 Connecting to a site web service
There are two ways to use a remote site. First, if you have a local site running, then
the site can also connect to other remote sites. In the siteConfig.xml file, you need
to add a site element into the siteList element.
For example, to get siteA to talk to siteB, you need to deploy a SOAP server on
siteB, then add a <site> element to the <siteList> of siteA’s siteConfig.xml
file (in $GSDL3HOME/sites/siteA/siteConfig.xml).
In the <siteList> element, add the following (substituting the chosen site uri
for siteBuri):
<site name="siteBuri"
address="http://localhost:8080/greenstone3/services/siteBuri"
type="soap"/>
(Note that localhost and 8080 should be changed to the values you entered
when installing Greenstone3. Localhost will only work for servers on the smae
machine.).
If you have changed the siteConfig.xml file for a site that is running, it will
need to be reconfigured. Either restart Tomcat, or reconfigure through a URL: e.g.
http://localhost:8080/greenstone3/library?a=s&sa=c. Several sites can be
connected to in this manner.
The second option is if you have a receptionist set up on a machine where
you have no site, and you only want to connect to a single remote site. Instead of
using site name in the servlet initialisation parameters (in $GSDL3HOME/WEB-
INF/web.xml), you can specify remote site name, remote site type and remote site address.
A communicator object will be set up instead of a MessageRouter and the recep-
tionist will talk to the communicator.
66
A Using Greenstone3 from CVS
Greenstone3 is also available via CVS. You can download the latest version of the
code. This is not guaranteed to be stable, in fact it is likely to be unstable. The
advantage of using CVS is that you can update the code and get the latest fixes.
Note that you will need the Java 2 SDK, version 1.4.0 or higher, and Ant
(Apache’s Java based build tool, http://ant.apache.org) installed.
To check out the Greenstone code, use:
cvs -d :pserver:cvs\_anon@cvs.scms.waikato.ac.nz:2402/usr/local/
global-cvs/gsdl-src co -P greenstone3
If you need it, the password for anonymous CVS access is anonymous. Note
that some older versions of CVS have trouble accessing this repository due to the
port number being present. We are using version 1.11.1p1.
Greenstone is built and installed using Ant (Apache’s Java based build tool,
http://ant.apache.org). You will need a Java Development Environment (1.4 or
higher), and Ant installed to use Greenstone. You can download Ant from
http://ant.apache.org/bindownload.cgi. Make sure that the environment vari-
ables JAVA HOME and ANT HOME are set.
In the greenstone3 directory, you can run ’ant’ which will give you a help
message. Running ’ant -projecthelp’ gives a list of the targets that you can run
— these do various things like compile the source code, startup the server etc.
The README.txt file has up-to-date instructions for installing from CVS. Briefly,
for a first time install, run ’ant prepare install’.
The file build.properties contains various parameters that can be set by the
user. Please check these settings before running the installation process. The in-
stall process will ask you if you accept the properties before starting. For a non-
interactive version of the install, run ’ant -Dproperties.accepted=yes install’
To log the output in build.log, run ’ant -Dproperties.accepted=yes -logfile
build.log install’
Compilation includes Java and C/C++. On Windows, you will need to have Vi-
sual Studio or equivalent installed. Please check the compile.windows.c++.setup
property in build.properties — make sure it is set to the setup script of Visual Stu-
dio.
Note: gs3-setup sets the environment variables GSDL3HOME, GSDL3SRCHOME,
CLASSPATH, PATH, JAVA HOME and needs to be done in a shell before doing col-
lection building etc.
To run the library, use the gs3-server.sh/bat shell scripts.
67
B Tomcat
Tomcat is a servlet container, and Greenstone3 runs as a servlet inside it.
The file $GSDL3SRCHOME/packages/tomcat/conf/server.xml is the Tomcat
configuration file. A context for Greenstone3 is given by the file
$GSDL3SRCHOME/packages/tomcat/conf/Catalina/localhost/greenstone3.xml.
This tells Tomcat where to find the web.xml file, and what URL (/greenstone3)
to give it. Anything inside the context directory is accessible via Tomcat11 . For
example, the index.html file that lives in $GSDL3HOME can be accessed through the
URL localhost:8080/greenstone3/index.html. The gs2mgdemo collection’s
images can be accessed through
localhost:8080/greenstone3/sites/localsite/collect/gs2mgdemo/images/.
Greenstone sets up Tomcat to run on port 8080 by default. To change this, you
can edit the tomcat.port property in build.properties. If you do this before installing
Greenstone, then running ’ant install’ will use the new port number. If you want
to change it later on, shutdown tomcat, run ’ant configure’, then when you restart
tomcat it will use the new port.
Note: Tomcat must be shutdown and restarted any time you make changes in
the following for those changes to take effect:
• $GSDL3HOME/WEB-INF/web.xml
• $GSDL3SRCHOME/packages/tomcat/conf/server.xml
• any classes or jar files used by the servlets
On startup, the servlet loads in its collections and services. If the site or col-
lection configuration files are changed, these changes will not take effect until the
site/collection is reloaded. This can be done through the reconfiguration messages
(see Section 1.7), or by restarting Tomcat.
We have disabled following symlinks for the greenstone servlet. To enable it,
edit $GSDL3SRCHOME/packages/tomcat/conf/Catalina/localhost/greenstone3.xml
and set ’allowLinking’ to true.
By default, Tomcat allows directory listings. To disable this, change the ’list-
ings’ parameter to false in the default servlet definition, in Tomcat’s web.xml file
($GSDL3SRCHOME/packages/tomcat/conf/web.xml):
We have set the greenstone context to be reloadable. This means that if a
class or resource file in web/WEB-INF/lib or web/WEB-INF/classes changes, the
servlet will be reloaded. This is useful for development, but should be turned off
for production mode (set the ’reloadable’ attribute to false).
Tomcat uses a Manager to handle HTTP session information. This may be
stored between restarts if possible. To use a persistent session handling manager,
uncomment the <Manager> element in
$GSDL3SRCHOME/packages/tomcat/conf/server.xml. For the default manager,
session information is stored in the work directory:
11
can we use .htaccess files to restrict access??
68
$GSDL3SRCHOME/packages/tomcat/work/Catalina/localhost/greenstone3/SESSIONS.ser.
Delete this file to clear the cached session info. Note that Tomcat needs to be shut-
down to delete this file.
<VirtualHost xx.xx.xx.xx>
ServerName www.greenstone.org
...
ProxyPass /greenstone3 http://puka.cs.waikato.ac.nz:8080/greenstone3
ProxyPassReverse /greenstone3 http://puka.cs.waikato.ac.nz:8080/greenstone3
</VirtualHost>
69
C SOAP
Greenstone uses the Apache Axis SOAP implementation for distributed commu-
nications. Axis runs as a servlet inside Tomcat, and SOAP web services can be
deployed by this Axis servlet. The Greenstone installation process sets up Axis for
Tomcat, but does not deploy any services.
To deploy the SOAP service for localsite, run ant deploy-localsite.
To deploy a SOAP service for other sites, run ant soap-deploy-site
This will prompt you for the sitename (the site’s directory name), and a unique
URI for the site. It creates a new SOAPServer class for the site
($GSDL3SRCHOME/src/java/org/greenstone/gsdl3/SOAPServer<sitename>.java),
creates a resource file for deployment ($GSDL3SRCHOME/resources/soap/<sitename>.wsdd),
and then tries to deploy the service.
Information about deployed services is maintained between Tomcat sessions—
you only need to deploy something once. To undeploy a site, use ant soap-undeploy-site.
The axis services can be accessed at localhost:8080/greenstone3/index.jsp.
The listen port is the port that you want the monitor to be listening on. It should
’act as’ a Listener, with target hostname 127.0.0.1 (localhost), and target port the
port that Tomcat is running on (8080). You need to modify the address used to
talk to the SOAP service. For example, if you want to monitor traffic between
the gateway site and the localsite SOAP server, you will need to edit gateway’s
siteConfig.xml file and change the port number (in the site element) to whatever
you have chosen as the listen port.
For example, in the Admin panel of TCPMonitor the Target Hostname might
be 127.0.0.1, and the Target Port # 8080. Set the Listen Port # to be a different port,
such as 8070 and click Add. This produces a new tab panel where you can see the
messages arriving at port 8070 before being forwarded to port 8080. You then need
to set your test request from your SOAP application to arrive at port 8070 and you
will see copies of the messages in the new tab panel.
70
D Tidying up the formatting for imported Greenstone2
collections
D.1 Format statements: Greenstone2 vs Greenstone3
The following table shows the Greenstone2 format elements, and their equivalents
in Greenstone3
71
Macro resolving using resource bundles and metadata:
<replace scope=’metadata’ macro="_magazines_" bundle="NZDLMacros"
key="Magazines"/>
<replace scope=’all’ macro=’_thisOID_’ metadata=’archivedir’/>
<replace macro="_httpcollimg_"
text="sites/localsite/collect/folktale/index/assoc"/>
These two examples show how to deal with Greenstone2’s external link macros.
The first one is for a ’relative’ external link. In this case, the links are like URL’s
but they actually refer to Greenstone internal documents. So the Greenstone3 link
is to the document, but with parameter s0.ext signifying that the d argument will
need translating before retrieving the content. The second example is a truly ex-
ternal link. This is translated into a HTML type page action, where the URL is
presented as a frame along with the collection header in a separate frame.
Sometimes we need to add in macros to be resolved in a second step:
<replace macro="_iconpdf_" scope="metadata"
text="<img title=’_texticonpdf_’ src=’interfaces/default/images/ipdf.gif’/>"/>
<replace macro="_texticonpdf_" scope="metadata" bundle="interface_gs2"
key="texticonpdf"/>
72