A search engine is a program that searches a dataset. On the World Wide
Web, this engine is most often used for searches through databases of HTML
documents gathered by robots or spiders.
The Verity Ultraseek server is the state search engine. A spider associated
with this server crawls all state agency Web sites overnight, during off peak times.
In addition, customized interfaces, such as the Bridges search interface to Minnesota
Environmental Agencies allow searching of portions of the state's Web sites.
You can add
a search capability directly from your Web page.
Full text information on state agency Web servers is spidered and
deposited in the search engine database. When queried, the search engine
retrieves information from the database to find items matching your query.
Information stored within discrete Dublin Core elements helps match search
queries with corresponding information located in the resource. The full
text of the resource is searched and results are displayed according to
the quality of the match. This quality is defined by words that repeatedly
correspond to the search query or in matches in the title of the document
and the URL. Results are then displayed.
The Ultraseek spider searches the public areas of Minnesota State Government
Web sites. Only parts of the Web site that are connected to the root URL
will be searched; the spider will not crawl through the network. This root
URL would look like www.agencyname.state.mn.us. For more information
on excluding parts of a site, such as test or administrative areas, use
a robots.txt file.
Currently, the Ultraseek robot is not spidering Web pages generated
from databases, network files or Common Gateway Interface (CGI) scripts.
Non-Web networks will also not be spidered for HTML pages. For those with
Intranets, use of a spider is not possible beyond password
What if I add new resources and I want the spider to include it?
ADD URL feature.
This function allows users (usually Web developers) to add a URL to
the collection for indexing. This URL must match a set of patterns already
in place for a given collection -- Environmental Information or State of
Minnesota. For example, in the State of Minnesota collection, a URL must
contain the string "state.mn.us". There are exceptions, however,
such as URLs that have .org, which are handled separately. The advantage
of using ADD URL is that it allows URLs of new pages or those with major
changes to be submitted for almost immediate processing.
The spider will usually gather a new URL or resource on the next
visit to your site -- usually within 24 hours.
on how to direct search engine robot navigation within your Web site
The following resources can be found through a State of Minnesota search:
Portable Document Format (PDF)
Geographic Information System (GIS)
The following settings are currently in use:
Disallow URLs to CGI scripts
Maximum number of directories in a URL = 10
Maximum number of hops from root URL = 100
Languages allowed = any
Documents are considered duplicates if they are identical or have identical metadata
Documents have higher relevancy ranking if the search words are found in the metadata, including Dublin Core elements.
Weighted elements include: Title, Description, Subject/Keywords, Alt Attribute, Remote Anchors
The minimum revisit is one day, maximum is 32 days. The spider tunes itself according to frequency of updates in a Web page.
The State of Minnesota Thesaurusis a comprehensive, cross-indexed set of subjects intended
for search assistance. It is based on current vocabularies, including the:
Legislative Indexing Vocabulary
Minnesota GIS Community
The thesaurus allows like communities to use a common
vocabulary when describing and detailing Web resources.