Skip to end of metadata
Go to start of metadata

The Web Archive Collection Service (WAX) is used by Harvard curators to harvest and archive selected web sites for purposes of teaching and research. This Help Guide describes the WAX public interface, which lets users browse and search the contents of these archived web sites. For more information about this service, see the About WAX page on the WAX site.

Need help? To report a problem or ask a question about WAX, use this feedback form.

Browser Setup


Browser support. The WAX public interface works best in modern browsers (version 6+ of Internet Explorer or current versions of other popular browsers such as Firefox, Safari or Opera). If you experience problems viewing archived content in Internet Explorer, you may have better luck using Firefox.

Javascript support must be enabled in your browser to use WAX successfully.

Language of presentation. English is the default language of presentation in the WAX user interface. To assist users of the Constitutional Revision in Japan archive, Japanese translations of introductory text and menu choices are available. To change the presentation to Japanese, click the Japanese flag icon . To change back to an English presentation, click the US flag icon .

Character set support. Some archived web sites may contain content in non-Western languages (for example, Japanese). WAX uses Unicode UTF-8 encoding to express characters in these languages. To display and search WAX collections in non-Western languages, your browser's character encoding must be set to UTF-8.

    • Setting character encoding in Firefox:

From the View menu, select Character Encoding > More Encodings > Unicode > Unicode (UTF-8).

    • Setting character encoding in Internet Explorer:

From the View menu, select Encoding > More > Unicode (UTF-8).

To enter non-Western characters into the WAX search form, use the character input methods that are available on your computer.

Searching


WAX offers full text keyword searching of its archived web content, including text and links within web pages and Acrobat PDF files.

From the WAX home page, you can search one archived collection or search across several collections. If you drill down into a single web site collection, you can search the entire collection or select and search an individual archived web site.

Basic searching

    • Use a space or a plus sign (+) between words to find pages containing both words:

espp concentration

+podcast +blog

    • Use a dash (-) in front of a word to exclude that word from results:

supernova -hubble

    • Use double quotes to search a phrase:

"conservative voice"

No wild card search options are available at this time.

Advanced searching

WAX offers a few advanced options that can help you search for specific file types or URLs contained within an archive. These are called fielded searches.

The format of a fielded search is [field name]:[term], e.g., type:application/pdf. Available fields are:

    • type (MIME type) - limit a search to files of a particular MIME type, which indicates the type of information a file contains. For example:

type:application/pdf limits a search to documents in PDF format
type:text/html limits a search to web pages

See the list of common mime types

    • url - limit a search to words contained in a URL. For example:

url:harvard limits results to web pages with a URL containing the word "harvard"

    • exacturl - limit a search to a particular URL. For example:

exacturl:http://professorkim.blogspot.com/2007/02/andrea-mitchell-will-not-testify.html

    • site (domain search) - limit a search to content within a specific domain (that is, the first part of an internet address, before the first forward slash). For example:

site:www.harvard.edu limits a search to the Harvard domain

Note that the protocol (the http:// part of the URL) should not be included when using site.

It is possible to combine fielded searches with each other as well as with full text searches in one search query, for example:

ministry type:application/pdf
     will find any PDF file containing the keyword ministry.

"conservative voice" site:www.adamsweb.us
     will find content within the specified site that contains the phrase "conservative voice".

Be sure to put a space between each component in your search.

Search results

WAX search results display in order of relevance (with most relevant hits at the top).

Search results will be limited to a single relevant hit from each of the individual web site domains within the scope of your search. WAX imposes this limitation to prevent hits in one domain from overwhelming your search results.

You have several options for viewing an archived page from search results:

    • Click latest archived version to view the latest archived version of the page.

    • Click all archived versions to select from a list of all archived versions of the page.

  • Click more hits for [domain] to view more search results within the specified domain.

The screen shot below illustrates the results of a search within an archive of Harvard departmental web sites. The results include a single hit from each Harvard domain (fas.harvard.edu, mcb.harvard.edu, etc.).

Viewing an Archived Web Site


WAX collections usually offer multiple archived versions of a web site, distinguished by the date the site was crawled. When you select an archived site, you will be viewing that web site as it appeared then. The number of archived versions available depends on the frequency of harvest schedules for each web site. Note that there is a delay of at least three months between when a web site is harvested and when it will display in WAX.

Important things to watch out for when you are viewing an archived web site:

    • Wait for the archived page to load before clicking links or buttons on the page. Clicking before the page has loaded may lead you to the live web site.

    • Some content in an archived web site may not work as expected. Common examples are request forms, search boxes, and drop-down menus.

  • Individual parts of an archived web page may not be harvested (for example, advertisements, images) because the content was out of scope, prohibited by robot exclusion rules or could not be harvested. The message "section not archived" will display in place of this missing content.

For additional hints about viewing archived sites, see the Known Issues part of this guide.

Citing an Archived Web Site


Please remember to cite the use of WAX archived content in your work. To assist you, WAX offers a "Cite This Resource" option that produces a ready-made citation in three styles (APA, Chicago and MLA).

Click on the "Cite This Resource" option to view and copy the WAX citation of your choice. This option appears on the collection's home page, the individual archived web site description page, and in the upper frame when you are viewing archived pages.

Consult these links to learn more about citation styles:

American Psychological Association (APA)

Chicago Manual of Style (Chicago)

MLA Style Manual and Guide to Scholarly Publishing

Known Issues


Web archiving technology is still in development with improvements being made continually. Currently, some of the original functionality found in web sites may not be preserved or may not display properly in the archived version of the sites. Issues that we have identified are listed below (with workarounds when we know them).

    • Blog content

If the navigation bar displays properly on the left hand side of the page, but the right hand side is blank, try scrolling down the page until content appears below the navigation bar.

    • Blog comments

Clicking on a blog comment may result in a "Not in Archive" message. Try locating the post by title on the left or right navigation bar. When displaying the post by title, all associated comments should appear below the content of the post.

    • Links

Links that are not in <a href> HTML tags, may unintentionally link out to the live web. This is a display problem related to harvesting dynamic web content that will be addressed in the future.

    • Images

Some images may not be showing. This may be due to harvesting restrictions of the original live site.

    • Expand / Collapse menus

Expand / Collapse menus may not work properly. This is a display problem related to harvested dynamic web content that will be addressed in the future.

    • Drop down menus

If you find that drop down menus do not work correctly, note that this is a display problem related to harvested dynamic web content that will be addressed in the future.

    • Non-Latin text

You may encounter garbled text if viewing archived pages that are using non-Latin characters. In this case the character encoding does not appear like it does on the live site. This is a WAX presentation problem which occurs when presenting non-Unicode non-Latin character fonts. You can try correcting this on a page-by-page basis by changing the character encoding in your browser. We are working on resolving this issue.

    • Video or audio not captured

Video or audio resources that visually are part of the web site but are actually links to outside content (such as YouTube or content from another site) may not get captured correctly by the WAX harvester.

About WAX harvesting


To archive web content, WAX uses Heritrix, a web crawler designed by the Internet Archive. The WAX crawler is a program that browses web sites and copies their content for the WAX archive.

The WAX crawler is called: hul-wax

The WAX crawler will obey all common instructions in robots.txt files. You may specifically instruct our crawler to harvest material from your site or not to harvest material from your web site by updating your robots.txt file to include us. The robots.txt file must be placed at the root of your server. More information about robots.txt files can be found at: http://www.robotstxt.org/robotstxt.html.

Allowing the WAX crawler. The following text added to the robots.txt file will allow our harvester to crawl your web site:

   User-agent: hul-wax
   Disallow:

Prohibiting the WAX crawler. The following text added to the robots.txt file will disallow our harvester to crawl your web site:

   User-agent: hul-wax
   Disallow: /

Information for Copyright Owners


If you own or control copyrighted content available in WAX and wish it to be taken down, please let us know. To make a take down request or inquire about inclusion of your content in WAX, use the WAX feedback form. Please identify in your submission the URL(s) of the web page(s) carrying your content, the date(s) and time(s) of archiving, the specific content on the page(s) to which you claim rights, and the nature of your rights, e.g.:

http://www.school.edu/faculty archived January 1, 2009 at 12:00 AM, photograph of teachers, creator Jane Doe, photograph registered for copyright.

  • No labels