Web Harvest FAQ
- What is the "2004 Presidential Term Web Harvest"?
- The 2004 Presidential Term Web Harvest is a National Archives and Records Administration (NARA) project that produced a collection of federal web sites copied, or harvested, from the world wide web between 10/14/04 and 11/19/04. The Heritrix web harvester (http://crawler.archive.org/) and a list of 982 active and unrestricted second level URLs were used to capture all linked federal sites down to the fourth level. Those initial 982 ".gov" and ".mil" URLs were provided by U.S. General Services Administration's (GSA) ".GOV" Internet Domain Registry and the Defense Information Systems Agency (DOD/DISA).
- What does "harvested" mean?
- Web harvesting is the process of automatically copying and organizing unstructured information from pages and data on the World Wide Web. It is also known as web mining, web scraping and web crawling. Web sites are identified with a "seed list" of URLs which are "harvested" so that content within, or linked to an identified site, is captured and copied.
- Who conducted the harvest?
- NARA contracted Information Systems Support (ISS) of Gaithersburg, Md. to manage the project while Internet Archive (IA), a San Francisco nonprofit, performed harvest.
- How large is the collection?
- The harvest collection contains approximately 6.5 terabytes of information, roughly 75 million web pages and represents about 50,000 ".gov" and ".mil" unrestricted federal web sites active between 10/14/04 and 11/19/04.
- Why doesn't form input or streaming video work in the collection?
- Can I search the archive?
- Yes, if you know the name of the agency or the web address of the site you're looking for, you may:
The difference between the number of web sites in the search versus the browse list is due to how the URLs were identified. The 982 URLs were identified prior to the harvest from the list provided by GSA and DISA. The 50,000 sites were identified by the harvester as sites linked to the original 982.
- Search, by entering a web site address (with or without the "www"), among 50,000 web sites, or
- Browse an index listing approximately 982 web sites.
- Why isn't the site I'm looking for in the archive?
- Sites were not harvested because:
- they were either not on the GSA or DISA provided list of URLs
- were not linked to one of those supplied-URLs
- they were password protected
- the harvest engine could not find or access them
- or they were several "clicks" into a large web site and couldn't be harvested because of project's time limitations.
- Why do I see the same site listed more than once?
- The GSA .GOV registration listed some sites more than once when there were variations in the spelling of a URL. In those cases, because the harvested content of those sites could not be adequately compared, some sites may have been captured and retained in the collection as multiple copies.
- What does the web site title listed in the index represent?
- The "titles" of the web sites are based on the names of the web sites as shown on their home pages, or were automatically taken from the web site's computer code that included the title name. In both cases, the title is that which the agency gave the web site.