Web Harvest of the 109th Congress (2006) FAQ
- What is the "109th Congress (2006) Web Harvest"?
- How accurate is the harvest?
- What does "harvested" mean?
- Who conducted the harvest?
- How large is the collection?
- Why doesn't form input or streaming video work in the collection?
- Can I search the archive?
- Why isn't the site I'm looking for in the archive?
- What is the "109th Congress (2006) Web Harvest"?
-
The 2006 Congressional Web Harvest for the 109th Congress is a National Archives and Records Administration (NARA) project that produced a collection of congressional web sites copied, or harvested, from the World Wide Web between 11/13/06 and 12/11/06.
- How accurate is the harvest?
-
The accuracy of each harvest was affected by these factors:
- The completeness of URL source lists,
- Whether URLs resolved successfully, and
- The capabilities of crawler tools used (see Heritrix at http://crawler.archive.org/) and the server environment being crawled. See a report on limitations of capabilities.
NARA has made every reasonable effort to ensure that web sites' code and programming were captured accurately. NARA is not responsible for any web sites' compliance with Federal laws, regulations, and requirements. NARA is responsible for providing public access to these copied web sites but is not responsible for maintaining code such as links, accessibility features, search or site maps, or other functionality that may have been true of the sites before they were copied.
- What does "harvested" mean?
- Web harvesting is the process of automatically copying and organizing unstructured information from pages and data on the World Wide Web. It is also known as web mining, web scraping and web crawling. Web sites are identified with a "seed list" of URLs which are "harvested" so that content within, or linked to an identified site, is captured and copied.
- Who conducted the harvest?
- NARA contracted CACI-ISS to manage the project while Internet Archive (IA), a San Francisco nonprofit, performed the harvest.
- How large is the collection?
- The harvest collection contains approximately 242 GB of information and roughly 4,291,840 downloaded files active between 11/11/06 and 12/11/06.
- Why doesn't form input or streaming video work in the collection?
- A harvest engine is not able to read and use the forms, video, or javascript. That means that forms and databases will not be active in the harvest, and files that can only be streamed from a website have not been harvested.
- Can I search the archive?
- Yes, by:
- Entering a search term which searches the combined House and Senate harvests, or
- Browse from the House or Senate home pages.
- Why isn't the site I'm looking for in the archive?
- Sites were not harvested because:
- were not linked to one of those supplied-URLs
- they were password protected
- the harvest engine could not find or access them
(Note: Harvest engines do not capture dynamic web content. See a report on limitations of capabilities.)
