In this guest blog post, Mark Crookston from the National Library shares some thoughts on the ongoing work to archive New Zealand's online heritage.
Collecting the New Zealand Web
Did you know the National Library of New Zealand collects websites? We’ve been doing it since 1999.
We collect websites because we’re in the business of societal memory. That's the information that is created that can provide social, cultural and economic benefits to New Zealand and enables more knowledge to be created. An increasing amount of this activity now takes place online, and the web collecting programme is the National Library’s response to this information shift. It's done as part of a broader function of collecting and preserving traditional and digital published, unpublished and online items created in New Zealand and by New Zealanders.
Check out the Web Standards website from 2009, which was the precursor to this site.
Selective Web Archiving
We have two web collecting streams. The first is the selective harvest stream, which involves a small team picking and choosing which websites to we archive. The selection is based on a range of criteria and judgements relating to:
- events
- existing subject priorities of the Library
- use of the web as evidence of New Zealand’s digital transformation.
We use the open source Web Curator Tool. So far there are over 14,000 websites, which are preserved in our National Digital Heritage Archive. Access to these are available through the search function on the National Library of New Zealand website
The Whole of .nz Domain Programme
The second stream is the whole of the .nz domain harvest, which is done around every two years. The purpose with this programme is to get a snapshot of the NZ web, which can be used in the future for a range of research relating to New Zealand’s digital transformation and online interaction that we can’t even think of now. We started the whole of domain web programme in 2008, and harvested 105 million URLs at about 4TB. We followed up in 2010 and harvested 130 million URLs at about 8TB. We’re just now finishing the next one, and it looks like we will take in over 190 million URLs at about 11TB. These datasets are not currently available to the public.
Challenges, Future Initiatives and Questions
So what does all this mean for the government web? There are a lot of challenges and ideas for the future of the NZ Web Archive, so here are just a few points to hopefully get you thinking. Feel free use the comments below for discussion (I’ll chip in as well), or email me at mark.crookston@dia.govt.nz if you're interested in engaging in a longer discourse.
- With the current .nz domain harvest project we’re looking to copy and extract the entire .govt.nz and manage it as a discrete ‘NZ Government Web Archive’. Would this archive assist how we develop the government web presence in the future?
- At the National Digital Forum 2012, Nate Solas talked about the long tail of the Walker Art Center website, where the public dives deep into the site archives for ‘quality content’. If you're a web manager, why manage your old web content when the NZ Web Archive can be your long tail? Like the Ministry for the Environment ….
- The technology to create and publish current information is always well ahead of the technology to capture, manage and provide access to that information for historical purposes. We can easily harvest HTML formatted sites, and static documents and videos within websites, but have difficulty with flash based sites, complex social media sites or sites that require passwords or registrations. Also, if you're site can’t be navigated with JavaScript turned off, our harvester can’t either. What impact does this have for alternative access points within NZ Government Web Standards 2.0, like site maps?
- If you’re developing an organisational records management strategy that addresses your web records, have a think about the role the web archive can play. Archives New Zealand has, and they developed this guide to managing web records.
- Tim Berners-Lee was recently in Wellington and spoke about the uncapturable web. But some government agencies prohibit staff from accessing certain websites while at work, such as Facebook and other social media sites. How about this for a future collecting possibility—the Forbidden Web (insert dramatic music). That is, the sites that government agencies filter out and forbid staff to access for decency or productivity reasons. The old collections of Chief Censor archives are a fascinating weathervane of societal norms. What will the future think of our current web censorship? That ‘site blocked for the following reasons: humour’ message always cracked me up.
- Allowing our NZ web content to be compatible with Memento—a Firefox add-on which lets you link to previous versions of your site. A seamless transition between the current and archived web could add an interesting aspect to the web experience.
- With the web, our sense of retro is dramatically shortened. I wonder what role the web archive will play when web retro kicks in? C’mon, we know it’s going to happen… everything gets retro-ised eventually. Check out the archived version of the National Library website from 1999. Hilarious.
I’ll stop there.