Tuesday, April 28, 2015

Legacy EAD Import into ArchivesSpace

As previously detailed by Max, the Bentley Historical Library is in the process of implementing ArchivesSpace as a component of its ArchivesSpace-Archivematica-DSpace Workflow Integration Project.

One of the requirements for using ArchivesSpace to manage our accession and collection metadata going forward is to migrate all of our legacy data into the new system. Up until now, the Bentley has been managing its accessions in a FileMaker Pro database, and its finding aids are all encoded as EAD (Encoded Archival Description). This creates some challenges for migrating that legacy data into ArchivesSpace.

As ArchivesSpace was formed as a merger between Archivists' Toolkit and Archon, institutions that were previously using one of those collection management systems can make use of migration tools to move information from the old system into ArchivesSpace. Institutions that were not using Archivists' Toolkit or Archon, such as the Bentley, must migrate their data using several standard formats, including CSV for accessions and EAD or MARCXML for collections. At the Bentley, we have chosen to start our migration to ArchivesSpace by first focusing on our EAD finding aids.

Legacy EAD Import Testing

Due to the varieties in practice allowed by EAD, and the more specific requirements enforced by ArchivesSpace, the migration of legacy collections data from EAD into ArchivesSpace is not as simple as starting a batch import job and ending up with all of that data properly imported into ArchivesSpace out of the box.

In fall 2014, I conducted a student practicum at the Bentley Historical Library investigating the importation of some of the Bentley's legacy EAD finding aids into ArchivesSpace, focusing primarily on identifying issues that exist between the stock ArchivesSpace EAD importer and the Bentley's legacy encoding and descriptive practices.

Altogether, I tested 166 finding aids, formed from a representative sample of the Bentley's approximately 3,000 EADs. The EADs that were tested were chosen to represent as much of the variety in the Bentley's legacy EADs as possible, as well as some that were chosen as being potentially likely to have some compatibility issues.

Of the 166 EADs that I tested, 107 imported successfully and 59 had errors on the initial import attempt, for an error rate of 35.34%. While this error rate could certainly be worse, it scales to roughly 1,000 of the Bentley's EADs not importing successfully into ArchivesSpace. A close examination of the errors that were encountered was necessary in order to identify some potential solutions.

Errors

During the testing process, several specific types of errors became apparent. It is worth noting that many of the errors are a result of ArchivesSpace, not EAD or DACS (Describing Archives: A Content Standard), requirements. The Bentley's legacy EADs are all valid EAD and conform to existing descriptive standards, and oftentimes contain the information that ArchivesSpace requires, just not necessarily in the exact place or in the exact form that ArchivesSpace expects.

The most common type of error that I encountered was that the Bentley's EADs do not always supply information that is required by ArchivesSpace in a way that the ArchivesSpace EAD importer understands. For example, the most common error (accounting for nearly half of all errors) was the result of digital objects (in the form of <dao> tags in the EAD) being imported into ArchivesSpace without titles. Digital objects in ArchivesSpace require titles, and the stock ArchivesSpace EAD importer looks for titles in the title attribute of <dao> tags. The practice at the Bentley, however, has been to indicate digital object titles in a <unititle> tag; the title is there, it just isn't importing properly into ArchivesSpace.

Another common type of error was incompatibilities in the way some fields in the Bentley's EADs were structured or the way that some of the content within the fields was supplied. A common example of this type of error can be found in some of our extent statements. ArchivesSpace requires extent statements to be formatted as a number followed by letters, such as "2 linear feet." Some of the Bentley's extent statements, however, begin as letters followed by a number, such as "ca. 1000 linear feet." ArchivesSpace is not designed to allow these types of extent statements, so EADs that contain such statements return an error during the import process.

A detailed breakdown of all of the errors encountered during legacy EAD import testing is as follows:
  1. Digital objects missing title attributes: 28 occurrences; 47.86% of errors
  2. Indices not formatted in accordance with ArchivesSpace requirements: 14 occurrences; 23.73% of errors
  3. Component-level descriptions missing either a title or a date: 13 occurrences; 22.03% of errors
  4. Extent types not conforming to ArchivesSpace specifications: 7 occurrences; 11.86% of errors
  5. Extent statement formatting not conforming to ArchivesSpace specifications: 5 occurrences; 8.47% of errors
  6. Container tags improperly formatted: 3 occurrences; 5.08% of errors
  7. Unidentified archival_object error: 2 occurrences; 3.39% of errors
  8. Empty <unitdate> tags: 2 occurrences; 3.39% of errors
  9. Unidentified file_version error: 1 occurence; 1.69% of errors
  10. Character encoding (in this instance, a right curly quote instead of a straight quote): 1 occurrence, 1.69% of errors
  11. Invalid EAD doctype definition: 1 occurrence; 1.69% of errors
  12. Empty <unitid> tag: 1 occurrence; 1.69% of errors
Notice those two "unidentified" errors (numbers 7 and 9) above? One of the biggest obstacles in determining what was causing an error during the EAD import testing was parsing the ArchivesSpace error messages. The error messages, at least in ArchivesSpace version 1.0.9, did not point at a particular line in the EAD that was causing the error. Rather, the error messages are based on where the error occurred in the conversion of the EAD to the ArchivesSpace JSONModel.

Some of the error messages were fairly easy to understand, such as the following:

Error: Problem creating 'American Civil Liberties Union of Washtenaw County Records
1961-2000': id_0 That ID is already in use, ead_id Must be unique
What this says is that there is an existing resource with the same ead_id (likely as a result of the same EAD being imported previously), and that ead_ids must be unique. Simple enough.

However, other error messages are not quite so helpful, such as the error message for the "Unidentified archival_object error":
Error: Unexpected Object Type in Queue: Expected archival_object got file_version
Despite much initial confusion, however, I was eventually able to understand most of the error messages provided by the ArchivesSpace EAD importer, which provided a great deal of guidance in identifying potential strategies for moving forward with our legacy EAD migration.

Solutions

Once I had a list of all of the known compatibility issues between our EADs and the ArchivesSpace EAD importer, it was clear that there was much work to be done to make our EADs and ArchivesSpace work well together. In addition, beyond the error messages described in this post, there are numerous examples of fields in our EADs that import successfully, but not quite in the way we want the data to be in ArchivesSpace going forward (posts on those additional concerns forthcoming!). In order to migrate our EAD finding aids successfully, and with all of the data mapped as we would like, some changes are necessary to the ArchivesSpace importer and in our legacy encoded data, which will be detailed in future posts.

Ultimately, the challenges posed by migrating our legacy data into ArchivesSpace pales in comparison to the benefits and opportunities that will be afforded to us once the process is complete. The end result of the migration process will allow us to manage information about our collections in a single, standardized, community-supported tool, something about which we are very excited. We'll be sharing some detailed information about some of the solutions we've come up with to migrate our legacy data in later posts, including details about our own ArchivesSpace plugin, some custom Python scripts, and the use of OpenRefine. Until then, it is worth noting two principles that have helped guide us through this process:

1. View the ArchivesSpace migration as an unprecedented opportunity to cleanup legacy metadata.
We are working with some EADs that were created years ago, and it's safe to assume that it will be a while until the opportunity arises do metadata cleanup on this scale again.

2. Automate legacy metadata cleanup and ArchivesSpace error resolution as much as possible.
In the process of migrating our legacy EADs to ArchivesSpace, we have spent a good amount of time and effort improving our skills in programming, working with XML, and working with existing metadata cleanup tools. Improving our ability to automate some of this work has greatly enhanced our efficiency, given us the ability to quickly resolve some major issues, and increased our ability to focus on additional problems and concerns as they have arisen.

Friday, April 24, 2015

Implementing Archivematica

My last post detailed our work to implement ArchivesSpace, the open source archives information management application for managing and providing web access to archives, manuscripts and digital objects. Today's post is an overview about implementing Archivematica (don't worry, much more on our feature development work with Artefactual Systems, Inc. to come) here at the Bentley Historical Library:


Known for their active community and catchy slogans, Archivematica is a “web- and standards-based, open-source application which allows your institution to preserve long-term access to trustworthy, authentic and reliable digital content.” [1]

Archivematica Kick Off


Mike has already mentioned the Artefactual Systems site visit we hosted in January of this year. It was during this site visit that folks from the Library Information Technology division and Artefactual Systems, Inc. installed Archivematica and the Archivematica Storage Service. Unlike our installation of ArchivesSpace, our installation of Archivematica is currently hosted, maintained and supported (thanks, Aaron!) by the Library Information Technology division. When I started in late February, one of my initial work priorities was testing this local implementation in order to determine how Archivematica might "replace and extend" existing workflows and procedures.

Background


Note: Our proposal to the Andrew W. Mellon Foundation gave a great background to the problem that the ArchivesSpace-Archivematica-DSpace Workflow Integration project is trying to solve. That document isn't public, but much of this introduction is gleaned from that portion of the proposal which details our institutional context.

A previous post detailed the history of digital curation at the Bentley Historical Library. Relevant to our immediate conversation is Automated Processor (AutoPro), a homegrown tool that--if you couldn't guess from its name--automates digital processing from "Initial Survey" through "Deposit Content in Deep Blue" using 33 Windows CMD.EXE shell scripts that control more than 20 applications and various command line utilities:

Windows Command Prompt Interface

I know from personal experience that AutoPro has been, and continues to be, an effective processing tool. In fact, it has received a number of accolades, being recognized by conference reviewers at iPRES 2012 as:
  • "[A] successful implementation of various tools into a a successful institutional workflow [...that] will be relevant to other implementers."
  • "[A] useful breakdown of the workflow steps used to process unstructured documents for ingest into an archival repository."
  • "[A] sound methodology including automated metadata generation following the EAD and PREMIS standards and creation of an audit trail."

A recent review of how to more efficiently process and deposit unstructured archival content (Microsoft Office documents, images, audio, video, &c.) in Deep Blue, however, determined that while AutoPro has been an effective processing tool, it is not an ideal solution, as:
  • Component programs installed on individual workstations must undergo frequent updates.
  • Windows CMD.EXE scripts have a limited capacity to handle errors and exceptions and little text-processing capabilities.
  • Scalability becomes an issue, as very large files or large collections can require a large amount of workstation resources.

This review just happened to coincide with enhancements to the University of Michigan Library's repository infrastructure and an increased budget for digital archives storage at the Bentley Historical Library. As a result, the Library Information Technology division recommended that we investigate Archivematica as an alternative to AutoPro, citing the following advantages:
  • A graphical user interface available via a web-based dashboard.
  • A "client/server processing architecture [that] allows it to be deployed in multi-node, distributed processing configurations."
  • Support for "large-scale, resource-intensive production environments" that would permit archivists to ingest and process simultaneously multiple large deposits of digital archives.
  • "Highly scalable configurations" that would permit granular control of settings for individual Virtual Machines (VMs) according to the size or contents of given Submission Information Packages (SIPs).
  • Ability to "control and trigger specific micro-services."
  • Improved exception handling and various notifications, "includ[ing] error reports, monitoring of [system] tasks and manual approvals in the workflow."
  • Simplified "alteration of preservation plans and user access levels." [2]

Initial Testing


Archivematica Dashboard

Using relevant procedures and workflows from Archivematica’s Testing page, I ran a number of representative transfers through Archivematica’s transfer and ingest micro-services using a variety of processing configurations. 

In addition to the sample transfers provided by Artefactual Systems, Inc. (some of which were intentionally designed to trigger Archivematica failures, such as the "Scan for viruses" micro-service), I tested a number of in-house transfers that had been previously run through AutoPro. These included all types of digital objects: 
  • websites;
  • text(ual) materials like PDFs and Word documents;
  • spreadsheets;
  • images;
  • email; and
  • audio/video files.

Some of these were hierarchical in nature, and some were flat. One transfer that was exceptionally large (about 10.7 GB, although that's only a small percentage of the total SIP). I also experimented with a disk image to test the new Forensic disk image ingest feature of Archivematica (released in September 2014) and a collection of sample files with personally identifiable information intended to test Archivematica’s existing integration with bulk_extractor.

Findings


Our main interest in all this testing was to find out if and how Archivematica would "replace and extend" the Bentley Historical Library's existing procedures (i.e., AutoPro).

Replacing AutoPro


After some initial trial and error (we've had some permissions-related trouble related to indexing and storing transfers and Archival Information Packages (AIPs), but I believe most of that is related to the way we have our server set up here) and communication with the Library Information Technology division, nearly all transfers were able to be ingested (I'll get to the one exception in a bit).

Most of the steps in the Bentley’s current digital processing workflow utilizing AutoPro can be replaced by one of Archivematica’s micro-services:

AutoPro Workflow Step
Archivematica Micro-Service
Virus scan
Scan for viruses
Create temporary backup
Create transfer backups
Open archive files (.ZIP, .TAR, etc.)
Extract packages
File and folder name normalization
Clean up names
Identify missing file extensions
Characterize and extract metadata
Create preservation copies
Normalize (Normalize preservation)
PII (credit care and Social Security number) scan
Examine contents*
Appraisal and arrangement
[Appraisal and Arrangement tab]
Descriptive and administrative metadata creation
Metadata
Extract technical metadata
Characterize and extract metadata
Transfer content (with metadata) to long-term storage
Store AIP
Clean up
Store AIP (Remove processing directory)

There are two notable exceptions (in red).

Notable Exception #1: Appraisal and Arrangement

The notable exception is AutoPro’s “Appraisal and arrangement” step, for which there exists no comparable Archivematica micro-service. This functionality is very important to us. While it's true that additional steps are needed in the digital world to ensure the authenticity, integrity and security of content, digital processing is first and foremost traditional processing (this is also why we have one Curation division here at the Bentley Historical Library, not two). Traditional archival functions like appraisal, arrangement and description are just as important in the digital world as they are in the paper world.

This is why we are partnering with Artefactual Systems, Inc. to develop an Appraisal and Arrangement tab in Archivematica. We consider this functionality a high priority, and as such it is part of the first phase of development. The mockup below is what we're working on during the first sprint; it's the Transfer Backlog pane (the "appraisal" part). The final product will also include an ArchivesSpace pane (the "arrangement" part).

Be sure to keep an eye out on this page of the Archivematica wiki for the latest and greatest version of the Appraisal and Arrangement tab.

Notable Exception #2: PII

A second exception has to do with Personally Identifiable Information (PII). While the “Examine contents” micro-service of Archivematica does replicate AutoPro's functionality to identify documents that may contain PII, it does not replicate its ability to redact PII (via Identify Finder's “Scrub” functionality), and it does not currently replicate its ability to “Shred” or securely delete files containing PII.

As it turns out, the University of Michigan has decided to pull support for Identify Finder, so this is a bit of a moot point. However, part of our proposed feature development with Artefactual Systems, Inc. also includes introducing functionality in Archivematica to act on some of the bulk_extractor reports it is currently running on transfers. For example, we hope to be able to apply machine-actionable PREMIS rights statements to files and folders identified using the accounts scanner (or others) in bulk_extractor, which looks for credit card numbers, credit card track 2 information (the magnetic stripe data track read by ATMs and credit card checkers), phone numbers, and other formatted numbers. We would then use this metadata to automatically embargo or restrict access to content in Deep Blue.

Extending AutoPro


A number of Archivematica micro-services would actually extend the functionality of AutoPro, giving the Bentley the ability to:
  • automatically create UUIDs for transfers, SIPs and files, uniquely identifying and directly associating transfers and SIPs, as well as files and metadata, and, as part of the proposed development work, directly associating that with the DSpace Handle System;
  • create workflow “pipelines,” pre-configuring processing decisions for transfers and SIPs for groups of like material (i.e., born-digital acquisitions, digitization projects, audio/video, disk images vs. logical copies of directories, web archives, etc.);
  • automatically generate a robust METS.xml document, which is automatically added to any SIP generated from a transfer;
  • verify transfer checksums to compare data inside of Archivematica with data as it existed outside of Archivematica;
  • quarantine a transfer for a set period of time, until virus definitions update;
  • remove cache files;
  • automatically normalize files to create Dissemination Information Packages and thumbnails, if desired;
  • set permissions using PREMIS rights metadata, which, as part of the proposed development work, would also be recorded in ArchivesSpace and would carry over to the ability to embargo collections in DSpace; and
  • interact with AIPs and their METS files via an API.

Improving AutoPro


The original Mellon proposal noted that AutoPro is not an ideal solution because component programs installed on individual machines must undergo frequent updates, because Windows CMD.EXE scripts have a limited capacity to handle errors and exceptions and little text-processing functionality, and because scalability becomes an issue. 

Archivematica addresses some of these limitations:

Web-Based

Because Archivematica is web-based, there is no need to install clients on individual machines, and system updates only need to happen once.

Better Error-Handling

Archivematica was designed to anticipate a wide variety of processing errors. As a result, it also improves upon AutoPro’s ability to handle them. While some errors result in a process being halted and the transfer or SIP being moved to the failed directory, for others, processing can continue. Both types of errors were encountered and corrected during testing, as you can see in this typical "Archivematica Fail Report":

Type
Status
Started
Index AIP
Failed
2015-03-16 17:26:35
Store the AIP
Completed successfully
2015-03-16 16:53:27
Verify AIP
Completed successfully
2015-03-16 16:52:17
Move to processing directory
Completed successfully
2015-03-16 16:52:17
Move to processing directory
Completed successfully
2015-03-16 15:00:38
Normalize
Completed successfully
2015-03-16 15:00:38
Resume after normalization file identification tool selected.
Completed successfully
2015-03-16 15:00:38
Identify file format
Failed
2015-03-16 14:42:38
Select pre-normalize file format identification command
Completed successfully
2015-03-16 14:42:38
Move to select file ID tool
Completed successfully
2015-03-16 14:42:37
Set resume link after tool selected.
Completed successfully
2015-03-16 14:42:37
Set file permissions
Completed successfully
2015-03-16 14:37:09
Create removal from backlog PREMIS events
Completed successfully
2015-03-16 14:37:09
Approve SIP Creation
Completed successfully
2015-03-16 14:18:16

As you can see, that's a lot of green (actually much more than is displayed here, hence the ellipses); the majority of these micro-services worked just fine. "Identify file format" is an example of an Archivematica error for which processing can and did continue. "Index AIP" is an example of an error for which processing is halted.

Scalability (To Be Determined)

Unfortunately, I'm not able to report out yet on how Archivematica does with scalability. We've heard tell that Archivematica can work on packages as large as one TB. However, I've attempted the 10.7 GB transfer twice, with no luck yet. Artefactual Systems, Inc. is currently working with the Library Information Technology division to get this resolved. Stay tuned for an update to this post.

Conclusion


While we did encounter some issues during Archivematica testing, for the most part it seems that Archivematica (or the proposed feature development) does indeed replace and extend the functionality of AutoPro. We're excited to start using it in production!

[1] https://ww.archivematica.org/en/
[2] Quotations in this section are from https://www.archivematica.org/wiki/Overview.
[3] Curse you, thumbs.db!

Wednesday, April 22, 2015

On the road...at ARCHIVES 2015

Mark your calendars! The Bentley Historical Library and our good friends at LYRASIS and Artefactual Systems are organizing a brown bag at this year's annual meeting of the Society of American Archivists, ARCHIVES 2015 in Cleveland.

We'll be getting together on Thursday, August 20 from 12:15-1:30 to provide an update (and hopefully a demo!) of our ArchivesSpace-Archivematica integration work and get some feedback from the respective user communities.

More details will follow, but for now you can see us on pp. 21-22 of the conference preliminary program.  Huzzah!


Monday, April 20, 2015

Implementing ArchivesSpace

The man who moves a mountain begins by carrying away small stones.

It takes time to do great things. Mike has already outlined our lofty goal to employ ArchivesSpace, Archivematica and DSpace in a single, end-to-end digital archiving workflow. That won't happen overnight. Before we do anything else, we have to get each of these systems up and running individually.

DSpace, the repository software package underlying Deep Blue, the University of Michigan's institutional repository, has been in place for quite some time now, so I won't dwell on it here. An upcoming post will detail our work to implement Archivematica, to customize workflows and pipelines and to determine how it will replace and/or extend our existing systems and procedures.

Today's post is about ArchivesSpace:


Implementing ArchivesSpace


We've had an ArchivesSpace test environment running on a re-purposed Windows machine in the back of what would become my cubical for just about a year now. At that time, Bentley Historical Library staff and a number of graduate students from the University of Michigan and Wayne State University did some preliminary testing of the features and functionality of ArchivesSpace, including its import and export functionality.

In January of this year, a couple of folks from the ArchivesSpace team at Lyrasis came out to do a three-day workshop on ArchivesSapce. The first two days of the workshop covered the basics:
  • creating Accession records;
  • creating Resource records;
  • creating and managing Agent and Subject records, and linking them to Accession and Resource records;
  • recording and managing physical locations within a repository;
  • producing description output files in standardized data structures such as EAD and MARCXML; and
  • importing legacy data and performing data cleanup tasks.

The final day covered Digital Objects:
  • the functional scope of the ArchivesSpace Digital Objects module;
  • how ArchivesSpace might be used in tandem with external digital asset management systems;
  • modeling, creating and updating simple and complex Digital Object records;
  • linking and relating Digital Object records to Resource and Accession records; and
  • generating Digital Object metadata exports in standardized data formats such as METS, MODS and Dublin Core. 

By February, therefore, we had implemented the software and reviewed its functional and technical requirements for use and development. That is, we had our very own ArchivesSpace instance and we knew [basically] how to use it. We could check that preliminary item off the list. 

In order to fulfill the requirements of the grant, however, our test environment of ArchivesSpace would need to become a production environment of ArchivesSpace.

Our Strategy: Call the A-Team


If you have a problem, if no one else can help, and if you can find them, maybe you can hire the A-Team...

Not this one [1].

...that is, the ArchiveSpace Implementation Team at the Bentley Historical Library, or the "ASpace" Implementation Team (for short), or, simply, the A-Team (if you're trying to be pop culture savvy).

The four of us (Mike, Dallas and I, and our colleague, Lead Archivist for Description and Workflow Management Olga Virakhovskaya) sat down for our first meeting. After deciding which A-Team member we wanted to be (first things first, of course), we came up with a charter of sorts, which begins like this:
The Bentley Historical Library (BHL) will implement the ArchivesSpace (AS) archival management system to replace current resources and procedures, increase efficiency, and join peer institutions in establishing archival best practices for the 21st century. BHL seeks to have a full implementation of AS by March 2016, with the capacity to create new accession and resource records and manage legacy records (primarily accessions and EAD) imported from current resources. 
We also broke our tasks down into three main areas, and divided up the work.

     1.  Leading the Way (and Liaising with MLibrary)


Mike is the project leader. Using lessons learned from a recent Society of American Archivists offering here on "Project Management for Archivists," he has laid out project outcomes (see above), goals and objectives, roles, constraints (e.g., the [im]maturity of some features of ArchivesSpace, the timeline imposed by the Mellon grant and the current [in]compatibility of ArchivesSpace with existing software platforms in use here, such as Aeon and Digital Library eXtension Service, or DLXS), costs and timeline.

We're experimenting with Teamwork, a web-based project-management tool, to track our progress.

Mike is also responsible for liaising with the University of Michigan Library. A production environment of ArchivesSpace will not be able to run off of the computer we have sitting in the back of my cubical. Instead, the current plan is that once we are "live," ArchivesSpace will be hosted, maintained and supported by the Library Information Technology division. Charged with "the design, development, management, and maintenance of a flexible and reliable technology environment," they have human and technological infrastructure in place to support library management systems like ArchivesSpace in a way that we simply cannot.

     2.  Developing New Accessioning Conventions and Descriptive Practice


At some point, we will have to begin creating accession records for new donations and transfers, as well a generating resource records for archival collections. Since both of these are likely to change after implementing ArchivesSpace, and since finer granularity has overhead for data input, Olga will take the lead for developing new accessioning conventions and descriptive practices.

     3.  Import of Legacy Accession Data and EAD Finding Aids


Finally, Dallas and I are working on importing legacy accession data (all 19,000+ records) and Encoded Archival Description (EAD) finding aids (all 2,800+ of them). We have decided to start with the latter, and hope to begin work on legacy accession data by the fall of this year.

Accession Data

The accession data will be coming from a CSV export of a homegrown FileMaker database, the Bentley Electronic Accessioning and Locating System, or BEAL, which also happens to be the name of the street on which we are located:


This has been described as the "lifeblood" of the Bentley Historical Library, and one of the many challenges we foresee with migrating legacy accession data is figuring out a way to ensure that any tasks and processes (both internal like mailing lists, reports, &c., as well as external) reliant on accession data in BEAL will be able to be done once we make the move to ArchivesSpace.

EAD Finding Aids

Finding aids start life here as Microsoft Word documents and get converted to valid EAD XML through a series of Word macros. Next, they are made searchable and displayed using DLXS. You can read more about that process here.

Our work to import legacy finding aids created like this into ArchivesSpace builds off of work Dallas did during a practicum he completed here before becoming a Project Archivist.

It hasn't exactly gone smoothly.

We have an error rate of about 30%, and even when we don't get an error, finding aids don't always import the way we expect, or in a way that takes the advantage of ArchiveSpace's functionality (or potential functionality, once we have a Hydra-based implementation of Deep Blue and, possibly, DLXS). Part of this is because the Microsoft Word method lends itself well to human-understandable finding aids, while data in ArchivesSpace needs to be both human- and computer-understandable.

Even if it doesn't affect import, we have discovered through this process that some of our data is fairly messy. We feel that we might as well clean it since we are spending so much time with it now and because it may be harder to manipulate in batch once it's in ArchivesSpace.

Next Steps


Longer term goals and objectives that the A-Team will have to address include:
  • importing legacy MARC records;
  • exporting EAD finding aids;
  • automating the transfer and upload of EADs to the DLXS platform;
  • exporting MARCXML records for import into the catalog; 
  • developing training materials and documentation; and
  • determining how ArchivesSpace will communicate with other Bentley Historical Library and university systems.

Be on the lookout for the aforementioned post on Archivematica and a post on some of the errors Dallas found during his practicum while importing a sample of our finding aids into ArchivesSpace. In the near future we'll also launch an ongoing series highlighting an ArchivesSpace plug-in we created, as well as some basic programs we have developed and tools we have used to address those errors.

[1] "Ateam" by Source. Licensed under Fair use via Wikipedia - http://en.wikipedia.org/wiki/File:Ateam.jpg#/media/File:Ateam.jpg

BHL Mellon Grant @ MMDP

The Bentley Historical Library and University of Michigan Library cohosted a gathering of the Mid-Michigan Digital Practitioners (MMDP) on March 26-27, 2015. This was the fifth biannual meeting of the group (for past agendas and presentations, click here), which was started by Ed Busch and his colleagues at Michigan State University (be sure to check out the great interview with Ed on the LOC Signal blog).

As the event host, the University of Michigan was given an opportunity to present on current projects and initiatives.  I gave an update on our "ArchivesSpace-Archivematica-DSpace Workflow Integration" project, highlighting our recent collaborations with Artefactual Systems and giving attendees an overview of our latest UML workflow diagram (version five) and a preview of an initial wireframe rendering of Archivematica's new appraisal and arrangement tab (which has already undergone some changes).

Check out my presentation slides here.  You can also view an introductory presentation on the project (delivered at the Sept. 18, 2014 meeting of the MMDP at Central Michigan University) here.

Video of the host presentation is also available on the University of Michigan Library website and includes a question and answer session with the audience.

For the host presentation, I was joined by my colleagues Matt Adair (Lead Archivist for Digitization, Bentley Historical Library) and Alix Keener (Digital Scholarship Librarian, University of Michigan Library); their respective presentations addressed the Bentley's implementation of the Aeon registration and reference request system and the upcoming symposium, Web Archives 2015: Capture, Curate, Analyze (check it out and submit a proposal before the May 15 deadline!!).

Saturday, April 18, 2015

One Year In: a Retrospective

Welcome back, readers! Previous posts have provided an overview of the Bentley Historical Library and University of Michigan Library's "ArchivesSpace-Archivematica-DSpace Workflow Integration" project and given some background on digital preservation and curation efforts at the Bentley.  In this post, I would like to give an update on what's been accomplished thus far and where we're going.

Unexpected Challenges

As with most endeavors, our project has faced some unexpected challenges, the first of which related to staffing.  Per our proposal, we originally intended to hire a software developer for a two-year term position at the University of Michigan Library to handle the technical aspects of integrating ArchivesSpace, Archivematica, and DSpace in a digital archives workflow.  This approach was selected during the planning phase so that the developer could share knowledge and expertise with other Library Information Technology (LIT) staff through daily interactions.  However, given the improving economy and the unique 'chaos to order' skills/experience required by the position, we were unable to secure any candidates after three months of intensive recruiting.

Rather than extend the posting for a fourth month and risk further delays, our team began to explore the possibility of contracting directly with Artefactual Systems Inc. for the necessary development work.  We soon realized this strategy would reap immense benefits for the project, due to the company's expert knowledge of Archivematica, extensive experience with the agile development of open source software  for libraries and archives, and large network in the archival and digital preservation communities.  Staff members at Artefactual Systems were also highly familiar with the project, as President Evelyn McClellan and Director of Archivematica Technical Services Justin Simpson were in regular communication with us since the planning stages of the grant in the summer of 2013 and staff had already been tapped to provide consulting services for LIT developers.  After completing the budget reallocation process, we finalized this arrangement in December 2014.

We faced even greater adversity with the illness and loss of our dear friend and colleague Nancy Deromedi.  The grant's original Principal Investigator, Deromedi pioneered the collection and preservation of digital and web archives at the Bentley, as outlined in my previous post.  As head of the library's Digital Curation Division from 2011-2014, she was incredibly supportive of my work developing the AutoPro ingest and processing tool and was a great advocate for open access to born-digital archives.  Upon being named Associate Director for Curation during the Bentley's 2014 reorganization, Deromedi asked me to unify our paper and digital processing procedures to ensure the standardization of descriptive practices and empower processing staff to handle all types of archival materials, regardless of format.  The move was typical of her progressive vision, dynamic leadership, and willingness to take risks.

Deromedi had been diagnosed with esophageal cancer in late 2013, but continued to work on the grant proposal and numerous other projects while undergoing rounds of treatment through early 2014.  In March 2014, she underwent major surgery, but after only three months of recovery and rehabilitation she was back at the Bentley, full of enthusiasm and determination to see the grant project to a successful outcome.  She provided leadership on the early stages of the budget reallocation process, but doctors discovered a recurrence of her cancer in September and on October 13, 2014 she passed away.
Nancy M. Deromedi
Nancy Deromedi was a skilled and knowledgeable archivist, a great mentor and leader, and a dear colleague and friend.  Our work on this grant is very much a tribute to her vision and record of achievement.

Moving Forward

While the delay in hiring a developer was vexing, Nancy Deromedi's illness and death posed very serious obstacles to progress.  Nevertheless, project staff made steady (albeit slow) progress on a number of fronts through 2014 to the present. These include:

  • Digital preservation policy review: Archivists undertook a review of current digital preservation policies and procedures.  While a final document is still in draft form, the exercise helped confirm preservation strategies (such as the creation of preservation copies of content in at-risk file formats) as well as approaches for handling sensitive personal information.
  • Software review and evaluation: While archivists had experimented with sandbox versions of Archivematica and ArchivesSpace, a more thorough review of each system was undertaken, which included local implementations of each platform.  In addition to understanding basic features and functionality, this work helped project staff identify development needs.  Future posts will provide more information about these undertakings.
  • Hiring additional staff: In January 2015, the Bentley hired Assistant Archivist for Digital Curation Max Eckard and Project Archivist Dallas Pillen.  Max currently devotes 100% of his time to the grant and Dallas is likewise fully engaged with the project, with exceptions for some weekly reference shifts and technical support for the Bentley's implementation of the Aeon registration and circulation management system.  Look for future posts from both Max and Dallas!
  • Artefactual Systems site visit: The Bentley hosted Evelyn McLellan and Justin Simpson from Artefactual Systems from January 13-15, 2015.  These were three days of nearly nonstop activity, which included:
    • An overview of the Bentley's existing digital backlog and its collections in Deep Blue, the University of Michigan's DSpace repository.
    • A thorough review of current Bentley procedures and workflows for the accession, ingest, and description of digital archives.
    • Analysis of current features and functionality of ArchivesSpace and Archivematica, with discussion of areas for future development and integration.
    • Meetings with LIT systems administrators about Michigan's current DSpace implementation and plans for the move to Hydra.
    • Archivematica demonstration for archivists, librarians, administrators, and IT staff from the Bentley Historical Library, University of Michigan Library, Clements Library, and Gerald R. Ford Presidential Library.
    • Review of Archivematica installation and maintenance for LIT staff and installation of a local Archivematica instance.
  •  UML workflow diagrams: Artefactual Systems prepared a basic workflow diagrams that were reviewed by project staff at the Bentley Historical Library and University of Michigan Library.  After five revisions, staff at Artefactual Systems and Michigan arrived at a version that will serve as a foundation for development work (but which will continue to be refined).  While a better version will be made available on the Archivematica wiki, a copy of version 4 is available here.
  • Consultation Final Report: Based upon the site visit and follow up telecons , Artefactual Systems prepared a final consultation report for the Bentley, which identified key development tasks and time estimates, an updated workflow diagram, and suggested strategies for moving forward with the integration work.  The main areas of development will include:
    • Developing a new appraisal/arrangement dashboard tab in Archivematica.  Here's a mockup of this tab, displaying the transfer backlog and associated reports (an additional ASpace archival object pane would also be available in the finished tab).
    • Archivematica-ArchivesSpace integration:​ notably the ability to create archival object records associate content, thereby creating Archivematica SIPs and ArchivesSpace digital objects.
    • AIP repackaging​: the Bentley will be providing access to AIPs (as opposed to DIPs) in its repository, to avoid redundant storage of content and ensure that researchers have access to original materials.  As part of this approach, the Bentley needs to be able to package multiple files and/or folders into zip files to simplify patron access to and archival management of content.  See an example of how we do this with our A. Alfred Taubman collection.
    • DSpace/Deep Blue integration​: including the ability to automatically upload data and administrative/descriptive metadata from ArchivesSpace as well as the ability to update file URIs in ASpace digital object records with DSpace handles.
    • External tools integration​: integration​: Ability to review transfer contents using BulkExtractor; ability to generate PREMIS rights information from BulkExtractor reports; addition of other external tools for analysis and file viewing.
  • Agile development sprints: After prioritizing and refining the proposed development tasks, we recently kicked off agile development cycles, in which we will use weekly telecons to identify priorities, review current work, and plan next steps.  As part of this effort, Bentley archivists are creating user stories to identify potential features and functionality.  In addition, the Archivematica wiki will feature development requirements, images of design features, our telecon meeting agendas, and other relevant information.  A page for the appraisal and arrangement tab is currently up.
Suffice it to say, a lot has been going on!  Stay tuned for more news and updates and, as always, feel free to drop us a line or leave a comment.

Monday, April 13, 2015

A Short(ish) History of Digital Curation at the Bentley Historical Library

Our inaugural post introduced readers to the Bentley Historical Library and University of Michigan Library's "ArchivesSpace-Archivematica-DSpace Workflow Integration" project.  In this post, I'd like to give some background and context for the Bentley's involvement in this project.

For close to two decades, the Bentley Historical Library has actively collected, processed, preserved, and provided access to born-digital archives from the University of Michigan as well as from private individuals and organizations from around the state.  These experiences have provided a strong foundation for the planning and implementation work already underway with the grant.

Laying the Foundation (1997-2008)

As early as 1979, archivists in the Bentley's University Archives and Records Program (UARP) were discussing the challenges posed by new technologies and machine readable records.  A 1991 NHPRC grant, “Study on the Uses of Electronic Communication to Document an Academic Community,” provided an opportunity for the library to explore the topic in more depth.  It was not until 1997, however, that the Bentley received its first significant collection of born-digital archives: the Macintosh personal computer of former University of Michigan President James J. Duderstadt.

At the time, Electronic Records Archivist Nancy Deromedi developed a preservation strategy for the approximately 2,100 files in the accession that included running virus scans, documenting file and folder naming conventions, and migrating content from the original MORE 3.1 and Microsoft Word 6.0 file formats to the Word 97 (and later PDF/A) format.  Through the late 1990s and early 2000s, the Bentley continued to collect born-digital archives and Deromedi later published accounts of her strategies in a series of SAA Campus Case Studies:
Deromedi also initiated a web archiving program at the Bentley in 2000 using desktop applications such as HTTrack and Teleport Pro to capture snapshots of the websites of key academic and administrative units and also documenting events such as the university's response to the Y2K bug and the Grutter v. Bollinger Supreme Court case on the use of affirmative action in Law School admission decisions.

These early efforts were instrumental in capturing historical and administrative records of long-term value, but each involved developing unique preservation strategies and relied upon heavily manual procedures.  As the university's production of electronic records with archival value increased, the library faced challenges of scalability and sustainability: the Bentley lacked in-house IT staff and extensive technical expertise and Deromedi balanced numerous responsibilities in addition to her work with digital archives.

MeMail (2009-2011)

Given the above issues and UARP's interest in developing a more proactive approach to documenting the history of the modern university, the Bentley launched the “MeMail” Project (formally titled "Email Archiving at the University of Michigan") in 2009 to explore strategies to collect and preserve the email of key administrators.  Email was selected as the project's focus because (a) the archives was no longer receiving correspondence with the same volume and regularity as in earlier decades and (b) the myriad complexities posed by email (unique platforms, proprietary formats, relationships of attachments to messages and messages to threads of correspondence, etc.) would help the Bentley enhance its overall capacity to preserve and provide access to digital content of unique, essential, and enduring value. 

A generous two-year grant from the Andrew W. Mellon Foundation in January 2010 allowed UARP to partner with the university’s Information and Technology Services (ITS) bringing both archival and IT expertise to bear on digital curation.  The grant also enabled UARP to hire two full-time archivists to serve as the project’s functional and technical leads. Working with ITS, project staff developed a system of 'archival mailboxes' that participating administrators used to collect email of long-term value (by dragging/dropping, forwarding, or CC'ing).  Having administrators conduct the appraisal and selection of their correspondence proved to be difficult due to the cumulative value of email threads and participants' concern over third-party privacy.  The university's decision in December 2010 to adopt Google collaborative tools further complicated the project by making the 'archival mailbox' strategy impractical.  For more information on the lessons learned from these efforts, see Functional Lead Aprille McKay's two SAA campus case studies.

The planning, development, and implementation work associated with MeMail laid the foundations for the Bentley's current digital curation program.  As Technical Lead, I explored software, procedures, and workflows required to ingest and preserve email and attachments (Office files, images, audio, video, etc.).  I worked closely with McKay and others to identify rights and access issues associated with acquiring digital content and making it accessible and also developed policies and procedures to address sensitive personal information (SSNs, credit card numbers, etc.).  In addition, the project gave us an impetus to review and enhance our infrastructure: we acquired secure server space to store our backlog and conduct ingest procedures and also negotiated for expanded use of Deep Blue, the University of Michigan's DSpace repository (with another copy of material stored in a local dark archive managed by ITS).

Digital Curation Division (2011-2014)

One of the most valuable legacies of the MeMail Project was that it helped the Bentley document the needs and demands of administrative and academic units for the preservation of University of Michigan digital assets.  With this information, then-Director Fran Blouin successfully lobbied for the creation of a new Digital Curation Division (headed by Nancy Deromedi) and the addition of a permanent position (yours truly).  Based upon the research and extensive testing from earlier phases of MeMail, we defined functional and technical requirements for digital archives ingest and processing procedures appropriate for our local needs and resources.  This work permitted us to draft a workflow diagram (which has since been updated a number of times) and accompanying guidelines for the manual processing of born-digital materials.  Progress on this manual workflow was tracked on a checklist that included more than 40 discrete steps and required staff to operate some twenty applications and command line utilities, follow strict naming conventions for directories and log files, and generate or record preservation metadata by hand.  While effective, this approach was highly-labor intensive, posed challenges for training staff, and presented numerous opportunities for user error.

Hoping to overcome these constraints and enable more of our staff to work with digital content, I started to explore the possibility of automating workflow steps.  In setting out, I was particularly influenced by the Archivematica digital preservation system and its 'microservice' design, whereby a specific tool is implemented to perform a specific function (and may be swapped out or replaced by another without impacting the rest of the system).  After a successful proof of concept in automating our format migration procedures (a step that creates preservation copies of content based upon migration pathways that reflect professional standards and best practices), I set about revising other steps.  By early 2012, I had produced the AutomatedProcessor (or AutoPro), a collection of 33 Visual Basic and Windows CMD.EXE shell scripts that moved content through an 11 step workflow.
AutoPro splash screen
Nancy Deromedi and I presented a poster on this work at the 2012 iPRES conference and I have continually refined and streamlined features in the intervening years to make procedures more efficient and user friendly. A comparison between earlier versions of the AutoPro user manual and our current procedures for digital processing reveals some alterations in the number and order of workflow steps and significant changes in the interface for adding descriptive and administrative metadata to content.  For more information on the basic AutoPro workflow and related procedures, see the overview in our manual.

Working Smarter (2014-)

Since its introduction, AutoPro has been used to prepare more than 230 accessions of digital content (approx. 1.2 TB) for deposit in our Deep Blue repository.  In helping us to address a growing backlog of digital archives in a standardized manner, the tool has been a smashing success.  At the same time, AutoPro was never intended to be a final solution for the Bentley: the command line interface is not particularly intuitive or user friendly, the CMD.EXE scripts have poor error-handling functionality, and maintaining and updating the scripts and software on individual workstations often takes an inordinate amount of time.  We also realized that we were entering the same descriptive and administrative metadata in numerous locations: once in our finding aids, again in our processing workflow (so that descriptions of content could be stored alongside materials in the Archival Information Package), and a third time when we manually uploaded material to the Deep Blue DSpace repository.

Given these complications and inefficiencies, Nancy Deromedi and I considered options for more than a year before deciding to explore integrating functionality of Archivematica and ArchivesSpace into a single workflow and automating the deposit of material into DSpace.  While the idea of bringing together these systems (especially the former two) has been discussed in various circles for years, the Bentley was fortunate enough to secure grant funding to push development work forward.  I've already described our basic goals and strategy in our first post—in the next one, I'll discuss the challenges and progress we've encountered thus far.  Stay tuned!

Wednesday, April 8, 2015

Hello World!

Welcome to the blog for the University of Michigan's ArchivesSpace-Archivematica-DSpace Workflow Integration project, a joint effort of the Bentley Historical Library and the University of Michigan Library with generous support from the Andrew W. Mellon Foundation.  This blog will provide updates and insights into the project as well as general digital curation practices at the Bentley Historical Library.

Project Overview

As outlined in the initial press release for the grant, this project seeks to expedite the ingest, description, and overall curation of digital archives by facilitating the creation and reuse of descriptive and administrative metadata among emerging platforms and streamlining the deposit of fully processed content into a digital preservation repository.

Many readers will already be familiar with the above-mentioned systems, but the following may be helpful for those in need of a refresher:

  • ArchivesSpace is an open-source archival management software that combines the best features of Archon and Archivists’ Toolkit.  This system permits institutions to track accessions, manage collections, and generate Encoded Archival Description (EAD) finding aids and MARCXML.  Development of "ASpace" was funded by the Andrew W. Mellon Foundation (2011-2013) and LYRASIS now serves as its institutional home.
  • Archivematica is a free and open-source digital preservation system developed by Artefactual Systems (British Columbia). Archivematica employs a micro-service design to “provide an integrated suite of software tools that allows users to process digital objects from ingest to access in compliance with the ISO-OAIS functional model” and furthermore employs METS and PREMIS to record and track descriptive, administrative, and rights metadata.
  • DSpace is an open-source repository platform that “preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets."  Initially developed by MIT Libraries with partial support from a Mellon Foundation grant, DSpace has acquired a growing community of developers and is employed by the University of Michigan and approximately 1,400 other academic, nonprofit, and commercial organizations around the world.
For the purposes of the grant, ArchivesSpace will be employed to store descriptive, administrative, and rights metadata related to digital archives; Archivematica will be used to ingest content; associate it with descriptive metadata from ASpace, and prepare information packages for deposit; DSpace will serve as a preservation repository and access portal for collections.

Goals and Deliverables

To achieve our goals, the Bentley has contracted with Artefactual Systems for development work in the following areas (all of which will be made clearer in future posts):
  • Introduce functionality into Archivematica that will permit users to review, appraise, deaccession, and arrange content in a new "Appraisal and Arrangement" tab in the system dashboard.
  • Load (and create) ASpace archival object records in the Archivematica "Appraisal and Arrangement" tab and then drag and drop content onto the appropriate archival objects to define Submission Information Packages (SIPs) that will in turn be described as 'digital objects' in ASpace and deposited as discrete 'items' in DSpace.  This work will build upon the SIP Arrangement panel developed for Simon Fraser University and the Rockefeller Archives Center's Archivematica-Archivists' Toolkit integration (as demonstrated around the 12 minute point of the first video here).
  • Create new archival object and digital object records in ASpace and associate the latter with DSpace handles to provide URIs/'href' values for <dao> elements in exported EADs.

After extensive work defining use cases, functional requirements, workflows, and development tasks, the first phase of project development kicked off in early April 2015.  As work progresses, the Bentley and our partners have the following goals:
  • Meet-ups and/or presentations at the annual meeting of the Society of American Archivists (SAA) and other professional organizations will be used to disseminate information and invite feedback from user communities.  Additional outreach will be conducted in coordination with Artefactual Systems on appropriate listservs and forums.
  • All software produced by the project will be incorporated back into appropriate source code repositories and be made freely available to the archives and library communities.
  • ArchivesSpace-Archivematica integration will function independently of DSpace.  Integration with the repository will employ open and widely-used standards so that institutions can reconfigure the workflow to replace DSpace with another repository/access system (such as Hydra).
  • While some project features may be unique to the Bentley Historical Library (such as our use of modified Archival Information Packages for access purposes), the final product should be flexible/extensible enough to accommodate the widely varied practices of the digital preservation and curation communities.
  • Project documentation and reports will be made freely available to all users through this blog and other sources.

Next Steps

Project staff at the Bentley—which includes myself (Mike Shallcross, Principal Investigator), Max Eckard, and Dallas Pillen—will work closely with developers at Artefactual Systems and IT staff and librarians at the University of Michigan Library to further define requirements, test development features, and document procedures.  We also look forward to exploring other community initiatives (such as BitCurator and ArcLight) to identify possible synergies and integration points with our endeavors.

We plan to post regularly to this blog and welcome any and all feedback, questions, and suggestions.  Feel free to leave a comment or send us a message at bhl-mellon-grant (at) umich.edu.  Thanks!