Wednesday, February 17, 2016

The Ingest Entity at the Bentley Historical Library

The other night I gave a guest lecture for a University of Michigan School of Information course Mike is teaching on "Digital Preservation." The session was nominally titled "SIPs and File Format Identification," but the focus was on "Ingest" more broadly, including what ingest means here at the Bentley Historical Library, some examples of real-world SIPs and AIPs (current and future!) and a little checksum calculation exercise--because who doesn't like checksum calculation exercises?

At the heart of the presentation is the following diagram that documents our current ingest strategy and a [relatively] current version of our workflow. It's a useful reference Nancy and Mike put together a couple years ago that I don't think we've featured here before.
BHL Digital Processing Workflow

Here are the slides if you're interested!

Friday, February 5, 2016

A Primer on PREMIS and PREMIS Rights

In today's post, I'd like to talk a little about PREMIS (the data dictionary, not the working group--although I'm sure they're all great people, like Evelyn!). We've been using something akin to "PREMIS Lite" as part of our digital archiving workflow for a while now. As part of our work on the ArchivesSpace-Archivematica-DSpace Workflow Integration project, however, and in thinking about our eventual move to Hydra, we're gearing up to implement something more like PREMIS Proper, especially for PREMIS Rights Statements.

An Overview of PREMIS

http://www.loc.gov/standards/premis/images/premis-text2.gif
PREMIS Data Dictionary for Preservation Metadata, Version 3.0
Let's dive in with an overview of PREMIS. The following is lifted straight from their website:
The PREMIS Data Dictionary for Preservation Metadata is the international standard for metadata to support the preservation of digital objects and ensure their long-term usability. Developed by an international team of experts, PREMIS is implemented in digital preservation projects around the world, and support for PREMIS is incorporated into a number of commercial and open-source digital preservation tools and systems. The PREMIS Editorial Committee coordinates revisions and implementation of the standard, which consists of the Data Dictionary, an XML schema, and supporting documentation.
Leaving descriptive metadata, especially domain-specific descriptive metadata, to the many existing descriptive metadata schemes and encoding standards (like MARC, MODS, Dublin Core, EAD, etc.), and leaving super format-specific technical metadata to those who would get super nerdy about format-specific technical metadata (you know who you are), what I love about PREMIS is that it focuses specifically on one of my favorite things, digital preservation. It allows the digital archivist to record and, in some ways it event defines--although they might not like that I said that--the common denominator of preservation actions a preservation repository might perform on a digital object.

As of PREMIS 3.0, the data model consists of four entities:
The PREMIS Data Model
  • Digital Objects:  Discrete units of information subject to digital preservation. This could also be an "Environment," that is, hardware or software that support a Digital Object in some way, like rendering or executing it. Digital Objects are further broken down into the following subcategories.
Conceptual view between object categories
    • Intellectual Entity: A set of content that makes up a single unit for purposes of management and description. Pretty much anything, at any level, can be an intellectual entity, and intellectual entities may (or may not) be made up of other intellectual entities. A website (an intellectual entity you might describe or manage as an aggregate) may have a webpage (another intellectual entity) which may have an image (another intellectual entity, especially if you apply some additional descriptive metadata to it  or migrate its file it to a new format).
    • Representation: A set of files, including structural information, needed for a "complete" or at least "reasonable" rendition of the Intellectual Entity. To take a book as an example, consider different ways you might represent it: a single PDF file, 10 images, one for each page (which must be read in a particular order), etc.
    • File: A sequence of bytes. This is the thing that has a format, access permissions, last modification date, etc.
    • Bitstream: Contiguous and non-contiguous data within a file that has meaningful common properties for preservation purposes. Makes perfect sense, right?
  • Events: Actions, like preservation actions, that involve Digital Objects or Agents.
  • Agents: People, organizations or software that are associated with Events (or the rights attached to them). That is, people like me, organizations like the Bentley Historical Library or software like Archivematica or FIDO.
  • Rights Statements: Assertion of one (or more, which is the exciting part, but more on that later) rights or permissions statements pertaining to a Digital Object or Agent.
That's about it. If you want more information you can read the latest PREMIS Data Dictionary yourself!

PREMIS at the Bentley

We currently record information (that is, identifier, date and time, detail and outcome) about the following PREMIS event types (with ourselves as the agent as well as the piece of software we used):

  • Virus Scan/Check: The process of scanning a file for malicious programs.
  • Personally Identifiable Information Scan: Hmm, I can't really seem to find this one in the controlled vocabulary. Maybe we made this one up? In any case, internal consistency is what's really important, right?
  • Identify Missing File Extensions/File Extension Change: Assignment of a new filetype extension to a file object; typically done only if the existing extension was found to be incorrect.
  • Compression of Files: The process of coding data to save storage space or transmission time. We go for lossless and don't actually compress anything.
  • Technical Metadata Extraction: Extraction of technical (or non-technical) metadata like the resolution, colordepth etc. from a file using tools such as JHOVE.
That's it! PREMIS Lite! This information gets recorded in the humble CSV format (no fancy XML for us!). We also record additional information that technically counts as PREMIS Events, like checksums (and the algorithms we used to calculate them), as well as what files got normalized (and what they got normalized to). While this information doesn't make it into this particular CSV, it does get recorded elsewhere.

The Future: PREMIS Rights Statements

By virtue of us moving to Archivematica, within mere months we'll become much more PREMIS Proper-compliant (and we'll have that PREMIS in XML!). But that's not all--there's also been a lot of discussion around here (and around MLibrary, Artefactual, ArchivesSpace, the Hydra folks and various listservs) about PREMIS Rights, especially the communication of PREMIS Rights Statements between systems, and the enforcement of them by repositories.

First, the Why

Before I jump into an outline of PREMIS Rights and all of their granular goodness, I wanted to talk briefly about the problem that PREMIS Rights Statements solve. Take, for example, two familiar standards, Dublin Core and EAD:

<dc>
  ...
  <rights>Some random free text rights statement.</rights>
  ...
</dc>

<ead>
  ...
  <accessrestrict>Some free text Conditions Governing Access note that ends on this <date>date</date>.</accessrestrict>
  <userestrict>Some free text Conditions Governing Use note.</userestrict>
  ...
</ead> 


Look familiar? Now, I'm not trying to bash Dublin Core or EAD and, certainly, some rights statements are better than no rights statements. However, there's a big issue I see with these examples. Because they are free text fields, they aren't really machine-actionable. Of course, an institution can try to be as prescriptive as possible about how to fill these out (and, in fact, we are one of those institutions), and that can arguably lead to rudimentary machine-actionability, but over time it's hard to make the case that this is a sustainable approach. It's just too hard to keep folks entering metadata on the straight path. Eventually, dates get entered in all kinds of formats and non-standard text proliferates. DPLA knows this perhaps like no other institution, because they get rights statements from institutions all over the US: "For DPLA content alone, contributing organizations have used over 87,000 different rights statements," part of the motivation behind rightsstatements.org. This variety makes it hard for a computer to predict or parse and, ultimately, enforce rights statements. 

So, for example, you can say that a particular component of a collection has an "Executive Restriction until February 5, 2036," and while that works just fine for the reference archivist or researcher in your reading room looking at a finding aid or box list, if you're talking about digital content you still have to tell the repository separately (in our case, we "e-mail José") to embargo this content for 20 years, and you still have to go back in 20 years and update the original rights statement because it has expired. Human error abounds in these types of situations, and I'm sure we're not alone in having had a researcher let us know that a particular restriction has expired. To add a layer of complexity, "Executive Restriction," when taken out of the context of a University of Michigan collection at the Bentley Historical Library, doesn't mean much to a researcher halfway across the world who might be accessing our content through something like DPLA. To put on my librarian hat for a second, free text fields also don't lend themselves to explicitly identification with standard licenses such as Creative Commons, but perhaps that's hardly ever the case for archival material.

PREMIS Rights Statements, by contrast, were designed to be machine-actionable and interoperable between software systems. They do allow for some free text explanation of what would probably be an human-unreadable rights statement, but first and foremost PREMIS Rights Statements are about solving the problem I outlined above.

PREMIS Rights Statements

PREMIS rights statements provide a flexible framework for describing both rights ("entitlements allowed to agents by copyright or other intellectual property law") and permissions ("powers or privileges granted by agreement between a rights holder and another party").

The data dictionary outlines four types of rights, or rights bases to which permissions can be linked: copyright, license, statute and other for everything else, including all those weird donor requests. Acts get applied a basis, and restrictions can refine an act. A simple example is given in Implementing the Rights Entity in Archivematica by Evelyn McClellan, a draft chapter for an upcoming book on PREMIS we got to sneak a peak at. 
Thus, a simple rights statement in a PREMIS implementation can consist of a rightsBasis (such as copyright); an act (such as replicate), and information about any restriction on the act (for example, replication permitted only for the purpose of making preservation copies).
The other cool thing about PREMIS is that you can apply multiple rights bases and associated acts and restrictions to a given digital object. For example, you may have an institutional policy (rightsBasis is Other) that allows you to have acts like migrating or deleting an item, while your transfer of copyright (rightsBasis is Copyright) allows you to have acts like modifying, publishing or disseminating an item. You can sort of do this in DACS and EAD since there are spots for both Conditions Governing Use and Access, but in those scenarios you only compound the issue I outlined above since now a machine won't be able to understand multiple human-readable rights statements.

Other Considerations

As part of our work on the grant project, I mentioned that we've been thinking a lot about PREMIS Rights. Here are some questions we're asking: 
  • Could we use PREMIS Rights Statements for access? According to the data dictionary, this is slightly out of scope for PREMIS and PREMIS Rights Statements, as the latter are meant to apply more to preservation actions. However, it seems feasible and desirable to use granular PREMIS Rights Statements to tell our repository, for example, to embargo or not embargo a particular digital object, or to restrict access to a particular IP address. Artefactual has expressed interest in developing this functionality between Archivematica and AtoM, and we're interested in being able to record PREMIS Rights Statements in Archivematica, management them in ArchivesSpace, and have those be acted upon by DSpace or  Fedora/Hydra, for example.
  • Is anybody using PREMIS Rights Statements in the way I just described? That is, is anyone actually using PREMIS Rights Statements to tell a repository what to do, particularly if those rights statements were recorded in a separate digital preservation or archival management system (particularly if those are Archivematica and ArchivesSpace, respectively)? So far as we know, the answer to that question is negative. If you are, let us know! We want to steal your stuff.
  • Can we pass PREMIS Rights Statements from Archviematica to ArchivesSpace for subsequent maintenance of PREMIS Rights Statements? We want ArchivesSpace to be the system of record for this type of information, and we may sometimes need to edit rights statements, for example, when a restriction is based on donor death date and the donor is living when we process the collection. So far, the answer to this question is also negative; rights in ArchivesSpace are PREMIS-ish (they have a spot for rightsbasis but associated acts are one-to-one not one-to-many), while rights in Archivematica are fully PREMIS. Basically, there's nowhere (at least not yet, but there are rumors that this is changing!), to put this information in ArchivesSpace.

Conclusion

Thats about it for this Primer on PREMIS and PREMIS Rights! It's a Brave New World out there for rights statements and machine-actionability!

As an FYI, PREMIS is constantly being updated. Just this week there was a call to review the controlled vocabulary for preservation events. Let them know what you think!