Friday, January 24, 2014

Charting Former Owners of Penn's Codex Manuscripts

Today is the American Library Association midwinter meeting LibHackathon here at the Penn Libraries. I thought I'd share a project using library data that I've been working on for a little while now in the hopes that it will be not only useful to scholars but also might generate some conversation over how libraries and archives distribute their valuable descriptive information.

In short, this piece is all about how we get to this:

Network diagram of Penn codex manuscripts and former owners
From this:

MARC record for UPenn Ms. Codex 465

Over the years and especially here at Penn I've been fortunate enough to work with a number of catalogers in both special and general collections. I can't think of a more under-appreciated part of the scholarly community. I've seen first-hand how much time, energy, and bibliographic skill goes into the description of texts and objects of all kinds. I've heard heated debates over whether one piece of information or another should go into one of the million-and-one MARC fields. What comes out of the other side of this process should be a goldmine of easily usable truly 'big' bibliographic data. Instead, I think it's safe to say that 99% of library users have no idea why one might want to search the 752 field instead of the 260 field for place of publication. Moreover, this is hardly the sole fault of users. Try searching any library online catalog for just information from subfield c of field 300 and see how far you get! So much structured data ignored and thousands of hours of cataloger effort hidden from the world [1].

Fortunately the data is there if you know how to find it [2]! I've been playing around with our catalog data at Penn for a while now and decided a few weeks ago that I wanted an easy way to visually display networks of provenance in our manuscript collection. Penn has a deep commitment to provenance and book history and for my money our catalogers have done some of the richest work in describing provenance of any manuscript collection I've seen. The Kislak Center here at the Penn Libraries currently has cataloged around 1,640 codex manuscripts (manuscripts bound in book form) as well as around 300 codex manuscripts from the Lawrence J. Schoenberg collection [3]. I knew from experience that most of these had detailed descriptions of former ownership in their online catalog records and it seemed reasonable to just download them all and make a quick visualization of who owned which manuscripts in common.

I realize now that this task would have been near to impossible at most libraries where the online catalogs and back-end databases don't easily allow public users to batch download full records. Fortunately at Penn all of our catalog records are available in MARC-XML form which looks something like this:

I knew that our catalogers were keen on including structured data about former owners in the 700 field with a "former owner" phrase after their name. It was easy enough to download a list of all of the manuscripts that possessed this field. Then, after some much needed coaching from Dot Porter, the Kislak Center's XML guru and medievalist extraordinaire, I was able to write an XSL transformation which would spit out just what I wanted. At first glance though, I didn't turn up nearly as many results as I'd hoped and I seemed to be missing a lot of data. Looking through the records I saw that, on the plus side, the 700 field was highly structured with authorized name headings but didn't always incorporate all of the rich narrative textual information in the 561 field (labeled "provenance" in our public catalog.  For example, an owner like Sir Thomas Phillipps would have his name included in the 700 field but the auction house which sold the manuscript would appear only in the 561. This is for very good reasons ("Sotheby's" is rarely a "former owner") but I really wanted to know everything about a text so I moved on to extracting every 561 field from the manuscripts. Instead of nice, neat authorized names, I of course got a lot of fascinating narrative:

Provenance note for UPenn Ms. Codex 234
I broke each of these lines of narrative into sentences and began the arduous work of identifying each owner in a chain of provenance uniquely. After some maddening time using OpenRefine, regular expressions, and plain copying and pasting I got a list I was happy with. In the end I came up with 3,252 manuscript/provenance pairs, like so:

1,285 of our 1,640 odd codices (including two ms. rolls, because: why not) had at least some provenance data recorded as well as an additional 265 of the 311 Schoenberg manuscripts we've cataloged. Out of these I was able to identify 985 "unique" entities through whose hands our manuscripts had passed. More interestingly, 225 of these owners had formerly been in possession of two or more of our manuscripts.

Past possessors of Penn's manuscript codices in yellow with individual manuscripts in grey. (Gephi network diagram rendered with sigma.js).[Full Screen View]

The historical strengths of our collection and Penn's institutional history can be seen pretty clearly here at  the center of the cluster. Our codices primarily come from European and American collections as mediated by the prominent dealers and auction houses of London, New York, Philadelphia, Paris,Florence, and Munich. In addition we have received several very large collections over the years including the Gondi-Medici collection via the dealer Bernard Rosenthal and the recent gift of the Lawrence J. Schoenberg  collection.

Center Cluster showing a variety of donors, bookdealers, and auction houses

Thursday, January 2, 2014

Linking Archival Sources in the 2013 AHR

With the annual conference of the American Historical Association (AHA) starting today I'm excited to see friends and hear some great papers. I'm always struck by just how broad a field 'history' represents but yet how often historians are able to make connections to each others work, even when far removed temporally and geographically. In reading the AHA's flagship journal, The American Historical Review (AHR) this year I especially enjoyed seeing places where seemingly unconnected articles spoke from similar frames of reference, and most interestingly, from overlapping source bases (be sure to check out my Penn colleague Vanessa Ogle's great article on the history of time reform!).

Authors of articles in the 2013 AHR connected by commonly used archives

As this site indicates, I'm very interested in tracking the circulation of texts, ideas, and archives over time as well as how these sources are used by scholars. Tracking networks of citation is nothing new and has been a favorite activity of scholars for centuries but recently there's been a surge of interest in quantitative analysis of academic citation patterns. Most of this interest has been in the sciences and social sciences where "impact factors" (put simply, the quantity and importance of articles citing one's work) are de rigueur in weighing scholarly merit. Though I'm wary of many of the developments in this "bibliometrics" field, some of the more useful advances have been in using data about authorship and citation to show the material ways fields are constructed, i.e. the influence of certain universities, graduate programs, or scholars in a specific sub-discipline. Here at Penn for instance, my colleagues at the library have helped the school of Medicine and others to create a way for viewing co-authorship networks of particular researchers.

Though tracking citation of articles and secondary sources in a journal like the AHR would really illuminate networks of influence, interest, and argument, I'm more interested in how historians use archival sources. This is especially important given that the bibliometric wizards at big publishing companies like Elsevier and Proquest have done a decent job at figuring out article and book citations and linking them together, but with much less success with archival sources.

I extracted data on archival sources from 16 of the 17 feature articles in the five AHR issues for 2013 [1]. The authors of these pieces did not disappoint, citing 66 different archives and libraries located in 54 different cities from Berkeley to Sarajevo to Zanzibar [2].

Location of Archives and Libraries cited in 2013 AHR articles [Interactive map]
Despite disparate topics and the relatively random assortment of scholars and articles across the year's issues (as far as I can tell none of the articles were grouped in 'theme' issues) there were several nodes of archival overlap.

Archives used by multiple 2013 AHR authors

Obviously one year of the AHR is a pretty weak sample but I suspect the pattern established would hold across a wider swath of the journal - i.e. an impressive array of geographically dispersed archives based on the focus of particular authors as well as a concentration of overlapping citation from the major state and university archives and libraries of Europe and North America. Along these lines I would be curious to see how the influence of particular archives have waxed and waned over the years in the profession, I imagine that a select number of repositories (NARA, the UK national archives, the British Library, Library of Congress, the BN in Paris, various German archives, etc.) have long been dominant across geographic and temporal fields given the institutional makeup of the historical profession but I would also be surprised if the dominance of these central archives haven't decreased given methodological and theoretical shifts in the discipline since the 1970s.