When I began working on the digital archive of Blake/An Illustrated Quarterly, the vast scale of the project—over 2000 articles in multiple formats (PDF and either XML or HTML) published across nearly 50 years—made systemic corrections and adjustments very difficult. For example, I found that we needed to format each issue’s table of contents differently from regular tables, but the tables of contents were not labeled or distinguished in the XML in any way. I went through nearly 200 issues, labeling each table of contents by hand.
To take an example on an even larger scale, when I joined the project, each XML file represented an entire issue (not an individual article), and no tag or label distinguished an article from smaller units (a section of an article) or from larger units (a section of the issue including, for example, book reviews). This not only meant that an article could not be displayed on a separate page from other articles in the same issue, but that search results could only tell you if keywords appeared a particular issue of the journal, not the title of the article in which the keywords appeared. When we decided that this was not an acceptable way to present the journal to users, I and the other members of the team manually split the issue files into their component articles and tables of contents.
Some tasks on this scale had to be done by hand, in part because they called for human discernment. But I found that many tasks could be automated. For instance, I found that some tables had been marked up without a line break or space between cells. When the search system stripped the tags, it combined words from the end of one cell and the beginning of the next. For example, a cell ending in “color” appeared next to a cell beginning “Thel,” and the search index stored the words as the single entity “colorThel,” so that a search for either of these words would not return the article in which they appeared. A relatively simple PHP script identified all of the XML files containing cell tags that lacked space between them and added line breaks, saving what would otherwise have been many hours of manual searching and correcting. A keyword search now correctly finds “color Thel” in this article.
In some cases, human and script can collaborate productively. For example, our team identifies the black-and-white images in the BIQ articles, and provides high-quality color versions from the Blake Archive (or, for images not in the Archive, such as works not by Blake, we crop images from the scanned issues). This requires discernment and cannot be automated. However, I have written scripts which identify not-fully-processed images—either tags in the XML with no filepath, or tags which list a filepath which does not correspond to an actual image file on the server. These scripts have enabled me and the members of the team to track our progress and find remaining articles requiring image processing.
Similarly, we have found a number of misspellings in the transcribed text of BIQ articles. By comparing the transcription with a PDF of the original, we can identify errors introduced in transcription (including obvious mechanical OCR errors such as “$idden” for “hidden”). In other cases, there is an error in the original printed journal (as in an author bio, corrected in the following issue, substituting historic “sties” for historic “sites”)—in these cases, we make an emendation linked to an emendation note explaining the change. All this depends on the discerning eye and mind of the scholar. But the biggest problem, as noted before, is the sheer volume. To use a spell-checking tool article by article means that one encounters the same words—correct words not in the spell-checking dictionary, such as names, foreign words, and Blake’s idiosyncratic spelling—over and over. To solve this problem, I wrote a script that listed all the unique words which ever appeared in our transcriptions of the print issues. When one runs a spell check on this list, one encounters each word only once, and can skip over correct names and foreign words one time rather than repeatedly. When we identify a word requiring correction, we can simply run a search on the misspelled word and find the articles in which it is misspelled. Thus human and machine each play to their strengths, and we can find and correct misspellings more easily and quickly.