Upload

White Paper: Uploading to Internet Archive

By Joel Richard (Smithsonian Libraries)

Overview

So you want to contribute to the Internet Archive?

This document is aimed at the population of users and institutions that are scanning items to ultimately be shared with the world through the Internet Archive (IA). These may be public domain videos, recording, books or other items. For the sake of discussion and because this is primarily what we do at the Smithsonian Institution Libraries (the Libraries), this document will discuss digitized books. We’ll refer to these as “items”, just to be as generic as possible.

Rest assured that the techniques for uploading books are generally the same for other types of things with two exceptions:

Your metadata that identifies the item will be different.
The actual files that you upload may differ from the TIFF or JPEG 2000 files used for books.

This document is the result of our learning and experiences in sending items to IA reliably and in a timely fashion. We hope that the information here can help others jump-start their projects with a minimum of trouble.

Finally, although this document describes our own experiences, you may encounter different situations, especially if you are uploading non-book items. There will be a different learning curve if you encounter difficulties in getting an audio or video file to derive, for example.

Before You Begin

You must have an account with the Internet Archive.
You must be an admin of the collection you are adding your items to – all items must be part of at least one collection.
You will need something to upload, scans of pages, some audio or video file(s), etc.
You will need some metadata, in very specific formats (more about this later.)

Use S3, FTP is going out of style

Our understanding is that there was (or still is) a method for uploading items to IA through FTP. We believe that IA is actively moving away from FTP as it’s probably not as reliable as using the REST-based S3 system. We did not get definite confirmation on this from IA, but one of its documents indicates that IA is using the same (or similar) software that Amazon uses for its Simple Storage Service (S3).

Furthermore, existing modules for PHP, Perl, Ruby, Python, etc. that are built to interface with Amazon’s S3 service may be modified to work with IA simply by changing the base URL that the module calls (amazonaws.com changes to us.archive.org). We did not go this route and chose instead to use command-line based tools to talk to the S3 servers. The reasons for using the command-line will be discussed later.

There is no API

Lastly, as much as we might wish it were not so, IA does not have an API as we are familiar with them. So we resort to using a few well-crafted URLs and scan the resulting pages for bits of text that give us the answer to our question. These instances will also be discussed in detail later.

Background

A quick background on what we do here at the Libraries will help when we talk about what we are sending to IA. To summarize, we are scanning out-of-copyright books printed largely before 1923.

Scanning

The bibliographic or title level information of the book is placed into a queue. When the book arrives at the scanning center, the pages are photographed using a high-resolution digital camera and the resulting RAW files are converted to TIFF and passed to a piece of software. The software allows the user to enter in details about the page (page numbering, type, etc.) When the user has completed this process, the software packages the scanned pages, creates derivative images to be sent to IA, XML files, and metadata that IA requires.

Automation

The process is automated as much as possible to allow humans to focus on what’s important: operating the camera, examining and evaluating a page, handling errors and exceptions. The rest of the process happens behind-the-scenes and automatically.

Our Process

From this point forward, we will assume that our script has a number of JPEG2000 images and access to some item-level metadata, probably in a database and we are ready to upload this item to the Internet Archive. We made some decisions along the way and we will try to clearly explain our reasoning for the choices we made.

Getting an identifier

Probably the most critical part of the process is deciding on or generating the IA identifier for the item. The consequences of getting this incorrect range from inconsequential (an error trying to create the item) or catastrophic (completely replacing one item with another.) We have experienced both. It is worth noting that the possibility to overwrite only relates to item that belong to you, but it’s still troublesome when it happens.

The general form of the identifier that IA tends to use is something that looks like the following:

Revisiondelgene00Burm

The suggested format for the identifier is found in the (FAQ) and take the form of Title-VolumeNumber-Author containing up to the first 16 characters of the title, the number of the volume, followed by the first four characters of the author’s last name. This is the recommended scheme and has a tendency to create unique identifiers without much work. In cases where the identifier already exists, we append an increasing letter suffix to the end. Therefore if Revisiondelgene00Burm exists, we’ll use Revisiondelgene00BurmA. If Revisiondelgene00BurmA exists, we try Revisiondelgene00BurmB and so on.

In addition to all punctuation and spaces in the title, we strip out inconsequential words (a, and, or, the, of, etc.) before creating the identifier, but this is a personal preference. Additionally, we stick to ASCII for our identifier, preferring not to use diacritics and UTF8 strings. We developed a PHP function to convert down from UTF8 to ASCII when encountering an accented character, for example. We recognize that this is only applicable for languages based on the Roman alphabet. We are uncertain as to how the Internet Archive is handling other languages such those of Russia, or Asia with respect to identifiers. In many cases, a simple barcode or numeric identifier works, but is not terribly meaningful for humans.

In the S3 way of doing things, the identifiers described here are also used to name the “bucket” in the S3 system. We will use this term on occasion in this document, but simply put a bucket holds the files for an item.

Checking that an identifier exists

From the outset, we knew that we would need to check at IA to be sure that we weren’t about to attempt to overwrite something that was already there. Although IA will raise an error when trying to create the bucket, we need the name of the identifier as early in the process as possible given that the name of the files and the file system structure used by our script depend on the identifier.

To check for an identifier, we use the following URL:

https://archive.org/services/check_identifier.php?identifier=Revisiondelgene00Burm

This will return one of the two following XML responses indicating whether the identifier is available for use. Check the CODE attribute of the RESULT element for a computer-friendly response code.

<?xml version="1.0" encoding="utf-8"?>
<result type="success" code="available">
  <message>The identifier you have chosen is available</message>
  <identifier>Revisiondelgene00BurmTest</identifier>
</result>

…or…

<?xml version="1.0" encoding="utf-8"?>
<result type="success" code="not_available">
  <message>The identifier you have chosen is not available</message>
  <identifier>Revisiondelgene00Burm</identifier>
</result>

This method is case insensitive and is therefore the most reliable method of checking for an identifier.

Account Information

To create an account for uploading and checking the status of your uploads, go to http://www.archive.org/account/login.createaccount.php. To upload using the cURL technique, you’ll need your access key and secret code, which are found here: http://www.archive.org/account/s3.php.

Getting your account set up, creating or being linked to a collection will be left as an exercise for the reader. To do this, you’ll need to contact the Internet Archive to take care of this or coordinate with the collection’s owner.

Mandatory Metadata

There are only a few files that IA needs for a text-based item: title-level metadata (such as a MARCxml record), item-level metadata, and page-level metadata. The item-level metadata is the first we will discuss. When looking at all of the files (http://www.archive.org/download/Revisiondelgene00Burm) for an item, the item-level information resides in the IDENTIFIER_meta.xml file.

(As a side note, to get to the list of all files, start at the Details page for an item and click the All Files: HTTP link.)

Generally speaking, the metadata that we describe here is an acceptable baseline of what is required. Fewer data elements may be allowed, especially in the XML files, but we have not gone so far as to experiment and try to reverse-engineer the process to determine which are absolutely required. You may try this out on your own, but keep in mind that the log files are verbose and often cryptic and may not explicitly say “variable XYZ not found.” (We’ll discuss log files in a later section.)

The metadata contains all of the title level information about the item, including the title, author, publisher, copyright information, digitizing sponsor, date published, type of item, and who originally uploaded it. IA may also update this XML file with information as it processes the pages of the item.

Although we do upload files through the S3 system, the meta.xml entries are sent as HTTP headers, which are then processed by IA into the meta.xml file. These HTTP headers are sent up with the first file we upload to IA, which is usually the IDENTIFIER_scandata.xml file.

The headers look like the following:

x-archive-meta-collection: biodiversity
x-archive-meta-contributor: Smithsonian Institution Libraries
x-archive-meta-creator: Burmeister, Hermann
x-archive-meta-curation: [curator]biodiversitylibrary.org[/curator][date]20101122123714[/date][state]approved[/state]
x-archive-meta-date: 1883]
x-archive-meta-identifier: Revisiondelgene00Burm
x-archive-meta-language: eng
x-archive-meta-mediatype: texts
x-archive-meta-possible-copyright-status: This image is in the public domain.
x-archive-meta-publisher: s. n.
x-archive-meta-sponsor: Smithsonian Institution
x-archive-meta-title: Revision del género Ecpantheria
x-archive-meta-uploader:[email address redacted for privacy reasons]
x-archive-meta-year: 1883]
x-archive-meta00-description: Caption title.
x-archive-meta01-description: Reprinted from Anales del Museo Publico de Buenos Aires, t. 3.

Certain elements are mandatory. We’re pretty sure these are the following, or we find it’s better to include them:

x-archive-meta-collection
x-archive-meta-contributor
x-archive-meta-identifier
x-archive-meta-mediatype

Additionally, there are certain rules for sending up duplicate entries, which you can see in the meta00-description headers. These are described in the http://www.archive.org/help/abouts3.txt file.

Page Level Information

The second file that IA needs describes the individual pages of the book. This is the IDENTIFIER_scandata.xml file. The scandata contains information about each page of the book, its pixel dimensions, cropping and rotating instructions, page numbering, page ordering, and other simple metadata.

The scandata file is an XML file, but there is no DTD or XSD to describe the file. This is one of the challenges of working with IA. The best we can do is to make sure that the XML is well-formed and that there are no errors in the log of activity for the item after we upload the file. (Again, we’ll discuss log files later.)

The XML file has two major sections: bookData and pageData.

bookData

The bookData section contains the identifier, number of leaves (scanned images) and the DPI at which they are scanned, and the pageNumData. The DPI information is used as a clue to the OCR engine to help when scanning the images for text.

In this section, it’s important to note that the term “leaf” means one scanned image of one side of a sheet of paper of the book. Page refers to the logical numbering on one side of the sheet of paper. (Conventional language usually treats leaf to be one sheet of paper and page as one side of a leaf. Page may also mean the entire leaf as well.)

The pageNumData section lists page number assertions for the book. This is really a list of where explicit (or implicit) page numbering starts and ends within the book. For example:

<pageNumData>
    <assertion>
        <leafNum>8</leafNum>
        <pageNum>4</pageNum>
    </assertion>
    <assertion>
        <leafNum>19</leafNum>
        <pageNum>15</pageNum>
    </assertion>
    <assertion>
        <leafNum>24</leafNum>
        <pageNum>2</pageNum>
    </assertion>
    <assertion>
        <leafNum>58</leafNum>
        <pageNum>36</pageNum>
    </assertion>
</pageNumData>

On leaf 8, numbering starts with “Page 4” and counts up to “Page 15” on leaf 19 where numbering stops. There is no page numbering on leaves 20-23 (inclusive.) Page numbers resume with “Page 2” on leaf 24 and continue through to “Page 36” on leaf 58. Page numbers can be entered in any way to accommodate the variety of numbering schemes we encounter in books.

Keep in mind that the assertions must appear in pairs indicating the start and stop leaves and numbers. Admittedly, the workflow at the Libraries does not always ensure an even number, but we have found no real adverse effect to having an odd number of assertions. The only expected effect would be to have the numbering continue through to the end of the file, but we have yet to verify if this is the case.

pageData

The second section of the scandata.xml file is the pageData containing the detailed metadata about the pages. One entry in the pageData looks like the following:

<page leafNum="27">
    <pageType>Normal</pageType>
    <addToAccessFormats>true</addToAccessFormats>
    <origWidth>4437</origWidth>
    <origHeight>6100</origHeight>
    <cropBox>
        <x>0</x>
        <y>0</y>
        <w>4437</w>
        <h>6100</h>
    </cropBox>
    <pageNumber>39</pageNumber>
    <handSide>RIGHT</handSide>
    <year>1883</year>
</page>

Only some of these entries are simply descriptive. Others affect how the item appears when displayed in the Read Online feature on IA.

The pageType is used when listing the pages of the book to give the user an indication of the type of page when selecting it. Additionally, a pageType of “Title” will be the first page that is shown when opening the book in the Read Online feature. A “normal” page type is the default. Other choices are “Illustrations”, “Blank”, “Cover”, “Map”, “Issue Start”, “Issue End”, “Color Card”, “White Card”, “Tissue”, and “Delete”.

The addToAccessFormats determines whether or not the page should be displayed on the Read Online feature and in the PDF, ePUB and other derivative files.

In case you don’t want to crop the images before sending them to the Internet Archive, you can send up instructions on how to have IA do this for you inside the cropBox element. The software we use at the Libraries crops before we upload, but the information is required, so we provide nominal values.

The pageNumber item is mostly self-explanatory, and is usually used to indicate the numeric portion of the number printed on the page, but it can also be used to indicate a logical (or implicit) page number that is not printed on a page but logically follows the convention on surrounding pages, such as the start of a chapter or issue that omits the page number on the first page, but explicitly numbers the subsequent pages. Examples of valid pageNumbers include:

<pageNumber>ix</pageNumber>
<pageNumber>Pl. 43</pageNumber>
<pageNumber>[24]</pageNumber>

Simply put, handSide is LEFT or RIGHT (in caps) to indicate the left- or right-hand page. This is likely a clue to IA on how to display the page in the Read Online feature.

We are able to include the publication year of the page. It isn’t necessary to include this, but it does make some sense when multiple different issues (of a journal, for example) are bound together in one physical book. Different sections of the book may have been originally published in different years. Either way, this information is easily added, so we include it. We’re not sure if this is used at all by IA, but we include it for posterity for now.

Two sets of page numbers?

You may be wondering why IA collects page numbering in two different places. This is a good question and we don’t know the answer. What we do know is that the pageNumData section is used to insert page numbers into the PDF files that IA derives from the uploaded images. There seems to be no other effect if the pageNumData section is omitted. It’s best to include it, however, even if you leave the <pageNumData></pageNumData> empty.

The pageData section is used, at least, for the Read Online feature of IA. It may be used in other places, possibly for the derivation of other file formats.

Title Level Information

The third file we send to IA is the MARC record for the item from our library catalog. This contains some repeated information from the IDENTIFIER_meta.xml file, but it often has more specific information that only comes from the library catalog. This file is mostly informational and is in the MARCXML format.

It is worth noting that the inclusion of the IDENTIFIER_marc.xml file will allow IA to create the IDENTIFIER_dc.xml file for you. Also, the MARC data will be used to populate data in the IDENTIFIER_meta.xml if that data is not already there (i.e., title, author, publication information.) We have taken the approach of including some title level information in the meta.xml headers, but recognize that some of this work is redundant.

JP2 files (original versus derived)

When it comes to uploading your images, you have a choice. You can either upload the original files, which aren’t necessarily the true originals or you can upload a set of smaller derived files and save everyone a bit of bandwidth and processing time.

Internet Archive, on their SCRIBE book scanning machines, converts camera RAW files first to lossy JPEG2000, compressed generally at a ratio of 15 to 1 (these are their ‘original masters’.) They then crop, rotate, and compress again to 20:1 – these are the final ‘processed masters’. Files are then compressed to around 120:1 for the access format. We at the Libraries initially were uploading “original” lossy JP2 files (compression quality 70) but kept running into false errors in the communication with IA. Even with the much smaller size of the JP2 images, the upload was still taking well over an hour or two to complete, which may have run into some hard timeout limit on the receiving server.

We at the Libraries decided to uploadcompressed JP2 derivatives (Compression Quality 15) that IA uses for delivering pages to the Read Online feature. These JP2 images are stored in the IDENTIFIER_jp2.zip file (as opposed to the IDENTIFIER_orig_jp2.tar file for the original images) This preempts IA from creating the derivative JP2 files and also allows it to deliver the
book via the Read Online almost immediately. The downside is that we don’t have the “original” images in a secondary, redundant location outside of our offices.

Within the .zip or .tar files the naming convention of the filenames must be very specific. The general format is IDENTIFIER_COUNT.jp2 (for JPEG200 images, for example) The identifier is the same as that of your item. (Again, this is another reason to get an available identifier before attempting to upload.) The counter is a four digit-based number that counts up without gaps in the numbering. Example:

Revisiondelgene00Burm_0001.jp2
Revisiondelgene00Burm_0002.jp2
Revisiondelgene00Burm_0003.jp2
Revisiondelgene00Burm_0004.jp2

The numbering can start at 0 or 1, and IA seems to be lenient but the numbers must correspond to those in your IDENTIFIER_scandata.xml file.

IA Processing

Once these four sets of data are sent to IA, the derive signal is given and IA begins processing the files, creating derivative files such as PDF, ePUB , Kindle, DjVu, “flippy book” animated GIF , and Abbyy OCR text. If you upload any of these files on your own, IA will not attempt to create them. In order to re-derive a file, you need to first delete it using IA’s web interface and then signal the derivation process which will find that the file is missing and attempt to regenerate it.

Monitoring

The best way to monitor the progress of your uploads is to use the My Outstanding Tasks Page, which lists all of your pending and currently running tasks as well as those that had errors. There is no programmatic way to check this, although we find that there’s no need to. To determine if there is a problem, we wait a day or two and start looking for the last of the derived files.

Verifying

When we think the derivation is completed, we go to the IA website and look for the IDENTIFIER_djvu.txt file. We have determined this to be the last file that IA creates in its derivation process, based on the timestamp of the file. The general URL of this would be of the form:

http://www.archive.org/download/IDENTIFIER/IDENTIFIER_djvu.txt

We could also check the IDENTIFIER_files.xml file to see if the file we are looking for has been created.

http://www.archive.org/download/IDENTIFIER/IDENTIFIER_files.xml

Harvesting

After the item is verified, we copy some items back. We like to keep some of the derived files handy, such as the OCR text, the EPub files, etc. Using the same list of files in the IDENTIFIER_files.xml, we can simply download the file to our server.

Using cURL to Upload

We are using the command-line version of cURL to upload to the internet archive. We found this to be a more reliable method than using the built-in PHP curl functions. Either the command was not flexible enough for our needs or we had issues with memory usage or timeouts when uploading large files.

Our PHP script is itself called from the command-line (by Linux’s cron) and therefore offers some opportunity for monitoring the upload process in that the output from curl is sent to the screen.

The exact curl functions look like the following. They have been wrapped to multiple lines for legibility and certain variables have been replaced with green capital letters.

/usr/local/bin/curl --location
    --header "authorization: LOW ACCESS_KEY:SECRET”
    --header "x-archive-auto-make-bucket:1"
    --header "x-archive-queue-derive:0"
    --header "x-archive-meta-mediatype:texts"
    --header "x-archive-meta-possible-copyright-status:No Known Copyright Issues"
    --header "x-archive-meta-contributor:Smithsonian Institution Libraries"
    --header "x-archive-meta-uploader:[email address redacted for privacy reasons]"
    --header "x-archive-meta-identifier:IDENTIFIER"
    --header "x-archive-meta-sponsor:Smithsonian Institution Libraries"
    --header "x-archive-meta-collection:biodiversity"
    --header "x-archive-meta-curation:[curator]biodiversitylibrary.org[/curator]
        [date]20110531093928[/date][state]approved[/state]"
    --header "x-archive-meta-title:Beiträge zur Insecten-Fauna von Neu-Granada
        und Venezuela"
    --header "x-archive-meta-creator:Kollar, Vincenz"
    –-header "x-archive-meta-year:1850]"
    --header "x-archive-meta-date:1850]"
    --header "x-archive-meta-publisher:K. K. Hofund Staatsdruckerei"
    --header "x-archive-meta-language:ger"
    --header "x-archive-meta-call_number:39088002393080"
    --header "x-archive-meta-identifier-bib:39088002393080"
    --header "x-archive-meta-description:Plates bound in separate volume."
    --upload-file "/path/to/file/IDENTIFIER_scandata.xml"
        "http://s3.us.archive.org/IDENTIFIER/IDENTIFIER_scandata.xml"

This first command serves multiple purposes.

It creates the bucket with the meta tag x-archive-auto-make-bucket:1
It sends in all of the metadata for the item, which allows IA to create the IDENTIFIER_meta.xml file.
It uploads the IDENTIFIER_scandata.xml file to describe the pages of the book.

A few other things are worth noting here. We specifically send a command to not start the derivation of the files for this item (x-archive-queue-derive:0) and we are telling IA that we are uploading a book (x-archive-meta-mediatype:texts).

You can also programmatically replace the metadata by sending a header that instructs IA to remove the existing IDENTIFIER_meta.xml and replace with the headers in the PUT method (x-archive-ignore-preexisting-bucket:1). We typically do not use this method since the derivation process creates new metadata and we prefer not to risk losing that information.

We also never issue the DELETE method, preferring not to allow our software to remove things from IA. (Read the http://www.archive.org/help/abouts3.txt page for more information on this feature.)

We follow this by uploading the MARCXML record from the Libraries’ catalog:

/usr/local/bin/curl --location
    --header "authorization: LOW ACCESS_KEY:SECRET”
    --header "x-archive-queue-derive:0"
    --upload-file "/path/to/file/IDENTIFIER_marc.xml"
        "http://s3.us.archive.org/IDENTIFIER/IDENTIFIER_marc.xml"

Again, worth noting here is that we aren’t uploading any metadata and we still tell IA to not derive anything.

/usr/local/bin/curl --location
    --header "authorization: LOW ACCESS_KEY:SECRET”
    --header "x-archive-queue-derive:1"
    --header "x-archive-size-hint:32681063"
    --upload-file "/path/to/file/IDENTIFIER_jp2.zip"
        "http://s3.us.archive.org/IDENTIFIER/IDENTIFIER_jp2.zip"

Finally, we still aren’t uploading any metadata and we tell the system how large the file is, which has some unknown effect, but it’s nice to offer that information anyway. Also, since this is the last file we are sending for this item, we give the signal to derive the item and create all of the supplemental files (x-archive-queue-derive:1).

Additional Info & Error Handling

IA intervention on errors

At the time of the writing of this document, it seems that sometimes when there is an error in one of the derivations, IA staff either investigate on their own or contact us when there’s an error. Lately this hasn’t been the case and any errors that we cannot resolve have specifically been brought to their attention. Your experiences may be different.

Metadata Errors

First, make sure you understand what UTF-8 is and why you should use it. That’s a topic for a different whitepaper. That said, our database, scripts and files are all stored in UTF-8 format and we talk to the server using the same. This allows us to send in accented characters with a minimum of hassle. It’s important to note that we are not explicitly indicating UTF-8 content to curl, so we’re lucky in the sense that IA is expecting it.

Second, in the headers you send to IA, make sure that all quotes and double quotes are properly escaped as needed for your implementation. Our command-line technique uses double-quotes to surround the metadata name-value pairs. A double-quote in the content would break the command line, so we added backslash escapes before the double quotes.

Understanding the Log files

If anything goes wrong with your derive, then the log file for the action will be the place to go. From either the My Outstanding Tasks Page or the details page for an item, the log file is found by clicking the task ID number for the task. This will display the log_show.php page for the action. The file can be quite long, but it will have clues to what should the action have failed and be displayed in red. A discussion of what is found in the log file is too long for the scope of this document.

Monitoring your progress

As mentioned before, the My Outstanding Tasks Page lists all of your pending, running or otherwise incomplete tasks that. Anything in red indicates that there is a problem, anything in green is pending and anything in blue is currently being processed.

OCR problems

The first major issue we found in the derivation of one of our books was a situation where the OCR of a complicated image was taking an extremely long time (many hours) to complete. As it turns out, we suspected that the fine lines of shading in the image were throwing off the OCR engine and was likely using up a large amount of memory. The log file, though cryptic, did show an error that, once deciphered, indicated trouble with scanning the page for text.

Again, on the advice of our contact at IA, we added a piece of metadata that told the OCR engine to skip the page entirely. The method was to add a special metadata element to IDENTIFIER_meta.xml called “skipocr”. This metadata element can be added by editing the item’s information at IA’s website or through the HTTP headers described earlier. The value is a restricted regular expression matching the 4-digit index numbers of the images to skip. Some examples:

0101
010[12]
010[1-5]
01(0[89]|1[0-4])

In the first example, we skip page 101. In the second, we skip page 101 and 102. In the third, pages 101 through 105. In we skip pages 108, 109, 110, 112, 113, and 114.
Allowed characters are digits, square brackets, hyphen, parentheses and vertical bar. Through experimentation, we found that this regular expression can be very complicated, as long as it is valid. For example the following would appear on one line for the “skipocr” meta variable.

(000[1234]|000[6789]|00[124][0-9]|003[01246789]|005[012468]|006[02346]|
007[0245689]|009[01234568]|01[0-4][0-9]|0150)

This book had a lot of images and only a few pages with text.