At the University of North Texas Digital Newspaper Unit we use many tools to create and edit metadata. One of our workflows involves gathering metadata from the US Newspaper Directory and entering it in an Excel spreadsheet.
The metadata is later transferred manually to an online interface called the New Record Creator (NRC) that exports XML templates for applying metadata to groups of newspapers as they are uploaded to the Portal to Texas History (PTH).
I recently developed a proof of concept script in Ruby that uses OLE automation to create an XML file directly from a spreadsheet. It also provides an option to immediately upload the metadata to the NRC with Selenium WebDriver.
The metadata is later transferred manually to an online interface called the New Record Creator (NRC) that exports XML templates for applying metadata to groups of newspapers as they are uploaded to the Portal to Texas History (PTH).
I recently developed a proof of concept script in Ruby that uses OLE automation to create an XML file directly from a spreadsheet. It also provides an option to immediately upload the metadata to the NRC with Selenium WebDriver.
WIN32OLE for Excel
There are a number of great resources for learning about OLE automation with Ruby. One of the best is the Ruby On Windows blog. It has not been updated recently, but there are many useful code snippets to get you started.
The first step in exporting XML metadata from a spreadsheet is to create a WIN32OLE object for Excel and link it to the desired workbook and worksheet. The script takes the spreadsheet file name as a command line argument and links to the first worksheet in the workbook, the 'Reel Info' tab where we record metadata about the newspaper(s) on a reel of microfilm.
# create new Excel object
excel = WIN32OLE.new('Excel.Application')
puts "Excel failed to start." unless excel
excel.visible = false # hidden
# file name is first command line argument
spreadsheet = File.absolute_path "#{ARGV[0]}"
workbook = excel.Workbooks.Open("#{spreadsheet}")
worksheet = workbook.Worksheets(1) # 'Reel Info'
Next, assign metadata variables according to the cell's grid location. Here are a few examples:
serial_title = worksheet.Range('c19').value.strip
lccn = worksheet.Range('c21').value.gsub(/\s+/, "")
height = worksheet.Range('c25').value.to_i
Finally, open a new XML file and write the metadata. Format the templated XML as you would in a text editor, one tagged line per line of code with spaces or tabs entered manually between the quotation marks:
filename = File.basename("#{spreadsheet}", ".xls") # grab short file name
File.open("#{filename}(#{num})_super_metadata.xml", "w") do |supermetadata| # write file
supermetadata << "<?xml version=\"1.0\" encoding=\"UTF-8\"?>" << "\n"
supermetadata << "<metadata>" << "\n"
supermetadata << " <text>...data with " << variables << " in appropriate spots...</text>" << "\n"
supermetadata << "</metadata>"
end
It's good practice to destroy the Excel object when you're done with it, and my script also adds an export time and date to the spreadsheet before saving and closing it:
# add export message to spreadsheet
t = Time.now
worksheet.Range('c16').value = "EXPORTED: #{t}"
workbook.Close(1) # save and close
excel.Quit
excel = nil # destroy Excel object
Selenium WebDriver
I've done some work with Selenium IDE, pioneering a method for batch editing PTH metadata via the online edit system. Effective use of the Selenium WebDriver Ruby Bindings merits an article of its own, and I'll only briefly describe my method for uploading XML metadata to the PTH NRC.
Each browser has its WebDriver quirks, and I quickly discovered that our enterprise Frontmotion Firefox installation does not play well with the NRC loaded in a WebDriver object, producing slow load times and display issues with the PTH edit application. Switching to Chrome solved those problems, but also requires the ChromeDriver.
After sorting out the technical problems, working with WebDriver is very easy. Create a browser object and navigate to the desired page:
driver = Selenium::WebDriver.for :chrome
driver.navigate.to "http://edit.texashistory.unt.edu/nrc/import"
WebDriver loads the URL and waits for the page to load. Find the appropriate element, enter the data, and submit the element:
element = driver.find_element(:name, 'text_input')
element.send_keys "#{data}"
element.submit
In this case, I had already slurped the XML file into the data object:
data = File.read("#{filename}(#{num})_super_metadata.xml")
When the exported spreadsheet metadata is loaded in the NRC we save another template online in the edit application. That web template can be reused to create validated XML files with additional information, not recorded in the spreadsheet, that changes over time (e.g. editor, publisher) during the course of a newspaper run.