The USDA’s National Agricultural Library (NAL), a national library of the United States, is home to the world’s largest collection of agricultural science materials and information. Located in Beltsville, Maryland, the Library serves scientists and researchers by assisting with database searches, literature reviews, and requests for hard-to-find documents and records. NAL’s Information Systems Division (ISD) develops and maintains applications that enable public access to the Library’s digital resources.
Since 2015, DSFederal has supported a large-scale development and O&M effort for NAL, with a team of developers who work onsite at the Beltsville facility. We architect data discovery platforms, and develop data integration capabilities using advanced search and database technologies. As a result of our work, the NAL website now serves as a powerful digital library that offers easy access to information scattered throughout hundreds of thousands of web pages.
Under a new Data Science task order, our team has been working on a SCINet big data storage plan for Ag Data Commons, a pilot effort to design, implement, and test a process by which Ag Data Commons can accept high volumes of data. We are also working on a recommendation system for NAL’s catalogs of datasets and published articles, using SCINet to parallelize NAL’s natural language processing pipeline. The work involves cleaning and preprocessing text, learning vector representations of ~1.4 million agricultural terms, computing the semantic similarity between texts, and executing various unsupervised clustering algorithms.
NAL currently stores 1,921 datasets on Ag Data Commons and 2,480,381 articles on PubAg. This means that our team must Â compute the similarity between each of approximately 4.8 billion dataset-to-dataset or dataset-to-article combinations. The quantity of data and computational complexity required make this project infeasible on personal computers.
As a long-time contractor-partner to NAL, DSFederal has developed extensive domain knowledge of NAL’s systems, and our team is energized by new technical challenges that the project offers. We look forward to continuing to support NAL and its vital role in scientific research.