James Myers

Published on June 2016 | Categories: Documents | Downloads: 53 | Comments: 0 | Views: 350

of 1

Consortium of Universities for the Advancement of Hydrologic Science, Inc.
Subscribe 0

WDC Informatics Conference 2013

Content

: Light-weight Data Services for Sustainability Research
James Myers1, Praveen Kumar2, Margaret L. Headstrom1, Beth Plale4, Robert H. McDonald5 , Rob Kooper3, Luigi Marini3, Jong Lee3, Inna Kouper4, Kavitha Chandrasekar4
1School

of Information, University of Michigan, Ann Arbor, MI, {hedstrom, myersjd}@umich.edu 2Department of Civil and Environmental Engineering, Univ. of Illinois, Urbana, IL, [email protected] 3National Center for Supercomputing Applications, Univ. of Illinois, Urbana, IL, {kooper, lmarini, jonglee1}@illinois.edu

4School

of Informatics and Computing, Indiana University, Bloomington, IN, {plale, kavchand, inkouper}@indiana.edu 5Data to Insight Center, Indiana University, Bloomington, IN, [email protected]

Introduction
The Sustainable Environment – Actionable Data (SEAD) project began in late 2011 as part of NSF’s DataNet Partnership. It’s mission is to develop light-weight data services that meet the needs of next generation sustainability projects for sophisticated management of highly heterogeneous data while dramatically lowering the cost and effort required to curate and preserve data for long-term community use. A key motivation for the project is the recognition that modern Web 2.0 and 3.0 technologies can make it easier to manage data gathered from multiple sources and can eliminate constraints such as having to fully define the list of file formats and vocabularies to be used before the project begins, or dealing with onerous forms to enter and re-enter information about data (metadata). Instead, SEAD creates a ‘write-once, use-repeatedly’ set of services. Once data is uploaded, it’s visible to who you want and where you want it. Put data in your SEAD project repository and your collaborators see it in the project dashboard, on individual data pages, and in customizable maps and other data display tools. Metadata buried in files is automatically extracted and displayed and immediately becomes useful for finding specific data sets. Mark the data for publication and curators see what you see – both data and metadata – and don’t need to ask you to re-enter anything. When publication is complete, your data is preserved for the long term and the new collection shows up on your project’s web page, in your online research profile, and in national data catalogs, and you can see as others cite, use, and generate new derived data from your work. SEAD calls this new approach “Active and Social Curation” – making curation a natural part of producing and using data, and leveraging technology to automate as much as possible and to integrate information provided by producers, consumers, and curators into a coherent and increasingly valuable whole.

SEAD’s Data Services

Active Content Repository
Active Project Spaces Individual Collection and Data Pages Automated Metadata Extraction Branded Public Access

VIVO Community Research Profiles
People/Organizations/Projects/Publications Data Citations Community Network and Dynamics Visualizations

Virtual Archive
Policy-Driven Curation and Preservation Institutional / Cloud / Grid Storage DOI Assignment Faceted Search and Catalog Registration

Design
SEAD’s services are primarily written in Java and are organized into three primary components that extend existing open source applications and common open source components and libraries: • An Active Content Repository (ACR) providing secure project spaces where data can be collected, shared, annotated, analyzed, used to create new data products, and ultimately published; • A community research profile and analytics service (VIVO) that tracks information about real-world entities (e.g. people, projects, centers), and provides links and citation information about papers, presentations, and data • A Virtual Archive (VA) that packages fixed, bounded versions of the data and information from ACR spaces and VIVO into new data collections, generates Digital Object Identifiers (DOI), matches collections to appropriate long-term repositories working with SEAD, and registers the new data across SEAD’s components and with internal and external data discovery services.

Pilot Operations
SEAD is entering a pilot operations phase and is seeking project groups with interesting in becoming early adopters of its data services. SEAD will be offering free and/or highly subsidized services to groups interested in leveraging SEAD to tackle more complex projects and/or are willing to take an active role in helping SEAD improve and evolve its services. During this phase, SEAD will offer both hosted services and downloadable (cloud-hostable) packages of its open source software. It will also provide both end-user and developer support and anticipates providing direct access to large reference collections, starting with 1.6 TB (~2.2 M files) from the National Center for Earth Surface Dynamics. SEAD follows an agile methodology and will evolve based on the interests of it growing community of active users.

Getting Started
Visit SEAD
Website: http://www.SEAD-data.net Project info, videos, demo services

Work with SEAD

Contact: [email protected] For projects, proposals, collaborations

Follow SEAD

Twitter: @SEADdatanet Email: [email protected]

Conceptually, SEAD’s ACR and VIVO components are web application/service stacks that leverage a distributed/cloud semantic content repository – which manages ‘things’ identified by global identifiers that may have associated content (e.g. a file) and be described by arbitrary metadata and inter-relationships (encoded in standard Resource Description Framework (RDF) format). This architecture lets SEAD be dynamically extended with new data types and vocabularies and support network graph browsing and analytics while retaining the ability to provide search, query, and customized views as one would expect from a traditional database with fixed data model. SEAD’s VA maps this community graph of semantic content into standard archival packages comprised of hierarchical file collections and resource map and metadata files that are well suited to long-term storage in institutional repositories and related long-term storage services. The VA is virtual in the sense that it only caches data moving between the ACR/VIVO and long-term stores, but it does leverage XML and geospatial stores to index and catalog information about published collections for future discovery and retrieval.

During the next 3 years, SEAD will be enhancing its services, and adding rich support for a growing set of data types, expanding its data holdings and community information, growing a network of affiliated long-term repositories, and Creating and refining a sustainable business model and long-term organizational structure.

Acknowledgements
SEAD is funded by the National Science Foundation under Cooperative Agreement #OCI0940824.
SEAD gratefully acknowledges all of our partner participants who have been involved in developing our services framework. This includes the research teams from the following organizations: School of Information, University of Michigan; Department of Civil and Environmental Engineering, the National Center for Supercomputing Applications (NCSA) and UIUC Libraries, University of Illinois at Urbana-Champaign; Data to Insight Center, IU Libraries and School of Informatics and Computing, Indiana University; the Interuniversity Consortium for Political and Social Research (ICPSR); the National Center for Earth-Surface Dynamics (NCED) and the Data Conservancy Project, John Hopkins University.

James Myers

Comments

Content

Sponsor Documents

Recommended