Select Page

Spatiotemporal Indexing Approach (SIA) for efficient management, access, and analyze Big Climate and Remote Sensing Data

MapReduce, a parallel data processing framework pioneered by Google, has been proven to be effective when it comes to handling big data challenges. As an open source implementation of MapReduce, Hadoop has gained increasing popularity over the past several years. However, Hadoop is not designed to handle spatiotemporal data. To bridge the gap,  we propose a novel spatiotemporal indexing approach that significantly accelerates querying and processing of big climate data with MapReduce in their native format.

The Spatiotemporal Indexing Approach (SIA) bridges the gap between array-based data models and block-oriented HDFS storage models by linking the logical spatiotemporal information (space, time, and variables) to the physical location information (node, file, and byte).  Based on the index, a grid partition algorithm was developed to optimize MapReduce processing performance by maximizing data locality and balancing the workload across cluster nodes.

Structure of the spatiotemporal index

SIA was adopted by NASA as one of the key technologies in their Data Analytics and Storage System (DASS). The SIA has also been extended and adapted to build a hierarchical indexing strategy for optimizing Apache Spark with HDFS to efficiently query big geospatial raster data (Fei at al., 2020).

“Testing the Spatiotemporal Indexing Approach (SIA) under a variety of configurations, including HDFS, General Parallel File System (GPFS), and Lustre, has helped to clarify and define the architecture of the DASS hardware and the software stack. The DASS will provide engineers and scientists with a platform for analyzing large climate datasets without the need to move the data.”
https://www.nas.nasa.gov/SC16/demos/demo37.html

Benchmarking performance measurements for the Spatiotemporal Indexing Approach (SIA) on multiple Portable Operating System Interface (POSIX) architectures leveraging connectors to the Hadoop Distributed File System (HDFS) environment; lower runtime is better. Carrie Spear, Michael Bowen, NASA/Goddard (Source: https://www.nas.nasa.gov/SC16/demos/demo37.html)

Credit/Source: Carrie Spear (carrie.e.spear@nasa.gov), HPC Architect/Contractor at the NASA Center for Climate Simulation (NCCS). http://files.gpfsug.org/presentations/2016/SC16/06_0-_Carrie_Spear_-_Spectrum_Sclale_and_HDFS.pdf
Source: https://www.nas.nasa.gov/SC16/demos/demo37.html

Publications:

Li, Z., Hu, F., Schnase, J. L., Duffy, D. Q., Lee, T., Bowen, M. K., & Yang, C. (2017). A spatiotemporal indexing approach for efficient processing of big array-based climate data with MapReduceInternational Journal of Geographical Information Science, 31(1), 17-35.

Hu, F., Yang, C., Jiang, Y., Li, Y., Song, W., Duffy, D. Q., … & Lee, T. (2020). A hierarchical indexing strategy for optimizing Apache Spark with HDFS to efficiently query big geospatial raster data. International Journal of Digital Earth13(3), 410-428.