Managing large hierarchical datasets with PyTables
PyTables is a free and open-source Python library for managing large hierarchical datasets. It is built on top of numpy and the HDF5 scientific dataset library, and it focuses both on performance and interactive analysis of very large datasets.
For large data streams (think multi-dimensional arrays or billions of records) it outperforms databases in terms of speed, memory usage and I/O bandwidth, although it is not a replacement to traditional relational databases as PyTables does not support broad relationships between dataset variables.
PyTables can be even used to organize a workflow with many (thousands to millions) of small files, as you can create a PyTables database of nodes that can be used like regular opened files in Python. This lets you store a large number of arbitrary files in a PyTables database with on-the-fly compression, making it very efficient for handling huge amounts of data.
This workshop will guide you through the basics with no previous PyTables or HDF5 knowledge.
Some basic Python knowledge would be useful, although many attendees will probably pick it up on the fly, as we'll try to go slowly.
Alex Razoumov earned his Ph.D. in computational astrophysics from the University of British Columbia and held postdoctoral positions in Urbana-Champaign, San Diego, Oak Ridge, and Halifax. He spent five years as HPC Analyst in SHARCNET and in 2014 moved back to Vancouver to focus on scientific visualization and training researchers to use advanced computing tools. Alex is currently based at Simon Fraser University.
- Friday, April 28, 2023
- 1:00pm - 2:30pm
- Koerner Library
- Data Research Commons Research Data Management
- Alex Razoumov, Digital Research Alliance of Canada