MAST API Tutorial

An introduction to using the MAST API to query MAST data and catalogs programmatically.

To start with, here are all the includes we need:

In [1]:
import sys
import os
import time
import re
import json

import requests
from urllib.parse import quote as urlencode

from astropy.table import Table
import numpy as np

import pprint
pp = pprint.PrettyPrinter(indent=4)

Basic MAST Query

Here we will perform a basic MAST query on M101, equivalent to choosing "All MAST Observations" and searching for M101 in the Portal like so:

We will then select some observations, view their data products, and download some data.

Step 0: MAST Request

All MAST requests (except direct download requests) have the same form:

  • HTTPS connect to MAST server
  • POST MAST request to /api/v0/invoke
  • MAST request is of the form "request={request json object}

Because every request looks the same, we will write a function to handle the HTTPS interaction, taking in a MAST request and returning the server response.

In [14]:
def mast_query(request):
    """Perform a MAST query.
    
        Parameters
        ----------
        request (dictionary): The MAST request json object
        
        Returns head,content where head is the response HTTP headers, and content is the returned data"""
    
    # Base API url
    request_url='https://mast.stsci.edu/api/v0/invoke'    
    
    # Grab Python Version 
    version = ".".join(map(str, sys.version_info[:3]))

    # Create Http Header Variables
    headers = {"Content-type": "application/x-www-form-urlencoded",
               "Accept": "text/plain",
               "User-agent":"python-requests/"+version}

    # Encoding the request as a json string
    req_string = json.dumps(request)
    req_string = urlencode(req_string)
    
    # Perform the HTTP request
    resp = requests.post(request_url, data="request="+req_string, headers=headers)
    
    # Pull out the headers and response content
    head = resp.headers
    content = resp.content.decode('utf-8')

    return head, content

Step 1: Name Resolver

The first step of this query is to "resolve" M101 into a position on the sky. To do this we use the Mast.Name.Lookup service.

As with all of our services, we recommend using the json format, as the json output is most easily parsed.

In [18]:
object_of_interest = 'M101'

resolver_request = {'service':'Mast.Name.Lookup',
                     'params':{'input':object_of_interest,
                               'format':'json'},
                     }

headers, resolved_object_string = mast_query(resolver_request)

resolved_object = json.loads(resolved_object_string)

pp.pprint(resolved_object)
{   'resolvedCoordinate': [   {   'cacheDate': 'Jan 15, 2021, 10:17:45 AM',
                                  'cached': True,
                                  'canonicalName': 'MESSIER 101',
                                  'decl': 54.34895,
                                  'objectType': 'G',
                                  'ra': 210.80227,
                                  'radius': 0.24000000000000002,
                                  'resolver': 'NED',
                                  'resolverTime': 241,
                                  'searchRadius': -1.0,
                                  'searchString': 'm101'}],
    'status': ''}

The resolver returns a variety of informaton about the resolved object, however for our purposes all we need are the RA and Dec:

In [20]:
obj_ra = resolved_object['resolvedCoordinate'][0]['ra']
obj_dec = resolved_object['resolvedCoordinate'][0]['decl']

Step 2: MAST Query

Now that we have the RA and Dec we can perform the MAST query on M101. To do this we will use the Mast.Caom.Cone service. The output of this query is the information that gets loaded into the grid when running a Portal query, like so:

Because M101 has been observed many times, there will be several thousand results. We can use the MashupRequest 'page' and 'pagesize' properties to control how we view these results, either by choosing a pagesize large enough to accommodate all of the results, or choosing a smaller pagesize and paging through them using the page property. The json response object will include information about paging, so check that to see if you need to collect additional results.

Note: page and pagesize must both be specified (or neither), if only one is specified, the other will be ignored.

In [21]:
mast_request = {'service':'Mast.Caom.Cone',
                'params':{'ra':obj_ra,
                          'dec':obj_dec,
                          'radius':0.2},
                'format':'json',
                'pagesize':2000,
                'page':1,
                'removenullcolumns':True,
                'removecache':True}

headers, mast_data_str = mast_query(mast_request)

mast_data = json.loads(mast_data_str)

print(mast_data.keys())
print("Query status:",mast_data['status'])
dict_keys(['status', 'msg', 'data', 'fields', 'paging'])
Query status: COMPLETE

In the json response object, the "fields" dictionary holds the column names and types. The column names are not the formatted column headings that appear in the Portal grid (these are not guarenteed to be unique), but instead are the column names from the database. These names can be accessed in the Portal by hovering over a column name, or in the details pane of "Show Details." Details about returned columns for various queries can be found in the "Related Pages" section of the API documentation.

In [22]:
pp.pprint(mast_data['fields'][:5])
[   {'name': 'intentType', 'type': 'string'},
    {'name': 'obs_collection', 'type': 'string'},
    {'name': 'provenance_name', 'type': 'string'},
    {'name': 'instrument_name', 'type': 'string'},
    {'name': 'project', 'type': 'string'}]

The data is found (predictably) under the "data" keyword. The data is a list of dictionaries, where each row corresponds to one observation collection (just the in the Portal grid):

In [23]:
pp.pprint(mast_data['data'][0])
{   '_selected_': None,
    'calib_level': 3,
    'dataRights': 'PUBLIC',
    'dataURL': None,
    'dataproduct_type': 'image',
    'distance': 0,
    'em_max': 1000,
    'em_min': 600,
    'filters': 'TESS',
    'instrument_name': 'Photometer',
    'intentType': 'science',
    'jpegURL': None,
    'mtFlag': False,
    'obs_collection': 'TESS',
    'obs_id': 'tess-s0015-4-1',
    'obs_title': None,
    'obsid': 17001016095,
    'project': 'TESS',
    'proposal_id': 'N/A',
    'proposal_pi': 'Ricker, George',
    'proposal_type': None,
    'provenance_name': 'SPOC',
    's_dec': 59.23577326662502,
    's_ra': 213.663757013405,
    's_region': 'POLYGON 227.56190400 55.89237000 210.04086200 50.98859500 '
                '197.01254300 60.84640200 220.28641600 67.20814100 '
                '227.56190400 55.89237000 ',
    'sequence_number': 15,
    'srcDen': None,
    't_exptime': 1425.599379,
    't_max': 58736.89295962,
    't_min': 58710.87239573,
    't_obs_release': 58756.33333,
    'target_classification': None,
    'target_name': 'TESS FFI',
    'wavelength_region': 'Optical'}

The data table can be used as is, but it can also be translated into different formats depending on user preference. Here we will demonstrate how to put the results of a MAST query into an Astropy Table.

In [24]:
mast_data_table = Table()

for col,atype in [(x['name'],x['type']) for x in mast_data['fields']]:
    if atype=="string":
        atype="str"
    if atype=="boolean":
        atype="bool"
    mast_data_table[col] = np.array([x.get(col,None) for x in mast_data['data']],dtype=atype)
    
print(mast_data_table)
intentType obs_collection provenance_name ...      distance      _selected_
---------- -------------- --------------- ... ------------------ ----------
   science           TESS            SPOC ...                0.0      False
   science           TESS            SPOC ...                0.0      False
   science           TESS            SPOC ...                0.0      False
   science           TESS            SPOC ...                0.0      False
   science           TESS            SPOC ...  407.3642445717816      False
   science           TESS            SPOC ...  407.3642445717816      False
   science           TESS            SPOC ...  407.3642445717816      False
   science          SWIFT            None ...                0.0      False
   science          SWIFT            None ...                0.0      False
   science          SWIFT            None ...                0.0      False
       ...            ...             ... ...                ...        ...
   science            HST        CALWFPC2 ...  24.39043796264874      False
   science            HST        CALWFPC2 ... 24.393940669256285      False
   science            HST        CALWFPC2 ... 24.393940669256285      False
   science            HST        CALWFPC2 ... 24.393940669256285      False
   science            HST        CALWFPC2 ... 24.393940669256285      False
   science            HST        CALWFPC2 ... 24.393940669256285      False
   science            HST        CALWFPC2 ... 24.393940669256285      False
   science            HST        CALWFPC2 ... 24.393940669256285      False
   science            HST        CALWFPC2 ... 24.393940669256285      False
   science            HST        CALWFPC2 ... 24.393940669256285      False
   science            HST        CALWFPC2 ... 24.393940669256285      False
Length = 2000 rows

At this point we are ready to do analysis on these observations. However, if we want to access the actual data products, there are a few more steps.

Step 2 tangent: filtered query

An alternative to the cone search query is the filtered query. This is analogous to Advanced Search in the Portal and results in the same list of observations as the cone search, but filtered on other criteria. The services we'll use to do this are Mast.Caom.Filtered and Mast.Caom.Filtered.Position.

Filtered queries can often end up being quite large, so we will first do a query that just returns the number of results and decide if it is managable before we do the full query. We do this by supplying the parameter "columns":"COUNT_BIG(*)".

In [25]:
mashup_request = {"service":"Mast.Caom.Filtered",
                  "format":"json",
                  "params":{
                      "columns":"COUNT_BIG(*)",
                      "filters":[
                          {"paramName":"filters",
                           "values":["NUV","FUV"],
                           "separator":";"
                          },
                          {"paramName":"t_max",
                           "values":[{"min":52264.4586,"max":54452.8914}], #MJD
                          },
                          {"paramName":"obsid",
                           "values":[],
                           "freeText":"%200%"}
                      ]}}
    
headers, out_string = mast_query(mashup_request)
count = json.loads(out_string)

pp.pprint(count)
{   'data': [{'Column1': 1068}],
    'fields': [{'name': 'Column1', 'type': 'string'}],
    'msg': '',
    'paging': {   'page': 1,
                  'pageSize': 1,
                  'pagesFiltered': 1,
                  'rows': 1,
                  'rowsFiltered': 1,
                  'rowsTotal': 1},
    'status': 'COMPLETE'}

1,068 isn't too many observations so we can go ahead and request them. The only thing we need to do differently is change "columns":"COUNT_BIG(*)" to "columns":"*".

In [26]:
mashup_request = {"service":"Mast.Caom.Filtered",
                 "format":"json",
                 "params":{
                     "columns":"*",
                     "filters":[
                         {"paramName":"filters",
                          "values":["NUV","FUV"],
                          "separator":";"
                         },
                         {"paramName":"t_max",
                          "values":[{"min":52264.4586,"max":54452.8914}], #MJD
                         },
                         {"paramName":"obsid",
                          "values":[],
                          "freeText":"%200%"}
                     ]}}
    
headers, out_string = mast_query(mashup_request)
filtered_data = json.loads(out_string)

print(filtered_data.keys())
print("Query status:", filtered_data['status'])
dict_keys(['status', 'msg', 'data', 'fields', 'paging'])
Query status: COMPLETE
In [27]:
pp.pprint(filtered_data['data'][0])
{   'calib_level': 2,
    'dataRights': 'PUBLIC',
    'dataURL': 'http://galex.stsci.edu/data/GR6/pipe/02-vsn/50152-AIS_152/d/01-main/0001-img/07-try/AIS_152_sg05-nd-int.fits.gz',
    'dataproduct_type': 'image',
    'em_max': 300700000000,
    'em_min': 169300000000,
    'filters': 'NUV',
    'instrument_name': 'GALEX',
    'intentType': 'science',
    'jpegURL': 'http://galex.stsci.edu/data/GR6/pipe/02-vsn/50152-AIS_152/d/01-main/0001-img/07-try/qa/AIS_152_sg05-xd-int_2color.jpg',
    'mtFlag': None,
    'objID': 1000019991,
    'obs_collection': 'GALEX',
    'obs_id': '6376263785413345280',
    'obs_title': None,
    'obsid': 1000020010,
    'project': 'AIS',
    'proposal_id': None,
    'proposal_pi': None,
    'proposal_type': 'AIS',
    'provenance_name': 'AIS',
    's_dec': 8.29313154479086,
    's_ra': 339.243468698603,
    's_region': 'CIRCLE ICRS 339.24346870   8.29313154 0.625',
    'sequence_number': -999,
    'srcDen': 5885,
    't_exptime': 166,
    't_max': 53228.78150462963,
    't_min': 53228.71202546296,
    't_obs_release': 55426.59541,
    'target_classification': None,
    'target_name': 'AIS_152_1_5',
    'wavelength_region': 'UV'}

To add position to a filtered query we use the service Mast.Caom.Filtered.Position and add a new parameter "position":"positionString" where positionString has the form "ra dec radius" in degrees.

In [28]:
mashup_request = {
        "service":"Mast.Caom.Filtered.Position",
        "format":"json",
        "params":{
            "columns":"COUNT_BIG(*)",
            "filters":[
                {"paramName":"dataproduct_type",
                 "values":["cube"]
                }],
            "position":"210.8023, 54.349, 0.24"
        }}

headers, out_string = mast_query(mashup_request)
count = json.loads(out_string)

pp.pprint(count)
{   'data': [{'Column1': 797}],
    'fields': [{'name': 'Column1', 'type': 'string'}],
    'msg': '',
    'paging': {   'page': 1,
                  'pageSize': 1,
                  'pagesFiltered': 1,
                  'rows': 1,
                  'rowsFiltered': 1,
                  'rowsTotal': 1},
    'status': 'COMPLETE'}

Step 3: Getting Data Products

Before we can download observational data, we need to figure out what data products are associated with the observation(s) we are interested in. To do that we will use the Mast.Caom.Products service. This service takes the "obsid" ("Product Group ID" is the formated label visible in the Portal) and returns information about the associated data products. This query can be thought of as somewhat analogous to adding an observation to the basket in the Portal.

In [30]:
# Picking the first Hubble Space Telescope observation
interesting_observation = mast_data_table[mast_data_table["obs_collection"] == "HST"][0]
print("Observation:",
      [interesting_observation[x] for x in ['dataproduct_type', 'obs_collection', 'instrument_name']])
Observation: ['image', 'HST', 'WFC3/UVIS']
In [31]:
obsid = interesting_observation['obsid']

product_request = {'service':'Mast.Caom.Products',
                  'params':{'obsid':obsid},
                  'format':'json',
                  'pagesize':100,
                  'page':1}   

headers, obs_products_string = mast_query(product_request)

obs_products = json.loads(obs_products_string)

print("Number of data products:", len(obs_products["data"]))
print("Product information column names:")
pp.pprint(obs_products['fields'])
Number of data products: 26
Product information column names:
[   {'name': 'obsID', 'type': 'string'},
    {'name': 'obs_collection', 'type': 'string'},
    {'name': 'dataproduct_type', 'type': 'string'},
    {'name': 'obs_id', 'type': 'string'},
    {'name': 'description', 'type': 'string'},
    {'name': 'type', 'type': 'string'},
    {'name': 'dataURI', 'type': 'string'},
    {'name': 'productType', 'type': 'string'},
    {'name': 'productGroupDescription', 'type': 'string'},
    {'name': 'productSubGroupDescription', 'type': 'string'},
    {'name': 'productDocumentationURL', 'type': 'string'},
    {'name': 'project', 'type': 'string'},
    {'name': 'prvversion', 'type': 'string'},
    {'name': 'proposal_id', 'type': 'string'},
    {'name': 'productFilename', 'type': 'string'},
    {'name': 'size', 'type': 'int'},
    {'name': 'parent_obsid', 'type': 'string'},
    {'name': 'dataRights', 'type': 'string'},
    {'name': 'calib_level', 'type': 'int'},
    {'name': '_selected_', 'type': 'boolean'}]

We might not want to download all of the available products, let's take a closer look and see which ones are important.

In [32]:
pp.pprint([x.get('productType',"") for x in obs_products["data"]])
[   'AUXILIARY',
    'PREVIEW',
    'SCIENCE',
    'AUXILIARY',
    'AUXILIARY',
    'PREVIEW',
    'SCIENCE',
    'SCIENCE',
    'AUXILIARY',
    'AUXILIARY',
    'PREVIEW',
    'SCIENCE',
    'SCIENCE',
    'AUXILIARY',
    'AUXILIARY',
    'PREVIEW',
    'SCIENCE',
    'SCIENCE',
    'AUXILIARY',
    'AUXILIARY',
    'PREVIEW',
    'SCIENCE',
    'SCIENCE',
    'AUXILIARY',
    'PREVIEW',
    'SCIENCE']

Let's download all of the science products. We'll start by making an Astropy Table containing just the science product information. Then we'll download the datafiles using two different methods.

In [33]:
sci_prod_arr = [x for x in obs_products['data'] if x.get("productType", None) == 'SCIENCE']
science_products = Table()

for col, atype in [(x['name'], x['type']) for x in obs_products['fields']]:
    if atype=="string":
        atype="str"
    if atype=="boolean":
        atype="bool"
    if atype == "int":
        atype = "float" # array may contain nan values, and they do not exist in numpy integer arrays
    science_products[col] = np.array([x.get(col,None) for x in sci_prod_arr],dtype=atype)

print("Number of science products:",len(science_products))
print(science_products)
Number of science products: 10
  obsID    obs_collection dataproduct_type ... dataRights calib_level _selected_
---------- -------------- ---------------- ... ---------- ----------- ----------
2008401691            HST            image ...     PUBLIC         2.0      False
2008401652            HST            image ...     PUBLIC         2.0      False
2008401652            HST            image ...     PUBLIC         2.0      False
2008401622            HST            image ...     PUBLIC         2.0      False
2008401622            HST            image ...     PUBLIC         2.0      False
2008401594            HST            image ...     PUBLIC         2.0      False
2008401594            HST            image ...     PUBLIC         2.0      False
2008401681            HST            image ...     PUBLIC         2.0      False
2008401681            HST            image ...     PUBLIC         2.0      False
2008401713            HST            image ...     PUBLIC         2.0      False

Step 4a: Downloading files individually

We can directly download the data files one at a time, using the MAST download service, file endpoint.

To download data files directly we really only need the 'dataURI' field, however we will also use the obs_collection, obs_id, and productFilename fields to create a unique download path for each file.

We will loop through the files and download them, saving each one as mastFiles/obs_collection/obs_id/productFilename. While you can use any naming convention (or none) this one is recommended because it guarantees a unique path for each file.

In [35]:
download_url = 'https://mast.stsci.edu/api/v0.1/Download/file?'

for row in science_products[:2]:     

    # make file path
    out_path = os.path.join("mastFiles", row['obs_collection'], row['obs_id'])
    if not os.path.exists(out_path):
        os.makedirs(out_path)
    out_path = os.path.join(out_path, os.path.basename(row['productFilename']))
        
    # Download the data
    payload = {"uri":row['dataURI']}
    resp = requests.get(download_url, params=payload)
    
    # save to file
    with open(out_path,'wb') as FLE:
        FLE.write(resp.content)
        
    # check for file 
    if not os.path.isfile(out_path):
        print("ERROR: " + out_path + " failed to download.")
    else:
        print("COMPLETE: ", out_path)
COMPLETE:  mastFiles/HST/hst_11635_06_wfc3_uvis_f469n_ib3p06/hst_11635_06_wfc3_uvis_f469n_ib3p06_drc.fits
COMPLETE:  mastFiles/HST/hst_11635_06_wfc3_uvis_f469n_ib3p06wj/hst_11635_06_wfc3_uvis_f469n_ib3p06wj_drc.fits

Step 4a: Downloading products in a "bundle"

This is how downloading is done through the Portal and will result in either a compressed file containing all the data products, or a curl script that can be run to download the data products at a later time. This can be a more complicated way to access data products, however it will produce exactly the same output as downloading through the Discovery Portal and the resulting file paths are guaranteed to be unique.

We will use the MAST bundle service to download all of our desired data products as a gzipped tarball.

In [45]:
url_list = [("uri", url) for url in science_products['dataURI'][:2]]
extension = ".tar.gz"

download_url = 'https://mast.stsci.edu/api/v0.1/Download/bundle'
resp = requests.post(download_url + extension, data=url_list)

out_file = "mastDownload" + extension
with open(out_file, 'wb') as FLE:
    FLE.write(resp.content)
    
# check for file 
if not os.path.isfile(out_file):
    print("ERROR: " + out_file + " failed to download.")
else:
    print("COMPLETE: ", out_file)
COMPLETE:  mastDownload.tar.gz
In [ ]: