MAST API Tutorial

An introduction to using the MAST API to query MAST data and catalogs programmatically.

To start with, here are all the includes we need:

In [1]:
import sys
import os
import time
import re
import json

try: # Python 3.x
    from urllib.parse import quote as urlencode
    from urllib.request import urlretrieve
except ImportError:  # Python 2.x
    from urllib import pathname2url as urlencode
    from urllib import urlretrieve

try: # Python 3.x
    import http.client as httplib 
except ImportError:  # Python 2.x
    import httplib   

from astropy.table import Table
import numpy as np

import pprint
pp = pprint.PrettyPrinter(indent=4)

Basic MAST Query

Here we will perform a basic MAST query on M101, equivalent to choosing "All MAST Observations" and searching for M101 in the Portal like so:

We will then select some observations, view their data products, and download some data.

Step 0: MAST Request

All MAST requests (except direct download requests) have the same form:

  • HTTPS connect to MAST server
  • POST MAST request to /api/v0/invoke
  • MAST request is of the form "request={request json object}

Because every request looks the same, we will write a function to handle the HTTPS interaction, taking in a MAST request and returning the server response.

In [2]:
def mastQuery(request):
    """Perform a MAST query.
    
        Parameters
        ----------
        request (dictionary): The MAST request json object
        
        Returns head,content where head is the response HTTP headers, and content is the returned data"""
    
    server='mast.stsci.edu'

    # Grab Python Version 
    version = ".".join(map(str, sys.version_info[:3]))

    # Create Http Header Variables
    headers = {"Content-type": "application/x-www-form-urlencoded",
               "Accept": "text/plain",
               "User-agent":"python-requests/"+version}

    # Encoding the request as a json string
    requestString = json.dumps(request)
    requestString = urlencode(requestString)
    
    # opening the https connection
    conn = httplib.HTTPSConnection(server)

    # Making the query
    conn.request("POST", "/api/v0/invoke", "request="+requestString, headers)

    # Getting the response
    resp = conn.getresponse()
    head = resp.getheaders()
    content = resp.read().decode('utf-8')

    # Close the https connection
    conn.close()

    return head,content

Step 1: Name Resolver

The first step of this query is to "resolve" M101 into a position on the sky. To do this we use the Mast.Name.Lookup service.

As with all of our services, we recommend using the json format, as the json output is most easily parsed.

In [3]:
objectOfInterest = 'M101'

resolverRequest = {'service':'Mast.Name.Lookup',
                     'params':{'input':objectOfInterest,
                               'format':'json'},
                     }

headers,resolvedObjectString = mastQuery(resolverRequest)

resolvedObject = json.loads(resolvedObjectString)

pp.pprint(resolvedObject)
{   'resolvedCoordinate': [   {   'cacheDate': 'Nov 14, 2017 8:17:09 AM',
                                  'cached': True,
                                  'canonicalName': 'MESSIER 101',
                                  'decl': 54.34895,
                                  'objectType': 'G',
                                  'ra': 210.80227,
                                  'radius': 0.24000000000000002,
                                  'resolver': 'NED',
                                  'resolverTime': 132,
                                  'searchRadius': -1.0,
                                  'searchString': 'm101'}],
    'status': ''}

The resolver returns a variety of informaton about the resolved object, however for our purposes all we need are the RA and Dec:

In [4]:
objRa = resolvedObject['resolvedCoordinate'][0]['ra']
objDec = resolvedObject['resolvedCoordinate'][0]['decl']

Step 2: MAST Query

Now that we have the RA and Dec we can perform the MAST query on M101. To do this we will use the Mast.Caom.Cone service. The output of this query is the information that gets loaded into the grid when running a Portal query, like so:

Because M101 has been observed many times, there will be several thousand results. We can use the MashupRequest 'page' and 'pagesize' properties to control how we view these results, either by choosing a pagesize large enough to accommodate all of the results, or choosing a smaller pagesize and paging through them using the page property. The json response object will include information about paging, so check that to see if you need to collect additional results.

Note: page and pagesize must both be specified (or neither), if only one is specified, the other will be ignored.

In [5]:
mastRequest = {'service':'Mast.Caom.Cone',
               'params':{'ra':objRa,
                         'dec':objDec,
                         'radius':0.2},
               'format':'json',
               'pagesize':2000,
               'page':1,
               'removenullcolumns':True,
               'removecache':True}

headers,mastDataString = mastQuery(mastRequest)

mastData = json.loads(mastDataString)

print(mastData.keys())
print("Query status:",mastData['status'])
dict_keys(['status', 'msg', 'data', 'fields', 'paging'])
Query status: COMPLETE

In the json response object, the "fields" dictionary holds the column names and types. The column names are not the formatted column headings that appear in the Portal grid (these are not guarenteed to be unique), but instead are the column names from the database. These names can be accessed in the Portal by hovering over a column name, or in the details pane of "Show Details." Details about returned columns for various queries can be found in the "Related Pages" section of the API documentation.

In [6]:
pp.pprint(mastData['fields'][:5])
[   {'name': 'dataproduct_type', 'type': 'string'},
    {'name': 'obs_collection', 'type': 'string'},
    {'name': 'instrument_name', 'type': 'string'},
    {'name': 'project', 'type': 'string'},
    {'name': 'filters', 'type': 'string'}]

The data is found (predictably) under the "data" keyword. The data is a list of dictionaries, where each row corresponds to one observation collection (just the in the Portal grid):

In [7]:
pp.pprint(mastData['data'][0])
{   '_selected_': None,
    'calib_level': 2,
    'dataRights': 'PUBLIC',
    'dataURL': None,
    'dataproduct_type': 'cube',
    'distance': 0,
    'em_max': 394.2,
    'em_min': 301.4,
    'filters': 'U',
    'instrument_name': 'UVOT',
    'jpegURL': 'http://archive.stsci.edu/cgi-bin/hla/fitscut.cgi?red=sw00030896001uuu[6]&size=ALL&output_size=2320',
    'mtFlag': None,
    'objID': 15000541440,
    'obs_collection': 'SWIFT',
    'obs_id': '00030896001',
    'obs_title': None,
    'obsid': 15000804761,
    'project': None,
    'proposal_id': None,
    'proposal_pi': None,
    's_dec': 54.3320323874192,
    's_ra': 210.884468835902,
    's_region': 'POLYGON -148.68197799999996 54.250447 -148.72010699999998 '
                '54.380959 -148.771094 54.525161 -148.80113844209316 '
                '54.522015545787568 -148.80339800000002 54.528132 '
                '-149.00890800000002 54.505617 -149.22571413697636 '
                '54.475317193971954 -149.23336 54.474477 -149.23331320240226 '
                '54.4742473291218 -149.25735499999996 54.470859 '
                '-149.25591688496976 54.465950017458638 -149.37013000000002 '
                '54.452035 -149.36860315816716 54.445640407505977 '
                '-149.38111000000004 54.443785 -149.339132 54.309445 '
                '-149.32235856865978 54.265305792795992 -149.31201271706755 '
                '54.2313318148205 -149.29647799999998 54.175928999999996 '
                '-149.28874914509623 54.176570406780428 -149.28693099999998 '
                '54.171759 -149.13378760314151 54.189321891688223 -149.070602 '
                '54.194462 -148.83541400000001 54.220082 -148.83659838226143 '
                '54.22526868924821 -148.83561599999996 54.225382 '
                '-148.83630396949408 54.228123086154028 -148.706526 54.241669 '
                '-148.70769828707938 54.247103078929328 -148.68197799999996 '
                '54.250447 -148.68197799999996 54.250447',
    'srcDen': 5885,
    't_exptime': 1556.7224828509793,
    't_max': 54160.8269792,
    't_min': 54160.089537,
    't_obs_release': None,
    'target_classification': None,
    'target_name': 'M101ULX-1',
    'wavelength_region': 'OPTICAL'}

The data table can be used as is, but it can also be translated into different formats depending on user preference. Here we will demonstrate how to put the results of a MAST query into an Astropy Table.

In [8]:
mastDataTable = Table()

for col,atype in [(x['name'],x['type']) for x in mastData['fields']]:
    if atype=="string":
        atype="str"
    if atype=="boolean":
        atype="bool"
    mastDataTable[col] = np.array([x.get(col,None) for x in mastData['data']],dtype=atype)
    
print(mastDataTable)
dataproduct_type obs_collection instrument_name ...    distance   _selected_
---------------- -------------- --------------- ... ------------- ----------
            cube          SWIFT            UVOT ...           0.0      False
            cube          SWIFT            UVOT ...           0.0      False
            cube          SWIFT            UVOT ...           0.0      False
            cube          SWIFT            UVOT ...           0.0      False
            cube          SWIFT            UVOT ...           0.0      False
            cube          SWIFT            UVOT ...           0.0      False
            cube          SWIFT            UVOT ...           0.0      False
            cube          SWIFT            UVOT ...           0.0      False
            cube          SWIFT            UVOT ...           0.0      False
            cube          SWIFT            UVOT ...           0.0      False
             ...            ...             ... ...           ...        ...
           image           HLSP       WFC3/UVIS ...  36.206866078      False
           image           HLSP       WFC3/UVIS ...  36.206866078      False
           image           HLSP             ACS ... 70.8762227799      False
           image           HLSP             ACS ... 70.8762227799      False
           image           HLSP             ACS ... 70.8762227799      False
           image           HLSP             ACS ... 91.5821757216      False
           image           HLSP             ACS ... 91.5821757216      False
           image           HLSP             ACS ... 91.5821757216      False
           image           HLSP             ACS ... 91.5821757216      False
           image           HLSP             ACS ... 91.5821757216      False
           image           HLSP             ACS ... 91.5821757216      False
Length = 2000 rows

At this point we are ready to do analysis on these observations. However, if we want to access the actual data products, there are a few more steps.

Step 2 tangent: filtered query

An alternative to the cone search query is the filtered query. This is analogous to Advanced Search in the Portal and results in the same list of observations as the cone search, but filtered on other criteria. The services we'll use to do this are Mast.Caom.Filtered and Mast.Caom.Filtered.Position.

Filtered queries can often end up being quite large, so we will first do a query that just returns the number of results and decide if it is managable before we do the full query. We do this by supplying the parameter "columns":"COUNT_BIG(*)".

In [9]:
mashupRequest = {"service":"Mast.Caom.Filtered",
                 "format":"json",
                 "params":{
                     "columns":"COUNT_BIG(*)",
                     "filters":[
                         {"paramName":"filters",
                          "values":["NUV","FUV"],
                          "separator":";"
                         },
                         {"paramName":"t_max",
                          "values":[{"min":52264.4586,"max":54452.8914}], #MJD
                         },
                         {"paramName":"obsid",
                          "values":[],
                          "freeText":"%200%"}
                     ]}}
    
headers,outString = mastQuery(mashupRequest)
countData = json.loads(outString)

pp.pprint(countData)
{   'data': [{'Column1': 1068}],
    'fields': [{'name': 'Column1', 'type': 'string'}],
    'msg': '',
    'paging': {   'page': 1,
                  'pageSize': 1,
                  'pagesFiltered': 1,
                  'rows': 1,
                  'rowsFiltered': 1,
                  'rowsTotal': 1},
    'status': 'COMPLETE'}

1,068 isn't too many observations so we can go ahead and request them. The only thing we need to do differently is change "columns":"COUNT_BIG(*)" to "columns":"*".

In [10]:
mashupRequest = {"service":"Mast.Caom.Filtered",
                 "format":"json",
                 "params":{
                     "columns":"*",
                     "filters":[
                         {"paramName":"filters",
                          "values":["NUV","FUV"],
                          "separator":";"
                         },
                         {"paramName":"t_max",
                          "values":[{"min":52264.4586,"max":54452.8914}], #MJD
                         },
                         {"paramName":"obsid",
                          "values":[],
                          "freeText":"%200%"}
                     ]}}
    
headers,outString = mastQuery(mashupRequest)
filteredData = json.loads(outString)

print(filteredData.keys())
print("Query status:",filteredData['status'])
dict_keys(['status', 'msg', 'data', 'fields', 'paging'])
Query status: COMPLETE
In [11]:
pp.pprint(filteredData['data'][0])
{   'calib_level': 2,
    'dataRights': 'PUBLIC',
    'dataURL': None,
    'dataproduct_type': 'image',
    'em_max': 180.6,
    'em_min': 134,
    'filters': 'FUV',
    'instrument_name': 'GALEX',
    'jpegURL': 'http://galex.stsci.edu/data/GR6/pipe/01-vsn/03201-MISDR1_16976_0422/d/01-main/0007-img/07-try/qa/MISDR1_16976_0422-xd-int_2color.jpg',
    'mtFlag': None,
    'objID': 1000044986,
    'obs_collection': 'GALEX',
    'obs_id': '2418470520996495360',
    'obs_title': None,
    'obsid': 1000000200,
    'project': 'MIS',
    'proposal_id': None,
    'proposal_pi': None,
    's_dec': 15.0328418507974,
    's_ra': 18.6161994329736,
    's_region': 'CIRCLE ICRS  18.61619943  15.03284185 0.625',
    'srcDen': 5885,
    't_exptime': 1440,
    't_max': 52932.32303240741,
    't_min': 52932.30636574075,
    't_obs_release': 55992.07319,
    'target_classification': None,
    'target_name': 'MISDR1_16976_0422',
    'wavelength_region': 'UV'}

To add position to a filtered query we use the service Mast.Caom.Filtered.Position and add a new parameter "position":"positionString" where positionString has the form "ra dec radius" in degrees.

In [12]:
mashupRequest = {
        "service":"Mast.Caom.Filtered.Position",
        "format":"json",
        "params":{
            "columns":"COUNT_BIG(*)",
            "filters":[
                {"paramName":"dataproduct_type",
                 "values":["cube"]
                }],
            "position":"210.8023, 54.349, 0.24"
        }}

headers,outString = mastQuery(mashupRequest)
countData = json.loads(outString)

pp.pprint(countData)
{   'data': [{'Column1': 789}],
    'fields': [{'name': 'Column1', 'type': 'string'}],
    'msg': '',
    'paging': {   'page': 1,
                  'pageSize': 1,
                  'pagesFiltered': 1,
                  'rows': 1,
                  'rowsFiltered': 1,
                  'rowsTotal': 1},
    'status': 'COMPLETE'}

Step 3: Getting Data Products

Before we can download observational data, we need to figure out what data products are associated with the observation(s) we are interested in. To do that we will use the Mast.Caom.Products service. This service takes the "obsid" ("Product Group ID" is the formated label visible in the Portal) and returns information about the associated data products. This query can be thought of as somewhat analogous to adding an observation to the basket in the Portal.

In [13]:
interestingObservation = mastDataTable[1400]
print("Observation:",
      [interestingObservation[x] for x in ['dataproduct_type', 'obs_collection', 'instrument_name']])
Observation: ['image', 'HST', 'WFPC2/WFC']
In [14]:
obsid = interestingObservation['obsid']

productRequest = {'service':'Mast.Caom.Products',
                 'params':{'obsid':obsid},
                 'format':'json',
                 'pagesize':100,
                 'page':1}   

headers,obsProductsString = mastQuery(productRequest)

obsProducts = json.loads(obsProductsString)

print("Number of data products:",len(obsProducts["data"]))
print("Product information column names:")
pp.pprint(obsProducts['fields'])
Number of data products: 22
Product information column names:
[   {'name': 'obsID', 'type': 'string'},
    {'name': 'obs_collection', 'type': 'string'},
    {'name': 'dataproduct_type', 'type': 'string'},
    {'name': 'obs_id', 'type': 'string'},
    {'name': 'description', 'type': 'string'},
    {'name': 'type', 'type': 'string'},
    {'name': 'dataURI', 'type': 'string'},
    {'name': 'productType', 'type': 'string'},
    {'name': 'productGroupDescription', 'type': 'string'},
    {'name': 'productSubGroupDescription', 'type': 'string'},
    {'name': 'productDocumentationURL', 'type': 'string'},
    {'name': 'project', 'type': 'string'},
    {'name': 'prvversion', 'type': 'string'},
    {'name': 'productFilename', 'type': 'string'},
    {'name': 'size', 'type': 'int'},
    {'name': '_selected_', 'type': 'boolean'}]

We might not want to download all of the available products, let's take a closer look and see which ones are important.

In [15]:
pp.pprint([x.get('productType',"") for x in obsProducts["data"]])
[   'AUXILIARY',
    'AUXILIARY',
    'AUXILIARY',
    'AUXILIARY',
    'AUXILIARY',
    'AUXILIARY',
    'AUXILIARY',
    'AUXILIARY',
    'AUXILIARY',
    'AUXILIARY',
    'AUXILIARY',
    'AUXILIARY',
    'AUXILIARY',
    'AUXILIARY',
    'AUXILIARY',
    'SCIENCE',
    'SCIENCE',
    'SCIENCE',
    'SCIENCE',
    'SCIENCE',
    'SCIENCE',
    'SCIENCE']

Let's download all of the science products. We'll start by making an Astropy Table containing just the science product information. Then we'll download the datafiles using two different methods.

In [16]:
sciProdArr = [x for x in obsProducts['data'] if x.get("productType",None) == 'SCIENCE']
scienceProducts = Table()

for col,atype in [(x['name'],x['type']) for x in obsProducts['fields']]:
    if atype=="string":
        atype="str"
    if atype=="boolean":
        atype="bool"
    if atype == "int":
        atype = "float" # array may contain nan values, and they do not exist in numpy integer arrays
    scienceProducts[col] = np.array([x.get(col,None) for x in sciProdArr],dtype=atype)

print("Number of science products:",len(scienceProducts))
print(scienceProducts)
Number of science products: 7
  obsID    obs_collection dataproduct_type ...    size    _selected_
---------- -------------- ---------------- ... ---------- ----------
2003738722            HST            image ... 10307520.0      False
2003738722            HST            image ...  5178240.0      False
2003738722            HST            image ...  5184000.0      False
2003738722            HST            image ...  5178240.0      False
2003738722            HST            image ... 10307520.0      False
2003738722            HST            image ...  5184000.0      False
2003738722            HST            image ... 27244800.0      False

Step 4a: Downloading products using the bundler

This is how downloading is done through the Portal and will result in either a compressed file containing all the data products, or a curl script that can be run to download the data products at a later time. This can be a more complicated way to access data products, however it will produce exactly the same output as downloading through the Discovery Portal and the resulting file paths are guaranteed to be unique.

We will use the Mast.Bundle.Request service to download all of our desired data products as a gzipped tarball.

The fields we need to create the download request are: dataURI, description, and dataproduct_type. We will also use obs_collection, obs_id, and productFilename to create a the download path for each file.

In [17]:
urls = scienceProducts['dataURI']
descriptions = scienceProducts['description'] 
productTypes = scienceProducts['dataproduct_type']
outPaths = ["mastFiles/"+x['obs_collection']+'/'+x['obs_id']+'/'+x['productFilename'] for x in scienceProducts]
zipFilename = "mastDownload"
extension = "tar.gz"

Now that we have collected all the information we need, we can build the download request. Note that two of the parameters (urlList and pathList) take comma separated strings, while another two (descriptionList and productTypeList) take lists. This a known issue, and will be fixed, but for now it's the way it is.

This query may take some time, if we are returned a status of EXECUTING, we will simply rerun the query until it completes.

In [18]:
bundleRequest = {"service":"Mast.Bundle.Request",
                 "params":{"urlList":",".join(urls),
                           "filename":zipFilename,
                           "pathList":",".join(outPaths),
                           "descriptionList":list(descriptions),
                           "productTypeList":list(productTypes),
                           "extension":extension},
                 "format":"json",
                 "page":1,
                 "pagesize":1000}  

headers,bundleString = mastQuery(bundleRequest)
bundleInfo = json.loads(bundleString)

pp.pprint(bundleInfo)
{   'bytesStreamed': 68586096,
    'fileStatusList': {   'mast:HST/product/u9o40406m/u9o40406m_c0f.fits': '{"status":"COMPLETE"}',
                          'mast:HST/product/u9o40406m/u9o40406m_c0m.fits': '{"status":"COMPLETE"}',
                          'mast:HST/product/u9o40406m/u9o40406m_c1f.fits': '{"status":"COMPLETE"}',
                          'mast:HST/product/u9o40406m/u9o40406m_c1m.fits': '{"status":"COMPLETE"}',
                          'mast:HST/product/u9o40406m/u9o40406m_d0f.fits': '{"status":"COMPLETE"}',
                          'mast:HST/product/u9o40406m/u9o40406m_d0m.fits': '{"status":"COMPLETE"}',
                          'mast:HST/product/u9o40406m/u9o40406m_drz.fits': '{"status":"COMPLETE"}'},
    'manifestUrl': 'https://mast.stsci.edu/portal/Download/stage/anonymous/public/af7e1604-d14c-44b6-b747-d66fd84d0a6b/mastDownload_MANIFEST.HTML',
    'msg': '',
    'progress': 1,
    'status': 'COMPLETE',
    'statusList': {   'mast:HST/product/u9o40406m/u9o40406m_c0f.fits': 'COMPLETE',
                      'mast:HST/product/u9o40406m/u9o40406m_c0m.fits': 'COMPLETE',
                      'mast:HST/product/u9o40406m/u9o40406m_c1f.fits': 'COMPLETE',
                      'mast:HST/product/u9o40406m/u9o40406m_c1m.fits': 'COMPLETE',
                      'mast:HST/product/u9o40406m/u9o40406m_d0f.fits': 'COMPLETE',
                      'mast:HST/product/u9o40406m/u9o40406m_d0m.fits': 'COMPLETE',
                      'mast:HST/product/u9o40406m/u9o40406m_drz.fits': 'COMPLETE'},
    'url': 'https://mast.stsci.edu/portal/Download/stage/anonymous/public/af7e1604-d14c-44b6-b747-d66fd84d0a6b/mastDownload.tar.gz'}

The information returned from this query tells us the status of each file we tried to downlad and gives us two important urls. The 'manifestUrl' displays information about each downloaded file, and if it was not downloaded, gives the associated error message. This is the manifest.html document that you get when downloading through the Portal. The 'url' is the location of the actual file containing our data. We can download this file using any method we like.

In [19]:
urlretrieve(bundleInfo['url'], zipFilename+"."+extension)
Out[19]:
('mastDownload.tar.gz', <http.client.HTTPMessage at 0x10a4178d0>)

Step 4b: Direct Download

Instead of going through the Mast.Bundle.Request service, we can directly download the data files one at a time, using the MAST download service. (Note: this does not work for HSC spectra which should be downloaded using the bundler method outlined above.)

To download data files directly we really only need the 'dataURI' field, however we will also use the obs_collection, obs_id, and productFilename fields to create a unique download path for each file.

We will loop through the files and download them, saving each one as mastFiles/obs_collection/obs_id/productFilename. While you can use any naming convention (or none) this one is recommended because it guarantees a unique path for each file.

In [20]:
server='mast.stsci.edu'
conn = httplib.HTTPSConnection(server)

for row in scienceProducts:     

    # make file path
    outPath = "mastFiles/"+row['obs_collection']+'/'+row['obs_id']
    if not os.path.exists(outPath):
        os.makedirs(outPath)
    outPath += '/'+row['productFilename']
        
    # Download the data
    uri = row['dataURI']
    conn.request("GET", "/api/v0/download/file?uri="+uri)
    resp = conn.getresponse()
    fileContent = resp.read()
    
    # save to file
    with open(outPath,'wb') as FLE:
        FLE.write(fileContent)
        
    # check for file 
    if not os.path.isfile(outPath):
        print("ERROR: " + outPath + " failed to download.")
    else:
        print("COMPLETE: ", outPath)

conn.close()
COMPLETE:  mastFiles/HST/U9O40406M/u9o40406m_c0m.fits
COMPLETE:  mastFiles/HST/U9O40406M/u9o40406m_c1m.fits
COMPLETE:  mastFiles/HST/U9O40406M/u9o40406m_d0f.fits
COMPLETE:  mastFiles/HST/U9O40406M/u9o40406m_d0m.fits
COMPLETE:  mastFiles/HST/U9O40406M/u9o40406m_c0f.fits
COMPLETE:  mastFiles/HST/U9O40406M/u9o40406m_c1f.fits
COMPLETE:  mastFiles/HST/U9O40406M/u9o40406m_drz.fits
In [ ]: