Important

This page is for preview purposes only to show the content of Amazon SageMaker JumpStart Industry Example Notebooks.

Note

The SageMaker JumpStart Industry example notebooks are hosted and runnable only through SageMaker Studio. Log in to the SageMaker console, and launch SageMaker Studio. For instructions on how to access the notebooks, see SageMaker JumpStart and SageMaker JumpStart Industry in the Amazon SageMaker Developer Guide.

Important

The example notebooks are for demonstrative purposes only. The notebooks are not financial advice and should not be relied on as financial or investment advice.

Machine Learning on a TabText Dataframe

An Example Based on the Paycheck Protection Program

The Paycheck Protection Program (PPP) was created by the U.S. government to enable employers struggling with COVID-related business adversities to make payments to their employees. For more information, see the Paycheck Protection Program. In this example notebook, you’ll learn how to run a machine learning model on a sample of companies in the program over the first two quarters of 2020.

In this notebook, we take U.S Securities and Exchange Commission (SEC) filing data from some of the companies that partook of the loans under this program. We demonstrate how to merge the SEC filing data (text data) with stock price data (tabular data) using the SageMaker JumpStart Industry Python SDK. The build_tabText class of the library helps merge text dataframes with numeric dataframes to create a multimodal dataframe for machine learning.

A subset of the list of tickers of firms that took PPP loans obtained from authors of the following paper:
- Balyuk, T., Prabhala, N. and Puri, M. (November 2020, revised June 2021), Indirect Costs of Government Aid and Intermediary Supply Effects: Lessons from the Paycheck Protection Program. NBER Working Paper No. w28114, Available at SSRN: https://ssrn.com/abstract=3735682.

The PPP program was created to help companies experiencing financial hardship and may have otherwise been unable to make payroll. To the degree companies not experiencing financial constraints took PPP money, they may have experienced a decrease in share value and returned the money to recoup value.

We are interested in seeing if an ML model is able to detect whether the text of filings of companies that returned PPP money is different from that of companies that retained PPP money.

Important: This example notebook is for demonstrative purposes only. It is not financial advice and should not be relied on as financial or investment advice.

General Steps

This notebook takes the following steps:

  1. Read in over 400 tickers of companies that took the PPP loans.

  2. Read in the 10-K, 10-Q, and 8K filings for all paycheck protection tickers during Q1 and Q2 of 2020. Texts in the SEC filings are loaded using the smjsindustry.DataLoader class.

  3. Load a synthetic time series data of daily stock prices for the given tickers during Q1 and Q2 of 2020. Convert prices to returns. The simulated data is generated to be correlated with appropriate labels so that it can be meaningful. An analogous exercise with true data yields similar results.

  4. Merge text and tabular datasets using the smjsindustry.build_tabText API.

  5. Conduct machine learning analysis to obtain a baseline accuracy.

  6. Build an AutoGluon model to analyze how stock prices and texts in the SEC filings are related to each company’s decision to accept or return the money. This notebook shows how to flag all filings of companies that return the money with a 1 and the filings of companies that do not return the money with a 0. A good fit to the data implies the model can distinguish companies into two categories: the ones that return PPP funding versus those that do not based on the text.

Objective

The goal in this notebook is to fit an ML model to the data on companies that partook of funding from the Paycheck Protection Program (PPP) and to study how stock prices and returns and text from the SEC forms are related to their decisions to return the money.

The PPP program is reported in each company’s 8-K, an SEC filing which is required when a public company experiences a material change in business conditions. In addition to a 8-K filing, 10-K and 10-Q filings, which present a comprehensive summary of a company’s financial performance, are also used as source of inputs for this study. The stock data is synthetically generated to be correlated with the labels. You can repeat this exercise with actual data as needed.

SageMaker Studio Kernel Setup

Recommended kernel is Python 3 (Data Science). DO NOT use the Python 3 (SageMaker JumpStart Data Science 1.0) kernel because there are some differences in preinstalled dependency. For the instance type, using a larger instance with sufficient memory can be helpful to download the following materials.

Load Data, SDK, and Dependencies

The following code cells download the `smjsindustry SDK <https://pypi.org/project/smjsindustry/>`__, dependencies, and dataset from an S3 bucket prepared by SageMaker JumpStart Industry. You will learn how to use the smjsindustry SDK which contains various APIs to curate SEC datasets. The dataset in this example was synthetically generated using the smjsindustry package’s SEC Forms Retrieval tool. For more information, see the SageMaker JumpStart Industry Python SDK documentation.

notebook_artifact_bucket = 'jumpstart-cache-alpha-us-west-2'
notebook_data_prefix = 'smfinance-notebook-data/ppp'
notebook_sdk_prefix = 'smfinance-notebook-dependency/smjsindustry'
notebook_autogluon_prefix = 'smfinance-notebook-dependency/autogluon'
data_bucket = f's3://{notebook_artifact_bucket}/{notebook_data_prefix}'
!aws s3 sync $data_bucket ./

Install the SageMaker JumpStart Industry Python SDK and dependencies that are needed for machine learning, because they are not available as defaults in Studio.

sdk_bucket = f's3://{notebook_artifact_bucket}/{notebook_sdk_prefix}'
!aws s3 sync $sdk_bucket ./

!pip install --no-index smjsindustry-1.0.0-py3-none-any.whl

Step 1: Read in the Tickers

Over 400 tickers are used for this study.

%pylab inline
import pandas as pd
import os

ppp_tickers = pd.read_excel("ppp_tickers.xlsx", index_col = None, sheet_name=0)
print("Number of PPP tickers =", ppp_tickers.shape[0])
ticker_list = list(set(ppp_tickers.ticker))
ppp_tickers.head()

Step 2: Read in the SEC Forms Filed by These Companies

  1. This notebook retrieves all 10-K/Q, 8-K forms from the SEC servers for Q1 and Q2 of 2020. This was done using the SageMaker JumpStart Industry Python SDK’s DataLoader class. For reference, the time taken by the data lodaer process was around 30 minutes for curating a dataframe of over 4000 filings.

  2. There is one 10K/Q form per quarter. These are quarterly reports.

  3. There can be multiple 8K forms per quarter, because these are filed for material changes in business conditions. Depending on how many such events there are, several 8Ks might need to be filed. As you will see, this notebook retrieves more than one form per quarter.

  4. The dataset was stored in a CSV file named ppp_10kq_8k_data.csv (351 MB).

Important: This example notebook uses data obtained from the SEC EDGAR database. You are responsible for complying with EDGAR’s access terms and conditions located in the Accessing EDGAR Data page.

%%time
df_sec = pd.read_csv("ppp_10kq_8k_data.csv")  # Text data
print("Number of SEC filings: ", df_sec.shape[0])

Step 3: Collect Stock Prices and Convert to Returns

  • Given the list of tickers, we synthetically generated stock prices using simulation of geometric Brownian motion. The stock prices are generated to be consistent with the real market data. You can buy data for commercial use if needed.

  • Convert the stock prices to returns.

Some tickers might have been delisted since the time of the PPP program.

df_prices = pd.read_csv("ppp_stock_prices_synthetic.csv")
print('Total number of days for the stock time series: ', df_prices.shape[0])
print('Total number of stocks: ', df_prices.shape[1])
df_prices.head()

The following code cell converts the prices into percentage returns.

  • It converts prices into returns.

  • It calls helper function to convert prices to returns.

  • It removes the stock that only has NaN values, if any.

  • It converts prices to returns using the pct_change function.

def convert_price_to_return(df_prices):
    ticker_list = list(df_prices.columns[1:])
    df_returns = df_prices[ticker_list].pct_change()                  # not using fill_method='ffill'
    df_returns = pd.concat([df_prices.Date, df_returns], axis=1)[1:]  # drop first row as it is NaN
    df_returns = df_returns.reset_index(drop=True)
    return df_returns

df_returns = convert_price_to_return(df_prices)
df_returns.dropna(axis=1, how='all', inplace = True)                  # drop columns with partial data
df_returns.set_index('Date', inplace = True)
print('Total number of stocks: ', len(list(df_returns.columns[1:])))
df_returns.head()
df_returns.to_csv('ppp_returns.csv', index = True)

Step 4: Merge Text and Tabular Datasets

The stock returns and the SEC forms are saved in earlier code blocks into CSV files. In this step, you’ll learn how to read in the files and merge the text data with the tabular data.

  • Line up the returns from day -5 before the filing date to day +5 after the filng date. Including the return on the filing date itself, we get 11 days of returns around the filing date. Three types of returns are considered here: > Ret - stock return > MktRet - S&P 500 return > NetRet - difference between Ret and MktRet

  • Merge the SEC text data and the tabular data with the build_tabText API. We need to see how returns evolve around the filing date.

%%time

df_returns = pd.read_csv("ppp_returns.csv")    # Tabular/numeric data
df_sec = pd.read_csv("ppp_10kq_8k_data.csv")   # Text data

Define helper functions to create 3 types of returns for 5 days before and 5 days after the filing date. The functions fill in returns for the ticker and corresponding S&P return.

%%time

def fillReturn(df_returns, ticker, dt, displacement):
    if np.where(df_returns.columns == ticker)[0].size > 0:
        bwd = list(df_returns[ticker].loc[:dt][-(displacement+1):]) # 5 days before filing plus filing date
        fwd = list(df_returns[ticker].loc[dt:][1:(displacement+1)]) # 5 days after filing
        if len(bwd) < displacement+1:
            bwd = [np.nan]*(displacement+1-len(bwd)) + bwd          # Add NaN at the beginning if less bwd
        if len(fwd) < displacement:
            fwd = fwd + [np.nan]*(displacement-len(fwd))            # Append NaN in the end if less fwd
        return bwd+fwd
    else:
        return [np.nan for idx in range(2*displacement+1)]

def create_df_5_days_return(df_returns):
    displace = 5
    cols = ['Date', 'ticker', 'Ret-5', 'Ret-4', 'Ret-3', 'Ret-2', 'Ret-1', 'Ret0', 'Ret1', 'Ret2', 'Ret3', 'Ret4', 'Ret5',
            'MktRet-5', 'MktRet-4', 'MktRet-3', 'MktRet-2', 'MktRet-1', 'MktRet0', 'MktRet1', 'MktRet2', 'MktRet3', 'MktRet4', 'MktRet5',
            'NetRet-5', 'NetRet-4', 'NetRet-3', 'NetRet-2', 'NetRet-1', 'NetRet0', 'NetRet1', 'NetRet2','NetRet3', 'NetRet4', 'NetRet5']
    df_trans_dict = {}
    idx = 0
    for ticker in df_returns.columns[1:]:
        for row in range(len(df_returns)):
            dt = df_returns.Date[row]
            rets = fillReturn(df_returns, ticker, dt, displace)
            mkt_rets = fillReturn(df_returns, '^GSPC', dt, displace)
            net_rets = [ a-b for a, b in zip(rets, mkt_rets)]
            row_data = [dt, ticker] + rets + mkt_rets + net_rets
            df_trans_dict[idx] = row_data
            idx += 1
    df_returns_trans = pd.DataFrame.from_dict(df_trans_dict, orient='index', columns = cols)
    return df_returns_trans

df_returns_trans = create_df_5_days_return(df_returns)
pd.set_option('display.max_columns', 50)
df_returns_trans.head(5)

Create a TabText dataframe

The following code cell calls the smjsindustry.build_tabText class to create a multimodal TabText dataframe, merging the tabular data and the text data together; the dataframe should have the Date column and a common column (‘ticker’ in this case) to generate a time series TabText dataset.

%%time
# Use build_tabText API to merge text and tabular datasets
from smjsindustry import build_tabText

tab_text = build_tabText(
        df_sec,
        "ticker",
        "filing_date",
        df_returns_trans,
        "ticker",
        "Date",
        freq='D'
    )
tab_text.head()

Write the merged dataframe into a CSV file

tab_text.to_csv("ppp_10kq_8k_stock_data.csv", index=False)

Step 5: Machine Learning Analysis

Some of these companies subsequently returned the money. Returning the money results in signaling an improvement in business conditions with a subsequent uptick in stock prices. Thus, an exercise to predict which firms would return the money based on their SEC filings is of interest.

The following code cells prepare the dataset for ML studies with the following steps:
* It flags all filings of the companies that returned the PPP money with a 1 and the others with a 0. Therefore, an ML model fit to these labels teases out whether the text for companies that retain PPP money is distinguishable from text of companies that return PPP money.

The resultant dataframe from the previous steps is stored as a CSV file titled ppp_model_TabText.csv (354 MB). This file contains both text and numerical columns of data.

tab_text = pd.read_csv('ppp_10kq_8k_stock_data.csv')

ppp_tickers_returned = pd.read_excel('ppp_tickers_returned.xlsx', index_col = None, sheet_name=0)
print("Number of PPP Returned tickers =", ppp_tickers_returned.shape[0])
ticker_list_returned = list(set(ppp_tickers_returned.ticker))
tab_text['returned'] = [1 if j in ticker_list_returned else 0 for j in tab_text['ticker']]
tab_text
tab_text['returned'] = [1 if j in ticker_list_returned else 0 for j in tab_text['ticker']]
tab_text = tab_text.drop(['Date'], axis=1)
tab_text.to_csv('ppp_model_TabText.csv', index=False)

You can start examining the mean return in the 5 days before the filing (-5,0) and 5 days after the filing (0,+5) to see how the firms that returned the money fared, compared to those that did not return the money. You’ll learn how the mean excess return (over the S&P return) between the two groups are calculated.

df = pd.read_csv("ppp_model_TabText.csv")
print(df.shape)
print(df.columns)

Next, the following cell curates the TabText dataframe by creating a cumulative (net of market) return for the 5 days before the filing (df["First5"]) and the 5 days after the filing (df["Second5"]). You can also see the various feature columns shown in the dataframe as shown in the following cell.

# Add up the returns for days (-5,0) denoted "First5" and days (0,5) denoted second 5
# Note that it is actually 6 days of returns.
df["First5"] = df["NetRet-5"] + df["NetRet-4"] + df["NetRet-3"] + df["NetRet-2"] + df["NetRet-1"] + df["NetRet0"]
df["Second5"] = df["NetRet5"] + df["NetRet4"] + df["NetRet3"] + df["NetRet2"] + df["NetRet1"] + df["NetRet0"]
df.head()
res = df.groupby(['returned']).count()['ticker']
print(res)
print("Baseline accuracy =", res[0]/sum(res))
df.groupby(['returned']).mean()[["First5","Second5"]]

From the output of the preceding cell, the mean return for the "First5" set is slightly worse for the "returned=0" case and the mean return for the "Second5" set is higher for the "returned=1" case. Maybe firms that returned the money were signalling to the market that they were in good shape and the market rewarded them with a stock price bounce.

Step 6: Machine Learning on the TabText Dataframe

In this notebook, an AutoGluon model is used with the SageMaker MXNet training framework to analyze how leading stock returns for 5 days (numerical data) and 10-K/Q, 8-K filings (text) are related to each company’s decision to accept or return the money.

Train an AutoGluon Model for Classification

Here, you’ll see how easy it is to undertake a seamless ML on multimodal data (TabText). In this section, you’ll learn how to use one of the open source AWS libraries known as AutoGluon, which is a part of the Gluon NLP family of tools. To learn more, see GluonNLP: NLP made easy.

In particular, we use the AutoGluon-Tabular model, which is designed for TabText and has superior performance. For more information about the model, see AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data.

For a quick start, see Predicting Columns in a Table - Quick Start. To find the AutoGluon-Tabular model in AWS Marketplace, see AutoGluon-Tabular.

The AutoGluon-Tabular model processes the data and trains a diverse ensemble of ML models to create a “predictor” which is able to predict the "returned" label in this data. This example uses both return and text data to build a model.

Create a sample dataset

For demonstration purposes, take a sample from the original dataset to reduce the time for training.

sample_df = pd.concat([df[df["returned"]==1].sample(n=500), df[df["returned"]==0].sample(n=500)]).sample(frac=1)

Save the dataframe into a CSV file.

sample_df.to_csv('ppp_model_sample_input.csv', index=False)

Prepare the SageMaker Training Environment

The following cells download installation packages, and create lib folder and requirements.txt file to store AutoGluon related dependencies. These dependencies will be installed in the training containers. For more information, see Use third-party libraries in the Amazon SageMaker Python SDK documentation.

Download AutoGluon installation packages

autogluon_bucket = f"s3://{notebook_artifact_bucket}/{notebook_autogluon_prefix}"
!aws s3 sync $autogluon_bucket ./
!mkdir model-training/lib
!tar -zxvf autogluon.tar.gz -C model-training/lib --strip-components=1 --no-same-owner

Save paths for dependency requirements

!cd model-training/lib && ls > ../requirements.txt
!cd model-training && sed -i -e 's#^#lib/#' requirements.txt

Split the sample dataset into a training dataset and a test dataset

from sklearn.model_selection import train_test_split

sample_df_ag = sample_df[["First5","text","returned"]]
train_data, test_data = train_test_split(sample_df_ag, test_size=0.2, random_state=123)
import sagemaker
session = sagemaker.Session()
bucket = session.default_bucket()

train_data.to_csv("train_data.csv", index=False)
test_data.to_csv("test_data.csv", index=False)

train_s3_path = session.upload_data('train_data.csv', bucket=bucket, key_prefix='ppp_model/data')
test_s3_path = session.upload_data('test_data.csv', bucket=bucket, key_prefix='ppp_model/data')

Run a SageMaker training job

The training job takes around 20 minutes with the sample dataset. If you want to train a model with your own data, you might need to update the training script train.py in themodel-training folder. If you want to use GPU instance to achieve a better accuracy, replace train_instance_type with the desired GPU instance, and uncomment fit_args and hyperparameters to pass in the related arguments to the training script as hyperparameters.

from sagemaker.mxnet import MXNet

# Define required label and additional parameters for Autogluon TabularPredictor
init_args = {
  'label': 'returned'
}

# Define parameters for Autogluon TabularPredictor fit method
#fit_args = {
#  'ag_args_fit': {'num_gpus': 1}
#}

hyperparameters = {'init_args': str(init_args)}
# hyperparameters = {'init_args': str(init_args), 'fit_args': str(fit_args)}

tags = [{'Key' : 'AlgorithmName', 'Value' : 'AutoGluon-Tabular'},
        {'Key' : 'ProjectName', 'Value' : 'Jumpstart-Industry'},]

estimator = MXNet(
    entry_point="train.py",
    role=sagemaker.get_execution_role(),
    train_instance_count=1,
    train_instance_type="ml.c5.2xlarge",
    framework_version="1.8.0",
    py_version="py37",
    source_dir="model-training",
    base_job_name='jumpstart-industry-example-ppp',
    hyperparameters=hyperparameters,
    tags=tags,
    disable_profiler=True,
    debugger_hook_config=False,
    enable_network_isolation=True,  # Set enable_network_isolation=True to ensure a security running environment
)

inputs = {'training': train_s3_path, 'testing': test_s3_path}

estimator.fit(inputs)

Download Model Outputs

Download the following files (training job artifacts) from the SageMaker session’s default S3 bucket: * leaderboard.csv * predictions.csv * feature_importance.csv * evaluation.json

import boto3

s3_client = boto3.client("s3")
job_name = estimator._current_job_name
s3_client.download_file(bucket, f"{job_name}/output/output.tar.gz", "output.tar.gz")
!tar -xvzf output.tar.gz

The result of the training evaluation

import json

with open('evaluation.json') as f:
    data = json.load(f)
print(data)

The evaluation.json file reports all the usual metrics as well as the Matthews correlation coefficient (MCC). This is a more comprehensive metric for an unbalanced dataset. It ranges from \(-1\) to \(+1\), where \(-1\) implies perfect misclassification and \(+1\) is perfect classification.

Reference: Davide Chicco & Giuseppe Jurman (2020), The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation, BMC Genomics volume 21, Article number: 6

Note: Various metrics are discussed in Receiver operating characteristic in Wikipedia, the free encyclopedia.

Clean Up

After you are done using this notebook, delete the model artifacts and other resources to avoid any incurring charges.

Caution: You need to manually delete resources that you may have created while running the notebook, such as Amazon S3 buckets for model artifacts, training datasets, processing artifacts, and Amazon CloudWatch log groups.

For more information about cleaning up resources, see Clean Up in the Amazon SageMaker Developer Guide.

Further Supports

The SEC filings retrieval API operations we introduced at the beginning of this example notebook also download and parse other SEC forms, such as 495, 497, 497K, S-3ASR, and N-1A. If you need further support for any other types of finance documents, reach out to the SageMaker JumpStart team through AWS Support or AWS Developer Forums for Amazon SageMaker.

Reference

  1. What’s New post

  2. Blogs:

  3. Documentation and links to the SageMaker JumpStart Industry Python SDK:

Licence

The SageMaker JumpStart Industry product and its related materials are under the Legal License Terms.