Important
This page is for preview purposes only to show the content of Amazon SageMaker JumpStart Industry Example Notebooks.
Note
The SageMaker JumpStart Industry example notebooks are hosted and runnable only through SageMaker Studio. Log in to the SageMaker console, and launch SageMaker Studio. For instructions on how to access the notebooks, see SageMaker JumpStart and SageMaker JumpStart Industry in the Amazon SageMaker Developer Guide.
Important
The example notebooks are for demonstrative purposes only. The notebooks are not financial advice and should not be relied on as financial or investment advice.
Machine Learning on a TabText Dataframe
An Example Based on the Paycheck Protection Program
The Paycheck Protection Program (PPP) was created by the U.S. government to enable employers struggling with COVID-related business adversities to make payments to their employees. For more information, see the Paycheck Protection Program. In this example notebook, you’ll learn how to run a machine learning model on a sample of companies in the program over the first two quarters of 2020.
In this notebook, we take U.S Securities and Exchange Commission (SEC)
filing data from some of the companies that partook of the loans under
this program. We demonstrate how to merge the SEC filing data (text
data) with stock price data (tabular data) using the SageMaker JumpStart
Industry Python SDK. The build_tabText
class of the library helps
merge text dataframes with numeric dataframes to create a multimodal
dataframe for machine learning.
The PPP program was created to help companies experiencing financial hardship and may have otherwise been unable to make payroll. To the degree companies not experiencing financial constraints took PPP money, they may have experienced a decrease in share value and returned the money to recoup value.
We are interested in seeing if an ML model is able to detect whether the text of filings of companies that returned PPP money is different from that of companies that retained PPP money.
Important: This example notebook is for demonstrative purposes only. It is not financial advice and should not be relied on as financial or investment advice.
General Steps
This notebook takes the following steps:
Read in over 400 tickers of companies that took the PPP loans.
Read in the 10-K, 10-Q, and 8K filings for all paycheck protection tickers during Q1 and Q2 of 2020. Texts in the SEC filings are loaded using the
smjsindustry.DataLoader
class.Load a synthetic time series data of daily stock prices for the given tickers during Q1 and Q2 of 2020. Convert prices to returns. The simulated data is generated to be correlated with appropriate labels so that it can be meaningful. An analogous exercise with true data yields similar results.
Merge text and tabular datasets using the
smjsindustry.build_tabText
API.Conduct machine learning analysis to obtain a baseline accuracy.
Build an AutoGluon model to analyze how stock prices and texts in the SEC filings are related to each company’s decision to accept or return the money. This notebook shows how to flag all filings of companies that return the money with a 1 and the filings of companies that do not return the money with a 0. A good fit to the data implies the model can distinguish companies into two categories: the ones that return PPP funding versus those that do not based on the text.
Objective
The goal in this notebook is to fit an ML model to the data on companies that partook of funding from the Paycheck Protection Program (PPP) and to study how stock prices and returns and text from the SEC forms are related to their decisions to return the money.
The PPP program is reported in each company’s 8-K, an SEC filing which is required when a public company experiences a material change in business conditions. In addition to a 8-K filing, 10-K and 10-Q filings, which present a comprehensive summary of a company’s financial performance, are also used as source of inputs for this study. The stock data is synthetically generated to be correlated with the labels. You can repeat this exercise with actual data as needed.
SageMaker Studio Kernel Setup
Recommended kernel is Python 3 (Data Science). DO NOT use the Python 3 (SageMaker JumpStart Data Science 1.0) kernel because there are some differences in preinstalled dependency. For the instance type, using a larger instance with sufficient memory can be helpful to download the following materials.
Load Data, SDK, and Dependencies
The following code cells download the `smjsindustry
SDK <https://pypi.org/project/smjsindustry/>`__, dependencies, and
dataset from an S3 bucket prepared by SageMaker JumpStart Industry. You
will learn how to use the smjsindustry
SDK which contains various
APIs to curate SEC datasets. The dataset in this example was
synthetically generated using the smjsindustry
package’s SEC Forms
Retrieval tool. For more information, see the SageMaker JumpStart
Industry Python SDK
documentation.
notebook_artifact_bucket = 'jumpstart-cache-alpha-us-west-2'
notebook_data_prefix = 'smfinance-notebook-data/ppp'
notebook_sdk_prefix = 'smfinance-notebook-dependency/smjsindustry'
notebook_autogluon_prefix = 'smfinance-notebook-dependency/autogluon'
data_bucket = f's3://{notebook_artifact_bucket}/{notebook_data_prefix}'
!aws s3 sync $data_bucket ./
Install the SageMaker JumpStart Industry Python SDK and dependencies that are needed for machine learning, because they are not available as defaults in Studio.
sdk_bucket = f's3://{notebook_artifact_bucket}/{notebook_sdk_prefix}'
!aws s3 sync $sdk_bucket ./
!pip install --no-index smjsindustry-1.0.0-py3-none-any.whl
Step 1: Read in the Tickers
Over 400 tickers are used for this study.
%pylab inline
import pandas as pd
import os
ppp_tickers = pd.read_excel("ppp_tickers.xlsx", index_col = None, sheet_name=0)
print("Number of PPP tickers =", ppp_tickers.shape[0])
ticker_list = list(set(ppp_tickers.ticker))
ppp_tickers.head()
Step 2: Read in the SEC Forms Filed by These Companies
This notebook retrieves all 10-K/Q, 8-K forms from the SEC servers for Q1 and Q2 of 2020. This was done using the SageMaker JumpStart Industry Python SDK’s
DataLoader
class. For reference, the time taken by the data lodaer process was around 30 minutes for curating a dataframe of over 4000 filings.There is one 10K/Q form per quarter. These are quarterly reports.
There can be multiple 8K forms per quarter, because these are filed for material changes in business conditions. Depending on how many such events there are, several 8Ks might need to be filed. As you will see, this notebook retrieves more than one form per quarter.
The dataset was stored in a CSV file named
ppp_10kq_8k_data.csv
(351 MB).
Important: This example notebook uses data obtained from the SEC EDGAR database. You are responsible for complying with EDGAR’s access terms and conditions located in the Accessing EDGAR Data page.
%%time
df_sec = pd.read_csv("ppp_10kq_8k_data.csv") # Text data
print("Number of SEC filings: ", df_sec.shape[0])
Step 3: Collect Stock Prices and Convert to Returns
Given the list of tickers, we synthetically generated stock prices using simulation of geometric Brownian motion. The stock prices are generated to be consistent with the real market data. You can buy data for commercial use if needed.
Convert the stock prices to returns.
Some tickers might have been delisted since the time of the PPP program.
df_prices = pd.read_csv("ppp_stock_prices_synthetic.csv")
print('Total number of days for the stock time series: ', df_prices.shape[0])
print('Total number of stocks: ', df_prices.shape[1])
df_prices.head()
The following code cell converts the prices into percentage returns.
It converts prices into returns.
It calls helper function to convert prices to returns.
It removes the stock that only has
NaN
values, if any.It converts prices to returns using the
pct_change
function.
def convert_price_to_return(df_prices):
ticker_list = list(df_prices.columns[1:])
df_returns = df_prices[ticker_list].pct_change() # not using fill_method='ffill'
df_returns = pd.concat([df_prices.Date, df_returns], axis=1)[1:] # drop first row as it is NaN
df_returns = df_returns.reset_index(drop=True)
return df_returns
df_returns = convert_price_to_return(df_prices)
df_returns.dropna(axis=1, how='all', inplace = True) # drop columns with partial data
df_returns.set_index('Date', inplace = True)
print('Total number of stocks: ', len(list(df_returns.columns[1:])))
df_returns.head()
df_returns.to_csv('ppp_returns.csv', index = True)
Step 4: Merge Text and Tabular Datasets
The stock returns and the SEC forms are saved in earlier code blocks into CSV files. In this step, you’ll learn how to read in the files and merge the text data with the tabular data.
Line up the returns from day -5 before the filing date to day +5 after the filng date. Including the return on the filing date itself, we get 11 days of returns around the filing date. Three types of returns are considered here: > Ret - stock return > MktRet - S&P 500 return > NetRet - difference between
Ret
andMktRet
Merge the SEC text data and the tabular data with the
build_tabText
API. We need to see how returns evolve around the filing date.
%%time
df_returns = pd.read_csv("ppp_returns.csv") # Tabular/numeric data
df_sec = pd.read_csv("ppp_10kq_8k_data.csv") # Text data
Define helper functions to create 3 types of returns for 5 days before and 5 days after the filing date. The functions fill in returns for the ticker and corresponding S&P return.
%%time
def fillReturn(df_returns, ticker, dt, displacement):
if np.where(df_returns.columns == ticker)[0].size > 0:
bwd = list(df_returns[ticker].loc[:dt][-(displacement+1):]) # 5 days before filing plus filing date
fwd = list(df_returns[ticker].loc[dt:][1:(displacement+1)]) # 5 days after filing
if len(bwd) < displacement+1:
bwd = [np.nan]*(displacement+1-len(bwd)) + bwd # Add NaN at the beginning if less bwd
if len(fwd) < displacement:
fwd = fwd + [np.nan]*(displacement-len(fwd)) # Append NaN in the end if less fwd
return bwd+fwd
else:
return [np.nan for idx in range(2*displacement+1)]
def create_df_5_days_return(df_returns):
displace = 5
cols = ['Date', 'ticker', 'Ret-5', 'Ret-4', 'Ret-3', 'Ret-2', 'Ret-1', 'Ret0', 'Ret1', 'Ret2', 'Ret3', 'Ret4', 'Ret5',
'MktRet-5', 'MktRet-4', 'MktRet-3', 'MktRet-2', 'MktRet-1', 'MktRet0', 'MktRet1', 'MktRet2', 'MktRet3', 'MktRet4', 'MktRet5',
'NetRet-5', 'NetRet-4', 'NetRet-3', 'NetRet-2', 'NetRet-1', 'NetRet0', 'NetRet1', 'NetRet2','NetRet3', 'NetRet4', 'NetRet5']
df_trans_dict = {}
idx = 0
for ticker in df_returns.columns[1:]:
for row in range(len(df_returns)):
dt = df_returns.Date[row]
rets = fillReturn(df_returns, ticker, dt, displace)
mkt_rets = fillReturn(df_returns, '^GSPC', dt, displace)
net_rets = [ a-b for a, b in zip(rets, mkt_rets)]
row_data = [dt, ticker] + rets + mkt_rets + net_rets
df_trans_dict[idx] = row_data
idx += 1
df_returns_trans = pd.DataFrame.from_dict(df_trans_dict, orient='index', columns = cols)
return df_returns_trans
df_returns_trans = create_df_5_days_return(df_returns)
pd.set_option('display.max_columns', 50)
df_returns_trans.head(5)
Create a TabText dataframe
The following code cell calls the smjsindustry.build_tabText
class
to create a multimodal TabText dataframe, merging the tabular data and
the text data together; the dataframe should have the Date
column
and a common column (‘ticker’ in this case) to generate a time series
TabText dataset.
%%time
# Use build_tabText API to merge text and tabular datasets
from smjsindustry import build_tabText
tab_text = build_tabText(
df_sec,
"ticker",
"filing_date",
df_returns_trans,
"ticker",
"Date",
freq='D'
)
tab_text.head()
Write the merged dataframe into a CSV file
tab_text.to_csv("ppp_10kq_8k_stock_data.csv", index=False)
Step 5: Machine Learning Analysis
Some of these companies subsequently returned the money. Returning the money results in signaling an improvement in business conditions with a subsequent uptick in stock prices. Thus, an exercise to predict which firms would return the money based on their SEC filings is of interest.
The resultant dataframe from the previous steps is stored as a CSV file
titled ppp_model_TabText.csv
(354 MB). This file contains both text
and numerical columns of data.
tab_text = pd.read_csv('ppp_10kq_8k_stock_data.csv')
ppp_tickers_returned = pd.read_excel('ppp_tickers_returned.xlsx', index_col = None, sheet_name=0)
print("Number of PPP Returned tickers =", ppp_tickers_returned.shape[0])
ticker_list_returned = list(set(ppp_tickers_returned.ticker))
tab_text['returned'] = [1 if j in ticker_list_returned else 0 for j in tab_text['ticker']]
tab_text
tab_text['returned'] = [1 if j in ticker_list_returned else 0 for j in tab_text['ticker']]
tab_text = tab_text.drop(['Date'], axis=1)
tab_text.to_csv('ppp_model_TabText.csv', index=False)
You can start examining the mean return in the 5 days before the filing (-5,0) and 5 days after the filing (0,+5) to see how the firms that returned the money fared, compared to those that did not return the money. You’ll learn how the mean excess return (over the S&P return) between the two groups are calculated.
df = pd.read_csv("ppp_model_TabText.csv")
print(df.shape)
print(df.columns)
Next, the following cell curates the TabText dataframe by creating a
cumulative (net of market) return for the 5 days before the filing
(df["First5"]
) and the 5 days after the filing (df["Second5"]
).
You can also see the various feature columns shown in the dataframe as
shown in the following cell.
# Add up the returns for days (-5,0) denoted "First5" and days (0,5) denoted second 5
# Note that it is actually 6 days of returns.
df["First5"] = df["NetRet-5"] + df["NetRet-4"] + df["NetRet-3"] + df["NetRet-2"] + df["NetRet-1"] + df["NetRet0"]
df["Second5"] = df["NetRet5"] + df["NetRet4"] + df["NetRet3"] + df["NetRet2"] + df["NetRet1"] + df["NetRet0"]
df.head()
res = df.groupby(['returned']).count()['ticker']
print(res)
print("Baseline accuracy =", res[0]/sum(res))
df.groupby(['returned']).mean()[["First5","Second5"]]
From the output of the preceding cell, the mean return for the
"First5"
set is slightly worse for the "returned=0"
case and the
mean return for the "Second5"
set is higher for the "returned=1"
case. Maybe firms that returned the money were signalling to the market
that they were in good shape and the market rewarded them with a stock
price bounce.
Step 6: Machine Learning on the TabText Dataframe
In this notebook, an AutoGluon model is used with the SageMaker MXNet training framework to analyze how leading stock returns for 5 days (numerical data) and 10-K/Q, 8-K filings (text) are related to each company’s decision to accept or return the money.
Train an AutoGluon Model for Classification
Here, you’ll see how easy it is to undertake a seamless ML on multimodal data (TabText). In this section, you’ll learn how to use one of the open source AWS libraries known as AutoGluon, which is a part of the Gluon NLP family of tools. To learn more, see GluonNLP: NLP made easy.
In particular, we use the AutoGluon-Tabular model, which is designed for TabText and has superior performance. For more information about the model, see AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data.
For a quick start, see Predicting Columns in a Table - Quick Start. To find the AutoGluon-Tabular model in AWS Marketplace, see AutoGluon-Tabular.
The AutoGluon-Tabular model processes the data and trains a diverse
ensemble of ML models to create a “predictor” which is able to predict
the "returned"
label in this data. This example uses both return and
text data to build a model.
Create a sample dataset
For demonstration purposes, take a sample from the original dataset to reduce the time for training.
sample_df = pd.concat([df[df["returned"]==1].sample(n=500), df[df["returned"]==0].sample(n=500)]).sample(frac=1)
Save the dataframe into a CSV file.
sample_df.to_csv('ppp_model_sample_input.csv', index=False)
Prepare the SageMaker Training Environment
The following cells download installation packages, and create lib
folder and requirements.txt
file to store AutoGluon related
dependencies. These dependencies will be installed in the training
containers. For more information, see Use third-party
libraries
in the Amazon SageMaker Python SDK documentation.
Download AutoGluon installation packages
autogluon_bucket = f"s3://{notebook_artifact_bucket}/{notebook_autogluon_prefix}"
!aws s3 sync $autogluon_bucket ./
!mkdir model-training/lib
!tar -zxvf autogluon.tar.gz -C model-training/lib --strip-components=1 --no-same-owner
Save paths for dependency requirements
!cd model-training/lib && ls > ../requirements.txt
!cd model-training && sed -i -e 's#^#lib/#' requirements.txt
Split the sample dataset into a training dataset and a test dataset
from sklearn.model_selection import train_test_split
sample_df_ag = sample_df[["First5","text","returned"]]
train_data, test_data = train_test_split(sample_df_ag, test_size=0.2, random_state=123)
import sagemaker
session = sagemaker.Session()
bucket = session.default_bucket()
train_data.to_csv("train_data.csv", index=False)
test_data.to_csv("test_data.csv", index=False)
train_s3_path = session.upload_data('train_data.csv', bucket=bucket, key_prefix='ppp_model/data')
test_s3_path = session.upload_data('test_data.csv', bucket=bucket, key_prefix='ppp_model/data')
Run a SageMaker training job
The training job takes around 20 minutes with the sample dataset. If you
want to train a model with your own data, you might need to update the
training script train.py
in themodel-training
folder. If you
want to use GPU instance to achieve a better accuracy, replace
train_instance_type
with the desired GPU instance, and uncomment
fit_args
and hyperparameters
to pass in the related arguments to
the training script as hyperparameters.
from sagemaker.mxnet import MXNet
# Define required label and additional parameters for Autogluon TabularPredictor
init_args = {
'label': 'returned'
}
# Define parameters for Autogluon TabularPredictor fit method
#fit_args = {
# 'ag_args_fit': {'num_gpus': 1}
#}
hyperparameters = {'init_args': str(init_args)}
# hyperparameters = {'init_args': str(init_args), 'fit_args': str(fit_args)}
tags = [{'Key' : 'AlgorithmName', 'Value' : 'AutoGluon-Tabular'},
{'Key' : 'ProjectName', 'Value' : 'Jumpstart-Industry'},]
estimator = MXNet(
entry_point="train.py",
role=sagemaker.get_execution_role(),
train_instance_count=1,
train_instance_type="ml.c5.2xlarge",
framework_version="1.8.0",
py_version="py37",
source_dir="model-training",
base_job_name='jumpstart-industry-example-ppp',
hyperparameters=hyperparameters,
tags=tags,
disable_profiler=True,
debugger_hook_config=False,
enable_network_isolation=True, # Set enable_network_isolation=True to ensure a security running environment
)
inputs = {'training': train_s3_path, 'testing': test_s3_path}
estimator.fit(inputs)
Download Model Outputs
Download the following files (training job artifacts) from the SageMaker
session’s default S3 bucket: * leaderboard.csv
*
predictions.csv
* feature_importance.csv
* evaluation.json
import boto3
s3_client = boto3.client("s3")
job_name = estimator._current_job_name
s3_client.download_file(bucket, f"{job_name}/output/output.tar.gz", "output.tar.gz")
!tar -xvzf output.tar.gz
The result of the training evaluation
import json
with open('evaluation.json') as f:
data = json.load(f)
print(data)
The evaluation.json
file reports all the usual metrics as well as
the Matthews correlation coefficient (MCC). This is a more comprehensive
metric for an unbalanced dataset. It ranges from \(-1\) to
\(+1\), where \(-1\) implies perfect misclassification and
\(+1\) is perfect classification.
Reference: Davide Chicco & Giuseppe Jurman (2020), The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation, BMC Genomics volume 21, Article number: 6
Note: Various metrics are discussed in Receiver operating characteristic in Wikipedia, the free encyclopedia.
Clean Up
After you are done using this notebook, delete the model artifacts and other resources to avoid any incurring charges.
Caution: You need to manually delete resources that you may have created while running the notebook, such as Amazon S3 buckets for model artifacts, training datasets, processing artifacts, and Amazon CloudWatch log groups.
For more information about cleaning up resources, see Clean Up in the Amazon SageMaker Developer Guide.
Further Supports
The SEC filings retrieval API operations we introduced at the beginning of this example notebook also download and parse other SEC forms, such as 495, 497, 497K, S-3ASR, and N-1A. If you need further support for any other types of finance documents, reach out to the SageMaker JumpStart team through AWS Support or AWS Developer Forums for Amazon SageMaker.
Reference
Blogs:
Documentation and links to the SageMaker JumpStart Industry Python SDK:
ReadTheDocs: https://sagemaker-jumpstart-industry-pack.readthedocs.io/en/latest/index.html
GitHub Repository: https://github.com/aws/sagemaker-jumpstart-industry-pack/
Official SageMaker Developer Guide: https://docs.aws.amazon.com/sagemaker/latest/dg/studio-jumpstart-industry.html
Licence
The SageMaker JumpStart Industry product and its related materials are under the Legal License Terms.