Financial DataLoader and Parser Module APIs

class smjsindustry.finance.DataLoader(role: str, instance_count: int, instance_type: str, volume_size_in_gb: int = 30, volume_kms_key: Optional[str] = None, output_kms_key: Optional[str] = None, max_runtime_in_seconds: Optional[int] = None, sagemaker_session: Optional[sagemaker.session.Session] = None, tags: Optional[List[Dict[str, str]]] = None, network_config: Optional[sagemaker.network.NetworkConfig] = None)

Bases: smjsindustry.finance.processor.FinanceProcessor

Initializes a DataLoader instance to load a dataset.

For the general processing job configuration parameters of this class, see the parameters in the FinanceProcessor class.

The following load class method with EDGARDataSetConfig downloads SEC XML filings from the SEC EDGAR database and parses the downloaded XML filings to plain text files.

load(dataset_config: smjsindustry.finance.processor_config.EDGARDataSetConfig, s3_output_path: str, output_file_name: str, wait: bool = True, logs: bool = True)

Runs a processing job to load dataset from SEC EDGAR database.

Parameters
  • dataset_config (EDGARDataSetConfig) – The config for the DataLoader.

  • s3_output_path (str) – An S3 prefix in the format of 's3://<output bucket name>/output/path'.

  • output_file_name (str) – The output file name. The full path is 's3://<output bucket name>/output/path/output_file_name'.

  • wait (bool) – Whether the call should wait until the job completes (default: True).

  • logs (bool) – Whether to show the logs produced by the job (default: True).

Raises

ValueError – if logs is True but wait is False.

class smjsindustry.finance.EDGARDataSetConfig(tickers_or_ciks: Optional[List[str]] = None, form_types: Optional[List[str]] = None, filing_date_start: Optional[str] = None, filing_date_end: Optional[str] = None, email_as_user_agent: Optional[str] = None)

Bases: smjsindustry.finance.processor_config.FinanceProcessorConfig

Config class for loading SEC filings from SEC EDGAR.

It specifies the details of SEC filings required by the DataLoader.

Parameters
  • tickers_or_ciks (List[str]) – A list of stock tickers or CIKs. For example, ['amzn']

  • form_types (List[str]) – A list of SEC form types. The supported form types are 10-K, 10-Q, 8-K, 497, 497K, S-3ASR, N-1A, 485BXT, 485BPOS, 485APOS, S-3, S-3/A, DEF 14A, SC 13D, and SC 13D/A. For example, ['10-K'].

  • filing_date_start (str) – The starting filing date in the format of 'YYYY-MM-DD'. For example, '2021-01-01'.

  • filing_date_end (str) – The ending filing date in the format of 'YYYY-MM-DD'. For example, '2021-12-31'.

  • email_as_user_agent (str) – The user email used as a user_agent for SEC EDGAR HTTP requests. For example, "gecko_demo_user@amazon.com".

get_config()

Returns config to be passed to a SageMaker JumpStart Industry DataLoader instance.

property tickers_or_ciks

Gets the string of the tickers_or_ciks parameter.

property form_types

Gets the string of the form_types parameter.

property filing_date_start

Gets the string of the filing_date_start parameter.

property filing_date_end

Gets the string of the filing_date_end parameter.

property email_as_user_agent

Gets the string of the email_as_user_agent parameter.

class smjsindustry.finance.SECXMLFilingParser(role: str, instance_count: int, instance_type: str, volume_size_in_gb: int = 30, volume_kms_key: Optional[str] = None, output_kms_key: Optional[str] = None, max_runtime_in_seconds: Optional[int] = None, sagemaker_session: Optional[sagemaker.session.Session] = None, tags: Optional[List[Dict[str, str]]] = None, network_config: Optional[sagemaker.network.NetworkConfig] = None)

Bases: smjsindustry.finance.processor.FinanceProcessor

Initializes a SECXMLFilingParser instance that parses SEC XML filings.

For the general processing job configuration parameters of this class, see the parameters in the FinanceProcessor class.

The following parse class method parses user-downloaded SEC XML filings to plain text files.

parse(input_data_path: str, s3_output_path: str, wait: bool = True, logs: bool = True)

Runs a processing job to parse SEC XML filings.

Parameters
  • input_data_path (str) – The input file path pointing to directory containing the SEC XML filings to be parsed. It can be a local folder or an S3 path.

  • s3_output_path (str) – An S3 prefix in the format of 's3://<output bucket name>/output/path'.

  • wait (bool) – Whether the call should wait until the job completes (default: True).

  • logs (bool) – Whether to show the logs produced by the job (default: True).

Raises

ValueError – if logs is True but wait is False.