Text Summarizer Module APIs

class smjsindustry.finance.processor.FinanceProcessor(role: str, instance_count: int, instance_type: str, volume_size_in_gb: int = 30, volume_kms_key: Optional[str] = None, output_kms_key: Optional[str] = None, max_runtime_in_seconds: Optional[int] = None, sagemaker_session: Optional[Session] = None, tags: Optional[List[Dict[str, str]]] = None, base_job_name: Optional[str] = None, network_config: Optional[NetworkConfig] = None)

Bases: Processor

Handles SageMaker JumpStart Industry processing tasks.

This base class is for handling SageMaker JumpStart Industry processing tasks. See its subclasses, such as Summarizer and NLPScorer, for concrete examples of FinanceProcessors that perform specific computation tasks.

Parameters

role (str) – An AWS IAM role name or ARN. Amazon SageMaker Processing uses this role to access AWS resources, such as data stored in Amazon S3.
instance_count (int) – The number of instances with which to run a processing job.
instance_type (str) – The type of Amazon EC2 instance to use for processing. For example, 'ml.c4.xlarge'.
volume_size_in_gb (int) – Size in GB of the EBS volume to use for storing data during processing (default: 30).
volume_kms_key (str) – An AWS KMS key for the processing volume (default: None).
output_kms_key (str) – The AWS KMS key ID for processing job outputs (default: None).
max_runtime_in_seconds (int) – Timeout in seconds (default: None). After this amount of time, Amazon SageMaker terminates the job, regardless of its current status. If max_runtime_in_seconds is not specified, the default value is 24 hours.
sagemaker_session (Session) – A SageMaker Session object which manages interactions with Amazon SageMaker and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.
tags (List[Dict[str, str]]) – List of tags to be passed to the processing job (default: None). To learn more, see Tag in the Amazon SageMaker API Reference.
base_job_name (str) – A prefix for the processing job name. If not specified, the processor generates a default job name, based on the processing image name and the current timestamp.
network_config (NetworkConfig) – A SageMaker NatworkConfig object that configures network isolation, encryption of inter-container traffic, security group IDs, and subnets.

run(**kwargs): Overrides the base class method.

class smjsindustry.Summarizer(role: str, instance_count: int, instance_type: str, volume_size_in_gb: int = 30, volume_kms_key: Optional[str] = None, output_kms_key: Optional[str] = None, max_runtime_in_seconds: Optional[int] = None, sagemaker_session: Optional[Session] = None, tags: Optional[List[Dict[str, str]]] = None, network_config: Optional[NetworkConfig] = None)

Bases: FinanceProcessor

Initializes a Summarizer instance that summarizes text.

For the general processing job configuration parameters of this class, see the parameters in the FinanceProcessor class.

It summarizes text while preserving key information content and overall meaning. Summarization can be performed using either the Jaccard algorithm or the k-medoids algorithm. See the summarize methods for details regarding the specific algorithms used.

summarize(summarizer_config: Union[JaccardSummarizerConfig, KMedoidsSummarizerConfig], text_column_name: str, input_file_path: str, s3_output_path: str, output_file_name: str, new_summary_column_name: str = 'summary', wait: bool = True, logs: bool = True)

Runs a processing job to generate Jaccard or k-medoid summary.

The summaries generated by the Jaccard algorithm give the main theme of the document by extracting the sentences with the greatest similarity among all sentences. Similarity is measured using the Jaccard coefficient, which, for a pair of sentences, is the number of common words between them normalized by the size of the super set of the words in the two sentences.

The k-medoids algorithm clusters sentences and outputs the medoids of each cluster as a summary.

Parameters

summarizer_config (Union[JaccardSummarizerConfig, KMedoidsSummarizerConfig]) – The config for the JaccardSummarizer or KmedoidSummarizer.
text_column_name (str) – The name for column containing text to be summarized.
input_file_path (str) – The input file path pointing to the input dataframe containing the text to be summarized. It can be a local file or an S3 path.
s3_output_path (str) – An S3 prefix in the format of 's3://<output bucket name>/output/path'.
output_file_name (str) – The output file name. The full path is 's3://<output bucket name>/output/path/output_file_name'.
new_summary_column_name (str) – The column name for the summary in the given dataframe (default: "summary").
wait (bool) – Whether the call should wait until the job completes (default: True).
logs (bool) – Whether to show the logs produced by the job (default: True).

Raises

ValueError – if logs is True but wait is False.

class smjsindustry.JaccardSummarizerConfig(summary_size: int = 0, summary_percentage: float = 0.0, max_tokens: int = 0, cutoff: float = 0.0, vocabulary: Optional[Set[str]] = None)

Bases: FinanceProcessorConfig

The configuration class for JaccardSummarizer.

The aim of the JaccardSummarizer is to extract the main thematic sentences of the document. The JaccardSummarizer is a traditional summarizer that scores the sentences in a document using similarities. The sentences with higher similarities to other sentences in the documents are ranked higher. The top scoring sentences are selected as the summary of the document.

More specifically, the similarity is calculated in terms of Jaccard Similarity. The Jaccard Similarity of two sentences A and B is the ratio of the size of intersection of tokens in A and B vs the size of union of tokens in A and B.

The JaccardSummarizer is based on extraction-based summarization. The extractive method is more practical because the summaries it creates are more grammatically correct and semantically relevant to the document. Abstraction-based summarization is avoided because it may alter the legal meaning of texts from SEC filings and legal financial texts that have strict meanings; small changes in the structure of a sentence may alter the legal meaning of the text. Extractive summarization also works for very long documents that cannot be easily processed with abstractive summarization.

Use this configuration class to use the JaccardSummarizer algorithm when you specify the required parameter by the Summarizer instance.

Parameters

summary_size (int) – The maximum number of sentences in the summary (default: 0).
summary_percentage (float) – The number of sentences in the summary should not exceed a summary_percentage of the sentences in the original text (default: 0.0).
max_tokens (int) – The max number of tokens in the summary (default: 0).
cutoff (float) – The similarity cut off (default: 0.0).
vocabulary (Set[str]) – A set of sentiment words (default: None).

get_config() → Dict[str, Union[str, int, float, Set[str]]]: Returns the config to be passed to a SageMaker JumpStart Industry Summarizer instance.

property summary_size: int: Gets the value of the summary_size parameter.

property summary_percentage: float: Gets the value of the summary_percentage parameter.

property max_tokens: int: Gets the value of the max_tokens parameter.

property cutoff: float: Gets the value of the cutoff parameter.

property vocabulary: Set[str]: Gets the value of the vocabulary parameter.

class smjsindustry.KMedoidsSummarizerConfig(summary_size: int, vector_size: int = 100, min_count: int = 0, epochs: int = 60, metric: str = 'euclidean', init: str = 'heuristic')

Bases: FinanceProcessorConfig

Configuration class for KMedoidsSummarizer.

The KMedoidsSummarizer is an extractive summarizer and uses the k-medoids based approach.

First, it creates sentence embeddings using Gensim’s Doc2Vec. Second, k-medoids clustering is performed on the sentence vectors. Note that we use k-medoids instead of k-means clustering. Whereas k-means minimizes the total squared error from a central position in each cluster (centroid), k-medoids minimizes the sum of dissimilarities between vectors in a cluster and one of the vectors designated as the representative of that cluster; the representative vectors are called medoids. The m sentences in the document corresponding to the cluster medoids are returned as the summary. The goal of this summarizer is different from the JaccardSummarizer. The KMedoidsSummarizer picks up peripheral sentences, not just the main theme of the document, in case there are items of importance that are buried in sentences different from the main theme.

The KMedoidsSummarizer is based on extraction-based summarization. The extractive method is more practical because the summaries it creates are more grammatically correct and semantically relevant to the document. Abstraction-based summarization is avoided because it may alter the legal meaning of texts from SEC filings and legal financial texts that have strict meanings; small changes in the structure of a sentence may alter the legal meaning of the text. Extractive summarization also works for very long documents that cannot be easily processed with abstractive summarization.

Use this configuration class to use the KMedoidsSummarizer algorithm when you specify the required parameter by the Summarizer instance.

Parameters

summary_size (int) – Required. The number of sentences to be extracted.
vector_size (int) – The embedding dimensions (default: 100).
min_count (int) – The minimal word occurrences to be included (default: 0).
epochs (int) – The number of epochs in a training (default: 60).
metric (str) – The distance metric to use. Possible values are 'euclidean', 'cosine', 'dot-product' (default: 'euclidean').
init (str) – The value specifies medoid initialization method. Possible values are 'random', 'heuristic', 'k-medoids++', 'build' (default: 'heuristic').

get_config() → Dict[str, Union[int, str]]: Returns the config to be passed to a SageMaker JumpStart Industry Summarizer instance.

property summary_size: int: Gets the value of the summary_size parameter.

property vector_size: int: Gets the value of the vector_size parameter.

property min_count: int: Gets the value of the min_count parameter.

property epochs: int: Gets the value of the epochs parameter.

property metric: str: Gets the value of the metric parameter.

property init: str: Gets the value of the init parameter.