Text Summarizer Module APIs
- class smjsindustry.finance.processor.FinanceProcessor(role: str, instance_count: int, instance_type: str, volume_size_in_gb: int = 30, volume_kms_key: Optional[str] = None, output_kms_key: Optional[str] = None, max_runtime_in_seconds: Optional[int] = None, sagemaker_session: Optional[Session] = None, tags: Optional[List[Dict[str, str]]] = None, base_job_name: Optional[str] = None, network_config: Optional[NetworkConfig] = None)
Bases:
Processor
Handles SageMaker JumpStart Industry processing tasks.
This base class is for handling SageMaker JumpStart Industry processing tasks. See its subclasses, such as
Summarizer
andNLPScorer
, for concrete examples ofFinanceProcessors
that perform specific computation tasks.- Parameters
role (str) – An AWS IAM role name or ARN. Amazon SageMaker Processing uses this role to access AWS resources, such as data stored in Amazon S3.
instance_count (int) – The number of instances with which to run a processing job.
instance_type (str) – The type of Amazon EC2 instance to use for processing. For example,
'ml.c4.xlarge'
.volume_size_in_gb (int) – Size in GB of the EBS volume to use for storing data during processing (default: 30).
volume_kms_key (str) – An AWS KMS key for the processing volume (default: None).
output_kms_key (str) – The AWS KMS key ID for processing job outputs (default: None).
max_runtime_in_seconds (int) – Timeout in seconds (default: None). After this amount of time, Amazon SageMaker terminates the job, regardless of its current status. If
max_runtime_in_seconds
is not specified, the default value is 24 hours.sagemaker_session (
Session
) – A SageMaker Session object which manages interactions with Amazon SageMaker and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.tags (List[Dict[str, str]]) – List of tags to be passed to the processing job (default: None). To learn more, see Tag in the Amazon SageMaker API Reference.
base_job_name (str) – A prefix for the processing job name. If not specified, the processor generates a default job name, based on the processing image name and the current timestamp.
network_config (
NetworkConfig
) – A SageMaker NatworkConfig object that configures network isolation, encryption of inter-container traffic, security group IDs, and subnets.
- run(**kwargs)
Overrides the base class method.
- class smjsindustry.Summarizer(role: str, instance_count: int, instance_type: str, volume_size_in_gb: int = 30, volume_kms_key: Optional[str] = None, output_kms_key: Optional[str] = None, max_runtime_in_seconds: Optional[int] = None, sagemaker_session: Optional[Session] = None, tags: Optional[List[Dict[str, str]]] = None, network_config: Optional[NetworkConfig] = None)
Bases:
FinanceProcessor
Initializes a Summarizer instance that summarizes text.
For the general processing job configuration parameters of this class, see the parameters in the
FinanceProcessor
class.It summarizes text while preserving key information content and overall meaning. Summarization can be performed using either the Jaccard algorithm or the k-medoids algorithm. See the summarize methods for details regarding the specific algorithms used.
- summarize(summarizer_config: Union[JaccardSummarizerConfig, KMedoidsSummarizerConfig], text_column_name: str, input_file_path: str, s3_output_path: str, output_file_name: str, new_summary_column_name: str = 'summary', wait: bool = True, logs: bool = True)
Runs a processing job to generate Jaccard or k-medoid summary.
The summaries generated by the Jaccard algorithm give the main theme of the document by extracting the sentences with the greatest similarity among all sentences. Similarity is measured using the Jaccard coefficient, which, for a pair of sentences, is the number of common words between them normalized by the size of the super set of the words in the two sentences.
The k-medoids algorithm clusters sentences and outputs the medoids of each cluster as a summary.
- Parameters
summarizer_config (Union[JaccardSummarizerConfig, KMedoidsSummarizerConfig]) – The config for the JaccardSummarizer or KmedoidSummarizer.
text_column_name (str) – The name for column containing text to be summarized.
input_file_path (str) – The input file path pointing to the input dataframe containing the text to be summarized. It can be a local file or an S3 path.
s3_output_path (str) – An S3 prefix in the format of
's3://<output bucket name>/output/path'
.output_file_name (str) – The output file name. The full path is
's3://<output bucket name>/output/path/output_file_name'
.new_summary_column_name (str) – The column name for the summary in the given dataframe (default:
"summary"
).wait (bool) – Whether the call should wait until the job completes (default:
True
).logs (bool) – Whether to show the logs produced by the job (default:
True
).
- Raises
ValueError – if
logs
is True butwait
is False.
- class smjsindustry.JaccardSummarizerConfig(summary_size: int = 0, summary_percentage: float = 0.0, max_tokens: int = 0, cutoff: float = 0.0, vocabulary: Optional[Set[str]] = None)
Bases:
FinanceProcessorConfig
The configuration class for
JaccardSummarizer
.The aim of the
JaccardSummarizer
is to extract the main thematic sentences of the document. TheJaccardSummarizer
is a traditional summarizer that scores the sentences in a document using similarities. The sentences with higher similarities to other sentences in the documents are ranked higher. The top scoring sentences are selected as the summary of the document.More specifically, the similarity is calculated in terms of Jaccard Similarity. The Jaccard Similarity of two sentences A and B is the ratio of the size of intersection of tokens in A and B vs the size of union of tokens in A and B.
The
JaccardSummarizer
is based on extraction-based summarization. The extractive method is more practical because the summaries it creates are more grammatically correct and semantically relevant to the document. Abstraction-based summarization is avoided because it may alter the legal meaning of texts from SEC filings and legal financial texts that have strict meanings; small changes in the structure of a sentence may alter the legal meaning of the text. Extractive summarization also works for very long documents that cannot be easily processed with abstractive summarization.Use this configuration class to use the
JaccardSummarizer
algorithm when you specify the required parameter by theSummarizer
instance.- Parameters
summary_size (int) – The maximum number of sentences in the summary (default: 0).
summary_percentage (float) – The number of sentences in the summary should not exceed a
summary_percentage
of the sentences in the original text (default: 0.0).max_tokens (int) – The max number of tokens in the summary (default: 0).
cutoff (float) – The similarity cut off (default: 0.0).
vocabulary (Set[str]) – A set of sentiment words (default: None).
- get_config() Dict[str, Union[str, int, float, Set[str]]]
Returns the config to be passed to a SageMaker JumpStart Industry Summarizer instance.
- property summary_size: int
Gets the value of the
summary_size
parameter.
- property summary_percentage: float
Gets the value of the
summary_percentage
parameter.
- property max_tokens: int
Gets the value of the
max_tokens
parameter.
- property cutoff: float
Gets the value of the
cutoff
parameter.
- property vocabulary: Set[str]
Gets the value of the
vocabulary
parameter.
- class smjsindustry.KMedoidsSummarizerConfig(summary_size: int, vector_size: int = 100, min_count: int = 0, epochs: int = 60, metric: str = 'euclidean', init: str = 'heuristic')
Bases:
FinanceProcessorConfig
Configuration class for
KMedoidsSummarizer
.The
KMedoidsSummarizer
is an extractive summarizer and uses the k-medoids based approach.First, it creates sentence embeddings using Gensim’s Doc2Vec. Second, k-medoids clustering is performed on the sentence vectors. Note that we use k-medoids instead of k-means clustering. Whereas k-means minimizes the total squared error from a central position in each cluster (centroid), k-medoids minimizes the sum of dissimilarities between vectors in a cluster and one of the vectors designated as the representative of that cluster; the representative vectors are called medoids. The m sentences in the document corresponding to the cluster medoids are returned as the summary. The goal of this summarizer is different from the
JaccardSummarizer
. TheKMedoidsSummarizer
picks up peripheral sentences, not just the main theme of the document, in case there are items of importance that are buried in sentences different from the main theme.The
KMedoidsSummarizer
is based on extraction-based summarization. The extractive method is more practical because the summaries it creates are more grammatically correct and semantically relevant to the document. Abstraction-based summarization is avoided because it may alter the legal meaning of texts from SEC filings and legal financial texts that have strict meanings; small changes in the structure of a sentence may alter the legal meaning of the text. Extractive summarization also works for very long documents that cannot be easily processed with abstractive summarization.Use this configuration class to use the
KMedoidsSummarizer
algorithm when you specify the required parameter by theSummarizer
instance.- Parameters
summary_size (int) – Required. The number of sentences to be extracted.
vector_size (int) – The embedding dimensions (default: 100).
min_count (int) – The minimal word occurrences to be included (default: 0).
epochs (int) – The number of epochs in a training (default: 60).
metric (str) – The distance metric to use. Possible values are
'euclidean'
,'cosine'
,'dot-product'
(default:'euclidean'
).init (str) – The value specifies medoid initialization method. Possible values are
'random'
,'heuristic'
,'k-medoids++'
,'build'
(default:'heuristic'
).
- get_config() Dict[str, Union[int, str]]
Returns the config to be passed to a SageMaker JumpStart Industry Summarizer instance.
- property summary_size: int
Gets the value of the
summary_size
parameter.
- property vector_size: int
Gets the value of the
vector_size
parameter.
- property min_count: int
Gets the value of the
min_count
parameter.
- property epochs: int
Gets the value of the
epochs
parameter.
- property metric: str
Gets the value of the
metric
parameter.
- property init: str
Gets the value of the
init
parameter.