NLP Scorer Module APIs

class smjsindustry.NLPScorer(role: str, instance_count: int, instance_type: str, volume_size_in_gb: int = 30, volume_kms_key: Optional[str] = None, output_kms_key: Optional[str] = None, max_runtime_in_seconds: Optional[int] = None, sagemaker_session: Optional[sagemaker.session.Session] = None, tags: Optional[List[Dict[str, str]]] = None, network_config: Optional[sagemaker.network.NetworkConfig] = None)

Bases: smjsindustry.finance.processor.FinanceProcessor

Calculates NLP scores for text using default or user-provided word lists.

Text that contains many words and phrases that are related to the provided word lists will receive high scores while text that is unrelated will score lower.

The NLP scores report the percentage of words in a document that match a list of words, which is called a lexicon. The matching is undertaken after stemming of the document and the lexicon. NLP scoring of sentiment is based on the Vader sentiment lexicon. NLP Scoring of readability is based on the Gunning-Fog index.

For the general processing job configuration parameters of this class, see the parameters in the FinanceProcessor class.

calculate(score_config: smjsindustry.finance.processor_config.NLPScorerConfig, text_column_name: str, input_file_path: str, s3_output_path: str, output_file_name: str, wait: bool = True, logs: bool = True)

Runs a processing job to generate NLP scores for input text.

Parameters
  • score_config (NLPScorerConfig) – The config for the NLP scorer.

  • text_column_name (str) – The name for column containing text to be summarized.

  • input_file_path (str) – The input file path pointing to the input dataframe containing the text to be summarized. It can be a local path or an S3 path.

  • s3_output_path (str) – An S3 prefix in the format of 's3://<output bucket name>/output/path'.

  • output_file_name (str) – The output file name. The full path is 's3://<output bucket name>/output/path/output_file_name'.

  • wait (bool) – Whether the call should wait until the job completes (default: True).

  • logs (bool) – Whether to show the logs produced by the job (default: True).

Raises

ValueError – if logs is True but wait is False.

class smjsindustry.NLPScorerConfig(nlp_score_types: List[smjsindustry.finance.nlp_score_type.NLPScoreType])

Bases: smjsindustry.finance.processor_config.FinanceProcessorConfig

Config class for NLPScorer.

The NLP scores report the percentage of words in a document that match a list of words, which is called a lexicon. The matching is undertaken after stemming of the document and the lexicon. NLP scoring of sentiment is based on the Vader sentiment lexicon. NLP Scoring of readability is based on the Gunning-Fog index.

Use this configuration class to specify the word lists and their corresponding names that will be used when performing NLP scoring on a document.

Parameters

nlp_score_types (List[NLPScoreType]) – The score types that will be used for NLP scoring.

get_config() Dict[str, Union[str, Dict[str, List[str]]]]

Returns the config to be passed to a SageMaker JumpStart Industry NLPScorer instance.

class smjsindustry.NLPScoreType(score_name: str, word_list: List[str])

Bases: object

Initializes an NLPScoreType instance.

It wraps score names and their corresponding word lists used for NLP scoring.

It provides an organized standard for passing required data to an NLPScorerConfig and defines several constants, such as POSITIVE and READABILITY, which can be used to perform NLP scoring using SageMaker JumpStart Industry’s internal word lists.

A single NLPScoreType or a list of NLPScoreTypes is required when initializing an NLPScorerConfig. Passing the data required by the NLPScorerConfig via NLPScoreTypes ensures that any potential errors which could affect the creation of the config are caught at the earliest possible stage.

To create an NLPScoreType using SageMaker JumpStart Industry’s internal word lists, use an NLPScoreType constant (such as NLPScoreType.POSITIVE) for the score_name argument, and either [] or None for the word_list argument.

Parameters
  • score_name (str) –

    A name that describes the overall topic represented by the words in the word_list argument. For example, if the word_list argument is ["promising", "prodigy", "talented", "adept"], the score_name argument could be "talent".

    SageMaker JumpStart Industry has internal word lists corresponding to the following score_name values: NLPScoreType.POSITIVE, NLPScoreType.NEGATIVE, NLPScoreType.POLARITY, NLPScoreType.CERTAINTY, NLPScoreType.UNCERTAINTY, NLPScoreType.FRAUD, NLPScoreType.LITIGIOUS, NLPScoreType.RISK, NLPScoreType.SAFE, NLPScoreType.READABILITY, NLPScoreType.SENTIMENT.

  • word_list (List[str]) –

    A list of words corresponding to the topic indicated by score_name.

    The following score_names values require the word_list argument to be None (the remaining score names require word_list to be []): NLPScoreType.POLARITY, NLPScoreType.READABILITY, NLPScoreType.SENTIMENT.

POSITIVE = 'positive'
NEGATIVE = 'negative'
CERTAINTY = 'certainty'
UNCERTAINTY = 'uncertainty'
RISK = 'risk'
SAFE = 'safe'
LITIGIOUS = 'litigious'
FRAUD = 'fraud'
SENTIMENT = 'sentiment'
POLARITY = 'polarity'
READABILITY = 'readability'
DEFAULT_SCORE_TYPES = ['positive', 'negative', 'certainty', 'uncertainty', 'risk', 'safe', 'litigious', 'fraud', 'sentiment', 'polarity', 'readability']
property score_name: str

Gets the string of the score_name argument.

property word_list: List[str]

Gets the string of the word_list argument.

smjsindustry.NLPSCORE_NO_WORD_LIST

alias of [‘sentiment’, ‘polarity’, ‘readability’]