NLP Scorer Module APIs
- class smjsindustry.NLPScorer(role: str, instance_count: int, instance_type: str, volume_size_in_gb: int = 30, volume_kms_key: Optional[str] = None, output_kms_key: Optional[str] = None, max_runtime_in_seconds: Optional[int] = None, sagemaker_session: Optional[Session] = None, tags: Optional[List[Dict[str, str]]] = None, network_config: Optional[NetworkConfig] = None)
Bases:
FinanceProcessorCalculates NLP scores for text using default or user-provided word lists.
Text that contains many words and phrases that are related to the provided word lists will receive high scores while text that is unrelated will score lower.
The NLP scores report the percentage of words in a document that match a list of words, which is called a lexicon. The matching is undertaken after stemming of the document and the lexicon. NLP scoring of sentiment is based on the Vader sentiment lexicon. NLP Scoring of readability is based on the Gunning-Fog index.
For the general processing job configuration parameters of this class, see the parameters in the
FinanceProcessorclass.- calculate(score_config: NLPScorerConfig, text_column_name: str, input_file_path: str, s3_output_path: str, output_file_name: str, wait: bool = True, logs: bool = True)
Runs a processing job to generate NLP scores for input text.
- Parameters
score_config (
NLPScorerConfig) – The config for the NLP scorer.text_column_name (str) – The name for column containing text to be summarized.
input_file_path (str) – The input file path pointing to the input dataframe containing the text to be summarized. It can be a local path or an S3 path.
s3_output_path (str) – An S3 prefix in the format of
's3://<output bucket name>/output/path'.output_file_name (str) – The output file name. The full path is
's3://<output bucket name>/output/path/output_file_name'.wait (bool) – Whether the call should wait until the job completes (default:
True).logs (bool) – Whether to show the logs produced by the job (default:
True).
- Raises
ValueError – if
logsis True butwaitis False.
- class smjsindustry.NLPScorerConfig(nlp_score_types: List[NLPScoreType])
Bases:
FinanceProcessorConfigConfig class for
NLPScorer.The NLP scores report the percentage of words in a document that match a list of words, which is called a lexicon. The matching is undertaken after stemming of the document and the lexicon. NLP scoring of sentiment is based on the Vader sentiment lexicon. NLP Scoring of readability is based on the Gunning-Fog index.
Use this configuration class to specify the word lists and their corresponding names that will be used when performing NLP scoring on a document.
- Parameters
nlp_score_types (List[NLPScoreType]) – The score types that will be used for NLP scoring.
- get_config() Dict[str, Union[str, Dict[str, List[str]]]]
Returns the config to be passed to a SageMaker JumpStart Industry NLPScorer instance.
- class smjsindustry.NLPScoreType(score_name: str, word_list: List[str])
Bases:
objectInitializes an
NLPScoreTypeinstance.It wraps score names and their corresponding word lists used for NLP scoring.
It provides an organized standard for passing required data to an NLPScorerConfig and defines several constants, such as
POSITIVEandREADABILITY, which can be used to perform NLP scoring using SageMaker JumpStart Industry’s internal word lists.A single
NLPScoreTypeor a list ofNLPScoreTypesis required when initializing anNLPScorerConfig. Passing the data required by theNLPScorerConfigviaNLPScoreTypesensures that any potential errors which could affect the creation of the config are caught at the earliest possible stage.To create an
NLPScoreTypeusing SageMaker JumpStart Industry’s internal word lists, use anNLPScoreTypeconstant (such asNLPScoreType.POSITIVE) for thescore_nameargument, and either[]orNonefor theword_listargument.- Parameters
score_name (str) –
A name that describes the overall topic represented by the words in the
word_listargument. For example, if theword_listargument is["promising", "prodigy", "talented", "adept"], thescore_nameargument could be"talent".SageMaker JumpStart Industry has internal word lists corresponding to the following
score_namevalues:NLPScoreType.POSITIVE,NLPScoreType.NEGATIVE,NLPScoreType.POLARITY,NLPScoreType.CERTAINTY,NLPScoreType.UNCERTAINTY,NLPScoreType.FRAUD,NLPScoreType.LITIGIOUS,NLPScoreType.RISK,NLPScoreType.SAFE,NLPScoreType.READABILITY,NLPScoreType.SENTIMENT.word_list (List[str]) –
A list of words corresponding to the topic indicated by
score_name.The following
score_namesvalues require theword_listargument to beNone(the remaining score names requireword_listto be[]):NLPScoreType.POLARITY,NLPScoreType.READABILITY,NLPScoreType.SENTIMENT.
- POSITIVE = 'positive'
- NEGATIVE = 'negative'
- CERTAINTY = 'certainty'
- UNCERTAINTY = 'uncertainty'
- RISK = 'risk'
- SAFE = 'safe'
- LITIGIOUS = 'litigious'
- FRAUD = 'fraud'
- SENTIMENT = 'sentiment'
- POLARITY = 'polarity'
- READABILITY = 'readability'
- DEFAULT_SCORE_TYPES = ['positive', 'negative', 'certainty', 'uncertainty', 'risk', 'safe', 'litigious', 'fraud', 'sentiment', 'polarity', 'readability']
- property score_name: str
Gets the string of the
score_nameargument.
- property word_list: List[str]
Gets the string of the
word_listargument.
- smjsindustry.NLPSCORE_NO_WORD_LIST
alias of [‘sentiment’, ‘polarity’, ‘readability’]