NLP Scorer Module APIs
- class smjsindustry.NLPScorer(role: str, instance_count: int, instance_type: str, volume_size_in_gb: int = 30, volume_kms_key: Optional[str] = None, output_kms_key: Optional[str] = None, max_runtime_in_seconds: Optional[int] = None, sagemaker_session: Optional[Session] = None, tags: Optional[List[Dict[str, str]]] = None, network_config: Optional[NetworkConfig] = None)
Bases:
FinanceProcessor
Calculates NLP scores for text using default or user-provided word lists.
Text that contains many words and phrases that are related to the provided word lists will receive high scores while text that is unrelated will score lower.
The NLP scores report the percentage of words in a document that match a list of words, which is called a lexicon. The matching is undertaken after stemming of the document and the lexicon. NLP scoring of sentiment is based on the Vader sentiment lexicon. NLP Scoring of readability is based on the Gunning-Fog index.
For the general processing job configuration parameters of this class, see the parameters in the
FinanceProcessor
class.- calculate(score_config: NLPScorerConfig, text_column_name: str, input_file_path: str, s3_output_path: str, output_file_name: str, wait: bool = True, logs: bool = True)
Runs a processing job to generate NLP scores for input text.
- Parameters
score_config (
NLPScorerConfig
) – The config for the NLP scorer.text_column_name (str) – The name for column containing text to be summarized.
input_file_path (str) – The input file path pointing to the input dataframe containing the text to be summarized. It can be a local path or an S3 path.
s3_output_path (str) – An S3 prefix in the format of
's3://<output bucket name>/output/path'
.output_file_name (str) – The output file name. The full path is
's3://<output bucket name>/output/path/output_file_name'
.wait (bool) – Whether the call should wait until the job completes (default:
True
).logs (bool) – Whether to show the logs produced by the job (default:
True
).
- Raises
ValueError – if
logs
is True butwait
is False.
- class smjsindustry.NLPScorerConfig(nlp_score_types: List[NLPScoreType])
Bases:
FinanceProcessorConfig
Config class for
NLPScorer
.The NLP scores report the percentage of words in a document that match a list of words, which is called a lexicon. The matching is undertaken after stemming of the document and the lexicon. NLP scoring of sentiment is based on the Vader sentiment lexicon. NLP Scoring of readability is based on the Gunning-Fog index.
Use this configuration class to specify the word lists and their corresponding names that will be used when performing NLP scoring on a document.
- Parameters
nlp_score_types (List[NLPScoreType]) – The score types that will be used for NLP scoring.
- get_config() Dict[str, Union[str, Dict[str, List[str]]]]
Returns the config to be passed to a SageMaker JumpStart Industry NLPScorer instance.
- class smjsindustry.NLPScoreType(score_name: str, word_list: List[str])
Bases:
object
Initializes an
NLPScoreType
instance.It wraps score names and their corresponding word lists used for NLP scoring.
It provides an organized standard for passing required data to an NLPScorerConfig and defines several constants, such as
POSITIVE
andREADABILITY
, which can be used to perform NLP scoring using SageMaker JumpStart Industry’s internal word lists.A single
NLPScoreType
or a list ofNLPScoreTypes
is required when initializing anNLPScorerConfig
. Passing the data required by theNLPScorerConfig
viaNLPScoreTypes
ensures that any potential errors which could affect the creation of the config are caught at the earliest possible stage.To create an
NLPScoreType
using SageMaker JumpStart Industry’s internal word lists, use anNLPScoreType
constant (such asNLPScoreType.POSITIVE
) for thescore_name
argument, and either[]
orNone
for theword_list
argument.- Parameters
score_name (str) –
A name that describes the overall topic represented by the words in the
word_list
argument. For example, if theword_list
argument is["promising", "prodigy", "talented", "adept"]
, thescore_name
argument could be"talent"
.SageMaker JumpStart Industry has internal word lists corresponding to the following
score_name
values:NLPScoreType.POSITIVE
,NLPScoreType.NEGATIVE
,NLPScoreType.POLARITY
,NLPScoreType.CERTAINTY
,NLPScoreType.UNCERTAINTY
,NLPScoreType.FRAUD
,NLPScoreType.LITIGIOUS
,NLPScoreType.RISK
,NLPScoreType.SAFE
,NLPScoreType.READABILITY
,NLPScoreType.SENTIMENT
.word_list (List[str]) –
A list of words corresponding to the topic indicated by
score_name
.The following
score_names
values require theword_list
argument to beNone
(the remaining score names requireword_list
to be[]
):NLPScoreType.POLARITY
,NLPScoreType.READABILITY
,NLPScoreType.SENTIMENT
.
- POSITIVE = 'positive'
- NEGATIVE = 'negative'
- CERTAINTY = 'certainty'
- UNCERTAINTY = 'uncertainty'
- RISK = 'risk'
- SAFE = 'safe'
- LITIGIOUS = 'litigious'
- FRAUD = 'fraud'
- SENTIMENT = 'sentiment'
- POLARITY = 'polarity'
- READABILITY = 'readability'
- DEFAULT_SCORE_TYPES = ['positive', 'negative', 'certainty', 'uncertainty', 'risk', 'safe', 'litigious', 'fraud', 'sentiment', 'polarity', 'readability']
- property score_name: str
Gets the string of the
score_name
argument.
- property word_list: List[str]
Gets the string of the
word_list
argument.
- smjsindustry.NLPSCORE_NO_WORD_LIST
alias of [‘sentiment’, ‘polarity’, ‘readability’]