Data exploration is an essential piece of any new ETL (extraction-transformation-load) process. Knowing the structure, type, and even the semantics of data sources can help the ETL developer or architect to make better decisions on data type and length settings, transformation logic, and error handling procedures. Additionally, there are situations in which the exploration of the data is the principal purpose of the ETL, such as text mining and other pattern analysis. In most scenarios, SQL Server Integration Services is used as a traditional ETL tool and not necessarily for data exploration or text mining. However, there are some tools built into SSIS that are ideally suited for exploring source feed to expose patterns in the data.
In this brief SSIS Basics post, I’ll cover one such tool: the SSIS term extraction component.
The SSIS Term Extraction Component
The SSIS term extraction component is a data flow transformation that will aggregate the frequency of words found in a specified column supplied by an upstream data source. This component expects a Unicode text or text stream field as an input, and calculates as an output the frequency of the nouns and/or noun phrases in the specified source column. As shown below on the data flow surface, the term extraction component (circled) is always used as a transformation, accepting exactly one input and expecting either one or two outputs (the second being an optional error output).
Configuring the term extraction component can be a relatively simple exercise. As mentioned, you’ll need to specify the upstream column that will be used as an input, and this column must be Unicode text (DT_WSTR) or stream (DT_NTEXT).
Purposefully skipping over the Exclusion tab for just a moment, we’ll review the Advanced tab. This pane exposes several of the key configuration elements of the SSIS term extraction component. In particular, we can use this page to set the output for noun and/or noun phrase, set the score type to frequency (raw count) or TFIDF (a calculation based both on the frequency of the term and the ratio of the analyzed terms to the entire document), and set the minimum score threshold and maximum term length. We can also set the case-sensitive flag, in case our analytics need to be case sensitive.
Additionally, the exclusion list (on the Exclusion tab) can help to refine your calculation results. This feature allows the ETL developer to specify a known list of terms that will be excluded from the output. This is useful in cases where industry or common language terms might occur so frequently that including them in the output simply dilutes the rest of the data.
Although the term extraction component does not have a lot of moving parts, it does provide a quick way to perform data exploration and simple text analysis.
In this post, I have briefly demonstrated the attributes of the SSIS term extraction transformation, a data flow tool that can be used for data exploration. This transformation provides a quick and easy way to perform triage analysis and mining of raw text data.