Enterprise Vault™ Classification using the Veritas Information Classifier
- About this guide
- Preparing Enterprise Vault for classification
- Setting up Veritas Information Classifier policies
- Defining and applying Enterprise Vault classification policies
- Defining classification policies
- Running classification in test mode
- Using classification with smart partitions
- Appendix A. Enterprise Vault properties for use in custom field searches
- Appendix B. PowerShell cmdlets for use with classification
- Appendix C. Classification cache folder
- Appendix D. Migrating from FCI classification to the Veritas Information Classifier
- Appendix E. Monitoring and troubleshooting
About policy conditions
A condition specifies the criteria that an item must meet for the Veritas Information Classifier to consider it a match. Your policies can contain any number of conditions.
This topic provides information on the following:
All conditions have this basic form:
property operator value
For example, in the following condition, "Content" is the property, "contains text" is the operator, and "Stocks" is the value:
The property specifies the part or characteristic of an item that you want to evaluate: its content, title, modified date, file size, and so on. When you choose a property from the list, the options in the two other fields change to suit it. For example, if you choose the "Modified date" property, the other fields provide options with which you can set one or more dates. For properties such as "Content" the available operators are as follows:
contains text
matches regex
matches pattern
is similar to
contains exact data match in
language is
contains entity
sentiment score
At the right of each condition, you can specify the minimum number of times that an item must meet the criteria for the Veritas Information Classifier to consider it a match.
Various applications that you use in your organization may add custom property information to the items that you want to classify. For example, when Enterprise Vault processes an item, it populates a number of the item's metadata properties with information and stores this information with the archived item: the date on which Enterprise Vault archived the item, the number of attachments that it has, and so on.
If you know the name of a property that particularly interests you, you can enter it as a custom field in your policy conditions.
See About the Enterprise Vault properties.
While creating a policy if a required property is not available in the property list, you can create a new property by using custom property fields.
To create a new property, use custom property fields while creating or editing a policy as follows:
- Set the other fields as per steps given in the topic Creating or editing policies.
- Under Conditions section, from the Property drop-down list, select a required custom property field: Custom date field, Custom number field, or Custom string field.
- Specify the name for the new custom property.
Note:
Custom property name must be same as the metadata property name as identified by text extraction engine, for example Apache TIKA. In case of Veritas Enterprise Vault, custom property name must match with one of the indexing properties.
- Complete the rest of the steps to create a policy.
The new policy is created with a new custom property.
Use the Veritas Information Classifier's YAML file to add a custom property under the property list on the UI.
The metadataDefinitions section of YAML file lists all the existing properties in the property list as follows:
The following table shows the data structure for an existing property:
Property Item | Description |
---|---|
name | Specifies the metadata property recognized by the text extractor engine like Apache TIKA. In case of Veritas Enterprise Vault, specify the indexing properties captured. |
displayName | Name of the property as displayed in the property list on the UI, for example "Title". |
type | Associated property type, for example String, Datetime, or Number. |
aliases | Specifies the additional metadata properties to be mapped to displayName. |
To make this property available in UI under policy condition page
- Add the new property details as shown in the previous table to the metadataDefinitions section in YAML.
- Restart the Veritas Information Classifier service of respective application.
Observe the following guidelines when you set up a condition to look for specific words or phrases in the items that you submit for classification:
The condition can look for multiple words or phrases, if you place each one on a line of its own. An item needs to contain just one word or phrase in the list to meet the condition.
Select
to find only exact matches for the uppercase and lowercase characters in the specified words or phrases.Select
to find instances where the specified words or phrases are contained within other ones. For example, if you select this option, the word enter matches enters, entertainment and carpenter. If you clear the option, enter matches only enter.Similarly, if you select
, the phrase call me matches call media and recall meeting, but not surgically mend.You can place the proximity operators NEAR and BEFORE between two words in the same line. For example, tax NEAR/10 reform matches instances where there are no more than ten words between tax and reform. sales BEFORE/5 report matches instances where sales precedes report and there are no more than five words between them. The number is mandatory in both cases.
Note:
These proximity operators may not work as expected when evaluating formatted data, such as tables and spreadsheets. The conversion process that this data undergoes before it is classified can swap the order of the table cells. For example, suppose that a spreadsheet contains the word sales in one cell and report in the cell immediately to the right. This should match the operator sales BEFORE/5 report but may not do so after the spreadsheet has been converted, because the conversion process has transposed the two words.
Word and phrases can include the asterisk (*) and question mark (?) wildcard characters. As part of a word, an asterisk matches zero or more characters. On its own, the asterisk matches exactly one word. A question mark matches exactly one character. For example:
stock* matches stock, stocks, and stockings.
*ock matches stock and clock.
*ock* matches stock and clocks.
??ock matches stock and clock, but not dock.
sell * stock matches sell the stock and sell some stock, but not sell stock.
You can use wildcards in combination with the NEAR and BEFORE operators. For example:
s?l? BEFORE/1 stock* matches sold the stock, sell stocks, and sale of stockings.
A regular expression, or regex for short, is a pattern of text that consists of ordinary characters (for example, letters a through z) and special characters, called metacharacters. The pattern describes one or more strings to match when searching text. For example, the following regular expression matches the sequence of digits in all Visa card numbers:
\b4[0-9]{12}(?:[0-9]{3})?\b
Your regular expressions must conform to the Perl regular expression syntax.
See the online Help for the Veritas Information Classifier for extensive information on this syntax.
You may find it helpful to build and test your regular expressions using the free online tool at https://regex101.com. This tool displays an explanation of your regular expression as you type it, and also lists all matches between the regular expression and a test string of your choice. The default regular expression flavor, pcre (php), is compatible with the Veritas Information Classifier.
Note:
Looking for regular expression matches is considerably slower than looking for matches for specific words or phrases. You can greatly improve performance and accuracy by looking for instances where both types of matches occur in proximity to each other. To do this, set up an All of condition group that contains both a regular expression condition and a condition for finding specific words and phrases, and specify the required distance within which matches must occur. The Veritas Information Classifier first evaluates the condition and only then looks for a regular expression match.
A pattern match evaluates the selected item property against an existing Veritas Information Classifier pattern. Depending on the selected pattern, you may be able to set the confidence levels that you are willing to accept. A high confidence level is likely to produce fewer but more relevant matches.
Note the following if you do not get the expected results when you test a policy that makes use of a built-in pattern:
It is important to check that your test item meets the pattern confidence levels. For example, by default, the Credit Card Policy looks for content that matches the pattern "Credit/Debit Card Number" with medium to very high confidence. To meet the requirements of the medium confidence level, an item must contain either of the following:
A delimited credit card number (one that contains spaces or dashes between the numbers).
Both a non-delimited credit card number and one or more credit card keywords, such as "AMEX" or "Visa".
So, an item does not meet these requirements if it contains a non-delimited credit card number but it does not also contain credit card keywords.
After you click Test classification results window may fail to highlight some or all of the matches. This is a known issue with certain patterns only. A future version of the Veritas Information Classifier will correct the issue.
to view the results of a test, the
Unlike most classification techniques that rely on pattern matching to identify sensitive data, Exact Data Match (EDM) triggers a classification response when the actual data that needs to be protected is detected. By matching on the exact data, this reduces the rate of false positives and allows for much higher levels of accuracy in automatic classification. EDM uses a fingerprint method whereby an extract of a database or table is provided as source file in either CSV or TXT format. The table is ingested, and rules are created that indicate a match when one or more columns of a single row are detected in proximity. EDM is ideal when the identification of discrete customer data, employee data, and any other sensitive data repository maintained within a table is required.
To classify information using Exact Data Match
Create an EDM pattern by setting the configuration options and providing the source document (typically containing the desired fields exported from a data store, such as a database). See “To create an Exact Data Match based pattern”.
Use the resulting EDM pattern in any policy to be used for EDM based classification.
Exact Data Match can be enabled or disabled using YAML.
The Exact Data Match feature allows you to detect the specific data sets from a database. For example, employee records. You can match one or more fields and optional fields as per the configured proximity value. It supports large data sets (like database records) and text in all languages and provides data protection by hashing the stored data. The main benefit of using Exact Data Match is to reduce false positives by matching data exactly (unlike pattern-based matching).
For example, if you have the following content in the document to classify:
Name: Teresa M. Brown
Employee ID: 624828
and you are trying to match against the following EDM source document,
Then this will trigger a match.
Exact Data Match provides following benefits:
Provides the ability to detect specific data sets from a database. For example, employee records.
Supports matching of combinations of data. For example, matching one or more fields and optional fields as per configured proximity value.
Supports large data sets like database records.
Provides data protection by hashing of stored data.
Supports text in all languages.
To create a policy using an Exact Data Match pattern
- Follow the initial steps for creating or editing a policy as described earlier.
- In the operator list box, select contains exact data match in and then select the required EDM pattern from the value list box next to it.
- Click Save.
When you test a document against a EDM based policy, Veritas Information Classifier shows the result. Also, the first column of the matching row is highlighted.
Example 1:
If source document content is as follows,
with Exact Data Matching Options as follows,
Name | Value |
---|---|
First row contains column headers | Yes |
Column delimiter | , |
Perform hashing to secure data fields | No |
Use case-sensitive matching | No |
Proximity for matches | 200 |
Minimum columns to match | 2 |
All columns | Not selected |
And if test document content is as follows,
The classification result will show a match for two records Stuart, and James.
Example 2:
For same source document and test document as stated in earlier example, if Minimum Columns value is set to 3 as follows:
Name | Value |
---|---|
First row contains column headers | Yes |
Column delimiter | , |
Perform hashing to secure data fields | No |
Use case-sensitive matching | No |
Proximity for matches | 200 |
Minimum columns to match | 3 |
All columns | Not selected |
The classification result will show a match for single record, that is Stuart. Because all 3 fields from first record is present in test document.
Example 3:
For same source document and test document as stated in first example, if proximity value is set to 50 as follows:
Name | Value |
---|---|
First row contains column headers | Yes |
Column delimiter | , |
Perform hashing to secure data fields | No |
Use case-sensitive matching | No |
Proximity for matches | 50 |
Minimum columns to match | 3 |
All columns | Not selected |
In this case, required words are not within proximity of 50 characters. Therefore the result will show no match.
Classification performance for Exact Data Match based policy depends on following factors.
Number of records to be matched
Number of fields and field size
Data being classified
Number of matches
Proximity and column matches found
Compute hardware and available resources
You can set up a condition to restrict policy matching to items in a particular language. For example, set the condition like the one below to find items whose content is primarily in French:
One of the options in the language list is
. This option matches items that contain at least two languages.To safeguard against the Veritas Information Classifier ignoring items because it cannot determine their primary language, select
. The most common reason why the Veritas Information Classifier may be unable to determine an item's primary language is that the item has a very small amount of content.You can set up a condition to restrict policy matching to content that includes a person name or location.
Note:
The "contains entity" condition will only be available if nlp-service-0.1.6.jar
is used while running the Veritas Information Classifier application. Also, Named Entity Recognition (NER) is available only for English.
For example, set the condition like the one below to find content including the person name.
Note:
Named Entity Recognition (NER) consumes more time and resources compared to normal classification. NER is not suitable for large documents, especially documents bigger than 10 MB.
Starting with release 3.2.0, risk score and risk level for each classified item is sent to the consuming applications. Consuming applications can analyze this information and support features such as sort, filter, search, and report on items by risk score and/or risk level. By understanding the level of risk, you can optimize efforts on data management, review, and control. You can prioritize activities and resources on items of highest risk.
The risk score and risk level are based on the number of pattern or policy condition hits. Items with more hits are categorized as high risk. Items with fewer hits are categorized as low risk.
You can configure YAML to lower the weightage of patterns while calculating risk score, for more details, see Configuring hit count weightage through YAML.
The risk information is sent as part of classify response only if following conditions are met:
The matchDetailLevel is configured in classify request as either LOW/MEDIUM/HIGH
At least one policy hit is observed
The risk computation of potentially sensitive content is based on the degree of hits against patterns or policy conditions.
Consider the following example, a document is analyzed against a policy containing five patterns and number of hits per pattern is as follows:
Pattern Number | Hits per Pattern |
---|---|
Pattern 1 | 7 |
Pattern 2 | 2 |
Pattern 3 | 4 |
Pattern 4 | 0 |
Pattern 5 | 3 |
The risk score for the document is addition of all pattern hits. Therefore, the Risk score is = 7+2+4+0+3 =16.
Risk is categorized in different risk levels as per the risk score as follows:
Risk Score | Risk Levels |
---|---|
0 | No |
1-2 | Low |
3-5 | Medium |
6-10 | High |
>=11 | Very High |
In earlier example, the Risk score is =16. Therefore, the Risk is categorized as: Risk Level: Very High
For a less sensitive pattern, you can reduce its weightage in risk score calculation irrespective of actual number of hits against it.
Use application-specific YAML to configure list of patterns for which any number of actual hits will always be considered as a single (1) hit. You can add list of patterns under lowerRiskRuleNameParts field in YAML. Note that the setting is common to all tenants.
Note:
In this release, PDF report may show inaccurate graph for 'Most Common Sensitive Data' after this configuration.
For example, see the following table that shows the effect on overall risk score when Pattern 1 is listed under lowerRiskRuleNameParts in YAML:
Pattern Number | Hits per Pattern | Final Hit Count |
---|---|---|
Pattern 1 | 7 | 1 |
Pattern 2 | 2 | 2 |
Pattern 3 | 4 | 4 |
Pattern 4 | 0 | 0 |
Pattern 5 | 3 | 3 |
Therefore, the Risk score in this case is = 1 +2+4+0+3 =10.
Sentiment score/Named Entity based policy condition hits does not contribute towards risk score.
Following policy conditions contribute to risk score as their actual number of hits:
Content
Title
Author
Content Type
Recipient
Following policy conditions (properties) will start contributing towards risk score in later releases:
Modified Date
Creation Date
Sensitivity
Category
Size (Bytes)
Custom date field
Custom number field
Custom string field
Custom Pattern listed under lowerRiskRuleNameParts in YAML will lower the risk score for a document only if used in policy condition. The risk score will not lower if the pattern is used individually.
You can group a set of conditions and nest grouped conditions within other grouped conditions. The group operator that you choose determines whether an item must meet all, some, or none of the conditions in the group to be considered a match. The following group operators are available:
All of. An item must meet all the specified conditions.
Any of. An item must meet at least one of the specified conditions.
None of. An item must not meet any of the specified conditions.
Note:
You can nest a None of group within an All of group to look for certain condition matches while also excluding others. For example, to achieve the effect of "(condition X AND condition Y) BUT NOT condition Z", you would include the X and Y conditions in an All of group and the Z condition in a nested None of group.
n or more of. An item must meet the specified number of conditions.
For an All of group only, you can choose to look for instances where the conditions occur within a specified number of characters of each other. For example, the following condition group looks for instances where the word Goodbye appears within 20 characters of the word Hello:
The text string "You say Goodbye and I say Hello" matches these conditions because there are fewer than 20 characters between the first character of Hello and the first character of Goodbye. Similarly, the string "You say Hello and I say Goodbye" also matches because there are fewer than 20 characters between the ends of the two words. In each case, the spaces count as characters.
Note:
When you conduct within nn characters proximity searches, take care not to duplicate the same search terms across multiple conditions. For example, suppose that you define one condition to look for the names Fred, Sue, and Bob, and a second to look for Joe, Bob, and Sarah. An item that contains a single instance of Bob would match these conditions.
Rather than choose the
option, you can choose . This option looks for instances where the conditions occur within any sequence of characters of the specified number. For example, a condition group that looks for instances where the word Goodbye appears within a 20-character sliding window of the word Hello does not match "You say Goodbye and I say Hello". There are 23 characters between the start of the word Goodbye and the end of the word Hello.More Information