Data Sampling

Data sampling is a technique that can be used to help classify large volumes of data more efficiently, but with a risk of silent failure. In data sampling, rather than classify every element of a data repository, only a subset of the data is examined and classified. This approach can work well when the data is structured/semi-structured, where the data follows the same format from the beginning of the file through to the end. For these types of data, we can examine a few representative samples and be assured that the rest of the data is substantially similar. Data sampling can have significant accuracy issues when applied to unstructured data. In these cases there are often significant differences in the kinds of information that exist at the start of the file versus the middle and end. Employing sampling for unstructured data brings a high risk of false negatives: missing pieces of data that contain valid findings.