    What is data sampling?

    Is a statistical analysis technique, where we take a representative sample of a larger data set for further understanding and pattern recognition. When you have huge amounts of data to analyze, instead of analysing the entire dataset we take a sample of the data. This sample data describes the and explains the entire data set.

    Types of sampling?

    Random sampling: Randomly choose data points from the larger dataset. In python we could use random.sample(), in pandas we have df.sample()

    Stratified sampling: Small groups of the data set is created based on some common factor, and samples are randomly collected from each subgroup. Example: sklearn.model_selection.train_test_split(*arrays, **options), sklearn.model_selection.StratifiedShuffleSplit(n_splits=10, *, test_size=None, train_size=None, random_state=None)

    Cluster sampling: The larger data set is clustered or divided into buckets based on a factor, then a random sampling of clusters is analyzed.

    Multistage sampling: A more complicated version of cluster sampling. In this method we larger population into a number of clusters, like in cluster sampling. And in the next stage these clusters are further broken into clusters based on a another factor, and these new clusters are then samples are analyzed. This process can continue.

    Systematic sampling: We set a predefined limit till which we would pull the data from the population. Example, first 100 rows in a data frame or in excel.

    The above methods are probability sampling, we can also do nonprobability sampling. In this technique, the analyst decides based on knowledge and experience, which data is important.

    Nonprobability data sampling methods include:

    Convenience sampling: Data is collected from a convenient group or easily available group.

    Consecutive sampling: Data is collected as long as a give criteria and the predetermined sample size is met.

    Purposive: The data is selected based analysts judgment.

    Quota sampling: The analyst ensures that each group selected from the data represents the population.

