How To Find A Missing Thing

R Find Missing Values (half dozen Examples for Data Frame, Column & Vector)

Let's face up it:

Missing values are an issue of virtually every raw information fix!

If we don't handle our missing data in an appropriate way, our estimates are likely to exist biased.

However, before we can deal with missingness, we demand to place in which rows and columns the missing values occur.

In the following, I volition testify y'all several examples how to notice missing values in R.

Example i: One of the nigh common means in R to find missing values in a vector

expl_vec1                <-                c(                four,                8,                12, NA,                99,                -                20, NA)                # Create your own instance vector with NA'southward                is                .                na                (expl_vec1)                # The is.na() function returns a logical vector. The vector is Truthful in case                # of a missing value and Simulated in instance of an observed value                which(                is                .                na                (expl_vec1)                )                # The which() role returns the positions with missing values in your vector.                # In our case there are NA's at positions 4 & 7                ### [i] 4 7

expl_vec1 <- c(four, 8, 12, NA, 99, - 20, NA) # Create your ain instance vector with NA'due south is.na(expl_vec1) # The is.na() function returns a logical vector. The vector is True in case # of a missing value and FALSE in example of an observed value which(is.na(expl_vec1)) # The which() function returns the positions with missing values in your vector. # In our case there are NA's at positions 4 & 7 ### [1] 4 vii

Y'all can find a more detailed explanation for this example in the following video:

Instance 2: Observe missing values in a cavalcade of a data frame

expl_data1                <-                data.                frame                (x1                =                c(NA,                vii,                viii,                ix,                3                ),                # Numeric variable with i missing value                x2                =                c(                four,                1, NA, NA,                four                ),                # Numeric variable with two missing values                x3                =                c(                ane,                4,                two,                9,                vi                ),                # Numeric variable without any missing values                x4                =                c(                "How-do-you-do",                "I am non NA", NA,                "I love R", NA)                )                # Gene variable with                # 2 missing values                expl_data1                # This is how our data with missing values looks like

expl_data1 <- data.frame(x1 = c(NA, 7, 8, 9, 3), # Numeric variable with ane missing value x2 = c(4, 1, NA, NA, 4), # Numeric variable with two missing values x3 = c(1, iv, ii, 9, vi), # Numeric variable without whatever missing values x4 = c("Hi", "I am non NA", NA, "I love R", NA)) # Cistron variable with # two missing values expl_data1 # This is how our data with missing values looks similar

Example Data R Find Missing Values

Table one: Example Data Frame with Missing Values

which(                is                .                na                (expl_data1$x1)                )                # Same procedure as in Case 1, but this time with the cavalcade of a data frame;                # Missing value in x1 at position 1                which(                is                .                na                (expl_data1$x2)                )                # Variable x2 has missing values at positions 3 and 4                which(                is                .                na                (expl_data1$x3)                )                # The variable x3 in column 3 has no missing values                which(                is                .                na                (expl_data1$x4)                )                # Our gene variable x4 in cavalcade iv has missing values at positions 3 and 5;                # The same procedure can be applied to factors

which(is.na(expl_data1$x1)) # Same procedure as in Instance ane, but this time with the column of a information frame; # Missing value in x1 at position i which(is.na(expl_data1$x2)) # Variable x2 has missing values at positions 3 and four which(is.na(expl_data1$x3)) # The variable x3 in cavalcade 3 has no missing values which(is.na(expl_data1$x4)) # Our factor variable x4 in column four has missing values at positions iii and v; # The same procedure can be practical to factors

Example iii: Identify missing values in an R data frame

                # Every bit in Example one, yous can create a data frame with logical TRUE and Imitation values;                                # Indicating observed and missing values                is                .                na                (expl_data1)                apply(                is                .                na                (expl_data1),                2, which)                # In order to become the positions of each column in your data ready,                # you can apply the apply() role

# Equally in Example i, you lot can create a information frame with logical TRUE and FALSE values; # Indicating observed and missing values is.na(expl_data1) employ(is.na(expl_data1), 2, which) # In order to go the positions of each column in your data set, # you tin can use the apply() function

Case iv: Find missing values in a column of an R matrix

                # Create matrix on the basis of the first three columns of our example data of Instance 2                expl_matrix1                <-                as                .                matrix                (expl_data1[                ,                1                :                iii                ]                )                expl_matrix1   which(                is                .                na                (expl_matrix1[                ,                1                ]                )                )                # The $ operator is invalid for columns of matrices.                # Therefore we accept to select our matrix columns by squared brackets                                which(                is                .                na                (expl_matrix1[                ,                two                ]                )                )                # Beside the change from the $ operator to squared brackets,                # nosotros can utilise the aforementioned functions every bit in the other examples                which(                is                .                na                (expl_matrix1[                ,                3                ]                )                )                # Once more, no missing values in x3

# Create matrix on the footing of the first iii columns of our example data of Example 2 expl_matrix1 <- as.matrix(expl_data1[ , 1:3]) expl_matrix1 which(is.na(expl_matrix1[ , 1])) # The $ operator is invalid for columns of matrices. # Therefore we have to select our matrix columns by squared brackets which(is.na(expl_matrix1[ , 2])) # Beside the change from the $ operator to squared brackets, # we can utilise the same functions equally in the other examples which(is.na(expl_matrix1[ , 3])) # Once again, no missing values in x3

Instance 5: Identify NA values in a matrix

                # We can check the missing values of the whole matrix with the same process as in Case iii                use(                is                .                na                (expl_matrix1),                2, which)

# We can check the missing values of the whole matrix with the same process as in Example 3 employ(is.na(expl_matrix1), 2, which)

Example 6: Find missing values in R with the complete.cases() function

                # An alternative to the is.na() function is the function consummate.cases(),                # which searches for observed values instead of missing values                which(complete.                cases                (expl_vec1)                )                # Identify observed values (opposite result as in Example one)                which(complete.                cases                (expl_vec1)                ==                Simulated                )                # Reproduce result of Instance 1 past adding == FALSE                complete.                cases                (expl_data1)                # If a information frame or matrix is checked by consummate.instance(),                # the function returns a logical vector indicating whether a row is complete

# An alternative to the is.na() part is the function consummate.cases(), # which searches for observed values instead of missing values which(complete.cases(expl_vec1)) # Identify observed values (reverse effect as in Instance 1) which(complete.cases(expl_vec1) == FALSE) # Reproduce result of Example 1 by calculation == FALSE consummate.cases(expl_data1) # If a data frame or matrix is checked by complete.case(), # the part returns a logical vector indicating whether a row is consummate

Video Example – Find Missing Values in a Real Data Set up

The following video of my YouTube channel shows in a live example how to find NA, how to count NA, how to omit NA, and how to remove missing values.

Have a await at minute ane:05.

I'm showing here the same approach that I have explained in Example ane.

R – Count Missing Values per Row and Column

Likewise the positioning of your missing data, the question might arise how to count missing values per row, by column, or in a single vector. Let's check how to do this based on our case data above:

                # With the sum() and the is.na() functions yous can discover the number of missing values in your data                sum(                is                .                na                (expl_vec1)                )                # Two missings in our vector                sum(                is                .                na                (expl_data1)                )                # The same method works for the whole information frame; V missings overall                sum(                is                .                na                (expl_matrix1)                )                # The procedure works as well for matrices; The NA count is three in our case

# With the sum() and the is.na() functions you can observe the number of missing values in your data sum(is.na(expl_vec1)) # Two missings in our vector sum(is.na(expl_data1)) # The same method works for the whole information frame; Five missings overall sum(is.na(expl_matrix1)) # The procedure works besides for matrices; The NA count is 3 in our case

How to Handle Missing Data in R?

In one case we plant and located missing values and their index positions in our data, the question appears how we should care for these not available values. Complete instance data is needed for most data analyses in R!

The default method in the R programming language is listwise deletion, which deletes all rows with missing values in ane or more columns.

Basic data manipulations can exist done with the na.omit control or with the is.na R role.

A more sophisticated approach – which is ordinarily preferable to a complete case analysis – is the imputation of missing values.

Very uncomplicated imputation approaches would exist hateful imputation (mode imputation in case of chiselled variables) or the replacement of NA'south with 0.

However, in society to create a more reasonable complete data fix, missing data imputation normally replaces missing values with estimates that are based on statistical models (east.g. via regression imputation or predictive hateful matching).

Now It's Your Turn

And then that is how I'm checking for missing values in my data sets.

Now I'd like to hear well-nigh your thoughts: What'due south your favorite approach?

Are you going to use the is.na function of Example 1? Or will y'all notice NA's by searching for complete cases?

Let me know by leaving a comment below. I will respond to every question!

Appendix

How to create the graphic of the header of this folio

The header graphic shows a simple dotplot created with the R package ggplot2.

The dark bluish values betoken observed values; The calorie-free bluish values indicate missingness.

Since the missing values announced more ofttimes in the upper right part of the plot, they tin can non exist considered as Missing Completely At Random anymore.

                set up                .                seed                (                8765                )                # Reproducability                var1                <-                rnorm(                2000,                ten,                3                )                # Normal distribution                var2                <-                var1                +                rnorm(                2000                )                # Correlated normal distribution                range01                <-                office(ten)                {                (x                -                min(10)                )                /                (max(x)                -                min(ten)                )                }                # Suppress probabilities of missingness between 0 and 1                var2_miss                <-                rbinom(                2000,                1, range01(var1^                iii                )                )                ==                1                # Insert missing values for var2 in dependance of var1                data_ggplot_missings                <-                data.                frame                (var1, var2)                # Shop var1 and var2 in a data frame                colours                <-                rep(                1,                2000                )                # Set colours                                colours[var2_miss]                <-                two                ggplot_missings                <-                ggplot(data_ggplot_missings, aes(x                =                var1, y                =                var2)                )                +                # Create ggplot                geom_point(aes(col                =                colours, size                =                1.1                )                )                +                theme(legend.                position                =                "none"                )

gear up.seed(8765) # Reproducability var1 <- rnorm(2000, 10, three) # Normal distribution var2 <- var1 + rnorm(2000) # Correlated normal distribution range01 <- office(x){(x - min(x)) / (max(x) - min(x))} # Suppress probabilities of missingness between 0 and i var2_miss <- rbinom(2000, 1, range01(var1^3)) == 1 # Insert missing values for var2 in dependance of var1 data_ggplot_missings <- information.frame(var1, var2) # Store var1 and var2 in a data frame colours <- rep(ane, 2000) # Set colours colours[var2_miss] <- two ggplot_missings <- ggplot(data_ggplot_missings, aes(x = var1, y = var2)) + # Create ggplot geom_point(aes(col = colours, size = ane.ane)) + theme(legend.position = "none")