The aim of this post is to provide a short guide for analyzing nominal data.
The first and most important step in data analysis is to be able to identify the types of measures of the given variables and the tests that are applicable to each type. The types of data can be classified into two major groups: Continuous and Categorical. These groups are further divided into the categories of Interval and Ratio for Continuous, and Nominal and Ordinal for Categorical.

Types of Measures
Interval: Also known as scale data. This refers to an actual measure of a whole (e.g. 0, 2, 45) or fractional number (e.g. 4.3, -8.55). The classical example of interval data is the temperature in Celsius or Fahrenheit. Both positive and negative numbers can exist, as well as zero. Interval data are commonly used in research since the standard statistical analysis of central tendency can be applied such as mean, mode and standard deviation.
Ratio: Ratio variables are used in order to capture the difference between measures since there is a clear relation in the quantity. For example, it could be the number of goals scored in a football match where 2 goals are double the 1 goal. They take the same properties as interval data but also have a clear definition of 0. Negative ratios can technically exist but are rare in nature.
Nominal: Nominal data are discrete categories that do not overlap. They can take two or more categories for which there is no hierarchy and cannot be ordered in any way. In nominal data there is no category that is better or worse than another. For example:
Gender; Male / Female
Color; Red / Yellow / Red / …
Nationality; European / America / Asian / …
The only measure of central tendency that is applicable is the mode, i.e. the frequency of each category.
Ordinal: This type of categorical data partitions the relative values on a linear scale, however the gaps between the values cannot be considered as equal. A typical example of ordinal data is the Education level, with categories Primary School, Secondary School, Bachelor Degree, Master’s Degree, Doctoral Degree where there is a clear order but the variance between Primary and Secondary school or Bachelor and Master is not equal neither can be measured.
Preparing Nominal Data
Since Nominal data refer to named data and can often take a large variety of answers, it is recommended before the analysis to organize the data if needed and if possible. It could be the case that answers which refer to the same meaning can be grouped, but the researcher should be careful of not loosing important information.
This is mainly applicable for questions which include very low responses, for example “Favorite Color”. Responses of rare colors such as Silver, Violet or Tan could be grouped as “Others”. Other grouping can be done to parent categories, for example “Country of Origin” could be grouped into the “Continent of Origin” if the analysis would still be relevant. The aim is to have sufficient amount of responses to each group in order to enable proper statistical analysis and meaningful interpretations.
Standard Approach: Chi-square Test
The standard way of analyzing Nominal data is to apply Chi Square Test. The test examines the relationship between 2 categorical variables and derives if the occurrence of one is influenced by the other. An example is to examine if the color preference is different between Males and Females. To do so, the test first extracts the frequencies of each color separately for Males and Females. Then, it calculates what would be the expected count if there are no differences between the genders. Lastly, it compares the expected count with the actual count and provides a Chi-Square Value and significance p-value score to indicate if the difference is significant or not.
In SPSS the Chi-Square test can be accessed by Analyze >Descriptive Statistics > Crosstabs, where one variable will be selected as Row and the other as Column (the order does not matter).
In the Statistics tab select “Chi-square”.
In the Cells tab select “Expected” for Counts.



For an example database that is used with Color and Gender as Nominal data, the output of the Chi Square test is the following.


The null and alternative hypotheses would be:
H0: Best Color selection is not associated with Gender, and
H1: Best Color selection is associated with Gender
Results: The Chi-Square value is 4.975 with a p-value of 0.419. Since the p-value is above 0.05 we cannot reject the null hypothesis and hence we cannot conclude that Best Color selection is associated with Gender.
Dynamic Approach: Create dummies
What would be the case however if we are looking to add more variables in the analysis and seek to use the Color preference as independent variable? How can we use Nominal data in regression analysis where the variable would be assigned a coefficient of weight? The answer to this is to recode the categorical variable into several separate, dichotomous variables and use a dynamic approach!
Dichotomous variables would be binary dummy variables that take values 0 for No and 1 for Yes. In our example there would be a single dummy variable for each color. That is, a variable for Blue, another for Green, another for Red etc. It would be as if the question was “Is Blue your favorite color?” and the answer would be Yes or No. Then “Is Green your favorite color?”, “Is Red your favorite color?” etc.
Let us use these dummy variables in a regression analysis in order to examine how the color preference is useful to predict a score in a psychological depression score. The score is measuring the level of depression from 0 to 100 where the higher the score, the higher the depression.
In SPSS we select Analyze > Regression > Linear

In the next tab we select the variable “Score” as dependent and the variables of Gender and the newly formed dummies as independent. The “Other” color is excluded because it is already defined by the zeros of the remaining dummies. That is, if all the values are zero then it means that the favorite color is “Other”.

The output of the test is shown below. We are interested to see the coefficients and their significance.

The variables which have significance level (Sig.) below 0.05 have statistically significant coefficients. These would be Green, Red, Purple and Black. Gender and Blue coefficients are not statistically significant and cannot be interpreted.
It is interesting to see that “Red” takes a negative coefficient of -14.910. This means that people that selected “Red” as favorite color are likely to get a lower depression score. On the other hand, “Black” has the highest positive coefficient, indicating that people that selected Black as favorite color are more likely to receive a higher depression score. The same applies for people that selected Green or Purple which leads to a higher score. Dealing with dummies for each color is therefore much easier to handle in the analysis and to interpret the coefficients.
Conclusion
It is effective to use variables with Nominal data in the recording of the attributes since they can carry vast amounts of information. Analyzing such data however can be tricky and difficult to deal with so many categories. It is important to recode the data into meaningful groups and up to the level that the information of the research question requires. Chi-square test can be applied to the Nominal data, but if we need to get deeper and use regression analysis then the data should be recoded into multiple dummy variables.
For further information and thoughts just contact me.

