In today's world, vast amounts of data are generated every second. Understanding data means learning how to collect, organise, analyse, and visualise data to extract meaningful insights and make informed decisions. This chapter forms the foundation of data science and statistical thinking for computer science students.
What is Data?
Data is a collection of raw, unorganised facts and figures. On its own, data may not be meaningful. When data is processed, organised, or structured to be useful in a specific context, it becomes information.
- Examples:
- Data: 78, 85, 92, 67, 55 (a list of numbers)
- Information: "These are the marks of 5 students in a maths test, with an average of 75.4"
Types of Data
- Quantitative (Numerical) Data — data that can be measured and expressed as numbers.
- Discrete: countable values (number of students, number of cars).
- Continuous: any value within a range (height, weight, temperature).
- Qualitative (Categorical) Data — data that describes categories or labels.
- Nominal: categories with no natural order (colour, gender, city).
- Ordinal: categories with a natural order (poor / fair / good / excellent).
Data Collection
- Data can be collected from:
- Primary sources: surveys, questionnaires, direct observation, experiments.
- Secondary sources: books, internet, published reports, databases.
Statistical Measures
Measures of Central Tendency describe the "centre" of a data set.
- Mean (Arithmetic Mean): sum of all values / count of values.
- Mean = (x1 + x2 + ... + xn) / n
- Median: the middle value when data is sorted.
- If n is odd: median = value at position (n+1)/2.
- If n is even: median = average of values at positions n/2 and n/2 + 1.
- Mode: the value that appears most frequently. A dataset can have no mode, one mode (unimodal), or multiple modes.
Measures of Spread (Dispersion):
- Range: maximum value - minimum value.
- Variance: average of the squared differences from the mean.
- Variance = sum of (xi - mean)2 / n
- Standard Deviation (SD): square root of variance. SD = √(Variance). Indicates how spread out values are around the mean.
Data Visualisation
Charts and graphs help humans understand patterns in data quickly.
| Chart Type | Best Used For |
|------------|---------------|
| Bar Chart | Comparing categories |
| Pie Chart | Showing proportions of a whole |
| Line Graph | Showing trends over time |
| Histogram | Showing frequency distribution of continuous data |
| Scatter Plot | Showing relationship between two variables |
Python Libraries for Data Analysis
- numpy: numerical operations, array handling.
- pandas: data manipulation, DataFrames, reading CSV/Excel.
- matplotlib: creating charts and graphs.
- statistics module: mean(), median(), mode(), stdev().
Example using statistics module:
```
import statistics
data = [10, 20, 20, 30, 40]
print(statistics.mean(data)) # 24.0
print(statistics.median(data)) # 20
print(statistics.mode(data)) # 20
```
Data Interpretation
- After computing statistics and creating visualisations, the key step is interpretation — explaining what the data tells us. Ask:
- Is there a trend? (increasing/decreasing over time)
- Are there outliers? (values far from others)
- Is there a relationship between two variables? (correlation)
- Which category dominates? (modal class)
Common mistakes
- Confusing mean and median — mean is affected by extreme values (outliers); median is more robust.
- Incorrect median calculation — always SORT the data first before finding the median.
- Using a pie chart for non-proportional data — pie charts only make sense when parts sum to a meaningful whole (100%).
- Ignoring data types — applying numerical operations to categorical data leads to meaningless results.
Summary
Data is raw facts; information is processed data. Data types include quantitative (discrete/continuous) and qualitative (nominal/ordinal). Key statistical measures are mean, median, mode (central tendency) and range, variance, standard deviation (dispersion). Visualisation tools (bar, pie, line, histogram, scatter) make patterns visible. Python libraries like numpy, pandas, and matplotlib support data analysis programmatically.