How to Approach Data Analysis

What is Data Analysis?

Data analysis is transforming data into information by putting data into context. For an overview, start by reading quickly through “Data analysis” in Wikipedia.

Alternatively, the chapter titled “The Scope of Statistics” in the classic textbook Statistical Methods in Medical Research (Armitage, Berry et al. 2002) provides a gentle introduction to data structures, diagrams, tabulation and data processing, summarizing numerical data, means and other measures of location, and measures of variation. There are many free online tutorials that help with key concepts in statistics and data visualization.
For further information, a very useful introductory reference book for biostatistics written for non-statisticians is Intuitive Biostatistics (2018).

How do I start?

Engagement of qualified biostatistical advice is very strongly recommended from the planning stage and then throughout any research project for any researchers who do not have considerable biostatistical training. This will help you choose the most appropriate type of data analysis technique for your particular research question. To proceed without sufficient knowledge is to risk a poor research result (Altman 1994; Lee, Moreno-Betancur et al.). Finding such statistical support may not be easy – ask your senior colleagues and check for a university affiliation with your hospital department.

The consulting statistician will be pleased if you have an idea of the question(s) you are exploring, and the size and type of data you might collect and present to generate evidence. Try to be as specific as possible when choosing your primary and secondary outcomes within your research plan. Formulating the wording of your research hypotheses may be helped by adapting the output of this freely downloadable interactive power and sample size calculator (Dupont and Plummer 2018).

In the design phase of research, reference to internationally recommended checklists of items to be included in scientific reports may be helpful to identify initially overlooked but necessary or desirable data elements. These lists are available for a wide range of research types including randomised trials, observational studies, diagnostic/prognostic studies and qualitative research (Enhancing the QUAlity and Transparency Of health Research). A rapid browse through another international guide for Preparing a Manuscript for Submission to a Medical Journal might also help to consolidate the overall data structure and analysis plan.

How to Organise your Data

Most data for analyses will form a rectangular table of numeric (numbers such as -4 or 107 or 3.625) and “string” characters (values like “B” or “NZ” or “Jane”), in which each row represents the values of all variables for one observation unit (usually one person). Each column will contain the values within one variable across all observation units (people). See a variation on this below, under “long format”.

Analyses are facilitated if the collected data presented are already “cleaned” – i.e., without errors or inconsistencies in the data fields, and with short but logical column headings. The appropriate rounding of numeric data requires consideration (Barnett 2018). Upper versus lower case text symbols are often not interchangeable when data are analysed, making “b” and “B” not the same observation. Likewise, “ICU” may not be equivalent to “iCU”.

Make a data dictionary, which could be a simple text document, in which all data column headings are listed and explained in terms of the data type contained and their units of measurement. Carefully define any underlying codes. For example, the column labelled “TEST” contains only two codes, I = intervention and C = control; and the column labelled “FRAIL” records baseline clinical frailty within a 9-point ordinal scale ranging from 1 = “very fit” to 9 = “terminally ill”.

Spreadsheets like Microsoft Excel or the free program LibreOffice Calc are often used to organize and codify data in rows and columns. These programs can also generate summary statistics like means, standard deviations, medians, quartiles and ranges, and proportions of various categories that might be sufficient for presentations of simple data. Missing data are best left as an empty spreadsheet cell, which can be recoded later if needed by the analyst.

Note that data in an Excel spreadsheet as filename.xlsx may be saved and exported in many other “plain” formats for easy importation into dedicated analytical software. One of the most common forms of these delimited text files separates each field of data using the comma character (,) producing comma separated values text files “filename.csv”.

In the early research planning stage, it is very helpful to draft the data spreadsheet that will be used for the project. For example, measuring the primary outcome once only for each individual (e.g., survival to day 30 post randomization) will require only one outcome column for all rows (patients), containing only two symbols for the two possible values (e.g., “0” for a non-survivor and “1” for a survivor). A different project with repeated measurements collected within each individual over time will call for additional columns for each time point. This can become complex (e.g., 48 columns for serial outcome data if measuring the highest FiO2 hourly for two days).

As an illustration, the following fictional data shows observations of the outcome variable of interest captured on two days per person.

The first format (Table 1) is “wide” (one row per person), and this is the usual format of an analytic dataset relevant to simple studies. Note this example has serial observations of a hypothetical numeric outcome variable on each of two days, recorded in adjacent columns. An equivalent “long” format of these data, sometimes created within analytical software, where each row represents an observation at a specific time and individual persons have multiple rows amounting to their total number of observations, is shown in Table 2.

Table 1 Fictional data in “wide” format, with observations of a numeric outcome of interest on two separate days for each of three patients
id number person age day1 day2
1 A 40 10 12
2 B 50 3 4
3 C 22 20 10

Table 2 The equivalent “long” format of the “wide” format data in Table 1.
id number day person age observation
1 1 A 40 10
1 2 A 40 12
2 1 B 50 3
2 2 B 50 4
3 1 C 22 20
3 2 C 22 10

Backup your data frequently. Keep a separate file with a separate name at every important step as you go through any data cleaning and data recoding steps. If you discover a mistake later, you may then easily go back to a prior dataset generated before the error and move forward quickly.

What Software do I use?

There are many choices of data analytic software. These will often be decided by your statistical advisor as some programs require substantial specialist experience.

However, there are more accessible options, at least for initial exploratory analyses. Your university or hospital might have an annual subscription to GraphPad Prism (Home - GraphPad), an excellent and easy to use scientific graphing program which also provides some data manipulation options and basic statistical analyses. Prism provides some useful free advice on some principles of statistics and data management.

Beyond the calculations offered by spreadsheet programs like Excel (as mentioned above) and user-friendly packages like Prism, there are numerous higher-level data and statistical analysis programs. These comprise commercial packages such as Stata, SAS and SPSS; and Open Source programs such as Python and R, the latter often used with the free R Studio interface.

The very powerful features of R as a system for statistical computation and graphics has led to it fast becoming an international standard, but the manuals and help files tend to be too technical for beginners (CRAN: Manuals (r-project.org)). Remember, avoiding errors in the scientific analysis of data may require substantial knowledge, so it is important for most early researchers to seek informed advice rather than naively using analytical software or published statistical analysis code.



References and Resources

  1. Altman, D. G. (1994). "The scandal of poor medical research." BMJ 308 (6924): 283-284.
  2. Armitage, P., G. Berry and J. N. S. Matthews (2002). Statistical methods in medical research. Oxford, Blackwell Scientific Publications.
  3. Barnett, A. G. (2018). "Missing the point: are journals using the ideal number of decimal places?" F1000Res 7: 450.
  4. Dupont, W. D. and W. D. Plummer, Jr. (2018). PS: Power and Sample Size Calculation, https://biostat.app.vumc.org/wiki/Main/PowerSampleSize.
  5. Lee, K. J., M. Moreno-Betancur, J. Kasza, I. C. Marschner, A. G. Barnett and J. B. Carlin (2019). "Biostatistics: a fundamental discipline at the core of modern health data science." Med J Aust 211(10): 444-446 e441.
  6. Motulsky, H. (2018). Intuitive biostatistics: a nonmathematical guide to statistical thinking. New York, Oxford University Press.

Section Author

Associate Professor Jeffrey Presneill
Jeff is an ICU physician at Royal Melbourne Hospital. He believes biostatisticians are the missing specialist service in teaching hospitals. Uncertainty with analyses of research data led Jeff to a Biostatistics Collaboration of Australia postgraduate study program. For relaxation, Jeff would choose to read a good biostatistics textbook, drive a sports car, or sail a boat.

Trainee Reviewer

Dr Ariel Ho
Ariel is an ICU trainee currently based on the Gold Coast. She enjoys scuba diving and putting the try in triathlon.