Prepare and automate for data analysis

What can we do together:

  • Reduce data preparation time and better estimate these deadlines;
  • Transfer data preparation tasks from Excel to PowerQuery;
  • Help users understand the need for good data structure.

    Let’s get started!

“Mind the Gap”

Data access and preparation costs are almost always higher than estimated.

image-leftPreparing data for analysis is not a sexy task; it is a thankless, slow task full of exasperating detail. Being labor-intensive, any variation in data quality can have a significant impact on deadlines. It is common to overestimate the quality of data and therefore underestimate the cost of preparing it.

This “mind the gap” alerts to the difference between data as we think it exists and actual data, already collected, transformed, and structured. Underestimating the tasks may not be more than a few hours of extra work. Still, it can also derail the entire project (incompatible tables, data quality that require manual evaluation…).

From Excel to PowerQuery

*PowerQuery is essential for anyone working with data, enabling data transformations with an efficiency that Excel cannot match.*

image-leftExcel is a spreadsheet, and, as a spreadsheet, it is good enough for many tasks. Using its functions, more or less manually, it is possible to clean and transform the data. But Excel is far from being an ETL tool ( Extract, Transform, Load).

In a sense, PowerQuery and PowerBI are extensions to Excel: PowerQuery solves less simple data preparation issues (collection and processing of an increasing volume of data), while PowerBI solves problems at the opposite end (large data volumes, report distribution). For an Excel user, the experience with PowerQuery makes us wonder how we could live so long without this tool, solving problems like:

  • Rearrange rows and columns ( pivot/unpivot )
  • Filter tables, join tables by columns, add rows from one table to another
  • Multiple options for handling text, numbers, and dates
  • calculated columns

Record and repeat processes

*PowerQuery automates repeated processes, making time spent on updates and replication of steps to multiple files waste.*

image-left Unlike data preparation in Excel, PowerQuery has the additional advantage of allowing steps to be recorded and reproduced in files of similar structure. Those hours of work every month preparing an update? Now it’s just time to press a button. Repeat the same steps on dozens or hundreds of files? Put them in a folder, and PowerQuery does it automatically.

This automatism is not entirely new, as it was already possible to record some tasks with macros before. But that was outside the comfort zone of many users. In PowerQuery it is possible to carry out sophisticated data file transformations without looking under the rug and editing the M language used to record each step.

Side effects on Excel users

*PowerQuery makes Excel users more aware of the need for a better data structure.*

image-leftOne of the most positive effects of using PowerQuery is not strictly technical: it encourages a better data management culture among Excel users. Excel’s flexibility is one of its advantages, but it becomes a disadvantage when the volume of data increases and becomes more complex.

The traditional way of using Excel promotes and tends not to differentiate between storage and presentation. A data table is designed to make it easy to read:

  • Dates in columns.
  • Headers and titles using multiple rows.
  • Blank rows and columns to group data.

This is hard to maintain and update, and we are handcuffed by this structure, unable to apply a different perspective.

By better understanding the need to separate data tables and their multiple potential clients (presentation tables, pivot tables, charts…), it is possible to create spreadsheets with less risk of errors, easier to handle and update, and greater freedom of analysis.