What Is Data Engineering?

6 years ago

Even if you don’t work within the field of data processing, you’ve probably heard about data engineers. You may have the vague idea that they somehow make sense of the enormous amount of data floating around in our world. But you may still be left with the question, “What is data engineering?”

Simply put, data engineering is a method of making all that data accessible to analysts and data scientists. It takes a giant pile of junk and turns it into something we can use. Read on to learn more about data engineering and why it’s so crucial to our modern world.

What Is Data Engineering?

In our Information Age world, being able to manage, track, and interpret data is crucial. It allows us to do everything from check up on old friends on Facebook to call a ride through Uber. Data scientists can also help companies better understand their customers and how to serve their needs.

But while data scientists are upfront working with those numbers, data engineers are in the background making that analysis possible. Data engineers build the systems that allow data to be moved and stored so it can be analyzed. If data is running water in your house, data engineers are the plumbers who installed the pipelines that bring that water in and out of your home.

The Hierarchy of Analytics

There are a few different levels of data engineering that build on each other, like a pyramid. In order to have the highest level of data engineering and science happening (artificial intelligence and deep learning), you have to build the foundation levels first.

The most basic level of data science and engineering is collection, followed by movement and storage. Then you can begin to explore and transform the data before you aggregate and label it. Finally, you can begin to learn from and optimize the data, and at the highest levels, you can begin teaching computers to do that work for you.

Building Data Warehouses

One of the first things data engineers have to do when they’re setting up a system is build data warehouses. Like real warehouses, these function to store data in organized systems. This makes it much easier to manage data and to scale systems up as a company or research project grows.

Computers can carry out two billion operations per second, but when it comes to sorting through data, that speed may not be enough. Without proper warehousing, trying to get a computer to find a specific piece of data is like trying to find one specific piece of sand on a mile-long beach. Data warehousing lets the computer go directly to where that piece of sand is located rather than having to check each individual grain.

Building and Maintaining Data Pipelines

Data pipelines are how that data gets moved around once it’s found. Think of them as conveyor belts in the data warehouses. Once the computer finds the piece of data it’s looking for, it drops it on a “conveyor belt” to head back to the location it needs to go to.

There are three basic steps of data pipeline structures: extract, transform, and load. Sensors pick up a piece of data and transfer it to a transformation location. There, the system transforms that data into something that can be read and used and then load that data to transfer it to another sensor.

Choosing Frameworks

There are different frameworks you can use to manage this extract, transfer, load (ETL) process. There are a few factors companies need to consider when they’re selecting their ETL frameworks.

The first thing you’ll want to look at is the configuration; you want a configuration that will meet your needs. You also want to make sure your framework will monitor and alert you when problems arise. You should also check in on how your framework processes historical data.

Different Paradigm Options

There are two different paradigm options that you can choose from for your ETL: JVM-centric and SQL-centric. JVM-centric ETLs are built in languages like Java and Scala. Engineers prefer this option because it involves managing data transformation in a more imperative manner.

SQL-centric ETLs are defined in more declarative ways than JVM models. They are centered around SQL and tables, and data scientists like this model because it’s much easier to learn than Java and Scala. That allows you to focus your time on the actual data rather than the computer language to understand the data.

Data Modeling and Normalization

When you go to set up a database, you need to think about the kind of information you’re going to need from that table further down the line. For example, if you’re gathering names in a data set, you may not think to gather the first and last names in separate columns. That becomes a problem when you decide you want to organize your list of people in alphabetical order by last name.

Data modeling helps you design a system that will be able to generate the reports you need from the data you have. One factor to consider during this process is whether you want your data tables normalized (all fitted into one standard set of fields) or denormalized (all available data in whichever field it fits best in). Normalized data is easier to handle quickly, but denormalized data can provide a wider view into whatever you’re analyzing.

Fact and Dimension Tables

It is possible to build denormalized data tables from fact tables and dimension tables, two smaller normalized tables. Fact tables are simple tables that contain point-in-time transactional data. So when you buy a hamburger at a fast-food restaurant, their system might record the time and cost of your transaction.

Dimension tables keep track of how specific entities change over time. So going back to our fast food example, a dimension table might keep track of the fact that while you used to order a hamburger every week, now you only order a hamburger once a month, or now you order chicken nuggets instead. Dimension tables can work with fact tables to gather this information.

Data Partitioning

One great way to improve the efficiency of your database is to partition your data. This is a way of chopping up large data sets into chunks so you don’t have to manage the whole unwieldy system at once.

Think of it as having file folders in a filing cabinet drawer. Instead of having to pull out the whole drawer to sort through the information, you can lift out one folder.

One common way to partition data is by datestamp. New date partitions may be created for each daily run of data, so you can go look at what the data for a particular day was. Our fast-food restaurant could see how many hamburgers were purchased worldwide on March 9 of the previous year.

Backfilling Historical Data

One of the challenges of setting up a new data system is you want to be able to look at all the data you’ve gathered previously, not just what you have moving forward. But it becomes difficult to organize that data with your new system; this is where backfilling historical data comes in. This system allows you to take existing data and sort it into the new system so it can be accessed as easily and quickly as the new data.

Datestamps can be extremely helpful here, too. The new system can go through the old data and partition it by what date it was gathered on. From there, it becomes much easier to sort that data out into other subcategories. You can also use dynamic partitions to perform multiple insertions at once so backfilling becomes even quicker and easier.

From Pipelines to Frameworks

Pipelines are a crucial piece of data engineering, but they get extremely complicated in very short order. If you’re using single, standalone pipelines, you’ll have to construct a new pipe for every function you want to perform in your database. This would be a little like running an individual pipe from a water tower to your house and another one to your neighbor’s house and another one to their neighbor’s house, as well as individual sewer lines running to the sewage treatment plant.

Data frameworks allow you to generate data pipelines and directed acyclic graphs on the fly. This can automate data workflows, allowing you to manage your data much more efficiently. Rather than performing the same tasks over and over again, you find patterns that can be automated and delegate those tasks to your computer.

Incremental Computation Framework

There are several different frameworks you can use during data engineering. Incremental computation frameworks can allow you to look at information like how many customers ever engaged with a new product without having to look at your entire database of all the interactions every customer has ever had with your company.

In an incremental computation framework, a script will build a summary table that will unify the summary table from the previous datestamped partition to the table from the current day. They’ll update the expensive metrics and create a table where these metrics can be queried from one single date partition from your summary table.

Backfill Framework

Even with multiple partition insertions running at once, backfilling can still be a cumbersome process. You’re taking potentially years’ worth of data and trying to wrangle it into a new organization system. A backfill framework can automate those workflows and make the process even simpler.

Users start by specifying how many processes they want to parallelize the backfill for and how many days each process should backfill for. Then the framework creates a pipeline that will parallelize those backfill tasks, perform sanity checks, and swap staging tables with production tables. At the end of the process, you’ll have a fully backfilled table ready to go.

Global Metrics Framework

In many businesses, different parts of the company may need different key performance metrics. Most of the time, these groups may need a large number of fact tables joined to a much smaller number of dimension tables. You can create a global metrics framework that will make it easier for everyone to access the specific combination of tables you need.

You start by inputting a number of metrics from an atomic fact table, dimensions you want in your final table, primary and foreign keys to be used for joins, and more. Then the framework automatically generates denormalized tables created from the appropriate dimension and atomic fact tables.

Experimentation Reporting Framework

The key to growing a company is to experiment with different ideas, but you need to know which ideas work and which don’t. This is where an experimentation reporting framework comes in. These frameworks can allow data scientists to run hundreds or thousands of experiments without each experiment needing one dedicated scientist.

Users start by specifying what kind of experiment they want to run, which metrics to track, and any other relevant information about the experiment. The framework then computes the specified metrics and the corresponding dimensions. From there, depending on how complex your framework is, it might do some downstream processing to make the data more manageable.

Learn More About Data Engineering

These days, everything in our world runs off data, from the kinds of ads we see to the way we use apps on our phones to even our medical records. None of this data can go anywhere without data engineering. Data engineering makes it possible for us to store, transport, and interpret the massive amount of data circulating in our world.

If you’d like to answer more of the question, “What is data engineering?” check out the rest of our site at Boost Labs. We love creating ways to view data, interact with it, and share it in ways that empower our clients and their stakeholders. Learn more about our services and start making the best use of your data today.