A valuable asset for anyone looking to break into the Data Engineering field is understanding the different types of data and the Data Pipeline.
How to become a better data leader that the data engineers love?
Meet The Entrepreneur: Alon Lev, CEO, Qwak
This is the first completed webinar of our “Great Expectations 101” series. The goal of this webinar is to show you what it takes to deploy and run Great Expectations successfully.
Do we need a radical new approach to data warehouse technology? An immutable data warehouse starts with the data consumer SLAs and pipes data in pre-modeled.
Processing large data, e.g. for cleansing, aggregation or filtering is done blazingly fast with the Polars data frame library in python thanks to its design.
In this article, I will talk about how I improved overall data processing efficiency by optimizing the choice and usage of data warehouses.
Self-serve systems are a big priority for data leaders, but what exactly does it mean? And is it more trouble than it's worth?
Noom helps you lose weight. We help you get a job at Noom. In today’s article, we’ll show you one of Noom’s hard SQL interview questions.
Location-based information makes the field of geospatial analytics so popular today. Collecting useful data requires some unique tools covered in this blog.
In "Towards Open Options Chains", Chris Chow presents his solution for collecting options data: a data pipeline with Airflow, PostgreSQL, and Docker.
The worst nightmare of analytics managers is accidentally blowing up the data warehouse cost. How can we avoid receiving unexpectedly expensive bills?
Migrating from Convox to Nomad and some AWS performance issues we encountered along the way thanks to Datadog
Here is not really an article, but more some notes about how we use dbt in our team.
Too lazy to scrape nlp data yourself? In this post, I’ll show you a quick way to scrape NLP datasets using Youtube and Python.
Write efficient and flexible data-pipelines in Python that generalise to changing requirements.
In this post, I discuss the algorithms of a nested loop, hash join, and merge join in Python.
Implementing tracking code based on an outdated version of your organization's data plan can result in time-consuming debugging, dirty data pipelines, an
Best practices for building a data team at a hypergrowth startup, from hiring your first data engineer to IPO.
Ask anyone in the data industry what’s hot and chances are “data mesh” will rise to the top of the list. But what is a data mesh and is it right for you?
Here are six important steps for setting goals for data teams.
In this article, we cover how to use pipeline patterns in python data engineering projects. Create a functional pipeline, install fastcore, and other steps.
Data teams come in all different shapes and sizes. How do you build data observability into your pipeline in a way that suits your team structure? Read on.
In "Towards Open Options Chains", Chris Chow presents his solution for collecting options data: a data pipeline with Airflow, PostgreSQL, and Docker.
Bridging the gap between Application Developers and Data Scientists, the demand for Data Engineers rose up to 50% in 2020, especially due to increase in investments in AI-based SaaS products.
With each day, enterprises increasingly rely on data to make decisions.
Predictive Modeling in Data Science is more like the answer to the question “What is going to happen in the future, based on known past behaviors?”
28. 80% of Issues Aren't Caught by Testing Alone: Build Your Data Reliability Stack to Reduce Downtime
After speaking to hundreds of teams, I discovered ~80% of data issues aren’t covered by testing alone. Here are 4 layers to building a data reliability stack.
2021 Noonies Nominee General Interview with Veronika. Read for more on cloud services, data engineering, and python.
Find out how to set up and work locally with the most granular demographics dataset that is out there.
Applying machine learning models at scale in production can be hard. Here's the four biggest challenges data teams face and how to solve them.
Standard Audiences: A product that extends the functionality of regular Audiences, one of the most flexible, powerful, and heavily leveraged tools on mParticle.
Put your organization on the path to consistent data quality with by adopting these six habits of highly effective data.
Data trust starts and ends with communication. Here’s how best-in-class data teams are certifying tables as approved for use across their organization.
The art of building a large catalog of connectors is thinking in onion layers.
PyTorch Geometric Temporal is a deep learning library for neural spatiotemporal signal processing.
In the previous article, I described the concept and design of the Structured Data Service in the Alluxio 2.1.0 release. This article will go through an example to demonstrate how it helps SQL and structured data workloads.
A Data Pipeline Solution - Part I](https://hackernoon.com/towards-open-options-chains-a-data-pipeline-solution-for-options-data-part-i) In "Towards Open Options Chains", Chris Chow presents his solution for collecting options data: a data pipeline with Airflow, PostgreSQL, and Docker.
Governance is the Gordian Knot to all Your Business Problems.
Learning about best data visualisation tools may be the first step in utilising data analytics to your advantage and the benefit of your company
In the previous article, I described the concept and design of the Structured Data Service in the Alluxio 2.1.0 release. This article will go through an example to demonstrate how it helps SQL and structured data workloads.
From simplifying data collection to enabling data-driven feature development, Customer Data Platforms (CDPs) have far-reaching value for engineers.
Why we chose to finally buy a unified data workspace (Atlan), after spending 1.5 years building our own internal solution with Amundsen and Atlas
Learn the impact of airflow on the data quality checks and why you should look for an alternative solution tool
In "Towards Open Options Chains", Chris Chow presents his solution for collecting options data: a data pipeline with Airflow, PostgreSQL, and Docker.
The benefits that come with using Docker containers are well known: they provide consistent and isolated environments so that applications can be deployed anywhere - locally, in dev / testing / prod environments, across all cloud providers, and on-premise - in a repeatable way.
See how a hybrid architecture marries the best of the SaaS world and on-prem world for modern data stack software.
Influenza Vaccines and Data Science in Biology
The 5 things every data analyst should know and why it is not Python, nor SQL
It doesn’t matter if you are running background tasks, preprocessing jobs or ML pipelines. Writing tasks is the easy part. The hard part is the orchestration— Managing dependencies among tasks, scheduling workflows and monitor their execution is tedious.
mParticle & HackerNoon are excited to host a Growth Marketing Writing Contest. Here’s your chance to win money from a whopping $12,000 prize pool!
Multi-part series that will take you from beginner to expert in Delta Lake
See mParticle data events and attributes displayed in an eCommerce UI, and experiment with implementing an mParticle data plan yourself.
HarperDB is more than just a database, and for certain users or projects, HarperDB is not serving as a database at all. How can this be possible?
I've worked on teams building ML-powered product features, everything from personalization to propensity paywalls. Meetings to find and get access to data consumed my time, other days it was consumed building ETLs to get and clean that data. The worst situations were when I had to deal with existing microservice oriented architectures. I wouldn't advocate that we stop using microservices, but if you want to fit in a ML project in an already in-place strict microservice oriented architecture, you're doomed.
As the third largest e-commerce site in China, Vipshop processes large amounts of data collected daily to generate targeted advertisements for its consumers. In this article, guest author Gang Deng from Vipshop describes how to meet SLAs by improving struggling Spark jobs on HDFS by up to 30x, and optimize hot data access with Alluxio to create a reliable and stable computation pipeline for e-commerce targeted advertising.
This case study describes how we built a custom library that combines data housed in disparate sources to acquire the insights we needed.
Today, I am going to cover why I consider data science as a team sport?
This is a collaboration between Baolong Mao's team at JD.com and my team at Alluxio. The original article was published on Alluxio's blog. This article describes how JD built an interactive OLAP platform combining two open-source technologies: Presto and Alluxio.
handoff is a serverless data pipeline orchestration framework simplifies the process of deploying ETL/ELT tasks to AWS Fargate.
In this article, we’ll investigate use cases for which data engineers may need to interact with NoSQL database, as well as the pros and cons.
Is the data engineer still the "worst seat at the table?" Maxime Beauchemin, creator of Apache Airflow and Apache Superset, weighs in.
This blog post is a refresh of a talk that James and I gave at Strata back in 2017. Why recap a 3-year-old conference talk? Well, the core ideas have aged well, we’ve never actually put them into writing before, and we’ve learned some new things in the meantime. Enjoy!
In "Towards Open Options Chains", Chris Chow presents his solution for collecting options data: a data pipeline with Airflow, PostgreSQL, and Docker.
In this blog, guest writer Derek Tan, Executive Director of Infra & Simulation at WeRide, describes how engineers leverage Alluxio as a hybrid cloud data gateway for applications on-premises to access public cloud storage like AWS S3.
Delight is an open-source an cross-platform monitoring dashboard for Apache Spark with memory & CPU metrics complementing the Spark UI and Spark History Server.
See how to leverage the Airflow ShortCircuitOperator to create data circuit breakers to prevent bad data from reaching your data pipelines.
Learn how to build an n8n workflow that processes text, stores data in two databases, and sends messages to Slack.
In this first post in our 2-part ML Ops series, we are going to look at ML Ops and highlight how and why data quality is key to ML Ops workflows.
Sometimes, we might not be able to afford a paid subscription on Slack. Here's a tutorial on how you can save and search through your Slack history for free.
This post explains what a data connector is and provides a framework for building connectors that replicate data from different sources into your data warehouse
Tiered Locality is a feature led by my colleague Andrew Audibert at Alluxio. This article dives into the details of how tiered locality helps provide optimized performance and lower costs. The original article was published on Alluxio’s engineering blog
This article covers 7 data engineering gotchas in an ML project. The list is sorted in descending order based on the number of times I've encountered each one.
Since the big bang in the data technology landscape happened a decade and a half ago, giving rise to technologies like Hadoop, which cater to the four ‘V’s. — volume, variety, velocity, and veracity there has been an uptick in the use of databases with specialized capabilities to cater to different types of data and usage patterns. You can now see companies using graph databases, time-series databases, document databases, and others for different customer and internal workloads.
This article introduces Structured Data Management (Developer Preview) available in the latest Alluxio 2.1.0 release, a new effort to provide further benefits to SQL and structured data workloads using Alluxio. The original concept was discussed on Alluxio’s engineering blog. This article is part one of the two articles on the Structured Data Management feature my team worked on.
In this listicle, you'll find some of the best data engineering courses, and career paths that can help you jumpstart your data engineering journey!
This article presents the collaboration of Alibaba, Alluxio, and Nanjing University in tackling the problem of Deep Learning model training in the cloud. Various performance bottlenecks are analyzed with detailed optimizations of each component in the architecture. This content was previously published on Alluxio's Engineering Blog, featuring Alibaba Cloud Container Service Team's case study (White Paper here). Our goal was to reduce the cost and complexity of data access for Deep Learning training in a hybrid environment, which resulted in over 40% reduction in training time and cost.
A brief description of the difference between Data Science and Data Engineering.
What's Deep Data Observability and how it's different from Shallow.
Writing ML code as pipelines from the get-go reduces technical debt and increases velocity of getting ML in production.
Congratulations, you’ve successfully implemented data testing in your pipeline!
Metabase is a business intelligence tool for your organisation that plugs in various data-sources so you can explore data and build dashboards. I'll aim to provide a series of articles on provisioning and building this out for your organisation. This article is about getting up and running quickly.
When it comes to Big Data infrastructure on Google Cloud Platform , the most popular choices Data architects need to consider today are Google BigQuery – A serverless, highly scalable and cost-effective cloud data warehouse, Apache Beam based Cloud Dataflow and Dataproc – a fully managed cloud service for running Apache Spark and Apache Hadoop clusters in a simpler, more cost-efficient way.
This tutorial shows how Alibaba Cloud Container team runs PyTorch on HDFS using Alluxio under Kubernetes environment. The original Chinese article was published on Alibaba Cloud's engineering blog, then translated and published on Alluxio's Engineering Blog
Migrating Presto workloads from a fully on-premise environment to cloud infrastructure has numerous benefits, including alleviating resource contention and reducing costs by paying for computation resources on an on-demand basis. In the case of Presto running on data stored in HDFS, the separation of compute in the cloud and storage on-premises is apparent since Presto’s architecture enables the storage and compute components to operate independently. The critical issue in this hybrid environment of Presto in the cloud retrieving HDFS data from an on-premise environment is the network latency between the two clusters.
Data lakes are an essential component in building any future-proof data platform. In this article, we round up 7 reasons why you need a data lake.
How I learned to stop using pandas and love SQL.
Overview of the modern data stack after interview 200+ data leaders. Decision Matrix for Benchmark (DW, ETL, Governance, Visualisation, Documentation, etc)
This blog covers real-world use cases of businesses embracing machine learning and data engineering revolution to optimize their marketing efforts.
How to detect, capture, and propagate changes in source databases to target systems in a real-time, event-driven manner with Change Data Capture (CDC).
Goldman Will Dominate Consumer Banking
Maximizing efficiency is about knowing how the data science puzzles fit together and then executing them.
Data Version Control (DVC) is a data-focused version of Git. In fact, it’s almost exactly like Git in terms of features and workflows associated with it.
Data augmentation is a technique used by practitioners to increase the data by creating modified data from the existing data.