Making Sense of Big Data

How can we use BigQuery to handle tables with many columns? Here’s a how using scripting and table metadata.

BigQuery is great, but how can it handle tables with many columns? Especially when we don’t even know how many columns there are? (Photo Credit: Ilse Orsel)

BigQuery (BQ) is a Petabyte-ready proprietary Google data warehouse product. Pretty much every day in my work, I use BQ to process billions of rows of data. But sometimes, tricky situations arise. For example, what if my table has a large number of rows and columns? How does BQ deal with that?

As with all other SQL languages, besides a few syntactical tricks such as * and EXCEPT, every query requires us to manually enter the names of all the desired columns. But what if I have a table with hundreds of columns, and I want to compute, say, all…


Quantum physics doesn’t have to be weird. Here, we explore a modern and intuitive interpretation of how quantum systems really work.

Can a (Schrödinger’s) cat in a box be both dead and alive? Common sense says no, and the Consistent Histories Interpretation elaborates how this works (credit: Gerd Altmann)

The word “quantum” is often associated with complicated equations and unintuitive physical phenomena. Yet, our world is unapologetically quantum; quantum physics governs all that exists in the Universe. This fact generates a lot of confusion amongst the general public: how can something so unintuitive — quantum physics — describe the intuitive world?

Well, one source of vexation comes from the old-school way of discussing quantum physics: that there are some magical wave-functions, such that objects can be in a mixture of two different places/situations, and that these “wave-functions” can change (or collapse) instantaneously across vast distances, when someone takes a…


Learn how to supercharge your aggregation queries using Materialized View

Learn how to supercharge your aggregation query using Materialized View! (Photo by Nana Smirnova on Unsplash)

BigQuery (BQ) is Google’s proprietary data warehouse product, advertised to be ready at the PetaByte (PB) scale. However, it’s not immediately obvious how to scale to PB. In fact, looking at the cost structure of BQ ($5/TB), running PB-scale analytics seems prohibitively expensive, as querying 1PB of data will quickly rack up a bill of $5000! Of course, there are dedicated slots available to keep cost down, but the question remains, how does one leverage BQ to allow lightning-fast analytics that can scale affordably to PB?

Well, the solution hinges on one crucial fact: PB-scale analytics usually only include aggregated


Many people think the arrow of time can be explained by the increase of entropy; but that’s inadequate. We’ll explore why.

The irreversibility of time is ubiquitous, but it’s not so obvious why that has to be the case (Photo by Aron Visuals)

“An inch of time is an inch of gold”: The transience of time has been known since time immemorial. Unlike trekking across an open plain or diving into the ocean’s depths, one can neither explore around nor remain stationary in time. With every tick of the clock, we are faced with the relentless advancement of time.

Time’s irreversibility seems contradictory: Einstein’s theory of Relativity unified space and time, and the laws of physics are (approximately) the same in all directions (including forward and backward in time). Yet, our Universe seems to have picked out a unique direction — the arrow…


The story of discovery from the perspective of a physicist in training.

During the presentation of the discovery of a “Higgs-like boson” (Credit: CERN Photo, Maximilien Brice, Laurent Egli)

I can still vividly recall the events of July 4th, 2012. The alarm clock woke me up around 3 a.m. I was slightly groggy, but I quickly collected myself, remembering the significance of the moment.

I was still a graduate student at Princeton at the time. There was a celebration “party” event that I decided not to attend (given that it was a 40-minute commute for me)—as I would later find out, I might have missed a chance to be in a few frames of the acclaimed documentary Particle Fever.

In the darkness of my room, I quietly turned on…


Unlike slot machines, BigQuery lets you generate random numbers for FREE! Here’s how to leverage this to perform free computations

Unlike slot machines, BigQuery lets you generate random numbers and run computations for FREE! (Credit: Amit Lahav)

BigQuery provides a convenient and cheap serverless framework to run data analytics and algorithms at scale (you can sign-up for a free account here). However, its SQL frontend might seem like a rather stringent constraint.

No worries, there is a hack out of this: custom User-Defined Functions (UDFs) in Javascript.

With the power of these UDFs, one can run all sorts of massively parallelized algorithms at scale. What better way to illustrate this than running Monte Carlo (MC) simulations?

Here’s a pro-tip: BigQuery’s cost structure depends only on the amount of data queried. What about generating random numbers? It doesn’t…


Entropy is often treated as synonymous to chaos and disorder. But what is it really? In this article, we explore how entropy is more about ignorances.

From the “invisible force” to the “harbinger of chaos,” you may have heard quite a few sensational phrases describing entropy.

But what is it really? The equations—frequently misunderstood—tell a more humbling story.

In many cases, entropy doesn’t capture anything particularly deep about a physical system. In fact, it says more about our understanding of the system than the system itself.

The main punchline is:

entropy measures our ignorance of a system

Let’s dissect how and why this is the proper way to understand entropy.

Understanding Entropy’s Definition

Starting from the beginning, the classical definition of entropy in physics, S, is given by the…


Learn how to create multiple tables (concatenated) in one query

Learn how to create concatenated tables in BigQuery
Learn how to create concatenated tables in BigQuery
Learn how to create concatenated tables in BigQuery

BigQuery (BQ) has become a popular way of managing large databases and running ad-hoc queries. BQ can be very cost-efficient, as it charges by the amount of data queried ($5/TB), and not the amount of computation time. Thus, it can be far cheaper to run computations in BQ compared to running jobs on Hadoop or Spark.

However, the SQL frontend comes with restrictions. A common computational task involves creating multiple outputs. How does BQ deal with storing output data?

This can be achieved through Data Manipulation Language (DML), which allows us to create tables to store results of a computation…

Tim Lou, PhD

Data Scientist @ LiveRamp | ex Particle Physics Postdoc @ Berkeley | Podcast host @ quirkcast.org

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store