Data et. al - Data, Software, Tech, & Productivity

You have power over your mind - not outside events. Realize this, and you will find strength. -Marcus Aurelius

Habits and identity

The first three chapters of Atomic Habits discusses achieving goals and thinking about results, processes, and identities. That is, many people start with outcomes in mind when setting up for achieving goals. I want to run a marathon. Or, I want to make more money. The way to actually achieve these sorts of outcomes is to think about the identity of an individual who has more money or who runs marathons.

If I want to stop eating garbage food so frequently I need to think about the identity of an individual who does not eat garbage food. That is, the identity of a health-minded individual.

An identity is thus comprised of multiple decisions throughout your day. At each decision an opportunity presents itself for you to think about what the identity of an individual with your chosen goal would do. A sort of secular WWJD...

So I want to be healthier. So I need to take micro-actions throughout the day that personify the identity of a healthy person. It seems obvious but focusing on results constantly rather than the process stemming from an identity will lead to impatience and burnout. So I will try to WWJD myself into making better decisions for health.

Flattening a nested dictionary in Python

I've always hated recursive logic because it makes my brain hurt.

This is relatively straightforward approach to flattening a nested dictionary by recursively passing the value of a dictionary back to itself if it is itself a dictionary.

d = {
    'a':1,
    'b':2,
    'c':{
        'a':1
    }
}


def flatten(d, parent_key=''):
    output = []
    for k,v in d.items():
        new_key = k + parent_key
        if isinstance(v, dict):
            output.extend(flatten(v, new_key).items())
        else:
            output.append((new_key, v))
    return dict(output)


test = flatten(d)
print(test)

How to become a data engineer

(this blog post is auto-generated via OpenAI)

Here is a guide on how to become a data engineer:

What are the Requirements for Becoming a Data Engineer?

The requirements for becoming a data engineer vary depending on the company. However, most companies require that you have experience with SQL, Python, Java, or Scala. You should also have experience with Hadoop or Spark. You should also have experience with machine learning tools such as TensorFlow or Keras. You should also have experience with distributed computing frameworks such as Apache Spark or Apache Hadoop.

What are the Education Requirements for Becoming a Data Engineer?

The education requirements for becoming a data engineer vary depending on the company. However, most companies require that you have a bachelor's degree in computer science, mathematics, statistics, or engineering. You should also have experience with SQL, Python, Java, or Scala. You should also have experience with Hadoop or Spark. You should also have experience with machine learning tools such as TensorFlow or Keras. You should also have experience with distributed computing frameworks such as Apache Spark or Apache Hadoop.

  1. Learn Python

Python is a popular programming language that is used for data engineering. It is a general-purpose programming language that can be used for many different purposes. It is also one of the most popular languages in the world, so it will be easy to find help if you need it.

  1. Learn SQL

SQL stands for Structured Query Language and it is a programming language that is used to interact with databases. It is important to learn SQL because it will allow you to query databases and extract information from them. You can also use SQL to create tables and insert data into them.

  1. Learn Hadoop

Hadoop is an open-source software framework that allows you to store and process large amounts of data in a distributed computing environment. Hadoop has become one of the most popular tools for data engineering because it allows you to store large amounts of data on inexpensive hardware, which makes it easier to process large amounts of data quickly and efficiently.

Using Airflow to orchestrate data pipelines

(this blog post is auto-generated via gpt-neo)

Using airflow to orchestrate your data pipelines

In many organizations, data pipelines are used to transfer data with as little manual intervention as possible to keep the time to value of the process as low as possible. The following video demonstrates how to set up and orchestrate a data pipeline on AWS with AWS CloudFormation.

Note

It’s important to pay attention to AWS documentation when setting up or updating any AWS resource. For example, using CloudFormation to update resource tags is not recommended. The new resource will be created at an unpredictable moment, or incorrectly updated, as resources that AWS recommends that you use to create resources are not created by CloudFormation.

While setting up a CloudFormation resource that can be updated after creation is a recommended, highly-guarded practice, AWS recommends against creating new resources, rather than simply updating existing ones, as the new resource will not be able to be deleted when the existing resource is deleted.

In this example, you will use an AWS Data Pipeline resource (called Source) to transfer data from one database to another database. This resource will transfer and maintain your data in your source database. After the data has been transferred, it will be dropped from the source database.

Note

The example in this post will use the PostgreSQL database, although any database is acceptable when using the Amazon RDS service.

The best laptops for business

(this blog post is auto-generated via gpt-neo)

The best laptop for business, gaming, media etc.

Let us break it down.

So you have been looking at laptops so long, just what is it that you needed to buy? I want you to think again and take it from the top down. For a while, I have been looking at the laptop market, a lot like how people look at cars. And the same sort of “Top 10” type of question arises. What is the perfect machine for me to use. And, then what is the perfect machine for others, to show off. It all sort of boils down to the same question, as well.

I am going to try and break it down in as much detail that I possibly can to see if we can discover what the perfect system is for each and every individual in this world.

I will break this down into a few general categories and if we find out where there is a gap for improvement, I’ll attempt to fix it in the article.\n\nBut it will just be a starting point.

I am not going to worry about the spec sheet at all. If you want to check their specs then go ahead.

But first, let us focus on the functionality of the system and how it can help you.

This is probably the most critical element for the user and for the company itself. It is this which makes or breaks a business or company.

Awkwardness, Microsoft Surface Laptop 4, Airflow & Composer

thought

Sometimes I feel like I need to say something in order for common courtesy to take place; even if I don't have anything to say or don't want to say anything. This can't be at all unusual, right?

Anyway, it seems forcing it ends up with me saying things that make little sense since they're so forced. But to say nothing is often considered rude. So is it better to just say anything and risk just coming off aloof?

bought

Microsoft Surface Laptop 4 was announced today. I impulsively placed an order. I have enjoyed the WSL2 experience when combined with Windows Terminal. Makes developing on Windows machines not a complete pain in my ass.

Why did I get this? I'm currently typing on a bohemoth 17" MSI gaming laptop that gets a solid 3-4 hours of time off charge. And while it kicks ass for gaming, it's sort of a terrible choice for general productivity, browsing, and portability if I don't need the power. I'm looking forward to something I can keep off the cord for hours at a time and charge at night like my phone.

Opted for the AMD 512GB with 16GB RAM.

worked on

More work with GCP's Airflow service, Composer. Getting weird memory limit errors with heavy loads on columnarly wide datasets. Scaled up node memory limits within its Kubernetes cluster and still nothing. Fine for now because I got what I need but part of the fun of BigQuery is analytical datastores with wide columnar dataset!

After messing with that for a bit, ended up creating a few more SQL queries to bring interesting views of Salesforce data into Data Studio.

Note to self: Diagram a set of view schemas so I don't have hundreds of custom SQL queries powering hundreds of reports. Reduce, reuse, recycle.

That's all I think.

Getting started on Cloud Run & Kubernetes

I'm getting more and more familiar with GCP and its capabilites as I deploy applications through things like Cloud Run now. This has been sort of a gateway into me wanting to learn about the underlying infrastructure behind it, which has in turn led me to start learning about Kubernetes setups.

I think one difficult thing about all of these options is that there are so many, that you have to really understand all of them in order to make a decision on which one you should be using.

I think if all you care about is saying, give me this CPU size in this region and serve this Image then Cloud Run is great.

I suppose the only real reason I might want to go through the extra work of managing my own kubernetes clusters would be if I wanted to make it a bit more platform agnostic and not be tied into one provider.

I think I might take one of my Cloud Run jobs and deploy a container to a Kubernetes cluster so I get a bit more end-to-end experience with it vs just sort of checking logs of pods that the infra team has procured and set up on their own clusters.

That's one of the things about data engineering; it's difficult sometimes to tell where infrastructure, devops, and data eng should differentiate and overlap. We're working through it live at work since our structure is sort of maturing over time. So far, so good.

That's all for now.

Easing into natural language processing with Hugging Face Transformers

Advancements in AI have brought a lot of attention to a number of subdomains within this vast field. One interesting one is natural language processing.

What is a Hugging Face Transformer?

Why don’t we let their pretrained models answer this question?

Transformers provides general-purpose architectures for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages . The library currently contains PyTorch, Tensorflow and Flax implementations, pretrained model weights, usage scripts and conversion utilities for the following models .

Not bad AI. Not bad at all. The above quote is what a pretrained model using a summarization pipeline provides when applied to the contents of the Hugging Face Transformers documentation.

Using these pipelines allow pretty much anybody to get started down the road of natural language processing without much insight into the back-end of PyTorch or TensorFlow.

How to use Hugging Face Text Summarization

First you have to install the transformers package for Python.

pip3 install transformers

Once you have this installed it is a simple matter of importing the pipeline, specifying the type of model we want to run; in this case summarization, and then passing it your content to summarize.


from transformers import pipeline
text = "Insert a wall of text here"
summarization = pipeline("summarization")
summary_text = summarization(text)[0]['summary_text']
print(summary_text)

For beginners and experts

The simplicity of these libraries mean you can get started quickly. You can do a lot out of the gate with these libraries and you’ll quickly notice the limitations of the vanilla models. Don’t get me wrong, they are amazing, but if you want to do fine tuning, expect to get reading on some documentation.

I’d suggest identifying a community contributed model that seems interesting and then reverse engineering that if you want to see how they come together.

Ultimately, I believe Hugging Face brings a democratization of NLP for developers in a sense. It is much easier to apply pretrained models to accomplish common tasks such as sentiment analysis, text summarization, and even question generation!

It also opens up NLP and AI practitioners to get involved by contributing to model building and improving the quality of the output that enthusiasts such as myself can enjoy without pouring through documentation tuning parameters when that isn’t my day job!

Give these transformers and pretrained models a try and let me know what you think! Have you found interesting uses for these on any projects?

Hello world

Standard Notes & Listed as a blog

This is my first post using Listed.to and Standard Notes.

Let's see how it works!

def hello():
  print("Hello World")

if __name__ == '__main__':
  hello()

Building an attribution model with markov chains

A short while ago I published a rather technical post on the development of a python-based attribution model that leverages a probabilistic graphical modeling concept known as a Markov chain.

I realize what might serve as better content is actually the motivation behind doing such a thing, as well as providing a clearer understanding of what is going on behind the scenes. So to that end, in this post I'll be describing the basics of the Markov process and why we would want to use it in practice for attribution modeling.

What is a Markov Chain

A Markov chain is a type of probabilistic model. This means that it is a system for representing different states that are connected to each other by probabilities.

The state, in the example of our attribution model, is the channel or tactic that a given user is exposed to (e.g. a nonbrand SEM ad or a Display ad). The question then becomes, given your current state, what is your next most likely state?

Well one way to estimate this would be to get a list of all possible states branching from the state in question and create a conditional probability distribution representing the likelihood of moving from the initial state to each other possible state.

So in practice, this could look like the following:

Let our current state be SEM in a system containing the possible states of SEM, SEO, Display, Affiliate, Conversion, and No Conversion.

After we look at every user path in our dataset we get conditional probabilities that resemble this.

P(SEM | SEM) = .1
P(SEO | SEM) = .2
P(Affiliate | SEM) = .05
P(Display | SEM) = .05
P(Conversion | SEM) = .5
P(No Conversion | SEM) = .1

This can be graphically represented.
Screen-Shot-2019-04-12-at-3.49.58-PM

Notice how the sum of the probabilities extending from the SEM state equal to one. This is an important property of a Markov process and one that will arise organically if you have engineered your datset properly.

Connect all the nodes

Above we only identified the conditional probabilities for scenario in which our current state was SEM. We now need to go through the same process for every other scenario that is possible to build a networked model that you can follow indefinitely.

Screen-Shot-2019-04-12-at-3.57.16-PM-1

Intuition

Now up to this point I've written a lot about the process of defining and constructing a Markov chain but I think at this point it is helpful to explain why I like these models over standard heuristic based attribution models.

Look again at the fully constructed network we have created, but pay special attention to the outbound Display vectors that I've highlighted in blue below.
Screen-Shot-2019-04-12-at-4.00.17-PM

According to the data, we have a high likelihood of not converting at about 75% and only a 5% chance of converting the user. However, that user has a 20% probability of going proceeding to SEM as the next step. And SEM has a 50% chance of converting!

This means that when it comes time to do the "attribution" portion of this model, Display is very likely to increase its share of conversions.

Attributing the Conversions

Now that we have constructed the system that represents our user behavior it's time to use it to re-allocate the total number of conversions that occured for a period of time.

What I like to do is take the entire system's probability matrix and simulate thousands of runs through the system that end when our simulated user arrives at either conversion or null. This allows us to use a rather small sample to generalize because we can simulate the random walk through the different stages of our system with our prior understanding of the probability of moving from one stage to the next. Since we pass a probability distribution into the mix we are allowing for a bit more variation in our simulation outcomes.

After getting the conversion rates of the system we can simulate what occurs when we remove channels from the system one by one to understand their overall contribution to the whole.

We do this by calculating the removal effect1 which is defined as the probability of reaching a conversion when a given channel or tactic is removed from the system.

In other words, if we create one new model for each channel where that channel is set to 100% no conversion, we will have a new model that highlights the effect that removing that channel entirely had on the overall system.

Mathematically speaking, we'd be taking the percent difference in the conversion rate of the overall system with a given channel set to NULL against the conversion rate of the overall system. We would do this for each channel. Then we would divide the removal CVR by the sum of all removal CVRs for every channel to get a weighting for each of them so that we could finally then multiply that number by the number of conversions to arrive at the fractionally attributed number of conversions.

If the above paragraph confuses you head over to here and scroll about a third of the way down for a clear removal effect example. I went and made my example system too complicated for me to want to manually write out the the removal effect CVRs.

That's it

Well by now you have a working attribution model that leverages a Markov process for allocating fractions of a conversion to multiple touchpoints! I have also built a proof-of-concept in Python that employs the above methodology to perform markov model based attribution given a set of touchpoints.2


  1. Anderl, Eva and Becker, Ingo and Wangenheim, Florian V. and Schumann, Jan Hendrik, Mapping the Customer Journey: A Graph-Based Framework for Online Attribution Modeling (October 18, 2014). Available at SSRN: https://ssrn.com/abstract=2343077 or http://dx.doi.org/10.2139/ssrn.2343077 

  2. https://github.com/jerednel/markov-chain-attribution