Because The Air Is Free -- Pablo Duboue's Blog

RSS Feed

How to age in technical roles as an individual contributor

Mon, 19 Feb 2024 09:35:15 -0800

Tags: personal, technical, work


A natural progression for people in technical roles is to grow into management roles. As organizations seek to better compensate senior personnel, promoting them into management roles enables a multiplier effect through the work of their reports. And, as the programmer base doubles every five years, managers are always on demand. Other areas where individual contributors (ICs) evolve include technical sales, which is very lucrative, co-founder, and instructor.

Management and ICs paths are not comparable. Managers do something different from ICs, which on itself is very hard and not necessarily that appreciated by technical contributors. Understanding the inner state of other humans is a task where experience helps a lot. Experience as a technical IC does not help that much directly in our blazingly fast world of technology.

Even if there is pressure to move into management or other roles, it is possible to stay technical and find satisfaction and fulfillment working as an IC. To me, it hinges on cherishing the process of creating great things together. Loving what has been built and helping others (team members, final users). It comes from the concept of willingness to serve.

This is a rather long post with a related thoughts on the matter, joining personal introspection with some pieces of advice. It is divided into three sections: technical skills, soft skills, and life lessons.

An insider critique of the Jeopardy! QA System

Wed, 27 Jul 2022 15:43:52 -0700

Tags: restrospective, business, technical


Last year was the 10 year anniversary of the IBM/Jeopardy! show and this year is 12 years since I left the company and immigrated to Canada.

Now that the dust has clearly settled on the show, it is a good time to share my thoughts on the matter. The show gathered a lot of press at the time of airing and afterward experienced continuous academic backlash ever since. Here is a piece from the perspective of an individual contributor to the project.

This post echoes some of the criticism but highlights many positive lessons that might have been overseen. Stick to the end if you want to get the full perspective.

On Ephemeral Software Development

Sat, 12 Feb 2022 16:58:26 -0800

Tags: philosophical, technical


Programming to me was about building something. A program or a system of programs that do something useful. It echoed construction of physical artifacts. Engineered artifacts have a lifespan, they are constructed to last a certain amount of time, years, decades, centuries. But it is a construction activity.

Since I started programming, the number of programmers has doubled, about 5 times. This reality plus the philosophy of some companies to "move fast and break things" and the advent of Software-as-a-Service has brought about a new state-of-the-art. It seems to me that the best way to intellectually comprehend what computer programming is these days is to consider it as some sort of performance. Making a parallelism with art, before we used to make a series of paintings, retouching or painting from scratch, but they were artifacts. Each version of a software was an artifact. Continuing with the metaphor, these days, programming seems closer to an art performance. There is rehearsing at the devs and QA desks, then the software is deployed to an audience that experiences it for a very short period of time before a new performance arrives. There is no artifact anymore. Just performances.

What this means in the day to day is that there is very little interest in supporting old software. The libraries are in a continuous state of flux and nothing runs unless it is continuously supported. Lack of support for old users is the new norm. Creation of new functionality is preferred over fixing or supporting existing users. The software grows fast in an almost organic manner. Continuous rewriting is the permanent revolution of the software world.

Is this good or bad? I believe we have to live in the times as they are, it doesn't really matter whether things are better or worse if we cannot change them. Given my background, I'd rather work on software projects that have more of an artifact construction mentality than a performative mentality. But the current reality implies that learning technologies and programming languages is an activity that needs to be exercised immediately. It is similar to growing produce. Either you harvest or it gets rotten. This sense of urgency in the cycle learn, code, deploy might be the reason that Google-programming is such a successful strategy.

On intuitions, Latin translations and prior distributions

Tue, 31 Aug 2021 07:14:58 -0400

Tags: philosophical, political, academic


Back in my teens, I attended a 7-year secondary school ("highschool"?) with a strong focus on humanities. Since I was 11 until I was 16, I had 5 hours a week of Latin instruction plus a similar amount of time spend in assignments doing translations at home. One of those years, the professor said "you can ask me anything, including why are we studying Latin". I duly obliged. He surprised us by saying teaching us how to do Latin translations was the humanities equivalent of teaching us Math; the way to teach structured thinking using words. Over time I started to see his point: translating languages with a strong case system was akin to solving a system of equations.

Fast forward a few years and I was finishing my undergraduate diploma thesis on Spanish parsing using Lexical Functional Grammars (LFG) where this whole "system of equations" come really to fruition. LFG poses a two-level process for parsing, the first one is a shallow process and the second one is (of course) a system of equations. But that still didn't quite capture my experience translating neverending diatribes against political adversaries. The equations were there, yes, but the solution was driven by intuitions on the roles and meanings of the words. At that time, I went to graduate school.

At Columbia, I got into statistical NLP and the intuitions I mentioned become clear as probability distributions, particularly as priors over the attachment of different words (is this word modifying the verb or the noun? well, if you have a large corpus of parsed text, you might find it usually modifies nouns and seldom verbs, so chances are in this case it also modifies the noun). The beauty of doing it by hand is connecting your (freshly being formed, in the case of my 12-year-old self) intuitions with those of somebody that has been dead for 20 centuries. Intuitions come from experience and the experience being shared goes beyond the text itself. But I disgress, the point here is that, as a human, we need intuitions to do the translation and in statistical systems, these are provided by statistics computed over the data.

In his book, "Thinking, Fast and Slow", Daniel Kahneman tells the story of firefighters trying to extinguish a fire in a kitchen, when the commander realizes something is wrong and gets everybody out before the floor collapses. Kahneman cites it as an example of intuition taking control. In our modern society, however,intuitions are somewhat shunned, because they are close to impossible to teach. "Do it until you develop enough intuitions to get better at it" does not sound like an actionable concept. In my book, I argue that feature engineering operations are intuition driven (otherwise they get folded into the actual machine learning algorithm). Intuitions both in the realm of general operations and also hyper-specific intuitions regarding the domain and dataset being worked on. Some of the criticism the book has received resembles my earlier comment and I feel the pain that teaching intuitions is hard.

It might all start with valuing intuitions more. North American society is not necessarily very keen on them. Latin Americans seem more connected to their intuitive selfs in general or at least "this feels right" can be considered a reasonable reason in my culture of origin. In the midtime, if you or your child were considering taking a Latin class, go for it. It teaches structured thinking and intuitions!

Theory-first vs. Reality-first people and the EM algorithm

Mon, 23 Aug 2021 00:50:34 -0700

Tags: philosophical, political, academic


As the years pass by, I start realizing people minds when approaching a problem seem to have a bias towards either theory-first or reality-first. Theory-first people are more mathematically driven and equate the properties of things to the things themselves. If the model is appealing or beautiful enough, they will fight reality to have reality change and make it closer to the model. The manifestation of some religious, economic or political beliefs sometimes falls into this category.

On the other hand, we have people which are more observational and describe reality in all its infinite glory. While I see myself more within this camp, I'd say it might lead to some sort of navel gazing. As the full reality is too complex for our limited cognition to understand, it is difficult to reach actionable conclussions from such maremagnum of data.

This model-driven gross oversimplification of people cognitions is, however, quite useful, both at the personal level to understand our own biases, at the interpersonal level to understand the biases of our collaborators, and even at the community level. I'd argue that Leo Breiman: The Two Cultures hinges precisely around this point. (Now, you can be an empiricist that reaches actionable conclussions if you're not constrained by the limits of your own cognition and resort to computers to do the trick but then the belief in the computers themselves is theory-driven, ups.)

But continuing with my musings mixing computational decision-making and politics, there is a very popular algorithm in statistics / machine learning that intermixes these two views: the expectation-maximization algorithm (EM for short). In this algorithm, we intermix two steps improving a model being built. In the general case, EM allows to solve equations with two sets of interlocking unknowns, and it does so by solving each set using the values for the other set from the previous step. This algorithm can be proved to find local maximum estimates.

In the case of clustering values, however, an interpretation of EM implies that the E step "believes" the model (the centers of the clusters computed so far, therefore reclustering the data using them) and the M step believes the data (by recomputing centers of clusters using the reclustered data in the M step). For the current discussion, the point is that without both types of mindsets, progress cannot be achieved. The theory-driven people will push for changes to reality, while the empiricists will force updates to the theory. The parallelism with EM might be far-fetched but I find it quite satisfying.

Google-programming Considered Harmful

Sun, 25 Apr 2021 14:17:28 -0700

Tags: technical, education


Perl has been my to go language for scripts that need to be written fast and only executed once. I have become very effective writing Perl one liners that do things such as computing the most frequent URLs in Wikipedia:

bzcat enwiki-20180901-pages-articles-multistream.xml.bz2 | \
  perl -ne '@a=split(/http\s?\:\/\//,$_);@a=map{s/\/.*//;s/\|.*//;$_}@a;shift@a;print join("\n",@a)."\n" if@a'| \
  sort|uniq -c|sort -n|head

(Using a parallel implementation of BZip2, that takes about 5', the most popular websites are factfinder from, BBC and, in case you were curious.)

But that wasn't always the case. The thing with Perl is that I remember very well when and how I learned it. It was the year 1999, I had just started my PhD. The topic was information extraction on biomedical texts using statistical techniques. We were processing ten years of the journal from the European Molecular Biology association. Coming from my undergraduate thesis using Haskell for parsing Spanish texts, I started doing detagging using Haskell. That didn't work fast enough. My other choice was Java but by then I was behind on the work. I needed something fast to code and show results. My advisor at that time, Dr. Hatzivassiloglou (which was actually my second advisor and not my thesis advisor, I think that deserves another post in some moment) was very keen on Perl so I decided to learn the language of the camel.

To learn it, I got two books: Learn Perl in 24 hours and Effective Perl Programming. One was the quick intro to get going while the other deal with more obscure advanced topics. With the basic book, I was hacking Perl furiously by the end of the week and with the advanced book I managed to write more complex integration scripts. By the end of my PhD I had put the 10,000 hours and was very proficient in the language, a skill that remains to this day.

The path Book => Practice => Mastering has been with me ever since, for the few new languages I picked up since then (Scala, Python, Elm, node.js).

These days, I'm learning Go and I'm trying to avoid the problem that has plagged my Python learning (I also started learning Python by reading a book but things went south from there).

It has to do with Google-programming, the practice of doing web search upon web search to build the code. This is way of programming works now and gets the job done. But I find it ineffective. This post discusses my thoughts on the topic.

Why I find it ineffective?

  • It involves a mindset change, going from "speaking programming in my head" to "speaking English in my head", that gets me out of "the flow".

  • The example code found online needs adaption to the codebase I'm writing. This adaptation can be quite onerous, again taking the concentration away from what I'm coding.

  • Example code in the Web might have security and/or stability issues introduced by the very nature of simplifying the code to fit an example.

  • Leaving the IDE into the Web browser exposes you to be ambushed by well-trained algorithms intent on stealing your attention away. Those advertisements you will see will be very relevant to you and might derail your thought process for a long time.

  • This is a minor issue for most people but it leaks information to the world and uses a lot of traffic and resources, both at Google and the target websites such as Stack Overflow.

  • Most importantly, the learning doesn't happen in the form of "programming need -> code structure" but instead in the form of "programming need -> Google query". This is not a minor issue. Google-programming makes you dependent on using Google over and over again. It makes programming off-line impossible, which is something desireable for a variety of reasons (long travel, living in areas with spottin internet, going off-the-grid to concentrate in a complex piece of code).

As such, for Golang, I'm trying something else. I again, started with two books (Go in Action and The Go Programming Language, great books, highly recommend them). Then I'm using Google-programming to bootstrap working examples (you can see one here) and then stopping the temptation of Google-programming new code but instead searching on my existing code base for working examples (that originated from google programming).

By the way, I'm not saying that searching in Google is bad for learning to program or that there is no space within the programming craft for doing web searches. The consensus seems to be clear that you are not a bad programmer if you do. My points here might also be a generational issue. Other people seem to be fine with Googling (but I wonder how that author would have behaved to having to program a second Netty server). There is a quote in there attributed to Albert Einstein 'Never memorize something that you can look up'. I don't think that applies to what we are talking about here. Einstein used some heavy tensor mathematics in relativity. I don't think he would advocate to people not taking such tools as innate. But as everything in life, YMMV!

Modern human nature is artificial

Sun, 06 Dec 2020 00:57:43 -0800

Tags: personal


Almost a year ago, my son was born. He came about one month earlier than expected and he almost crashed NeurIPS that we were attending together with the very pregnant wife.

The months that followed have been a lot of adaptation for the three of us. Annie went back to work in April and I have been taking care of the little boy since then up until last week when we got a spot in daycare. Parenting has been a lot of work and the pandemic has crushed some of our earlier plans of enjoying more time outdoors with the baby (but we have done fine and people have whole lives crushed by the pandemic so there is nothing we can really complain about it).

Besides mentioning I'm now a very old new dad, this blog post is about some cognitive disonance brought by raising the boy. In my mind, I had a dichotomy of "natural vs. artificial". We are "nature" and then, in some moment I thought, we learn the artificial. But nature comes first, because we are animals. Thus the narrative of "returning to nature", of "going back to where we belong" put forth from people who love the outdoors (just in case we never talked about it, I grew up alone in a forest, I had enough outdoors for the rest of my life, so they are not my thing).

Seeing the boy growing and learning to make sense of the world, I realized that narrative is incorrect. His world is 100% artifical and synchronic with our current technology. The technology of the generation he is born into is glued to the neural connections he makes as he grows. For him, a light switch is part of the world the same as dew is to a spider. And given the pandemic and the fact we live downtown, actual nature is odd to him. The first time he touched the bark of a tree he was truly puzzled. It was something new to him that felt... unnatural.

Thus, we can talk about "incorporating more nature in our lives" or "moving into nature" but the idea of "returning" only makes sense in an ancestral manner. And even then it is returning in a very limited sense (nobody is advocating living without an abode, even ancestral humans favored caves for a reason). But the point remains: the nature of modern humans is artificial. The connections in our brains generated at very early age are to make sense of carpets and furniture, of lighting fixtures and keyboards. That (seemingly obvious) fact had escaped me and wanted to share it with you all.

I will be sharing more parenting stories as they come my way, trying to respect his privacy as much as possible. Cheers.

Introducting P.A.P.E.R.

Sat, 21 Nov 2020 22:39:37 -0800

Tags: floss, academic


I got my first paper publish in 1996, in a conference in Antofagasta, Chile (the bus trip there was gruelling, that might be worth talking about in another post). It was in models and simulation, a joint work with Nicolas Bruno (and both ended up at Columbia University for PhD, that's yet for another blog post). From there I went to do my undergraduate thesis in Spanish parsing using LFGs in Haskell. Later on I continued working on model and simulation before getting into PhD. In the PhD I went through three advisers, two in my first year, where I worked in word sense disambiguation before moving to natural language generation in the medical domain. My final years of PhD where dedicated to natural language generation for intelligent analysis. At IBM Research, I moved into question answering, initially in the Human Resources domain, with a detour on expert search before settling on the Watson Jeopardy! project. Each change of topic and domain involved extensive background research to get myself up to speed. After I left IBM and started doing consulting work, it got even worse, so I won't bore you with the details. How to keep track of all that information in my head?

In 2012, I came to terms I had to spend less time reading research and more time tracking what I read. Everytime we read something, it is for a purpose. Without keeping extra metadata, it start becoming akin to haven't read the papers at all. It is not that I have a particularly bad memory, but after a few hundred papers, the names of the authors and the titles start escaping me. Interestingly, I remember under which circumstances I found the paper or where (as the place or the device or printout) I read it. Therefore, I decided to use a tool to keep track of such metadata.

After some search of various available tools, I decided to write my own, which I did and I have been using for many, many years. I credit the management of extended number of sources I had to go through for writing my book to having this tool, which I silly named P.A.P.E.R. (Pablo's Artifacts and Papers Environment plus Repository). This month I open sourced the tool at a lighting talk in Vancouver's Learn Data Science meetup. This post describes the tool, which is seeking beta testers, users and contributors (your help, basically).

I wrote a book!

Sun, 08 Nov 2020 20:09:07 -0800

Tags: academic, personal


I haven't been blogging much as I was busy writing The Art of Feature Engineering, a book that came out in the summer on Cambridge University Press. Here are some thoughts about the book writing process itself, condensed and expanded from its introduction. Also some thoughts on earlier criticism it has encountered.

My interest with feature engineering started working together with David Gondek and the rest of the Jeopardy! team at IBM TJ Watson Research Center in the late 2000s. The process and ideas in the book draw heavily from that experience. The error analysis sessions chaired by David Ferrucci were grueling two days of looking at problem after problem and brainstorming ideas of how to address them. It was a very stressful time; hopefully this book will help you profit from that experience without having to endure it. Even though it has been years since we worked together, the book exists thanks to their efforts that transcend the show itself.

After leaving IBM, during the years I have been doing consulting, I have seen countless professionals abandon promising paths due to lack of feature engineering tools. I wrote the book for them.

Summary for Leo Breiman: The Two Cultures

Sat, 23 Mar 2019 08:13:20 -0700

Tags: academic


Vancouver has a great machine learning paper reading meet-up aptly named Learn Data Science. Every two weeks, the group gets together and discuss a paper voted by the group the previous meeting. There are no presentations, just going person by person around the room discussing the paper. I find the format very positive and I try to attend whenever the selected paper is aligned with my interests and I have time (lately I have not been able to attend much, as I have been writing a feature engineering book but I'll blog about it another day). Last meet-up they picked a paper I selected, so this blog is a short summary of the paper, to help people reading it.

In 2005, Prof. Leo Breiman (of Random Forests fame), passed away after a long battle with cancer. He was 77 years old. This information comes from his obituary, that also highlights a very varied life that is captured by the paper for the Learn DS meetup of March 27th. Four years before passing away, when he was well established and well regarded in the field, he wrote Statistical Modeling: The Two Cultures a journal article for Statist. Sci., Volume 16, Issue 3 (2001), 199-231. The paper presents the field in a pivotal moment on its history, as it is about that time that "Data Science" took off (for example, Columbia University started publishing The Journal of Data Science in 2003). The paper itself is written in a clear tone with an unambiguous message but it got published with four letters and a rejoinder. The subtlelties in the letters and rejoinder present the most interesting material. Particularly as the letters included well known statisticians such D.R. Cox and Emanuel Parzen. I will summarize this content in turn.

The Value of Outliers

Tue, 03 Oct 2017 02:49:30 -0700

Tags: academic, philosophical


Moving coast-to-coast has taken most of my energy since the last post but we're finally established in Vancouver so I can come back to blogging more regularly. This post is about a piece of advice from the classic book How To Lie With Statistics extended to groups and society in general. It continues ideas from two other blog posts (What Random Forests Tell Us About Democracy and Have Humans Evolved to be Inaccurate Decision Makers?) regarding statistics, decision making and politics.

Let's sample 10 (pseudo)random numbers from a normal distribution centered around 100:

       >>> map(lambda x:int(random.normalvariate(100,10)), range(10))
       [85, 99, 97, 78, 87, 93, 91, 112, 90, 91]

Now, if you look at these numbers, you'll be tempted to conclude that 112 is just... wrong. A measurement error. That it does not belong there. However, the average for the 10 numbers is 92.3, still far from the real mean (100) but within the sigma we used to generate the sample (10). If we were to drop 112, the average for the remaining numbers will go down to 90.1, making it a worse estimator than before.

I believe the same happens in the realm of ideas. If each person has a piece of the truth, shutting down their contributions, irrespective of how far from the truth their might sound, will lead you farther away from the truth. A similar concept in the business world is Groupthink. This of course does not mean that outliers need to dominate, just not eliminated completely.

And if you haven't read the 1954 book by Darrell Huff, it is very short and makes for a great read. Its starting premise "democracy needs voters informed on basic statistical matters" is as up-to-date in our data-driven world as ever.

More Like This Queries on SQlite3 FTS

Mon, 12 Dec 2016 03:05:48 -0500

Tags: open source


SQLite is a great embedded SQL engine, and part of the Android platform. It has an extension FTS ("Full Text Search") that enables Boolean search queries (that is, mostly unranked). For small collections of documents (like a blog), Boolean searches could be a viable temporary solution until a full solution (like elasticsearch) can be deployed.

A common type of query supported by elasticsearch are More Like This (MLT) queries, that allow you to find similar documents to given ones. This type of queries are also very useful for blogs, for example. If you're using SQLite FTS, you can construct a query that will approximate MLT by issueing an OR for all terms in a document (in FTS, the terms are lowercase and uppercsase 'OR' or 'AND' are considered logical Boolean operators). The only issue is to obtain the terms, as assigned by FTS. To do so, it is necessary to access the FTS tokenizer by creating virtual table, for example:

CREATE VIRTUAL TABLE tok1 USING fts3tokenize('simple');

Then, given a document, the terms for it can be extracted by doing (in PHP):

$tokens = $db->query("SELECT token FROM tok1 WHERE input='" . SQLite3::escapeString($all_text) ."';");

The query itself can be assembled by taking an OR of the set of terms:

$query = "";
$query_array = array();
while($row = $tokens->fetchArray()){
    $token = $row['token'];
    if(! isset($query_array[$token])){
        $query_array[$token] = 1;
        if(strlen($query) > 0) {
            $query = $query . " OR ";
        $query = $query . $token;

This is the query that can then be used against the FTS to provide a makeshift MLT functionality.

Some ideas for the A.I. XPRIZE

Wed, 23 Nov 2016 00:31:29 -0500

Tags: research


As I'm married to a current IBM emloyee, I'm disqualified from participating in the AI XPRIZE sponsored by IBM. So I'm putting my ideas in this blog post, might they help inspire other people.

The XPRIZE follows the path of other multi-year challenges that have resulted in great accomplishments such as commercial rockets. In the case of the AI challenge, it diverges from previous challenges by being completely open ended: any major AI break-through can win the 3M USD prize.

What I'd like to see is a team tackling improvement in scientific communications by leveraging recent advances in machine reading and taking them to the next level. I would like to see some work on scientific metadata (possibly in the directions of the Semantic Web) that captures the main dicoveries in a scientific paper. This metadata should be feasible to be produced by a human, the machine reading aspect is there just to bring enough value to the metadata during transition to entice humans to self-annotate.

The case for this improvement lies in the amount of researchers in many key fields having physically no time to keep up-to-date with published results. A high-level summary or the possibility to query "has anybody applied method X to problem Y" would be invaluable. Moreover, this type of setting allows for a very constrained inference, simplifying scientific discovery for sometimes obvious, sometimes overseen new findings.

I'm not stranger to this approach. My most cited paper came to be as a contributor to a multidisciplinary project on automatic extraction and inference in the genomics domain (some form of automated inference was later realized many years after I left the project).

This is further simplified by reporting standards in many scientific disciplines. Take for example, the one from the American Psychological Association (thanks to Emily Sheepy to pointing me to that report). Such standards specify the type of contributions and the information to be expected on them, even up to the headers of each section.

I believe all these pieces together have a chance of, if now winning, at least doing well in the competition. And irrespective of the competition, this technology deserves to exist and help accelerate human discovery.

Regarding business aspects, it would be nice if the metadata format is an open format and the commercialization centers on extracting metadata from existing publications and authoring tools. Extracting metadata and doing inferences for profit is somewhat contrary to the goals of accelerating research, but that's speaking as a scientist, not a business person.

Let me summarize the concept with an example:

  1. Given an existing paper, for example Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network (Toutanova et al., 2003) produce metadata of this type:


    Part-of-Speech tagging (as a link to an ontology)


    Conditional Markov Model (or CMM with features x,y,z; all linked to an ontology)


    97.24% accuracy over Penn Treebank WSJ (the metric and the corpus are also links)

  2. These entries can be further populated by the scientists upon publication, maybe with the help of an authoring tool.

  3. From this metadata, a system can answer "what is the best performance for POS and what technique does it use?" but also "POS and role labeling are similar problems (fictional fact): both use similar techniques and both rank their performance similarly; however, the best performance in POS is using skip decision lists (also a fictional fact) but that technique has never been attempted on role labeling"

Wish the participants the best of luck. Look forward seeing great technology being developed as a result of the challenge!

Older Posts