Start Creating
Data For AI

Try for Free

Want to talk to our sales team instead?

By logging in, you agree to our terms of service and privacy policy

Blog

VISION: consuming big data – lots of algos…side of fries…one summer internship

blog-details-banner

blog-details-usertresata

blog-details-eye-slashSep 9, 2013

blog-details-banner

My summer internship with Tresata began on Memorial Day when I met Davis Dulin at the Smelly Cat Café. There, Davis introduced me to some of the software I would be working with that summer, including the underlying technologies – Linux and Scala. It was certainly a little frightening at first. I had never worked in Linux before and Java was the only language I had ever coded in. Fortunately for me, as I learned, there is a lot of cross over between Java and Scala so understanding the broad ideas of the language was not difficult. As for the finer points of the language, that’s where Koert’s tutelage proved to be invaluable.

Once I finally made it to the Tresata HQ the next day, I was introduced to the Tresata team, which is not a word I use lightly. The community Tresata has built is truly a great one. That week, Jack and Will did their best to explain to me how Hadoop works, along with a million other questions I had about distributed computing. I also started working on my ping pong game around the same time. I am proud to say that my game steadily improved throughout the summer (although not enough to withstand a 3 game thrashing by Abhi on my last day).

Over the next few weeks, I continued to learn the software built by Tresata, and soon, I was introduced to my summer project. My goal was to analyze 6 data files from multiple sources, extract the pertinent information from them, run Tresata’s matching algorithms on them, and finally, analyze that output.

This process started with me exploring samples of the data files with a goal to find inconsistencies in the data – entries that made no sense (like AADFE for a first name) or other peculiarities (like April being the most popular name in a file when James was the most popular name in every other file). During this phase of my project, I created a slide show of visually grabbing graphics that I later presented during Tresata Talks (Friday’s excuse to back away from the computers for an hour and learn something about what other people are working on ranging from grocery pricing to personal finance).

Next up was extracting the information we wanted out of the files. This phase of the project is also called “That Time I Messaged Koert A Lot And He Had Extreme Patience To Help Me Through This” (ofcourse I’m kidding! I don’t think I was that bad but nevertheless, I learned an invaluable amount from Koert who was a great help). He helped me run the software to converted, map and match data from across the right files.  Using this output, I then invoked a scalding program that directly ran Tresata software in Hadoop.  Once done, began the most exciting phase of my project, running Tresata’s very powerful Data Fusion Engine – TREE.  And for this part, my guide was Andy, who I knew I would enjoy working with when I made a Power Rangers reference (I speak almost exclusively in movie and TV references) and he knew exactly what I was talking about. For the next two and a half weeks, Andy trained me to get the algorithms in TREE tuned and running on data never seen by TREE before. We ran into many unexpected twists and turns (what they don’t teach you in school is that as sexy as Big Data is, writing software to work on really large data isn’t easy), but with a week left to go in the summer, we got the results we desired.

Like life, the end of my summer came full circle, as I ended it the way I started it – looking at data to make sure it made sense given the problem I was meant to solve. The task here was much the same as my first task – to look for any inconsistencies and create graphics representing the output.

To all observers, save for Richard and I, it would seem that getting output out of a piece of software so powerful that only one of them exists on Hadoop, was my crowning achievement of the summer.

But Richard and I know, that the pinnacle of my summer experience was fulfilling my role as the office “Lunch Coach” and putting down a bacon cheeseburger and a side of fries at Five Guys for my final meal. It was a test only the best interns can survive! (and for the record I did all Interns proud).  In all seriousness, though, my summer internship at Tresata made me learn more than I could have hoped for and gave ne a chance to be part of a great, hard-working, and positive team trying to build something special.