Popular Programming Languages in Data Engineering: Insights From Two Industry Professionals
The data world is constantly evolving, and with this change comes a growing need for knowledgeable professionals. How much talent, exactly, is needed to fill the demand?
Last year, the demand for data scientists increased by an average of 50% across healthcare, telecommunications and media/entertainment, as well as within the banking, financial services and insurance (BFSI) sectors, according to a recent Dice Report.
As companies continue to ramp up their capacity for big data and attempt to streamline digitized business models, they’re turning to individuals with the knowledge and skill set needed to help them make sense of all that information. If you’re looking to transition into the data field or wanting to upskill, you might be wondering which programming languages and tools to focus on.
To help you get started on the right foot, Trilogy Education Services, a 2U, Inc. brand, recently hosted a Tech Talk to discuss popular programming languages in data engineering. 2U’s Alison Abbington moderated a discussion between two industry leaders, Nelson Kandeya, Data Engineer at Labatt Breweries of Canada; and Dan Patwardhan, Senior Data Engineer at Scotiabank, to discuss which coding languages they’ve mastered to be successful in their roles and provide resources you can use to strengthen your own skills.
Read their key insights from the event, as well as useful data-related resources recommended by the speakers.
Data science, data analytics and data engineering — what’s the difference and how do they overlap?
Dan Patwardhan: A lot of people try to add everything under the umbrella of data analytics, but data analytics means making sense of and presenting data. Most of the time data analysts — or the new industry term, data strategists — storytell; you cannot make sense of data until you have data. In order to provide reliable data you need data engineers. To run scripts on data so you can extract meaningful information you need data scientists. Very rarely do you see data engineers running code. Usually they clean data and make it reliable so data scientists don’t have to worry about this.
Nelson Kandeya: In data analytics, you create front end or front-facing applications, dashboards, present data and get to interact more with stakeholders.
What does your day-to-day look like and what programming languages do you use the most?
Nelson Kandeya: My day-to-day opens up with daily scrums. My company has been trying to understand consumer behavior and especially with COVID-19, we’ve collaborated more on a daily basis. Our daily scrums look at tasks that need to be completed. We usually go through different tasks that have been defined, for example, working on new development or improving existing processes, and work with different data sources to see the feasibility of implementing new projects. In almost every other situation, we’re trying to automate tasks and put them into the production life cycle.
Regarding programming languages, the past two years have defined the data landscape, with cloud computing narrowing down programming languages we use. In our company, we commonly utilize SQL, Python, R and Scala. The main drivers of these programming languages are security, cost, efficiency and the ability to collaborate across programs. We need to have an environment that allows us to work together while building scalable applications with international team members.
Dan Patwardhan: We also start with daily scrums and then check tasks for the day. We take input from data analysts or data strategists and start working on them — most of the time it’s essentially a data engineering group whose task is to bring in data from different databases. Bringing all this data together is always a challenge; getting the tools and maintaining security; since we’re a bank, security is always going to be a big thing. Secondly, we do data profiling to see if data is correct or if there is missing data, clean up data and build a pipeline. We don’t want any process to be handled using an individual’s ID — everything has to be automated and shouldn’t be dependent on a single person. Building these automated pipelines and socializing them with data scientists is the final step.
As far as programming languages, SQL still remains the bread and butter. We’re trying to find alternatives but it’s not going away simply because it’s so easy to use and non-technical people can easily adapt to it. We also use Python and Sass, since most of the machine learning that was done in the 90s was done using Sass, but more and more we’re trying to discover open source alternatives to Sass and other data science tools. R is a good language but Python is so multi-faceted. We would like to go to the cloud but security is a big concern when you’re putting data in — Google will have access to it and banking is a heavily regulated industry. For data visualization, we use Power BI and Tableau; the former integrates so well with other Microsoft office tools compared to Tableau and is cheaper.
What are the pros and cons to using Python, SQL and other data tools?
Dan Patwardhan: I don’t have a favorite tool because some are driven by requirements. Python is fast and efficient but doesn’t have storage; it still needs a third party to store data. Python is the next best thing. In my day-to-day role, I have come across some statistical modeling. SQL is not very efficient in this regard, but if you want to look at descriptive analytics, in Python all you have to do is run two or three lines of code; whereas in SQL, your lines of code begin to grow. SQL is good for structured data and easy to learn but when you want to build additional, complex data you need to integrate it with Python, which is the front runner. Some people have biases when it comes to using R, but many shy away from it because of its complexity. In Python, you need to create your own niche; for example, in banking Python will be used differently compared to other industries. SQL is driven by vendors who push it out, but once you know one form of SQL you can adapt and adjust it in other environments.
My background is in engineering. I have worked in banks for all of my career and also worked in the data field as a data analyst, data engineer and so on. The only way to improve your skills is by doing it. You can do a lot of theoretical study — nowadays knowledge is free and you can find many technical lectures and seminars on YouTube. The only way you can improve, though, is by practicing your logic. Websites like Kaggle allow you to pick up [any] project and put your spin on it. This will also help when it comes to time to market your resume. If something looks very easy in a book, it’s going to be that much more difficult when you actually implement it. That’s the mindset you have to develop which will be key to success in your job.
Nelson Kandeya: My background is computer science and statistics. In data, it was more around trend analysis. Different organizations have different names for different roles and this can be confusing for applicants. For me, when I was looking for a comfortable role, I looked towards data engineering. When I was trying to match myself, each time I would look at the job description and role — you need to do this and see what focus areas the company is particularly looking for.
My role is focused on data engineering and some aspects of data science, so I needed to have an understanding and appreciation of the work in both areas. I currently have a core role and push data to the data science team who run complex models and send data back to us. We then have to integrate the data in the system and make sure it reaches the stakeholders’ deliverables; we take feedback and consolidate it to have a high-level view of the requirements stakeholders need.
How do you keep up with ever-evolving technology? Are you constantly having to learn new programming languages?
Dan Patwardhan: I am a subscriber of Medium.com, a website which has articles on almost every subject out there. In a day-to-day job, there might be some requirement that comes up that may not be fulfilled by your existing set of tools, so you’ll have to go off and find a solution. Once you start working in a role, sometimes your knowledge might not be sufficient and you’ll have to go out and find what works. The problem is the mother of knowledge in this case. If I come across an obstacle, the first thing I do is ask around in my team. I’m lucky to work with brilliant people. If not, you can always use Google; we’re lucky enough to live in an era where knowledge is so abundant. I don’t try to reinvent the wheel. In this field, you’re always working on a deadline and you have to find the fastest solution — it might not be the best but it will be fast.
Nelson Kandeya: My employer pushes my skill set. I’ve been fortunate enough to work with a team where I’m allowed to experiment. My employers will say, “This is what we have, this is the problem we need to solve.” If you then encounter hurdles and know of an external additional application that can resolve that, you can discuss it and see how to best solve the gap. Some migrations have pushed us to upskill and change our mindsets. When a problem comes up, you might have to build a workaround. For example, internally we have a data engineering guild that has about 25 people, so essentially when you have a problem, chances are you can ask the group and others may have encountered similar problems and have suggestions. If that fails, we use Google. One of the silver linings of COVID-19 is that since we’re all working from home, we’ve been able to ramp up collaborative efforts.
Do you feel you have the freedom to fail in your roles? Can you name a time when things weren’t as successful for you — what happened?
Nelson Kandeya: As long as you’re not dropping any production tables, you have the freedom to fail. Failure itself is a lesson. There’s no micromanagement in our company — instead, you get a high-level requirement and are expected to resolve it. The things that linger over us are efficiency and cost. Code is like math — if you do the wrong calculation it will have a result but that result might not bring value to the business requirement.
My failures have usually been a result of trying to follow someone else’s footsteps. At first, I was timid to speak out about something. If you’re confident and know ways of working about a particular process you need to say, “This is inefficient.” But likewise, you will need to propose a way that does work. Most of the work we do is in development environments where you test — that way you don’t affect or impact production processes that run on a day-to-day basis. You need to have patience that the output might be deemed incorrect and try to understand where it failed and where you got it right. The thing about data and coding is you’re prone to make mistakes and misinterpret data.
Dan Patwardhan: Most of our testing is done in development environments as well, so you’re not at risk of making a huge mistake. As long as you’re not deleting production data, you’re good. Making mistakes is fine, but you have to speak up if you’re not clear on something. In the beginning of your career, you might think others will think less of you for asking certain questions, but you have to try and surpass that because it will help you in the long-term. In two years, you will be helping someone else.