What is Data Science? Become a Data Scientist | Microsoft Azure (2024)

Data scientists follow a similar process to complete their projects:

1. Define the business problem

The data scientist works with stakeholders to clearly define the problem they want to solve or question they need to answer, along with the project's objectives and solution requirements.

2. Define the analytic approach
Based on the business problem, the data scientist decides which analytic approach to follow:

  • Descriptive for more information about the current status.
  • Diagnostic to understand what is happening and why.
  • Predictive to forecast what will happen.
  • Prescriptive to understand how to solve the problem.

3. Obtain the data

The data scientist identifies and acquires the data needed to achieve the desired result. This could involve querying databases, extracting information from websites (web scraping), or obtaining data from files. The data might be internally available, or the team might need to purchase the data. In some cases, organizations might need to collect new data to be able to successfully run a project.

4. Clean the data, also known as scrubbing

Typically, this step is the most time consuming. To create the dataset for modeling, the data scientist converts all the data into the same format, organizes the data, removes what's not needed, and replaces any missing data.

5. Explore the data

Once the data is cleaned, a data scientist explores the data and applies statistical analytical techniques to reveal relationships between data features and the statistical relationships between them and the values they predict (known as a label). The predicted label can be a quantitative value, like the financial value of something in the future, or the duration of a flight delay in minutes.

Exploration and preparation typically involve a great deal of interactive data analysis and visualization—usually using languages such as Python and R in interactive tools and environments that are specifically designed for this task. The scripts used to explore the data are typically hosted in specialized environments such as Jupyter Notebooks. These tools enable data scientists to explore the data programmatically while documenting and sharing the insights they find.

6. Model the data

The data scientist builds and trains prescriptive or descriptive models, then tests and evaluates the model to make sure it answers the question or addresses the business problem. At its simplest, a model is a piece of code that takes an input and produces output. Creating a machine learning model involves selecting an algorithm, providing it with data, and tuning hyperparameters. Hyperparameters are adjustable parameters that let data scientists control the model training process. For example, with neural networks, the data scientist decides the number of hidden layers and the number of nodes in each layer. Hyperparameter tuning, also called hyperparameter optimization, is the process of finding the configuration of hyperparameters that result in the best performance.

A common question is "Which machine learning algorithm should I use?" A machine learning algorithm turns a dataset into a model. The algorithm the data scientist selects depends primarily on two different aspects of the data science scenario:

  • What is the business question the data scientist wants to answer by learning from past data?
  • What are the requirements of the data science scenario, including the accuracy, training time, linearity, number of parameters, and number of features?

To help answer these questions, Azure Machine Learning provides a comprehensive portfolio of algorithms, such as multiclass decision forest, recommendation systems, neural network regression, multiclass neural network, and K-Means clustering. Each algorithm is designed to address a different type of machine learning problem. In addition, The Azure Machine Learning Algorithm Cheat Sheet helps data scientists choose the right algorithm to answer the business question.

7. Deploy the model

The data scientist delivers the final model with documentation and deploys the new dataset into production after testing, so it can play an active role in a business. Predictions from a deployed model can be used for business decisions.

8. Visualize and communicate the results

Visualization tools like Microsoft Power BI, Tableau, Apache Superset, and Metabase make it easy for the data scientist to explore the data and generate beautiful visualizations that show the findings in a way that makes it simple for non-technical audiences to understand.

Data scientists might also use web-based data science notebooks, such as Zeppelin Notebooks, throughout the much of the process for data ingestion, discovery, analytics, visualization, and collaboration.

Data science methods

Data scientists use statistical methods such as hypothesis testing, factor analysis, regression analysis and clustering to unearth statistically sound insights.

Data science documentation

Although data science documentation varies by project and industry, it generally includes documentation that shows where the data comes from and how it was modified. This helps other members of the data team effectively use the data moving forward. For example, documentation helps business analysts use visualization tools to interpret the dataset.

Types of data science documentation include:

  • Project plans to define the project's business objectives, evaluation metrics, resources, timeline, and budget.
  • Data science user stories to generate ideas for data science projects. The data scientist writes the story from the stakeholder's point of view, describing what the stakeholder would like to achieve and the reason the stakeholder is requesting the project.
  • Data science model documentation to document the dataset, the experiment's design, and the algorithms.
  • Supporting systems documentation including user guides, infrastructure documentation for system maintenance, and code documentation.
What is Data Science? Become a Data Scientist | Microsoft Azure (2024)

References

Top Articles
Latest Posts
Article information

Author: Tuan Roob DDS

Last Updated:

Views: 5709

Rating: 4.1 / 5 (42 voted)

Reviews: 81% of readers found this page helpful

Author information

Name: Tuan Roob DDS

Birthday: 1999-11-20

Address: Suite 592 642 Pfannerstill Island, South Keila, LA 74970-3076

Phone: +9617721773649

Job: Marketing Producer

Hobby: Skydiving, Flag Football, Knitting, Running, Lego building, Hunting, Juggling

Introduction: My name is Tuan Roob DDS, I am a friendly, good, energetic, faithful, fantastic, gentle, enchanting person who loves writing and wants to share my knowledge and understanding with you.