Defensive Python Coding for Data Scientists

Do modern developers need to be trained in secure coding techniques? This topic was raised during a recent round table hosted by CDOTrends when a participant asked if they should even hire developers who can’t write secure code.

As news of cybersecurity breaches and hacking incidents continue to grab the headlines globally, there is no doubt that a programmer would do well to adopt secure coding practices.

But should data scientists care? After all, many use Python to manage and manipulate data as part of their daily work. And while the code is generally not accessible to the public, there is no guarantee that it will remain in the corporate network. Plus, writing secure code, like brushing your teeth or driving defensively, is a valuable lifelong skill. So why not?

Tips for writing secure Python code

In a blog post Released last month, cloud native application security provider Snyk outlined various security best practices for Python as part of its updated 2021 cheat sheet. What are the best practices that data scientists should know? I highlight four below and explain why.

Disinfect external data

According to Frank Fischer of Synk, external data is an attack vector for any application. While this is less of a problem for data scientists than a developer’s Python application running on the company’s website, it is entirely plausible that an injection attack could occur through data. poisoned as part of an attempted watering or spear phishing.

The best defense against this is to carefully disinfect data from external sources and ensure that the entries conform to the expected data structures. Bleach is a popular HTML sanitizer library for content pulled from a website, while major frameworks like Flask or Django come with their sanitizer functions in the form of flask.escape () and Django.utils.html. escape (). Use them.

Be careful with downloaded packages

Data scientists learning Python for the first time would likely recall the incredible experience of typing a line of Python code to automatically download packages with a plethora of new features. Developers typically use the standard package installer for Python (pip), explained Fischer, which uses the Python Pack Index (PyPI). In short, the possibility of malicious packages in PyPI exists, especially common misspellings. So be sure to spell the name of the package correctly.

An alternative for data scientists to work around this problem is to use something like Anaconda, which comes with most of the best Python packages already bundled. It’s free for individual use, or your organization may already be licensed for the Commercial Edition.

Set DEBUG = False in production

For data scientists, it makes sense to set DEBUG to false once the code is written and verified to work properly. Indeed, the code could be reused by other team members who could deploy it in more public environments without much thought.

According to Fischer, most frameworks have debugging enabled by default. So be sure to turn off debugging in your preferred frameworks to prevent accidental leakage of sensitive application information from attackers.

Carefully deserialize

One of Python’s strengths is understanding the context of a variable without having to explicitly define its data type. This makes Python friendly for beginners and is made easier by the ease of loading data from various sources.

But if you plan to use the pickle module to serialize or deserialize a Python object structure, Fischer notes that the module is considered insecure and should only be used on trusted data sources. Use YAML instead, he suggests, and use SafeLoader () instead of Loader () as the loader.

You can download the Python 2021 Security Best Practices Checklist here (pdf).

Paul Mah is the editor of DSAITrends. A former systems administrator, programmer, and computer teacher, he enjoys writing both code and prose. You can reach him at [email protected].

Image credit: iStockphoto / gorodenkoff

Comments are closed.