Fourteen things you need to know about collaborating with data scientists

Fourteen things you need to know about collaborating with data scientists

Think of your relationship as a partnership, rather than as a transaction, say data scientists.Credit: Morsa Images/Getty Images

Data science is increasingly an integral part of research. But data scientists can wear many hats: interdisciplinary translator, software engineer, project manager and more. Fulfilling all these roles is challenging enough; but this difficulty can be exacerbated by differing expectations and, frankly, an undervaluing of data scientists’ contributions.

For example, although our primary role is data analysis, researchers often approach data scientists for help with data acquisition and wrangling as well as software development. Although in one sense this is ‘technical’ work, which perhaps only a data scientist can do, thinking of it as such overlooks its deep connection with reproducible research. The work also involves elements of data management, project documentation and adherence to best practices. Solely emphasizing a project’s technical requirements can lead collaborators to view the work as a transaction rather than as a partnership. This misunderstanding, in turn, poses obstacles to communication, project management and reproducibility.

As data scientists with a collective 17 years of experience across dozens of interdisciplinary projects, we have seen at first hand what does and doesn’t work in collaborations. Here, we offer tips for how to make working relationships more productive and rewarding. To our fellow data scientists: this is how we strike the balance. To the general audience: these are the parts of data science with which everyone on the team should engage.

1. Develop a communication plan

Set boundaries and norms for how communication will happen. Do members want to meet virtually or in person? When, how often, and on what platform should they meet? Decide how you will record tasks, project history and decisions. Make sure all members of the team have access to the project records so that everyone is kept abreast of its status and goals. And identify any limitations due to IT policies or privacy concerns. For example, many US government agencies restrict employees to an approved list of software tools.

2. Communicate openly

Err on the side of over-communicating by including everyone on communications and making the project’s repositories available to all members of the team. Involve collaborators in technical details, even if they are not directly responsible for these aspects of the project.

3. Learn the lingo

Different disciplines can attach very different meanings to the same term. ‘Map’, for example, means different things to geographers, geneticists and database engineers. When discrepancies arise, ask for clarification. Learn about the other disciplines on your team and be prepared to learn their jargon and methods.

4. Encourage questions

Questions from people outside your domain can reveal important workflow difficulties, illuminate misunderstandings or expose new lines of enquiry. Don’t allow questions to linger; if you need time to consider the answer, acknowledge that it was asked and follow it up. Address all questions with respect.

5. Communicate creatively

Diagrams, screenshots, process descriptions, and summary statistics can serve as a unifying language for team members and emphasize the bigger picture, avoiding unnecessary detail. Use them when you can.

6. Establish a timeline

Before starting the research, identify the goals and expected outputs of the collaboration. As a team, create a project timeline with concrete milestones, making sure to allow time for project set-up and data exploration. Ensure all team members are aware of the timeline and address any concerns before proceeding.

7. Avoid ‘scope creep’

One potential pitfall of working collaboratively is that a project’s scope can easily expand. To guard against this, when new ideas emerge, decide as a team if the new task helps you to meet the original goal. You might need to set the idea aside to stay on target. Perhaps this idea is the source of the next collaboration or grant application. A clear red flag is the question, “You know what would be cool?”

8. Plan for data storage and distribution

Agree early on about how and where the team will share files. This might involve your own servers, cloud storage, shared document-editing platforms, version-control platforms or a combination of these. Everyone should have appropriate levels of access. If there’s a chance that the project will produce code or data for public use, develop a written plan for long-term storage, distribution, maintenance, and archiving. Discuss licensing early.

9. Prioritize reproducibility

Develop a data-processing pipeline that extends from raw data to final outputs, avoiding hard-to-reproduce graphical interfaces or ad hoc steps whenever possible in favour of coded alternatives written in languages such as Python, R and Bash. Use a version-control system, such as git, to track changes to the project files, and an environment manager, such as conda, to track software versions.

10. Document everything

Be proactive about documenting technical steps. Before you begin, write draft documentation to reflect your plan. Edit and expand the documentation as you progress, to clarify details. Maintain the documentation after the project concludes so that it serves as a reference. Write in plain language and keep jargon to a minimum. If you must use jargon, define it.

11. Develop a publishing plan

Although you can’t anticipate all project outputs in advance, discuss attribution, authorship and publication responsibilities as early as possible. This clarity provides a point of reference for reassessing participants’ roles if the project direction changes.

12. Embrace creativity

Collaborating with people who have diverse backgrounds and skill sets often sparks creativity. Be open to ideas, but be willing to put them on the back burner or discard them if they don’t fit the project scope and timeline. Working with domain experts in one-on-one advice sessions, incubator projects, and in-the-moment data-analysis sessions often surfaces new data sources or potential modelling applications, for example. More than a few of our current grant projects have their roots in what was at first an improvisational exercise.

13. Share the knowledge

Disciplines are vast, and knowing when to defer to others’ expertise is essential for project momentum and keeping contributions equitable. Striking this balance is especially important around project infrastructure. Not everyone needs to write or run code, for example, but learning how to use technical platforms, such as code repositories or data storage, rather than relying on others to do so, balances the workload. If collaborators want to be involved in technical details, or if the project will be handed over to them in the long term, data scientists might need to teach collaborators as well.

14. Stop gracefully

Recognize when a project has run its course, whether it has been successful or not. Ongoing requests for work such as new analyses often weigh unequally on those responsible for project infrastructure. If the project didn’t achieve its stated goals, look for a silver lining: it doesn’t mean failure if there are insights, results or new lines of enquiry to explore. Above all, respect the timeline and the fact that you and your collaborators have other responsibilities.

Interdisciplinary collaborations that integrate data science can be challenging, but we have found these guidelines to be effective. Many involve skills that you can develop and refine over time. Thoughtful communication, careful project organization and equitable working relationships transform projects into genuine collaborations, yielding research that would not otherwise be possible.

Source link